Curious on LLM from Scratch - Chapter 5 Training Example
Running the Chapter 5 Training Example,
This is an interesting example and without fully understanding it, asking Github Copilot, Chat GPT abotu the code.
...
Here it is running, and also updates on the code.
The model has approximately 124 million parameters (hence the name) primarily from:
Position Embeddings: context_length × emb_dim = 256 × 768 parameters
Transformer Layers: 12 layers, each with attention heads and feed-forward networks
Even though "The Verdict" is only 3,600 words:
The model is designed to learn general language patterns, not just memorize the text
The vocabulary size (50,257 tokens) matches GPT-2's full vocabulary
The architecture follows smaller GPT-2 specifications (768-dim embeddings, 12 layers)
This is the total number of unique tokens the model can recognize and generate
'''
'''
The full vocabulary (50,257 tokens) gives it the capacity to represent
many words it never sees in training
It's based on the GPT-2 design which was trained on millions of documents
'''
'''
The word "Every" is converted to its corresponding token ID from the GPT-2 vocabulary.
Embedding Layer: The model has an embedding layer for all 50,257 tokens, including "Every".
See:
https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05
And:
Comments