Curious on LLM from Scratch - Chapter 5 Training Example

Running the Chapter 5 Training Example,

This is an interesting example and without fully understanding it, asking Github Copilot, Chat GPT abotu the code.

...

Here it is running, and also updates on the code.

The model has approximately 124 million parameters (hence the name) primarily from:

Position Embeddings: context_length × emb_dim = 256 × 768 parameters

Transformer Layers: 12 layers, each with attention heads and feed-forward networks

Even though "The Verdict" is only 3,600 words:

The model is designed to learn general language patterns, not just memorize the text

The vocabulary size (50,257 tokens) matches GPT-2's full vocabulary

The architecture follows smaller GPT-2 specifications (768-dim embeddings, 12 layers)

This is the total number of unique tokens the model can recognize and generate

'''

The full vocabulary (50,257 tokens) gives it the capacity to represent

many words it never sees in training

It's based on the GPT-2 design which was trained on millions of documents

'''

The word "Every" is converted to its corresponding token ID from the GPT-2 vocabulary.

Embedding Layer: The model has an embedding layer for all 50,257 tokens, including "Every".

See:

And:

Berlin Brown and Software Development