Curious on LLM from Scratch - Chapter 5 Training Example

 Running the Chapter 5 Training Example, 

This is an interesting example and without fully understanding it, asking Github Copilot, Chat GPT abotu the code.

...


Here it is running, and also updates on the code.



 The model has approximately 124 million parameters (hence the name) primarily from:

 Position Embeddings: context_length × emb_dim = 256 × 768 parameters

 Transformer Layers: 12 layers, each with attention heads and feed-forward networks


 Even though "The Verdict" is only 3,600 words:

 The model is designed to learn general language patterns, not just memorize the text

  The vocabulary size (50,257 tokens) matches GPT-2's full vocabulary

 The architecture follows smaller GPT-2 specifications (768-dim embeddings, 12 layers)

 This is the total number of unique tokens the model can recognize and generate

'''

'''

The full vocabulary (50,257 tokens) gives it the capacity to represent 

many words it never sees in training

It's based on the GPT-2 design which was trained on millions of documents

'''

'''

The word "Every" is converted to its corresponding token ID from the GPT-2 vocabulary.

Embedding Layer: The model has an embedding layer for all 50,257 tokens, including "Every".

See:

https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05

And:

https://github.com/berlinbrown/berlin-learn-ml-dl-capstone-projects/tree/main/basic-exercises/basic-rastb-fork-chap5

Comments

Popular posts from this blog

JVM Notebook: Basic Clojure, Java and JVM Language performance

On Unit Testing, Java TDD for developers to write

Application server performance testing, includes Django, ErlyWeb, Rails and others