Running the llama2.c training end to end with your custom local raw vocab data - Part 1

 This is first part in a small series.  So you have seen recent reports on LLM, Llama2, chatbots, AI and chat gpt.  Well, how does that work?  How do you build a chatbot?  An inference engine.  There has been a lot of discussion on building this machine learning based models in the past two years from 2022 to 2024 but one of the main issues to understanding and building the models comes from the cost tied training and running the models.  In my example, with a small subset of data, it took a day or more to actually training the smallest dataset.

Here are some step by step approaches for building the model.

The best FULL example from training the model to the chatbot comes from this project:

He really has all the components here.  Why not use that?  You could use this project but I want add more ELI5 basic steps here.

Also, he has some optimizations, I want to remove all of that.  I want to focus on running the system.

https://github.com/karpathy/llama2.c

I am going to continue in the next post but here is the code I will review.

https://github.com/berlinbrown/llama2-java8.java/tree/main/llama2c-training-only

Additional References:

https://medium.com/@kinoshitayukari18/how-to-train-llama2-c-with-google-colab-b0a91c36b6a9


And I know this is basic, but here python is python3

First Step - Download

The first step in the training process is the "download" step.  That is pretty straight forward and not much AI or LLM.  Download from the tiny stories dataset.  The actual words in English.

Let's see what happens at each step.

python tinystories.py download

When we run, it downloads the content for tiny stories:


python tinystories.py download

Downloading https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStories_all_data.tar.gz to data/TinyStories_all_data.tar.gz...



Code is really simply this:

    data_filename = os.path.join(DATA_CACHE_DIR, "TinyStories_all_data.tar.gz")
    if not os.path.exists(data_filename):
        print(f"Downloading {data_url} to {data_filename}...")
        download_file(data_url, data_filename)
    else:
        print(f"{data_filename} already exists, skipping download...")
You will see these files at 6gigs total

The size of the zipped content is 1.5 gigs







The data files are JSON and look the following:



The structure looks like the following:

With json story, instruction, summary and source.

[
  {
    "story": "\n\nLily and Ben are friends. They like to play in the park. One day, they see a big tree with a swing. Lily wants to try the swing. She runs to the tree and climbs on the swing.\n\"Push me, Ben!\" she says. Ben pushes her gently. Lily feels happy. She swings higher and higher. She laughs and shouts.\nBen watches Lily. He thinks she is cute. He wants to swing too. He waits for Lily to stop. But Lily does not stop. She swings faster and faster. She is having too much fun.\n\"Can I swing too, Lily?\" Ben asks. Lily does not hear him. She is too busy swinging. Ben feels sad. He walks away.\nLily swings so high that she loses her grip. She falls off the swing. She lands on the ground. She hurts her foot. She cries.\n\"Ow, ow, ow!\" she says. She looks for Ben. She wants him to help her. But Ben is not there. He is gone.\nLily feels sorry. She wishes she had shared the swing with Ben. She wishes he was there to hug her. She limps to the tree. She sees something hanging from a branch. It is Ben's hat. He left it for her.\nLily smiles. She thinks Ben is nice. She puts on his hat. She hopes he will come back. She wants to say sorry. She wants to be friends again.",
    "instruction": {
      "prompt:": "Write a short story (3-5 paragraphs) which only uses very simple words that a 3 year old child would understand. The story should use the verb \"hang\", the noun \"foot\" and the adjective \"cute\". The story has the following features: the story should contain at least one dialogue. Remember to only use simple words!\n\nPossible story:",
      "words": [
        "hang",
        "foot",
        "cute"
      ],
      "features": [
        "Dialogue"
      ]
    },
    "summary": "Lily and Ben play in the park and Lily gets too caught up in swinging, causing Ben to leave. Lily falls off the swing and hurts herself, but Ben leaves his hat for her as a kind gesture.",
    "source": "GPT-4"
  },

Then we pretokenize it

python tinystories.py pretokenize

Here are some of the routines under pretokenize

def process_shard(args, vocab_size):
shard_id, shard = args
tokenizer_model = get_tokenizer_model_path(vocab_size)
enc = Tokenizer(tokenizer_model)
with open(shard, "r") as f:
data = json.load(f)
all_tokens = []
for example in tqdm(data, position=shard_id):
text = example["story"]
text = text.strip() # get rid of leading/trailing whitespace
tokens = enc.encode(text, bos=True, eos=False) # encode the text, use BOS
all_tokens.extend(tokens)

Training

python train.py


After running the train, I got the followiing:

 raise RuntimeError("Python 3.11+ not yet supported for torch.compile")
RuntimeError: Python 3.11+ not yet supported for torch.compile

With my mac, I guess torch compile was disabled

...

I ran the train and got the following with my small set:


python3 -m train.py --compile=False --eval_iters=10 --batch_size=8

Overriding: compile = False
Overriding: eval_iters = 10
Overriding: batch_size = 8
tokens per iteration will be: 8,192
breaks down as: 4 grad accum steps * 1 processes * 8 batch size * 256 max seq len
Initializing a new model from scratch
num decayed parameter tensors: 43, with 15,187,968 parameters
num non-decayed parameter tensors: 13, with 3,744 parameters
using fused AdamW: False
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42

...

And then
Between 9am to 12:45pm EST, the following

83 | loss 8.8303 | lr 4.150000e-05 | 24470.25ms | mfu 0.02%
84 | loss 8.8644 | lr 4.200000e-05 | 419462.74ms | mfu 0.01%

And then after we can run the inference chat bot with:

python sample.py --checkpoint=./out/ckpt.pt --start='How are you?'

More on the responses and chat

Testing example where my name is part of the data:

So we prime the data and then the response will include continuation based on the start:

input_text = "Lily and Ben built a fort"

Example output:

"Lily and Ben built a fort in the backyard. They used blankets, chairs, and cushions to create a cozy space. Inside, they imagined all sorts of adventures and spent the afternoon playing and telling stories."

Using the example input "Lily and Ben built a fort," the model generates a continuation of the story, adding context and details. This is useful for creating engaging content or extending prompts in creative ways.

Here I replaced Lily with Berlin

```text
 python3 sample.py --checkpoint=./out/ckpt.pt --start='Once upon a time there was little girl name'
Overriding: checkpoint = ./out/ckpt.pt
Overriding: start = Once upon a time there was little girl name
Once upon a time there was little girl name. She loved to play outside in the park with her dad your world, they asked her?"
As they played Tom were a beautiful, so she had lots of fun.
When they decided to see to the park, her dad was so excited candy that she was very sad, your soft ch named Berlin's mom, said, "Thank you, Berlin and tookak people to the end much fun by the bird." 
After she was very this and said, "
```


Additional Note on the Training

Additional note on the training and the eval and how long it will take. 

I was curious how long this process would run.  I have a 2017, 2018 macbook pro with intel chip and no cuda, graphic acceleration.

I guesstimate that each eval takes about 12 seconds.

And looking at the default settings against the tiny stories,


# On a macbook 2017, 2018 with cpu only takes about 12 seconds per iteration
# So: 2000 * 100 * 12 seconds = total expected time
# Looks like save checkpoint at eval_interval
# Change accordingly
# Default eval_interval = 2000, eval_iters = 100
eval_interval = 200
log_interval = 1
eval_iters = 20

eval_only = False # if True, script exits right after the first eval

always_save_checkpoint = True # if True, always save a checkpoint after each eval

init_from = "scratch" # 'scratch' or 'resume'


So:

This training run will take 667 hours and 27 days

>>> 2000 * 100 * 12

2,400,000


So 2 million seconds               

Comments

Popular posts from this blog

On Unit Testing, Java TDD for developers to write

Is Java the new COBOL? Yes. What does that mean, exactly? (Part 1)

JVM Notebook: Basic Clojure, Java and JVM Language performance