This is first part in a small series. So you have seen recent reports on LLM, Llama2, chatbots, AI and chat gpt. Well, how does that work? How do you build a chatbot? An inference engine. There has been a lot of discussion on building this machine learning based models in the past two years from 2022 to 2024 but one of the main issues to understanding and building the models comes from the cost tied training and running the models. In my example, with a small subset of data, it took a day or more to actually training the smallest dataset.
Here are some step by step approaches for building the model.
The best FULL example from training the model to the chatbot comes from this project:
He really has all the components here. Why not use that? You could use this project but I want add more ELI5 basic steps here.
Also, he has some optimizations, I want to remove all of that. I want to focus on running the system.
https://github.com/karpathy/llama2.c
I am going to continue in the next post but here is the code I will review.
https://github.com/berlinbrown/llama2-java8.java/tree/main/llama2c-training-only
Additional References:
https://medium.com/@kinoshitayukari18/how-to-train-llama2-c-with-google-colab-b0a91c36b6a9
And I know this is basic, but here python is python3
First Step - Download
The first step in the training process is the "download" step. That is pretty straight forward and not much AI or LLM. Download from the tiny stories dataset. The actual words in English.
Let's see what happens at each step.
python tinystories.py download
When we run, it downloads the content for tiny stories:
python tinystories.py download
Downloading https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStories_all_data.tar.gz to data/TinyStories_all_data.tar.gz...
Code is really simply this:
data_filename = os.path.join(DATA_CACHE_DIR, "TinyStories_all_data.tar.gz")
if not os.path.exists(data_filename):
print(f"Downloading {data_url} to {data_filename}...")
download_file(data_url, data_filename)
else:
print(f"{data_filename} already exists, skipping download...")
You will see these files at 6gigs total
The size of the zipped content is 1.5 gigs
The data files are JSON and look the following:
The structure looks like the following:
With json story, instruction, summary and source.
[
{
"story": "\n\nLily and Ben are friends. They like to play in the park. One day, they see a big tree with a swing. Lily wants to try the swing. She runs to the tree and climbs on the swing.\n\"Push me, Ben!\" she says. Ben pushes her gently. Lily feels happy. She swings higher and higher. She laughs and shouts.\nBen watches Lily. He thinks she is cute. He wants to swing too. He waits for Lily to stop. But Lily does not stop. She swings faster and faster. She is having too much fun.\n\"Can I swing too, Lily?\" Ben asks. Lily does not hear him. She is too busy swinging. Ben feels sad. He walks away.\nLily swings so high that she loses her grip. She falls off the swing. She lands on the ground. She hurts her foot. She cries.\n\"Ow, ow, ow!\" she says. She looks for Ben. She wants him to help her. But Ben is not there. He is gone.\nLily feels sorry. She wishes she had shared the swing with Ben. She wishes he was there to hug her. She limps to the tree. She sees something hanging from a branch. It is Ben's hat. He left it for her.\nLily smiles. She thinks Ben is nice. She puts on his hat. She hopes he will come back. She wants to say sorry. She wants to be friends again.",
"instruction": {
"prompt:": "Write a short story (3-5 paragraphs) which only uses very simple words that a 3 year old child would understand. The story should use the verb \"hang\", the noun \"foot\" and the adjective \"cute\". The story has the following features: the story should contain at least one dialogue. Remember to only use simple words!\n\nPossible story:",
"words": [
"hang",
"foot",
"cute"
],
"features": [
"Dialogue"
]
},
"summary": "Lily and Ben play in the park and Lily gets too caught up in swinging, causing Ben to leave. Lily falls off the swing and hurts herself, but Ben leaves his hat for her as a kind gesture.",
"source": "GPT-4"
},
Then we pretokenize it
python tinystories.py pretokenize
Here are some of the routines under pretokenize
def process_shard(args, vocab_size):
shard_id, shard = args
tokenizer_model = get_tokenizer_model_path(vocab_size)
enc = Tokenizer(tokenizer_model)
with open(shard, "r") as f:
data = json.load(f)
all_tokens = []
for example in tqdm(data, position=shard_id):
text = example["story"]
text = text.strip() # get rid of leading/trailing whitespace
tokens = enc.encode(text, bos=True, eos=False) # encode the text, use BOS
all_tokens.extend(tokens)
Training
python train.py
After running the train, I got the followiing:
raise RuntimeError("Python 3.11+ not yet supported for torch.compile")
RuntimeError: Python 3.11+ not yet supported for torch.compile
With my mac, I guess torch compile was disabled
...
I ran the train and got the following with my small set:
python3 -m train.py --compile=False --eval_iters=10 --batch_size=8
Overriding: compile = False
Overriding: eval_iters = 10
Overriding: batch_size = 8
tokens per iteration will be: 8,192
breaks down as: 4 grad accum steps * 1 processes * 8 batch size * 256 max seq len
Initializing a new model from scratch
num decayed parameter tensors: 43, with 15,187,968 parameters
num non-decayed parameter tensors: 13, with 3,744 parameters
using fused AdamW: False
Created a PretokDataset with rng seed 42
Created a PretokDataset with rng seed 42
...
And then
Between 9am to 12:45pm EST, the following
83 | loss 8.8303 | lr 4.150000e-05 | 24470.25ms | mfu 0.02%
84 | loss 8.8644 | lr 4.200000e-05 | 419462.74ms | mfu 0.01%
And then after we can run the inference chat bot with:
python sample.py --checkpoint=./out/ckpt.pt --start='How are you?'
More on the responses and chat
Testing example where my name is part of the data:
So we prime the data and then the response will include continuation based on the start:
input_text = "Lily and Ben built a fort"
Example output:
"Lily and Ben built a fort in the backyard. They used blankets, chairs, and cushions to create a cozy space. Inside, they imagined all sorts of adventures and spent the afternoon playing and telling stories."
Using the example input "Lily and Ben built a fort," the model generates a continuation of the story, adding context and details. This is useful for creating engaging content or extending prompts in creative ways.
Here I replaced Lily with Berlin
```text
python3 sample.py --checkpoint=./out/ckpt.pt --start='Once upon a time there was little girl name'
Overriding: checkpoint = ./out/ckpt.pt
Overriding: start = Once upon a time there was little girl name
Once upon a time there was little girl name. She loved to play outside in the park with her dad your world, they asked her?"
As they played Tom were a beautiful, so she had lots of fun.
When they decided to see to the park, her dad was so excited candy that she was very sad, your soft ch named Berlin's mom, said, "Thank you, Berlin and tookak people to the end much fun by the bird."
After she was very this and said, "
```
Additional Note on the Training
Additional note on the training and the eval and how long it will take.
I was curious how long this process would run. I have a 2017, 2018 macbook pro with intel chip and no cuda, graphic acceleration.
I guesstimate that each eval takes about 12 seconds.
And looking at the default settings against the tiny stories,
# On a macbook 2017, 2018 with cpu only takes about 12 seconds per iteration
# So: 2000 * 100 * 12 seconds = total expected time
# Looks like save checkpoint at eval_interval
# Change accordingly
# Default eval_interval = 2000, eval_iters = 100
eval_interval = 200
log_interval = 1
eval_iters = 20
eval_only = False # if True, script exits right after the first eval
always_save_checkpoint = True # if True, always save a checkpoint after each eval
init_from = "scratch" # 'scratch' or 'resume'
So:
This training run will take 667 hours and 27 days
>>> 2000 * 100 * 12
2,400,000
So 2 million seconds
Comments