Calculating Parameters with LLM

 And this data from Sebastian's book on large language models

124 Million parameters


GPT_CONFIG_124M = {

    "vocab_size": 50257,     # Vocabulary size

    "context_length": 1024,  # Context length

    "emb_dim": 768,          # Embedding dimension

    "n_heads": 12,           # Number of attention heads

    "n_layers": 12,          # Number of layers

    "drop_rate": 0.1,        # Dropout rate

    "qkv_bias": False        # Query-Key-Value bias

}


The 1.5 billion parameter GPT model config 


GPT_CONFIG_1558M = {

    "vocab_size": 50257,     # Vocabulary size

    "context_length": 1024,  # Context length

    "emb_dim": 1600,         # Embedding dimension  (Change here)

    "n_heads": 25,           # Number of attention heads

    "n_layers": 48,          # Number of layers

    "drop_rate": 0.1,        # Dropout rate

    "qkv_bias": False        # Query-Key-Value bias

}


We can define a Llama 2 config file for the 7B model, 7 billion:


LLAMA2_CONFIG_7B = {

    "vocab_size": 32000,     # Vocabulary size

    "context_length": 4096,  # Context length

    "emb_dim": 4096,         # Embedding dimension (higher dimesion)

    "n_heads": 32,           # Number of attention heads

    "n_layers": 32,          # Number of layers

    "hidden_dim": 11008,     # NEW: Size of the intermediate dimension in FeedForward

    "dtype": torch.bfloat16  # NEW: Lower-precision dtype to save memory

}


This is straight from Chat GPT:

To calculate the total number of parameters for the GPT model configuration provided, we need to account for the parameters in the embedding layer, transformer layers (multi-head attention and feed-forward networks), and any other associated weights. Let's break down each component and how the total parameter count is obtained:


### 1. Embedding Layer

The embedding layer maps each token to a vector of size `emb_dim`. The number of parameters is:

\[

\text{vocab\_size} \times \text{emb\_dim} = 50257 \times 768 = 38,596,416

\]


### 2. Transformer Layers

Each of the `n_layers` has two main components: the multi-head attention mechanism and the feed-forward network.


#### a. Multi-Head Attention

- **Query, Key, and Value Matrices:** For multi-head attention, we compute the queries, keys, and values using linear transformations. Each transformation involves a weight matrix of shape \(\text{emb\_dim} \times \text{emb\_dim}\), repeated for the query, key, and value matrices:

  \[

  3 \times (\text{emb\_dim} \times \text{emb\_dim}) = 3 \times (768 \times 768) = 1,769,472

  \]

- **Output Projection:** The result of the attention mechanism is projected back to the original embedding dimension using another weight matrix of shape \(\text{emb\_dim} \times \text{emb\_dim}\):

  \[

  \text{emb\_dim} \times \text{emb\_dim} = 768 \times 768 = 589,824

  \]

- **Total Attention Parameters per Layer:** If the `qkv_bias` is `False`, we only have the weights:

  \[

  1,769,472 + 589,824 = 2,359,296

  \]


#### b. Feed-Forward Network

The feed-forward network consists of two linear transformations with an intermediate dimension, typically \(4 \times \text{emb\_dim}\):

- **First Layer:** Weights of size \(\text{emb\_dim} \times (4 \times \text{emb\_dim})\):

  \[

  768 \times 3072 = 2,359,296

  \]

- **Second Layer:** Weights of size \((4 \times \text{emb\_dim}) \times \text{emb\_dim}\):

  \[

  3072 \times 768 = 2,359,296

  \]

- **Total Feed-Forward Parameters per Layer:** 

  \[

  2,359,296 + 2,359,296 = 4,718,592

  \]


#### c. Layer Norm Parameters

Each layer also has two layer normalization layers, each with parameters equal to the embedding dimension:

\[

2 \times \text{emb\_dim} = 2 \times 768 = 1,536

\]


#### d. Total Parameters per Transformer Layer

Combining attention, feed-forward, and layer norm:

\[

2,359,296 + 4,718,592 + 1,536 = 7,079,424

\]


#### e. Total for All Transformer Layers

For `n_layers` = 12:

\[

7,079,424 \times 12 = 84,953,088

\]


### 3. Final Layer Norm

There is an additional layer normalization at the output, adding:

\[

\text{emb\_dim} = 768

\]


### 4. Total Parameters

Adding up all the components:

\[

38,596,416 \text{ (embedding)} + 84,953,088 \text{ (transformer layers)} + 768 \text{ (final layer norm)} = 123,550,272

\]


This value rounds to approximately 124 million parameters, matching the expected model size.




In the context of a neural network, a **parameter** refers to a value that the model learns during training. Parameters are typically the weights and biases used in the model's layers to transform the input data and generate predictions. In the data above, parameters are present in various components of the GPT model configuration:


1. **Embedding Layer Parameters:** These parameters map each word (or token) in the vocabulary to a vector representation of size `emb_dim`. The embedding matrix's weights are parameters that the model learns to represent each token in a meaningful way.


2. **Transformer Layer Parameters:**

   - **Multi-Head Attention Weights:** The parameters here include the weights for the Query, Key, and Value projections, which are learned to help the model focus on different parts of the input sequence.

   - **Output Projection Weights:** After the attention mechanism calculates the weighted sum of values, a weight matrix is used to project the output back to the original embedding size.

   - **Feed-Forward Network Weights:** The parameters in this part consist of weights for two linear transformations that help the model capture complex relationships in the data.

   - **Layer Normalization Weights:** These include scale and shift parameters, which normalize the outputs of the layers.


3. **Final Layer Norm Parameters:** These are also learnable parameters that help stabilize the output of the model.


Overall, each parameter is a value that influences how the model processes input data and makes predictions, and during training, these parameters are adjusted to minimize the error in the model's predictions.

Comments

Popular posts from this blog

Is Java the new COBOL? Yes. What does that mean, exactly? (Part 1)

On Unit Testing, Java TDD for developers to write

JVM Notebook: Basic Clojure, Java and JVM Language performance