3. Token Embeddings

Reading time: 6 minutes

Token Embeddings

After tokenizing text data, the next critical step in preparing data for training large language models (LLMs) like GPT is creating token embeddings. Token embeddings transform discrete tokens (such as words or subwords) into continuous numerical vectors that the model can process and learn from. This explanation breaks down token embeddings, their initialization, usage, and the role of positional embeddings in enhancing model understanding of token sequences.

tip

The goal of this third phase is very simple: Assign each of the previous tokens in the vocabulary a vector of the desired dimensions to train the model. Each word in the vocabulary will a point in a space of X dimensions.
Note that initially the position of each word in the space is just initialised "randomly" and these positions are trainable parameters (will be improved during the training).

Moreover, during the token embedding another layer of embeddings is created which represents (in this case) the absolute possition of the word in the training sentence. This way a word in different positions in the sentence will have a different representation (meaning).

What Are Token Embeddings?

Token Embeddings are numerical representations of tokens in a continuous vector space. Each token in the vocabulary is associated with a unique vector of fixed dimensions. These vectors capture semantic and syntactic information about the tokens, enabling the model to understand relationships and patterns in the data.

Vocabulary Size: The total number of unique tokens (e.g., words, subwords) in the model’s vocabulary.
Embedding Dimensions: The number of numerical values (dimensions) in each token’s vector. Higher dimensions can capture more nuanced information but require more computational resources.

Example:

Vocabulary Size: 6 tokens [1, 2, 3, 4, 5, 6]
Embedding Dimensions: 3 (x, y, z)

Initializing Token Embeddings

At the start of training, token embeddings are typically initialized with small random values. These initial values are adjusted (fine-tuned) during training to better represent the tokens' meanings based on the training data.

PyTorch Example:

python

import torch

# Set a random seed for reproducibility
torch.manual_seed(123)

# Create an embedding layer with 6 tokens and 3 dimensions
embedding_layer = torch.nn.Embedding(6, 3)

# Display the initial weights (embeddings)
print(embedding_layer.weight)

Output:

lua

luaCopy codeParameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)

Explanation:

Each row corresponds to a token in the vocabulary.
Each column represents a dimension in the embedding vector.
For example, the token at index 3 has an embedding vector [-0.4015, 0.9666, -1.1481].

Accessing a Token’s Embedding:

python

# Retrieve the embedding for the token at index 3
token_index = torch.tensor([3])
print(embedding_layer(token_index))

Output:

lua

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)

Interpretation:

The token at index 3 is represented by the vector [-0.4015, 0.9666, -1.1481].
These values are trainable parameters that the model will adjust during training to better represent the token's context and meaning.

How Token Embeddings Work During Training

During training, each token in the input data is converted into its corresponding embedding vector. These vectors are then used in various computations within the model, such as attention mechanisms and neural network layers.

Example Scenario:

Batch Size: 8 (number of samples processed simultaneously)
Max Sequence Length: 4 (number of tokens per sample)
Embedding Dimensions: 256

Data Structure:

Each batch is represented as a 3D tensor with shape (batch_size, max_length, embedding_dim).
For our example, the shape would be (8, 4, 256).

Visualization:

css

cssCopy codeBatch
┌─────────────┐
│ Sample 1    │
│ ┌─────┐     │
│ │Token│ → [x₁₁, x₁₂, ..., x₁₂₅₆]
│ │ 1   │     │
│ │...  │     │
│ │Token│     │
│ │ 4   │     │
│ └─────┘     │
│ Sample 2    │
│ ┌─────┐     │
│ │Token│ → [x₂₁, x₂₂, ..., x₂₂₅₆]
│ │ 1   │     │
│ │...  │     │
│ │Token│     │
│ │ 4   │     │
│ └─────┘     │
│ ...         │
│ Sample 8    │
│ ┌─────┐     │
│ │Token│ → [x₈₁, x₈₂, ..., x₈₂₅₆]
│ │ 1   │     │
│ │...  │     │
│ │Token│     │
│ │ 4   │     │
│ └─────┘     │
└─────────────┘

Explanation:

Each token in the sequence is represented by a 256-dimensional vector.
The model processes these embeddings to learn language patterns and generate predictions.

Positional Embeddings: Adding Context to Token Embeddings

While token embeddings capture the meaning of individual tokens, they do not inherently encode the position of tokens within a sequence. Understanding the order of tokens is crucial for language comprehension. This is where positional embeddings come into play.

Why Positional Embeddings Are Needed:

Token Order Matters: In sentences, the meaning often depends on the order of words. For example, "The cat sat on the mat" vs. "The mat sat on the cat."
Embedding Limitation: Without positional information, the model treats tokens as a "bag of words," ignoring their sequence.

Types of Positional Embeddings:

Absolute Positional Embeddings:
- Assign a unique position vector to each position in the sequence.
- Example: The first token in any sequence has the same positional embedding, the second token has another, and so on.
- Used By: OpenAI’s GPT models.
Relative Positional Embeddings:
- Encode the relative distance between tokens rather than their absolute positions.
- Example: Indicate how far apart two tokens are, regardless of their absolute positions in the sequence.
- Used By: Models like Transformer-XL and some variants of BERT.

How Positional Embeddings Are Integrated:

Same Dimensions: Positional embeddings have the same dimensionality as token embeddings.
Addition: They are added to token embeddings, combining token identity with positional information without increasing the overall dimensionality.

Example of Adding Positional Embeddings:

Suppose a token embedding vector is [0.5, -0.2, 0.1] and its positional embedding vector is [0.1, 0.3, -0.1]. The combined embedding used by the model would be:

css

Combined Embedding = Token Embedding + Positional Embedding
                   = [0.5 + 0.1, -0.2 + 0.3, 0.1 + (-0.1)]
                   = [0.6, 0.1, 0.0]

Benefits of Positional Embeddings:

Contextual Awareness: The model can differentiate between tokens based on their positions.
Sequence Understanding: Enables the model to understand grammar, syntax, and context-dependent meanings.

Code Example

Following with the code example from https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb:

python

# Use previous code...

# Create dimensional emdeddings
"""
BPE uses a vocabulary of 50257 words
Let's supose we want to use 256 dimensions (instead of the millions used by LLMs)
"""

vocab_size = 50257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

## Generate the dataloader like before
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

# Apply embeddings
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)
torch.Size([8, 4, 256]) # 8 x 4 x 256

# Generate absolute embeddings
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

pos_embeddings = pos_embedding_layer(torch.arange(max_length))

input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape) # torch.Size([8, 4, 256])

References

https://www.manning.com/books/build-a-large-language-model-from-scratch

HackTricks