2. 数据采样

Reading time: 8 minutes

数据采样

数据采样 是为训练大型语言模型（LLMs）如 GPT 准备数据的关键过程。它涉及将文本数据组织成模型用于学习如何根据前面的单词预测下一个单词（或标记）的输入和目标序列。适当的数据采样确保模型有效捕捉语言模式和依赖关系。

tip

第二阶段的目标非常简单：对输入数据进行采样，并为训练阶段准备，通常通过将数据集分成特定长度的句子并生成预期的响应。

为什么数据采样很重要

像 GPT 这样的 LLM 通过理解前面单词提供的上下文来生成或预测文本。为了实现这一点，训练数据必须以模型能够学习单词序列及其后续单词之间关系的方式进行结构化。这种结构化的方法使模型能够概括并生成连贯且上下文相关的文本。

数据采样中的关键概念

标记化： 将文本分解为称为标记的较小单元（例如，单词、子词或字符）。
序列长度 (max_length)： 每个输入序列中的标记数量。
滑动窗口： 通过在标记化文本上移动窗口来创建重叠输入序列的方法。
步幅： 滑动窗口向前移动以创建下一个序列的标记数量。

逐步示例

让我们通过一个示例来说明数据采样。

示例文本

arduino

"Lorem ipsum dolor sit amet, consectetur adipiscing elit."

Tokenization

假设我们使用一个基本的分词器，将文本分割成单词和标点符号：

vbnet

Tokens: ["Lorem", "ipsum", "dolor", "sit", "amet,", "consectetur", "adipiscing", "elit."]

参数

最大序列长度 (max_length): 4 个标记
滑动窗口步幅: 1 个标记

创建输入和目标序列

滑动窗口方法:

输入序列: 每个输入序列由 max_length 个标记组成。
目标序列: 每个目标序列由紧接着相应输入序列的标记组成。

生成序列:

窗口位置	输入序列	目标序列
1	["Lorem", "ipsum", "dolor", "sit"]	["ipsum", "dolor", "sit", "amet,"]
2	["ipsum", "dolor", "sit", "amet,"]	["dolor", "sit", "amet,", "consectetur"]
3	["dolor", "sit", "amet,", "consectetur"]	["sit", "amet,", "consectetur", "adipiscing"]
4	["sit", "amet,", "consectetur", "adipiscing"]	["amet,", "consectetur", "adipiscing", "elit."]

结果输入和目标数组:

输入:

python

[
["Lorem", "ipsum", "dolor", "sit"],
["ipsum", "dolor", "sit", "amet,"],
["dolor", "sit", "amet,", "consectetur"],
["sit", "amet,", "consectetur", "adipiscing"],
]

目标:

python

[
["ipsum", "dolor", "sit", "amet,"],
["dolor", "sit", "amet,", "consectetur"],
["sit", "amet,", "consectetur", "adipiscing"],
["amet,", "consectetur", "adipiscing", "elit."],
]

可视化表示

标记位置	标记
1	Lorem
2	ipsum
3	dolor
4	sit
5	amet,
6	consectetur
7	adipiscing
8	elit.

步幅为1的滑动窗口:

第一个窗口 (位置 1-4): ["Lorem", "ipsum", "dolor", "sit"] → 目标: ["ipsum", "dolor", "sit", "amet,"]
第二个窗口 (位置 2-5): ["ipsum", "dolor", "sit", "amet,"] → 目标: ["dolor", "sit", "amet,", "consectetur"]
第三个窗口 (位置 3-6): ["dolor", "sit", "amet,", "consectetur"] → 目标: ["sit", "amet,", "consectetur", "adipiscing"]
第四个窗口 (位置 4-7): ["sit", "amet,", "consectetur", "adipiscing"] → 目标: ["amet,", "consectetur", "adipiscing", "elit."]

理解步幅

步幅为1: 窗口每次向前移动一个标记，导致高度重叠的序列。这可以更好地学习上下文关系，但可能增加过拟合的风险，因为相似的数据点被重复。
步幅为2: 窗口每次向前移动两个标记，减少重叠。这减少了冗余和计算负担，但可能会错过一些上下文细微差别。
步幅等于max_length: 窗口按整个窗口大小向前移动，导致非重叠序列。这最小化了数据冗余，但可能限制模型学习序列间依赖关系的能力。

步幅为2的示例:

使用相同的标记文本和 max_length 为4:

第一个窗口 (位置 1-4): ["Lorem", "ipsum", "dolor", "sit"] → 目标: ["ipsum", "dolor", "sit", "amet,"]
第二个窗口 (位置 3-6): ["dolor", "sit", "amet,", "consectetur"] → 目标: ["sit", "amet,", "consectetur", "adipiscing"]
第三个窗口 (位置 5-8): ["amet,", "consectetur", "adipiscing", "elit."] → 目标: ["consectetur", "adipiscing", "elit.", "sed"] (假设继续)

代码示例

让我们通过来自 https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb 的代码示例更好地理解这一点:

python

# Download the text to pre-train the LLM
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

with open("the-verdict.txt", "r", encoding="utf-8") as f:
raw_text = f.read()

"""
Create a class that will receive some params lie tokenizer and text
and will prepare the input chunks and the target chunks to prepare
the LLM to learn which next token to generate
"""
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
def __init__(self, txt, tokenizer, max_length, stride):
self.input_ids = []
self.target_ids = []

# Tokenize the entire text
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

# Use a sliding window to chunk the book into overlapping sequences of max_length
for i in range(0, len(token_ids) - max_length, stride):
input_chunk = token_ids[i:i + max_length]
target_chunk = token_ids[i + 1: i + max_length + 1]
self.input_ids.append(torch.tensor(input_chunk))
self.target_ids.append(torch.tensor(target_chunk))

def __len__(self):
return len(self.input_ids)

def __getitem__(self, idx):
return self.input_ids[idx], self.target_ids[idx]


"""
Create a data loader which given the text and some params will
prepare the inputs and targets with the previous class and
then create a torch DataLoader with the info
"""

import tiktoken

def create_dataloader_v1(txt, batch_size=4, max_length=256,
stride=128, shuffle=True, drop_last=True,
num_workers=0):

# Initialize the tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# Create dataset
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

# Create dataloader
dataloader = DataLoader(
dataset,
batch_size=batch_size,
shuffle=shuffle,
drop_last=drop_last,
num_workers=num_workers
)

return dataloader


"""
Finally, create the data loader with the params we want:
- The used text for training
- batch_size: The size of each batch
- max_length: The size of each entry on each batch
- stride: The sliding window (how many tokens should the next entry advance compared to the previous one). The smaller the more overfitting, usually this is equals to the max_length so the same tokens aren't repeated.
- shuffle: Re-order randomly
"""
dataloader = create_dataloader_v1(
raw_text, batch_size=8, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

# Note the batch_size of 8, the max_length of 4 and the stride of 1
[
# Input
tensor([[   40,   367,  2885,  1464],
[  367,  2885,  1464,  1807],
[ 2885,  1464,  1807,  3619],
[ 1464,  1807,  3619,   402],
[ 1807,  3619,   402,   271],
[ 3619,   402,   271, 10899],
[  402,   271, 10899,  2138],
[  271, 10899,  2138,   257]]),
# Target
tensor([[  367,  2885,  1464,  1807],
[ 2885,  1464,  1807,  3619],
[ 1464,  1807,  3619,   402],
[ 1807,  3619,   402,   271],
[ 3619,   402,   271, 10899],
[  402,   271, 10899,  2138],
[  271, 10899,  2138,   257],
[10899,  2138,   257,  7026]])
]

# With stride=4 this will be the result:
[
# Input
tensor([[   40,   367,  2885,  1464],
[ 1807,  3619,   402,   271],
[10899,  2138,   257,  7026],
[15632,   438,  2016,   257],
[  922,  5891,  1576,   438],
[  568,   340,   373,   645],
[ 1049,  5975,   284,   502],
[  284,  3285,   326,    11]]),
# Target
tensor([[  367,  2885,  1464,  1807],
[ 3619,   402,   271, 10899],
[ 2138,   257,  7026, 15632],
[  438,  2016,   257,   922],
[ 5891,  1576,   438,   568],
[  340,   373,   645,  1049],
[ 5975,   284,   502,   284],
[ 3285,   326,    11,   287]])
]

参考文献

https://www.manning.com/books/build-a-large-language-model-from-scratch

HackTricks

2. 数据采样

数据采样

为什么数据采样很重要

数据采样中的关键概念

逐步示例

代码示例

参考文献