2. Data Sampling
Reading time: 6 minutes
Data Sampling
Data Sampling ni mchakato muhimu katika kuandaa data kwa ajili ya mafunzo ya mifano mikubwa ya lugha (LLMs) kama GPT. Inahusisha kuandaa data ya maandiko katika mfuatano wa ingizo na malengo ambayo mfano hutumia kujifunza jinsi ya kutabiri neno linalofuata (au token) kulingana na maneno yaliyotangulia. Sampuli sahihi za data zinahakikisha kwamba mfano unapata kwa ufanisi mifumo ya lugha na utegemezi.
tip
Lengo la awamu hii ya pili ni rahisi sana: Sampuli data ya ingizo na kuandaa kwa ajili ya awamu ya mafunzo kwa kawaida kwa kutenganisha dataset katika sentensi za urefu maalum na pia kuzalisha jibu linalotarajiwa.
Why Data Sampling Matters
LLMs kama GPT zinafundishwa kuzalisha au kutabiri maandiko kwa kuelewa muktadha unaotolewa na maneno ya awali. Ili kufikia hili, data ya mafunzo inapaswa kuandaliwa kwa njia ambayo mfano unaweza kujifunza uhusiano kati ya mfuatano wa maneno na maneno yao yanayofuata. Njia hii iliyopangwa inaruhusu mfano kuweza kujumlisha na kuzalisha maandiko yanayofaa na yanayoeleweka katika muktadha.
Key Concepts in Data Sampling
- Tokenization: Kugawanya maandiko katika vitengo vidogo vinavyoitwa tokens (mfano, maneno, subwords, au wahusika).
- Sequence Length (max_length): Idadi ya tokens katika kila mfuatano wa ingizo.
- Sliding Window: Njia ya kuunda mfuatano wa ingizo unaoshirikiana kwa kusogeza dirisha juu ya maandiko yaliyotolewa tokens.
- Stride: Idadi ya tokens ambayo dirisha linalosogea linahamia mbele ili kuunda mfuatano unaofuata.
Step-by-Step Example
Tufanye mfano ili kuonyesha sampuli za data.
Example Text
"Lorem ipsum dolor sit amet, consectetur adipiscing elit."
Tokenization
Fikiria tunatumia basic tokenizer inayogawanya maandiko katika maneno na alama za uakifishaji:
Tokens: ["Lorem", "ipsum", "dolor", "sit", "amet,", "consectetur", "adipiscing", "elit."]
Parametri
- Muda wa Mfuatano wa Juu (max_length): 4 tokens
- Kipande cha Dirisha Kinachosonga: 1 token
Kuunda Mfuatano wa Ingizo na Lengo
- Njia ya Dirisha Linalosonga:
- Mfuatano wa Ingizo: Kila mfuatano wa ingizo unajumuisha
max_length
tokens. - Mfuatano wa Lengo: Kila mfuatano wa lengo unajumuisha tokens ambazo zinafuata mara moja mfuatano wa ingizo husika.
- Kuzalisha Mfuatano:
Nafasi ya Dirisha | Mfuatano wa Ingizo | Mfuatano wa Lengo |
---|---|---|
1 | ["Lorem", "ipsum", "dolor", "sit"] | ["ipsum", "dolor", "sit", "amet,"] |
2 | ["ipsum", "dolor", "sit", "amet,"] | ["dolor", "sit", "amet,", "consectetur"] |
3 | ["dolor", "sit", "amet,", "consectetur"] | ["sit", "amet,", "consectetur", "adipiscing"] |
4 | ["sit", "amet,", "consectetur", "adipiscing"] | ["amet,", "consectetur", "adipiscing", "elit."] |
- Mifumo ya Ingizo na Lengo:
- Ingizo:
[
["Lorem", "ipsum", "dolor", "sit"],
["ipsum", "dolor", "sit", "amet,"],
["dolor", "sit", "amet,", "consectetur"],
["sit", "amet,", "consectetur", "adipiscing"],
]
- Lengo:
[
["ipsum", "dolor", "sit", "amet,"],
["dolor", "sit", "amet,", "consectetur"],
["sit", "amet,", "consectetur", "adipiscing"],
["amet,", "consectetur", "adipiscing", "elit."],
]
Uwakilishi wa Kihisia
Nafasi ya Token | Token |
---|---|
1 | Lorem |
2 | ipsum |
3 | dolor |
4 | sit |
5 | amet, |
6 | consectetur |
7 | adipiscing |
8 | elit. |
Dirisha Linalosonga na Kipande 1:
- Dirisha la Kwanza (Nafasi 1-4): ["Lorem", "ipsum", "dolor", "sit"] → Lengo: ["ipsum", "dolor", "sit", "amet,"]
- Dirisha la Pili (Nafasi 2-5): ["ipsum", "dolor", "sit", "amet,"] → Lengo: ["dolor", "sit", "amet,", "consectetur"]
- Dirisha la Tatu (Nafasi 3-6): ["dolor", "sit", "amet,", "consectetur"] → Lengo: ["sit", "amet,", "consectetur", "adipiscing"]
- Dirisha la Nne (Nafasi 4-7): ["sit", "amet,", "consectetur", "adipiscing"] → Lengo: ["amet,", "consectetur", "adipiscing", "elit."]
Kuelewa Kipande
- Kipande cha 1: Dirisha linahamia mbele kwa token moja kila wakati, likisababisha mfuatano unaoshirikiana sana. Hii inaweza kuleta kujifunza bora kwa uhusiano wa muktadha lakini inaweza kuongeza hatari ya kupita kiasi kwa sababu data zinazofanana zinajirudia.
- Kipande cha 2: Dirisha linahamia mbele kwa token mbili kila wakati, kupunguza ushirikiano. Hii inapunguza kurudiwa na mzigo wa kompyuta lakini inaweza kukosa baadhi ya nuances za muktadha.
- Kipande sawa na max_length: Dirisha linahamia mbele kwa ukubwa mzima wa dirisha, likisababisha mfuatano usio na ushirikiano. Hii inapunguza kurudiwa kwa data lakini inaweza kupunguza uwezo wa mfano kujifunza utegemezi kati ya mfuatano.
Mfano na Kipande cha 2:
Kwa kutumia maandiko yaliyotolewa na max_length
ya 4:
- Dirisha la Kwanza (Nafasi 1-4): ["Lorem", "ipsum", "dolor", "sit"] → Lengo: ["ipsum", "dolor", "sit", "amet,"]
- Dirisha la Pili (Nafasi 3-6): ["dolor", "sit", "amet,", "consectetur"] → Lengo: ["sit", "amet,", "consectetur", "adipiscing"]
- Dirisha la Tatu (Nafasi 5-8): ["amet,", "consectetur", "adipiscing", "elit."] → Lengo: ["consectetur", "adipiscing", "elit.", "sed"] (Kukisia kuendelea)
Mfano wa Kanuni
Hebu tuuelewe hili vizuri kutoka kwa mfano wa kanuni kutoka https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb:
# Download the text to pre-train the LLM
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)
with open("the-verdict.txt", "r", encoding="utf-8") as f:
raw_text = f.read()
"""
Create a class that will receive some params lie tokenizer and text
and will prepare the input chunks and the target chunks to prepare
the LLM to learn which next token to generate
"""
import torch
from torch.utils.data import Dataset, DataLoader
class GPTDatasetV1(Dataset):
def __init__(self, txt, tokenizer, max_length, stride):
self.input_ids = []
self.target_ids = []
# Tokenize the entire text
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
# Use a sliding window to chunk the book into overlapping sequences of max_length
for i in range(0, len(token_ids) - max_length, stride):
input_chunk = token_ids[i:i + max_length]
target_chunk = token_ids[i + 1: i + max_length + 1]
self.input_ids.append(torch.tensor(input_chunk))
self.target_ids.append(torch.tensor(target_chunk))
def __len__(self):
return len(self.input_ids)
def __getitem__(self, idx):
return self.input_ids[idx], self.target_ids[idx]
"""
Create a data loader which given the text and some params will
prepare the inputs and targets with the previous class and
then create a torch DataLoader with the info
"""
import tiktoken
def create_dataloader_v1(txt, batch_size=4, max_length=256,
stride=128, shuffle=True, drop_last=True,
num_workers=0):
# Initialize the tokenizer
tokenizer = tiktoken.get_encoding("gpt2")
# Create dataset
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
# Create dataloader
dataloader = DataLoader(
dataset,
batch_size=batch_size,
shuffle=shuffle,
drop_last=drop_last,
num_workers=num_workers
)
return dataloader
"""
Finally, create the data loader with the params we want:
- The used text for training
- batch_size: The size of each batch
- max_length: The size of each entry on each batch
- stride: The sliding window (how many tokens should the next entry advance compared to the previous one). The smaller the more overfitting, usually this is equals to the max_length so the same tokens aren't repeated.
- shuffle: Re-order randomly
"""
dataloader = create_dataloader_v1(
raw_text, batch_size=8, max_length=4, stride=1, shuffle=False
)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)
# Note the batch_size of 8, the max_length of 4 and the stride of 1
[
# Input
tensor([[ 40, 367, 2885, 1464],
[ 367, 2885, 1464, 1807],
[ 2885, 1464, 1807, 3619],
[ 1464, 1807, 3619, 402],
[ 1807, 3619, 402, 271],
[ 3619, 402, 271, 10899],
[ 402, 271, 10899, 2138],
[ 271, 10899, 2138, 257]]),
# Target
tensor([[ 367, 2885, 1464, 1807],
[ 2885, 1464, 1807, 3619],
[ 1464, 1807, 3619, 402],
[ 1807, 3619, 402, 271],
[ 3619, 402, 271, 10899],
[ 402, 271, 10899, 2138],
[ 271, 10899, 2138, 257],
[10899, 2138, 257, 7026]])
]
# With stride=4 this will be the result:
[
# Input
tensor([[ 40, 367, 2885, 1464],
[ 1807, 3619, 402, 271],
[10899, 2138, 257, 7026],
[15632, 438, 2016, 257],
[ 922, 5891, 1576, 438],
[ 568, 340, 373, 645],
[ 1049, 5975, 284, 502],
[ 284, 3285, 326, 11]]),
# Target
tensor([[ 367, 2885, 1464, 1807],
[ 3619, 402, 271, 10899],
[ 2138, 257, 7026, 15632],
[ 438, 2016, 257, 922],
[ 5891, 1576, 438, 568],
[ 340, 373, 645, 1049],
[ 5975, 284, 502, 284],
[ 3285, 326, 11, 287]])
]