2. Data Sampling

Reading time: 6 minutes

Data Sampling

Data Sampling ni mchakato muhimu katika kuandaa data kwa ajili ya mafunzo ya mifano mikubwa ya lugha (LLMs) kama GPT. Inahusisha kuandaa data ya maandiko katika mfuatano wa ingizo na malengo ambayo mfano hutumia kujifunza jinsi ya kutabiri neno linalofuata (au token) kulingana na maneno yaliyotangulia. Sampuli sahihi za data zinahakikisha kwamba mfano unapata kwa ufanisi mifumo ya lugha na utegemezi.

tip

Lengo la awamu hii ya pili ni rahisi sana: Sampuli data ya ingizo na kuandaa kwa ajili ya awamu ya mafunzo kwa kawaida kwa kutenganisha dataset katika sentensi za urefu maalum na pia kuzalisha jibu linalotarajiwa.

Why Data Sampling Matters

LLMs kama GPT zinafundishwa kuzalisha au kutabiri maandiko kwa kuelewa muktadha unaotolewa na maneno ya awali. Ili kufikia hili, data ya mafunzo inapaswa kuandaliwa kwa njia ambayo mfano unaweza kujifunza uhusiano kati ya mfuatano wa maneno na maneno yao yanayofuata. Njia hii iliyopangwa inaruhusu mfano kuweza kujumlisha na kuzalisha maandiko yanayofaa na yanayoeleweka katika muktadha.

Key Concepts in Data Sampling

  1. Tokenization: Kugawanya maandiko katika vitengo vidogo vinavyoitwa tokens (mfano, maneno, subwords, au wahusika).
  2. Sequence Length (max_length): Idadi ya tokens katika kila mfuatano wa ingizo.
  3. Sliding Window: Njia ya kuunda mfuatano wa ingizo unaoshirikiana kwa kusogeza dirisha juu ya maandiko yaliyotolewa tokens.
  4. Stride: Idadi ya tokens ambayo dirisha linalosogea linahamia mbele ili kuunda mfuatano unaofuata.

Step-by-Step Example

Tufanye mfano ili kuonyesha sampuli za data.

Example Text

arduino
"Lorem ipsum dolor sit amet, consectetur adipiscing elit."

Tokenization

Fikiria tunatumia basic tokenizer inayogawanya maandiko katika maneno na alama za uakifishaji:

vbnet
Tokens: ["Lorem", "ipsum", "dolor", "sit", "amet,", "consectetur", "adipiscing", "elit."]

Parametri

  • Muda wa Mfuatano wa Juu (max_length): 4 tokens
  • Kipande cha Dirisha Kinachosonga: 1 token

Kuunda Mfuatano wa Ingizo na Lengo

  1. Njia ya Dirisha Linalosonga:
  • Mfuatano wa Ingizo: Kila mfuatano wa ingizo unajumuisha max_length tokens.
  • Mfuatano wa Lengo: Kila mfuatano wa lengo unajumuisha tokens ambazo zinafuata mara moja mfuatano wa ingizo husika.
  1. Kuzalisha Mfuatano:
Nafasi ya DirishaMfuatano wa IngizoMfuatano wa Lengo
1["Lorem", "ipsum", "dolor", "sit"]["ipsum", "dolor", "sit", "amet,"]
2["ipsum", "dolor", "sit", "amet,"]["dolor", "sit", "amet,", "consectetur"]
3["dolor", "sit", "amet,", "consectetur"]["sit", "amet,", "consectetur", "adipiscing"]
4["sit", "amet,", "consectetur", "adipiscing"]["amet,", "consectetur", "adipiscing", "elit."]
  1. Mifumo ya Ingizo na Lengo:
  • Ingizo:
python
[
["Lorem", "ipsum", "dolor", "sit"],
["ipsum", "dolor", "sit", "amet,"],
["dolor", "sit", "amet,", "consectetur"],
["sit", "amet,", "consectetur", "adipiscing"],
]
  • Lengo:
python
[
["ipsum", "dolor", "sit", "amet,"],
["dolor", "sit", "amet,", "consectetur"],
["sit", "amet,", "consectetur", "adipiscing"],
["amet,", "consectetur", "adipiscing", "elit."],
]

Uwakilishi wa Kihisia

Nafasi ya TokenToken
1Lorem
2ipsum
3dolor
4sit
5amet,
6consectetur
7adipiscing
8elit.

Dirisha Linalosonga na Kipande 1:

  • Dirisha la Kwanza (Nafasi 1-4): ["Lorem", "ipsum", "dolor", "sit"] → Lengo: ["ipsum", "dolor", "sit", "amet,"]
  • Dirisha la Pili (Nafasi 2-5): ["ipsum", "dolor", "sit", "amet,"] → Lengo: ["dolor", "sit", "amet,", "consectetur"]
  • Dirisha la Tatu (Nafasi 3-6): ["dolor", "sit", "amet,", "consectetur"] → Lengo: ["sit", "amet,", "consectetur", "adipiscing"]
  • Dirisha la Nne (Nafasi 4-7): ["sit", "amet,", "consectetur", "adipiscing"] → Lengo: ["amet,", "consectetur", "adipiscing", "elit."]

Kuelewa Kipande

  • Kipande cha 1: Dirisha linahamia mbele kwa token moja kila wakati, likisababisha mfuatano unaoshirikiana sana. Hii inaweza kuleta kujifunza bora kwa uhusiano wa muktadha lakini inaweza kuongeza hatari ya kupita kiasi kwa sababu data zinazofanana zinajirudia.
  • Kipande cha 2: Dirisha linahamia mbele kwa token mbili kila wakati, kupunguza ushirikiano. Hii inapunguza kurudiwa na mzigo wa kompyuta lakini inaweza kukosa baadhi ya nuances za muktadha.
  • Kipande sawa na max_length: Dirisha linahamia mbele kwa ukubwa mzima wa dirisha, likisababisha mfuatano usio na ushirikiano. Hii inapunguza kurudiwa kwa data lakini inaweza kupunguza uwezo wa mfano kujifunza utegemezi kati ya mfuatano.

Mfano na Kipande cha 2:

Kwa kutumia maandiko yaliyotolewa na max_length ya 4:

  • Dirisha la Kwanza (Nafasi 1-4): ["Lorem", "ipsum", "dolor", "sit"] → Lengo: ["ipsum", "dolor", "sit", "amet,"]
  • Dirisha la Pili (Nafasi 3-6): ["dolor", "sit", "amet,", "consectetur"] → Lengo: ["sit", "amet,", "consectetur", "adipiscing"]
  • Dirisha la Tatu (Nafasi 5-8): ["amet,", "consectetur", "adipiscing", "elit."] → Lengo: ["consectetur", "adipiscing", "elit.", "sed"] (Kukisia kuendelea)

Mfano wa Kanuni

Hebu tuuelewe hili vizuri kutoka kwa mfano wa kanuni kutoka https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb:

python
# Download the text to pre-train the LLM
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

with open("the-verdict.txt", "r", encoding="utf-8") as f:
raw_text = f.read()

"""
Create a class that will receive some params lie tokenizer and text
and will prepare the input chunks and the target chunks to prepare
the LLM to learn which next token to generate
"""
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
def __init__(self, txt, tokenizer, max_length, stride):
self.input_ids = []
self.target_ids = []

# Tokenize the entire text
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

# Use a sliding window to chunk the book into overlapping sequences of max_length
for i in range(0, len(token_ids) - max_length, stride):
input_chunk = token_ids[i:i + max_length]
target_chunk = token_ids[i + 1: i + max_length + 1]
self.input_ids.append(torch.tensor(input_chunk))
self.target_ids.append(torch.tensor(target_chunk))

def __len__(self):
return len(self.input_ids)

def __getitem__(self, idx):
return self.input_ids[idx], self.target_ids[idx]


"""
Create a data loader which given the text and some params will
prepare the inputs and targets with the previous class and
then create a torch DataLoader with the info
"""

import tiktoken

def create_dataloader_v1(txt, batch_size=4, max_length=256,
stride=128, shuffle=True, drop_last=True,
num_workers=0):

# Initialize the tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# Create dataset
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

# Create dataloader
dataloader = DataLoader(
dataset,
batch_size=batch_size,
shuffle=shuffle,
drop_last=drop_last,
num_workers=num_workers
)

return dataloader


"""
Finally, create the data loader with the params we want:
- The used text for training
- batch_size: The size of each batch
- max_length: The size of each entry on each batch
- stride: The sliding window (how many tokens should the next entry advance compared to the previous one). The smaller the more overfitting, usually this is equals to the max_length so the same tokens aren't repeated.
- shuffle: Re-order randomly
"""
dataloader = create_dataloader_v1(
raw_text, batch_size=8, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

# Note the batch_size of 8, the max_length of 4 and the stride of 1
[
# Input
tensor([[   40,   367,  2885,  1464],
[  367,  2885,  1464,  1807],
[ 2885,  1464,  1807,  3619],
[ 1464,  1807,  3619,   402],
[ 1807,  3619,   402,   271],
[ 3619,   402,   271, 10899],
[  402,   271, 10899,  2138],
[  271, 10899,  2138,   257]]),
# Target
tensor([[  367,  2885,  1464,  1807],
[ 2885,  1464,  1807,  3619],
[ 1464,  1807,  3619,   402],
[ 1807,  3619,   402,   271],
[ 3619,   402,   271, 10899],
[  402,   271, 10899,  2138],
[  271, 10899,  2138,   257],
[10899,  2138,   257,  7026]])
]

# With stride=4 this will be the result:
[
# Input
tensor([[   40,   367,  2885,  1464],
[ 1807,  3619,   402,   271],
[10899,  2138,   257,  7026],
[15632,   438,  2016,   257],
[  922,  5891,  1576,   438],
[  568,   340,   373,   645],
[ 1049,  5975,   284,   502],
[  284,  3285,   326,    11]]),
# Target
tensor([[  367,  2885,  1464,  1807],
[ 3619,   402,   271, 10899],
[ 2138,   257,  7026, 15632],
[  438,  2016,   257,   922],
[ 5891,  1576,   438,   568],
[  340,   373,   645,  1049],
[ 5975,   284,   502,   284],
[ 3285,   326,    11,   287]])
]

Marejeo