2. Campionamento Dati

Tip

Impara e pratica il hacking AWS:HackTricks Training AWS Red Team Expert (ARTE)
Impara e pratica il hacking GCP: HackTricks Training GCP Red Team Expert (GRTE) Impara e pratica il hacking Azure: HackTricks Training Azure Red Team Expert (AzRTE)

Supporta HackTricks

Controlla i piani di abbonamento!

Unisciti al 💬 gruppo Discord o al gruppo telegram o seguici su Twitter 🐦 @hacktricks_live.

Condividi trucchi di hacking inviando PR ai HackTricks e HackTricks Cloud repos github.

Campionamento Dati

Il Campionamento Dati è un processo cruciale nella preparazione dei dati per l’addestramento di modelli di linguaggio di grandi dimensioni (LLM) come GPT. Comporta l’organizzazione dei dati testuali in sequenze di input e target che il modello utilizza per imparare a prevedere la parola successiva (o token) basandosi sulle parole precedenti. Un corretto campionamento dei dati assicura che il modello catturi efficacemente i modelli linguistici e le dipendenze.

Tip

L’obiettivo di questa seconda fase è molto semplice: Campionare i dati di input e prepararli per la fase di addestramento solitamente separando il dataset in frasi di una lunghezza specifica e generando anche la risposta attesa.

Perché il Campionamento Dati è Importante

Gli LLM come GPT sono addestrati a generare o prevedere testo comprendendo il contesto fornito dalle parole precedenti. Per raggiungere questo obiettivo, i dati di addestramento devono essere strutturati in modo che il modello possa apprendere la relazione tra sequenze di parole e le loro parole successive. Questo approccio strutturato consente al modello di generalizzare e generare testo coerente e contestualmente rilevante.

Concetti Chiave nel Campionamento Dati

Tokenizzazione: Suddividere il testo in unità più piccole chiamate token (ad es., parole, sottoparole o caratteri).
Lunghezza della Sequenza (max_length): Il numero di token in ciascuna sequenza di input.
Finestra Scorrevole: Un metodo per creare sequenze di input sovrapposte spostando una finestra sul testo tokenizzato.
Passo: Il numero di token che la finestra scorrevole si sposta in avanti per creare la sequenza successiva.

Esempio Passo-Passo

Esaminiamo un esempio per illustrare il campionamento dei dati.

Testo di Esempio

"Lorem ipsum dolor sit amet, consectetur adipiscing elit."

Tokenizzazione

Assumiamo di utilizzare un tokenizer di base che suddivide il testo in parole e segni di punteggiatura:

Tokens: ["Lorem", "ipsum", "dolor", "sit", "amet,", "consectetur", "adipiscing", "elit."]

Parametri

Lunghezza Massima della Sequenza (max_length): 4 token
Passo della Finestra Scorrevole: 1 token

Creazione di Sequenze di Input e Target

Approccio della Finestra Scorrevole:

Sequenze di Input: Ogni sequenza di input è composta da max_length token.
Sequenze di Target: Ogni sequenza di target è composta dai token che seguono immediatamente la corrispondente sequenza di input.

Generazione delle Sequenze:

Posizione della Finestra	Sequenza di Input	Sequenza di Target
1	["Lorem", "ipsum", "dolor", "sit"]	["ipsum", "dolor", "sit", "amet,"]
2	["ipsum", "dolor", "sit", "amet,"]	["dolor", "sit", "amet,", "consectetur"]
3	["dolor", "sit", "amet,", "consectetur"]	["sit", "amet,", "consectetur", "adipiscing"]
4	["sit", "amet,", "consectetur", "adipiscing"]	["amet,", "consectetur", "adipiscing", "elit."]

Array di Input e Target Risultanti:

Input:

[
["Lorem", "ipsum", "dolor", "sit"],
["ipsum", "dolor", "sit", "amet,"],
["dolor", "sit", "amet,", "consectetur"],
["sit", "amet,", "consectetur", "adipiscing"],
]

Target:

[
["ipsum", "dolor", "sit", "amet,"],
["dolor", "sit", "amet,", "consectetur"],
["sit", "amet,", "consectetur", "adipiscing"],
["amet,", "consectetur", "adipiscing", "elit."],
]

Rappresentazione Visiva

Posizione del Token	Token
1	Lorem
2	ipsum
3	dolor
4	sit
5	amet,
6	consectetur
7	adipiscing
8	elit.

Finestra Scorrevole con Passo 1:

Prima Finestra (Posizioni 1-4): [“Lorem”, “ipsum”, “dolor”, “sit”] → Target: [“ipsum”, “dolor”, “sit”, “amet,”]
Seconda Finestra (Posizioni 2-5): [“ipsum”, “dolor”, “sit”, “amet,”] → Target: [“dolor”, “sit”, “amet,”, “consectetur”]
Terza Finestra (Posizioni 3-6): [“dolor”, “sit”, “amet,”, “consectetur”] → Target: [“sit”, “amet,”, “consectetur”, “adipiscing”]
Quarta Finestra (Posizioni 4-7): [“sit”, “amet,”, “consectetur”, “adipiscing”] → Target: [“amet,”, “consectetur”, “adipiscing”, “elit.”]

Comprendere il Passo

Passo di 1: La finestra si sposta in avanti di un token ogni volta, risultando in sequenze altamente sovrapposte. Questo può portare a un miglior apprendimento delle relazioni contestuali ma può aumentare il rischio di overfitting poiché punti dati simili vengono ripetuti.
Passo di 2: La finestra si sposta in avanti di due token ogni volta, riducendo la sovrapposizione. Questo diminuisce la ridondanza e il carico computazionale ma potrebbe perdere alcune sfumature contestuali.
Passo Uguale a max_length: La finestra si sposta in avanti per l’intera dimensione della finestra, risultando in sequenze non sovrapposte. Questo minimizza la ridondanza dei dati ma può limitare la capacità del modello di apprendere dipendenze tra le sequenze.

Esempio con Passo di 2:

Utilizzando lo stesso testo tokenizzato e max_length di 4:

Prima Finestra (Posizioni 1-4): [“Lorem”, “ipsum”, “dolor”, “sit”] → Target: [“ipsum”, “dolor”, “sit”, “amet,”]
Seconda Finestra (Posizioni 3-6): [“dolor”, “sit”, “amet,”, “consectetur”] → Target: [“sit”, “amet,”, “consectetur”, “adipiscing”]
Terza Finestra (Posizioni 5-8): [“amet,”, “consectetur”, “adipiscing”, “elit.”] → Target: [“consectetur”, “adipiscing”, “elit.”, “sed”] (Assumendo continuazione)

Esempio di Codice

Cerchiamo di comprendere meglio questo attraverso un esempio di codice da https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb:

# Download the text to pre-train the LLM
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

with open("the-verdict.txt", "r", encoding="utf-8") as f:
raw_text = f.read()

"""
Create a class that will receive some params lie tokenizer and text
and will prepare the input chunks and the target chunks to prepare
the LLM to learn which next token to generate
"""
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
def __init__(self, txt, tokenizer, max_length, stride):
self.input_ids = []
self.target_ids = []

# Tokenize the entire text
token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

# Use a sliding window to chunk the book into overlapping sequences of max_length
for i in range(0, len(token_ids) - max_length, stride):
input_chunk = token_ids[i:i + max_length]
target_chunk = token_ids[i + 1: i + max_length + 1]
self.input_ids.append(torch.tensor(input_chunk))
self.target_ids.append(torch.tensor(target_chunk))

def __len__(self):
return len(self.input_ids)

def __getitem__(self, idx):
return self.input_ids[idx], self.target_ids[idx]


"""
Create a data loader which given the text and some params will
prepare the inputs and targets with the previous class and
then create a torch DataLoader with the info
"""

import tiktoken

def create_dataloader_v1(txt, batch_size=4, max_length=256,
stride=128, shuffle=True, drop_last=True,
num_workers=0):

# Initialize the tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# Create dataset
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

# Create dataloader
dataloader = DataLoader(
dataset,
batch_size=batch_size,
shuffle=shuffle,
drop_last=drop_last,
num_workers=num_workers
)

return dataloader


"""
Finally, create the data loader with the params we want:
- The used text for training
- batch_size: The size of each batch
- max_length: The size of each entry on each batch
- stride: The sliding window (how many tokens should the next entry advance compared to the previous one). The smaller the more overfitting, usually this is equals to the max_length so the same tokens aren't repeated.
- shuffle: Re-order randomly
"""
dataloader = create_dataloader_v1(
raw_text, batch_size=8, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

# Note the batch_size of 8, the max_length of 4 and the stride of 1
[
# Input
tensor([[   40,   367,  2885,  1464],
[  367,  2885,  1464,  1807],
[ 2885,  1464,  1807,  3619],
[ 1464,  1807,  3619,   402],
[ 1807,  3619,   402,   271],
[ 3619,   402,   271, 10899],
[  402,   271, 10899,  2138],
[  271, 10899,  2138,   257]]),
# Target
tensor([[  367,  2885,  1464,  1807],
[ 2885,  1464,  1807,  3619],
[ 1464,  1807,  3619,   402],
[ 1807,  3619,   402,   271],
[ 3619,   402,   271, 10899],
[  402,   271, 10899,  2138],
[  271, 10899,  2138,   257],
[10899,  2138,   257,  7026]])
]

# With stride=4 this will be the result:
[
# Input
tensor([[   40,   367,  2885,  1464],
[ 1807,  3619,   402,   271],
[10899,  2138,   257,  7026],
[15632,   438,  2016,   257],
[  922,  5891,  1576,   438],
[  568,   340,   373,   645],
[ 1049,  5975,   284,   502],
[  284,  3285,   326,    11]]),
# Target
tensor([[  367,  2885,  1464,  1807],
[ 3619,   402,   271, 10899],
[ 2138,   257,  7026, 15632],
[  438,  2016,   257,   922],
[ 5891,  1576,   438,   568],
[  340,   373,   645,  1049],
[ 5975,   284,   502,   284],
[ 3285,   326,    11,   287]])
]

Strategie di Campionamento Avanzate (2023-2025)

1. Pesatura della Miscela Basata sulla Temperatura

I LLM all’avanguardia sono raramente addestrati su un singolo corpus. Invece, campionano da diverse fonti di dati eterogenee (codice, web, articoli accademici, forum…). La proporzione relativa di ciascuna fonte può influenzare fortemente le prestazioni a valle. Modelli open-source recenti come Llama 2 hanno introdotto uno schema di campionamento basato sulla temperatura in cui la probabilità di estrarre un documento dal corpus i diventa

p(i) = \frac{w_i^{\alpha}}{\sum_j w_j^{\alpha}}

• w_i – percentuale di token grezzi del corpus i
• α (“temperatura”) – un valore in (0,1]. α < 1 appiattisce la distribuzione, dando più peso a corpus di alta qualità più piccoli.

Llama 2 ha utilizzato α = 0.7 e ha dimostrato che diminuire α ha aumentato i punteggi di valutazione su compiti ad alta conoscenza mantenendo stabile il mix di addestramento. Lo stesso trucco è adottato da Mistral (2023) e Claude 3.

from collections import Counter

def temperature_sample(corpus_ids, alpha=0.7):
counts = Counter(corpus_ids)           # number of tokens seen per corpus
probs  = {c: c_count**alpha for c, c_count in counts.items()}
Z = sum(probs.values())
probs = {c: p/Z for c, p in probs.items()}
# Now draw according to probs to fill every batch


### 2. Sequence Packing / Dynamic Batching
GPU memory is wasted when every sequence in a batch is padded to the longest example.  "Packing" concatenates multiple shorter sequences until the **exact** `max_length` is reached and builds a parallel `attention_mask` so that tokens do not attend across segment boundaries.  Packing can improve throughput by 20–40 % with no gradient change and is supported out-of-the-box in

* PyTorch `torchtext.experimental.agents.PackedBatch`
* HuggingFace `DataCollatorForLanguageModeling(pad_to_multiple_of=…)`

Dynamic batching frameworks (e.g. FlashAttention 2, vLLM 2024) combine sequence packing with just-in-time kernel selection, enabling thousand-token context training at 400+ K tokens/s on A100-80G.

### 3. Deduplication & Quality Filtering
Repeated passages cause memorization and provide an easy channel for data-poisoning.  Modern pipelines therefore:

1. MinHash/FAISS near-duplicate detection at **document** and **128-gram** level.
2. Filter documents whose perplexity under a small reference model is > µ + 3σ (noisy OCR, garbled HTML).
3. Block-list documents that contain PII or CWE keywords using regex & spaCy NER.

The Llama 2 team deduplicated with 8-gram MinHash and removed ~15 % of CommonCrawl before sampling.  OpenAI’s 2024 "Deduplicate Everything" paper demonstrates ≤0.04 duplicate ratio reduces over-fitting and speeds convergence.

## Security & Privacy Considerations During Sampling

### Data-Poisoning / Backdoor Attacks
Researchers showed that inserting <1 % backdoored sentences can make a model obey a hidden trigger ("PoisonGPT", 2023).  Recommended mitigations:

* **Shuffled mixing** – make sure adjacent training examples originate from different sources; this dilutes gradient alignment of malicious spans.
* **Gradient similarity scoring** – compute cosine similarity of example gradient to batch average; outliers are candidates for removal.
* **Dataset versioning & hashes** – freeze immutable tarballs and verify SHA-256 before each training run.

### Membership-Inference & Memorization
Long overlap between sliding-window samples increases the chance that rare strings (telephone numbers, secret keys) are memorized.  OpenAI’s 2024 study on ChatGPT memorization reports that raising stride from 1 × `max_length` to 4 × reduces verbatim leakage by ≈50 % with negligible loss in perplexity.

Practical recommendations:

* Use **stride ≥ max_length** except for <1B parameter models where data volume is scarce.
* Add random masking of 1-3 tokens per window during training; this lowers memorization while preserving utility.

---

## References

- [Build a Large Language Model from Scratch (Manning, 2024)](https://www.manning.com/books/build-a-large-language-model-from-scratch)
- [Llama 2: Open Foundation and Fine-Tuned Chat Models (2023)](https://arxiv.org/abs/2307.09288)
- [PoisonGPT: Assessing Backdoor Vulnerabilities in Large Language Models (BlackHat EU 2023)](https://arxiv.org/abs/2308.12364)

> [!TIP]
> Impara e pratica il hacking AWS:<img src="../../../../../images/arte.png" alt="" style="width:auto;height:24px;vertical-align:middle;">[**HackTricks Training AWS Red Team Expert (ARTE)**](https://training.hacktricks.xyz/courses/arte)<img src="../../../../../images/arte.png" alt="" style="width:auto;height:24px;vertical-align:middle;">\
> Impara e pratica il hacking GCP: <img src="../../../../../images/grte.png" alt="" style="width:auto;height:24px;vertical-align:middle;">[**HackTricks Training GCP Red Team Expert (GRTE)**](https://training.hacktricks.xyz/courses/grte)<img src="../../../../../images/grte.png" alt="" style="width:auto;height:24px;vertical-align:middle;">
> Impara e pratica il hacking Azure: <img src="../../../../../images/azrte.png" alt="" style="width:auto;height:24px;vertical-align:middle;">[**HackTricks Training Azure Red Team Expert (AzRTE)**](https://training.hacktricks.xyz/courses/azrte)<img src="../../../../../images/azrte.png" alt="" style="width:auto;height:24px;vertical-align:middle;">
>
> <details>
>
> <summary>Supporta HackTricks</summary>
>
> - Controlla i [**piani di abbonamento**](https://github.com/sponsors/carlospolop)!
> - **Unisciti al** 💬 [**gruppo Discord**](https://discord.gg/hRep4RUj7f) o al [**gruppo telegram**](https://t.me/peass) o **seguici** su **Twitter** 🐦 [**@hacktricks_live**](https://twitter.com/hacktricks_live)**.**
> - **Condividi trucchi di hacking inviando PR ai** [**HackTricks**](https://github.com/carlospolop/hacktricks) e [**HackTricks Cloud**](https://github.com/carlospolop/hacktricks-cloud) repos github.
>
> </details>