6. Pre-training & Loading models
Tip
AWS ํดํน ๋ฐฐ์ฐ๊ธฐ ๋ฐ ์ฐ์ตํ๊ธฐ:
HackTricks Training AWS Red Team Expert (ARTE)
GCP ํดํน ๋ฐฐ์ฐ๊ธฐ ๋ฐ ์ฐ์ตํ๊ธฐ:HackTricks Training GCP Red Team Expert (GRTE)
Azure ํดํน ๋ฐฐ์ฐ๊ธฐ ๋ฐ ์ฐ์ตํ๊ธฐ:
HackTricks Training Azure Red Team Expert (AzRTE)
HackTricks ์ง์ํ๊ธฐ
- ๊ตฌ๋ ๊ณํ ํ์ธํ๊ธฐ!
- **๐ฌ ๋์ค์ฝ๋ ๊ทธ๋ฃน ๋๋ ํ ๋ ๊ทธ๋จ ๊ทธ๋ฃน์ ์ฐธ์ฌํ๊ฑฐ๋ ํธ์ํฐ ๐ฆ @hacktricks_live๋ฅผ ํ๋ก์ฐํ์ธ์.
- HackTricks ๋ฐ HackTricks Cloud ๊นํ๋ธ ๋ฆฌํฌ์งํ ๋ฆฌ์ PR์ ์ ์ถํ์ฌ ํดํน ํธ๋ฆญ์ ๊ณต์ ํ์ธ์.
Text Generation
๋ชจ๋ธ์ ํ๋ จํ๊ธฐ ์ํด์๋ ํด๋น ๋ชจ๋ธ์ด ์๋ก์ด ํ ํฐ์ ์์ฑํ ์ ์์ด์ผ ํฉ๋๋ค. ๊ทธ๋ฐ ๋ค์ ์์ฑ๋ ํ ํฐ์ ์์๋ ํ ํฐ๊ณผ ๋น๊ตํ์ฌ ๋ชจ๋ธ์ด ์์ฑํด์ผ ํ ํ ํฐ์ ํ์ตํ๋๋ก ํ๋ จํฉ๋๋ค.
์ด์ ์์ ์์ ์ด๋ฏธ ์ผ๋ถ ํ ํฐ์ ์์ธกํ์ผ๋ฏ๋ก, ์ด ๋ชฉ์ ์ ์ํด ํด๋น ๊ธฐ๋ฅ์ ์ฌ์ฌ์ฉํ ์ ์์ต๋๋ค.
Tip
์ด ์ฌ์ฏ ๋ฒ์งธ ๋จ๊ณ์ ๋ชฉํ๋ ๋งค์ฐ ๊ฐ๋จํฉ๋๋ค: ๋ชจ๋ธ์ ์ฒ์๋ถํฐ ํ๋ จ์ํค๊ธฐ. ์ด๋ฅผ ์ํด ์ด์ LLM ์ํคํ ์ฒ๊ฐ ์ฌ์ฉ๋๋ฉฐ, ์ ์๋ ์์ค ํจ์์ ์ต์ ํ๋ฅผ ์ฌ์ฉํ์ฌ ๋ฐ์ดํฐ ์ธํธ๋ฅผ ๋ฐ๋ณตํ๋ ๋ฃจํ๊ฐ ํฌํจ๋ฉ๋๋ค.
Text Evaluation
์ฌ๋ฐ๋ฅธ ํ๋ จ์ ์ํํ๊ธฐ ์ํด์๋ ์์๋ ํ ํฐ์ ๋ํด ์ป์ ์์ธก์ ์ธก์ ํด์ผ ํฉ๋๋ค. ํ๋ จ์ ๋ชฉํ๋ ์ฌ๋ฐ๋ฅธ ํ ํฐ์ ๊ฐ๋ฅ์ฑ์ ๊ทน๋ํํ๋ ๊ฒ์ผ๋ก, ์ด๋ ๋ค๋ฅธ ํ ํฐ์ ๋นํด ๊ทธ ํ๋ฅ ์ ์ฆ๊ฐ์ํค๋ ๊ฒ์ ํฌํจํฉ๋๋ค.
์ฌ๋ฐ๋ฅธ ํ ํฐ์ ํ๋ฅ ์ ๊ทน๋ํํ๊ธฐ ์ํด์๋ ๋ชจ๋ธ์ ๊ฐ์ค์น๋ฅผ ์์ ํ์ฌ ๊ทธ ํ๋ฅ ์ด ๊ทน๋ํ๋์ด์ผ ํฉ๋๋ค. ๊ฐ์ค์น์ ์ ๋ฐ์ดํธ๋ ์ญ์ ํ๋ฅผ ํตํด ์ด๋ฃจ์ด์ง๋๋ค. ์ด๋ ๊ทน๋ํํ ์์ค ํจ์๊ฐ ํ์ํฉ๋๋ค. ์ด ๊ฒฝ์ฐ, ํจ์๋ ์ํ๋ ์์ธก๊ณผ ์ํ๋ ์์ธก ๊ฐ์ ์ฐจ์ด๊ฐ ๋ฉ๋๋ค.
๊ทธ๋ฌ๋ ์์ ์์ธก์ผ๋ก ์์
ํ๋ ๋์ , n์ ๋ฐ์ผ๋ก ํ๋ ๋ก๊ทธ๋ก ์์
ํฉ๋๋ค. ๋ฐ๋ผ์ ์์๋ ํ ํฐ์ ํ์ฌ ์์ธก์ด 7.4541e-05๋ผ๋ฉด, 7.4541e-05์ ์์ฐ ๋ก๊ทธ(๋ฐ e)๋ ๋๋ต -9.5042์
๋๋ค.
๊ทธ๋ฐ ๋ค์, ์๋ฅผ ๋ค์ด 5๊ฐ์ ํ ํฐ์ ์ปจํ
์คํธ ๊ธธ์ด๋ฅผ ๊ฐ์ง ๊ฐ ํญ๋ชฉ์ ๋ํด ๋ชจ๋ธ์ 5๊ฐ์ ํ ํฐ์ ์์ธกํด์ผ ํ๋ฉฐ, ์ฒซ 4๊ฐ์ ํ ํฐ์ ์
๋ ฅ์ ๋ง์ง๋ง ๊ฒ์ด๊ณ ๋ค์ฏ ๋ฒ์งธ๋ ์์ธก๋ ๊ฒ์
๋๋ค. ๋ฐ๋ผ์ ๊ฐ ํญ๋ชฉ์ ๋ํด 5๊ฐ์ ์์ธก์ด ์๊ฒ ๋ฉ๋๋ค(์ฒซ 4๊ฐ๊ฐ ์
๋ ฅ์ ์์๋๋ผ๋ ๋ชจ๋ธ์ ์ด๋ฅผ ์์ง ๋ชปํฉ๋๋ค)์ 5๊ฐ์ ์์ ํ ํฐ์ด ์์ผ๋ฉฐ ๋ฐ๋ผ์ ๊ทน๋ํํ 5๊ฐ์ ํ๋ฅ ์ด ์์ต๋๋ค.
๋ฐ๋ผ์ ๊ฐ ์์ธก์ ์์ฐ ๋ก๊ทธ๋ฅผ ์ํํ ํ, ํ๊ท ์ด ๊ณ์ฐ๋๊ณ , ๋ง์ด๋์ค ๊ธฐํธ๊ฐ ์ ๊ฑฐ๋ฉ๋๋ค(์ด๋ฅผ _๊ต์ฐจ ์ํธ๋กํผ ์์ค_์ด๋ผ๊ณ ํจ) ๊ทธ๋ฆฌ๊ณ ๊ทธ๊ฒ์ด 0์ ์ต๋ํ ๊ฐ๊น๊ฒ ์ค์ฌ์ผ ํ ์ซ์์ ๋๋ค. ์๋ํ๋ฉด 1์ ์์ฐ ๋ก๊ทธ๋ 0์ด๊ธฐ ๋๋ฌธ์ ๋๋ค:
 (1).png)
๋ชจ๋ธ์ ์ฑ๋ฅ์ ์ธก์ ํ๋ ๋ ๋ค๋ฅธ ๋ฐฉ๋ฒ์ ๋นํน๊ฐ(perplexity)์ด๋ผ๊ณ ํฉ๋๋ค. Perplexity๋ ํ๋ฅ ๋ชจ๋ธ์ด ์ํ์ ์์ธกํ๋ ์ ๋๋ฅผ ํ๊ฐํ๋ ๋ฐ ์ฌ์ฉ๋๋ ๋ฉํธ๋ฆญ์
๋๋ค. ์ธ์ด ๋ชจ๋ธ๋ง์์ ์ด๋ ์ํ์ค์์ ๋ค์ ํ ํฐ์ ์์ธกํ ๋ ๋ชจ๋ธ์ ๋ถํ์ค์ฑ์ ๋ํ๋
๋๋ค.
์๋ฅผ ๋ค์ด, 48725์ perplexity ๊ฐ์ ํ ํฐ์ ์์ธกํด์ผ ํ ๋ 48,725๊ฐ์ ์ดํ ์ค ์ด๋ค ๊ฒ์ด ์ข์ ๊ฒ์ธ์ง ํ์ ํ์ง ๋ชปํ๋ค๋ ๊ฒ์ ์๋ฏธํฉ๋๋ค.
Pre-Train Example
์ด๊ฒ์ https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/ch05.ipynb์์ ์ ์๋ ์ด๊ธฐ ์ฝ๋๋ก, ๋๋๋ก ์ฝ๊ฐ ์์ ๋์์ต๋๋ค.
์ฌ๊ธฐ์ ์ฌ์ฉ๋ ์ด์ ์ฝ๋์ง๋ง ์ด์ ์น์ ์์ ์ด๋ฏธ ์ค๋ช ๋จ
```python """ This is code explained before so it won't be exaplained """import tiktoken import torch import torch.nn as nn from torch.utils.data import Dataset, DataLoader
class GPTDatasetV1(Dataset): def init(self, txt, tokenizer, max_length, stride): self.input_ids = [] self.target_ids = []
Tokenize the entire text
token_ids = tokenizer.encode(txt, allowed_special={โ<|endoftext|>โ})
Use a sliding window to chunk the book into overlapping sequences of max_length
for i in range(0, len(token_ids) - max_length, stride): input_chunk = token_ids[i:i + max_length] target_chunk = token_ids[i + 1: i + max_length + 1] self.input_ids.append(torch.tensor(input_chunk)) self.target_ids.append(torch.tensor(target_chunk))
def len(self): return len(self.input_ids)
def getitem(self, idx): return self.input_ids[idx], self.target_ids[idx]
def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
Initialize the tokenizer
tokenizer = tiktoken.get_encoding(โgpt2โ)
Create dataset
dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
Create dataloader
dataloader = DataLoader( dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
return dataloader
class MultiHeadAttention(nn.Module): def init(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False): super().init() assert d_out % num_heads == 0, โd_out must be divisible by n_headsโ
self.d_out = d_out self.num_heads = num_heads self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs self.dropout = nn.Dropout(dropout) self.register_buffer(โmaskโ, torch.triu(torch.ones(context_length, context_length), diagonal=1))
def forward(self, x): b, num_tokens, d_in = x.shape
keys = self.W_key(x) # Shape: (b, num_tokens, d_out) queries = self.W_query(x) values = self.W_value(x)
We implicitly split the matrix by adding a num_heads dimension
Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) values = values.view(b, num_tokens, self.num_heads, self.head_dim) queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
keys = keys.transpose(1, 2) queries = queries.transpose(1, 2) values = values.transpose(1, 2)
Compute scaled dot-product attention (aka self-attention) with a causal mask
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
Original mask truncated to the number of tokens and converted to boolean
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
Use the mask to fill attention scores
attn_scores.masked_fill_(mask_bool, -torch.inf)
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) attn_weights = self.dropout(attn_weights)
Shape: (b, num_tokens, num_heads, head_dim)
context_vec = (attn_weights @ values).transpose(1, 2)
Combine heads, where self.d_out = self.num_heads * self.head_dim
context_vec = context_vec.reshape(b, num_tokens, self.d_out) context_vec = self.out_proj(context_vec) # optional projection
return context_vec
class LayerNorm(nn.Module): def init(self, emb_dim): super().init() self.eps = 1e-5 self.scale = nn.Parameter(torch.ones(emb_dim)) self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x): mean = x.mean(dim=-1, keepdim=True) var = x.var(dim=-1, keepdim=True, unbiased=False) norm_x = (x - mean) / torch.sqrt(var + self.eps) return self.scale * norm_x + self.shift
class GELU(nn.Module): def init(self): super().init()
def forward(self, x): return 0.5 * x * (1 + torch.tanh( torch.sqrt(torch.tensor(2.0 / torch.pi)) * (x + 0.044715 * torch.pow(x, 3)) ))
class FeedForward(nn.Module): def init(self, cfg): super().init() self.layers = nn.Sequential( nn.Linear(cfg[โemb_dimโ], 4 * cfg[โemb_dimโ]), GELU(), nn.Linear(4 * cfg[โemb_dimโ], cfg[โemb_dimโ]), )
def forward(self, x): return self.layers(x)
class TransformerBlock(nn.Module): def init(self, cfg): super().init() self.att = MultiHeadAttention( d_in=cfg[โemb_dimโ], d_out=cfg[โemb_dimโ], context_length=cfg[โcontext_lengthโ], num_heads=cfg[โn_headsโ], dropout=cfg[โdrop_rateโ], qkv_bias=cfg[โqkv_biasโ]) self.ff = FeedForward(cfg) self.norm1 = LayerNorm(cfg[โemb_dimโ]) self.norm2 = LayerNorm(cfg[โemb_dimโ]) self.drop_shortcut = nn.Dropout(cfg[โdrop_rateโ])
def forward(self, x):
Shortcut connection for attention block
shortcut = x x = self.norm1(x) x = self.att(x) # Shape [batch_size, num_tokens, emb_size] x = self.drop_shortcut(x) x = x + shortcut # Add the original input back
Shortcut connection for feed-forward block
shortcut = x x = self.norm2(x) x = self.ff(x) x = self.drop_shortcut(x) x = x + shortcut # Add the original input back
return x
class GPTModel(nn.Module): def init(self, cfg): super().init() self.tok_emb = nn.Embedding(cfg[โvocab_sizeโ], cfg[โemb_dimโ]) self.pos_emb = nn.Embedding(cfg[โcontext_lengthโ], cfg[โemb_dimโ]) self.drop_emb = nn.Dropout(cfg[โdrop_rateโ])
self.trf_blocks = nn.Sequential( *[TransformerBlock(cfg) for _ in range(cfg[โn_layersโ])])
self.final_norm = LayerNorm(cfg[โemb_dimโ]) self.out_head = nn.Linear(cfg[โemb_dimโ], cfg[โvocab_sizeโ], bias=False)
def forward(self, in_idx): batch_size, seq_len = in_idx.shape tok_embeds = self.tok_emb(in_idx) pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device)) x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size] x = self.drop_emb(x) x = self.trf_blocks(x) x = self.final_norm(x) logits = self.out_head(x) return logits
</details>
```python
# Download contents to train the data with
import os
import urllib.request
file_path = "the-verdict.txt"
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
if not os.path.exists(file_path):
with urllib.request.urlopen(url) as response:
text_data = response.read().decode('utf-8')
with open(file_path, "w", encoding="utf-8") as file:
file.write(text_data)
else:
with open(file_path, "r", encoding="utf-8") as file:
text_data = file.read()
total_characters = len(text_data)
tokenizer = tiktoken.get_encoding("gpt2")
total_tokens = len(tokenizer.encode(text_data))
print("Data downloaded")
print("Characters:", total_characters)
print("Tokens:", total_tokens)
# Model initialization
GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size
"context_length": 256, # Shortened context length (orig: 1024)
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-key-value bias
}
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval()
print ("Model initialized")
# Functions to transform from tokens to ids and from to ids to tokens
def text_to_token_ids(text, tokenizer):
encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
return encoded_tensor
def token_ids_to_text(token_ids, tokenizer):
flat = token_ids.squeeze(0) # remove batch dimension
return tokenizer.decode(flat.tolist())
# Define loss functions
def calc_loss_batch(input_batch, target_batch, model, device):
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
logits = model(input_batch)
loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
return loss
def calc_loss_loader(data_loader, model, device, num_batches=None):
total_loss = 0.
if len(data_loader) == 0:
return float("nan")
elif num_batches is None:
num_batches = len(data_loader)
else:
# Reduce the number of batches to match the total number of batches in the data loader
# if num_batches exceeds the number of batches in the data loader
num_batches = min(num_batches, len(data_loader))
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
loss = calc_loss_batch(input_batch, target_batch, model, device)
total_loss += loss.item()
else:
break
return total_loss / num_batches
# Apply Train/validation ratio and create dataloaders
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]
torch.manual_seed(123)
train_loader = create_dataloader_v1(
train_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=True,
shuffle=True,
num_workers=0
)
val_loader = create_dataloader_v1(
val_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=False,
shuffle=False,
num_workers=0
)
# Sanity checks
if total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:
print("Not enough tokens for the training loader. "
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
"increase the `training_ratio`")
if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:
print("Not enough tokens for the validation loader. "
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
"decrease the `training_ratio`")
print("Train loader:")
for x, y in train_loader:
print(x.shape, y.shape)
print("\nValidation loader:")
for x, y in val_loader:
print(x.shape, y.shape)
train_tokens = 0
for input_batch, target_batch in train_loader:
train_tokens += input_batch.numel()
val_tokens = 0
for input_batch, target_batch in val_loader:
val_tokens += input_batch.numel()
print("Training tokens:", train_tokens)
print("Validation tokens:", val_tokens)
print("All tokens:", train_tokens + val_tokens)
# Indicate the device to use
if torch.cuda.is_available():
device = torch.device("cuda")
elif torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")
print(f"Using {device} device.")
model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes
# Pre-calculate losses without starting yet
torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader
with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
train_loss = calc_loss_loader(train_loader, model, device)
val_loss = calc_loss_loader(val_loader, model, device)
print("Training loss:", train_loss)
print("Validation loss:", val_loss)
# Functions to train the data
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
eval_freq, eval_iter, start_context, tokenizer):
# Initialize lists to track losses and tokens seen
train_losses, val_losses, track_tokens_seen = [], [], []
tokens_seen, global_step = 0, -1
# Main training loop
for epoch in range(num_epochs):
model.train() # Set model to training mode
for input_batch, target_batch in train_loader:
optimizer.zero_grad() # Reset loss gradients from previous batch iteration
loss = calc_loss_batch(input_batch, target_batch, model, device)
loss.backward() # Calculate loss gradients
optimizer.step() # Update model weights using loss gradients
tokens_seen += input_batch.numel()
global_step += 1
# Optional evaluation step
if global_step % eval_freq == 0:
train_loss, val_loss = evaluate_model(
model, train_loader, val_loader, device, eval_iter)
train_losses.append(train_loss)
val_losses.append(val_loss)
track_tokens_seen.append(tokens_seen)
print(f"Ep {epoch+1} (Step {global_step:06d}): "
f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
# Print a sample text after each epoch
generate_and_print_sample(
model, tokenizer, device, start_context
)
return train_losses, val_losses, track_tokens_seen
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
model.eval()
with torch.no_grad():
train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
model.train()
return train_loss, val_loss
def generate_and_print_sample(model, tokenizer, device, start_context):
model.eval()
context_size = model.pos_emb.weight.shape[0]
encoded = text_to_token_ids(start_context, tokenizer).to(device)
with torch.no_grad():
token_ids = generate_text(
model=model, idx=encoded,
max_new_tokens=50, context_size=context_size
)
decoded_text = token_ids_to_text(token_ids, tokenizer)
print(decoded_text.replace("\n", " ")) # Compact print format
model.train()
# Start training!
import time
start_time = time.time()
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
model, train_loader, val_loader, optimizer, device,
num_epochs=num_epochs, eval_freq=5, eval_iter=5,
start_context="Every effort moves you", tokenizer=tokenizer
)
end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes.")
# Show graphics with the training process
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import math
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
fig, ax1 = plt.subplots(figsize=(5, 3))
ax1.plot(epochs_seen, train_losses, label="Training loss")
ax1.plot(
epochs_seen, val_losses, linestyle="-.", label="Validation loss"
)
ax1.set_xlabel("Epochs")
ax1.set_ylabel("Loss")
ax1.legend(loc="upper right")
ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
ax2 = ax1.twiny()
ax2.plot(tokens_seen, train_losses, alpha=0)
ax2.set_xlabel("Tokens seen")
fig.tight_layout()
plt.show()
# Compute perplexity from the loss values
train_ppls = [math.exp(loss) for loss in train_losses]
val_ppls = [math.exp(loss) for loss in val_losses]
# Plot perplexity over tokens seen
plt.figure()
plt.plot(tokens_seen, train_ppls, label='Training Perplexity')
plt.plot(tokens_seen, val_ppls, label='Validation Perplexity')
plt.xlabel('Tokens Seen')
plt.ylabel('Perplexity')
plt.title('Perplexity over Training')
plt.legend()
plt.show()
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
torch.save({
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
},
"/tmp/model_and_optimizer.pth"
)
Functions to transform text <โ> ids
์ด๊ฒ์ ์ดํ์์ ํ ์คํธ๋ฅผ ID๋ก, ๊ทธ๋ฆฌ๊ณ ๊ทธ ๋ฐ๋๋ก ๋ณํํ๋ ๋ฐ ์ฌ์ฉํ ์ ์๋ ๋ช ๊ฐ์ง ๊ฐ๋จํ ํจ์์ ๋๋ค. ์ด๋ ํ ์คํธ ์ฒ๋ฆฌ์ ์์๊ณผ ์์ธก์ ๋์์ ํ์ํฉ๋๋ค:
# Functions to transform from tokens to ids and from to ids to tokens
def text_to_token_ids(text, tokenizer):
encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
return encoded_tensor
def token_ids_to_text(token_ids, tokenizer):
flat = token_ids.squeeze(0) # remove batch dimension
return tokenizer.decode(flat.tolist())
ํ ์คํธ ์์ฑ ํจ์
์ด์ ์น์ ์์๋ ๊ฐ์ฅ ๊ฐ๋ฅ์ฑ์ด ๋์ ํ ํฐ์ ๋ก์ง์ ์ป์ ํ์ ๊ฐ์ ธ์ค๋ ํจ์๊ฐ ์์์ต๋๋ค. ๊ทธ๋ฌ๋ ์ด๋ ๊ฐ ์ ๋ ฅ์ ๋ํด ํญ์ ๋์ผํ ์ถ๋ ฅ์ ์์ฑํ๊ฒ ๋์ด ๋งค์ฐ ๊ฒฐ์ ์ ์ ๋๋ค.
๋ค์ generate_text ํจ์๋ top-k, temperature ๋ฐ multinomial ๊ฐ๋
์ ์ ์ฉํฉ๋๋ค.
- **
top-k**๋ ์์ k๊ฐ์ ํ ํฐ์ ์ ์ธํ ๋ชจ๋ ํ ํฐ์ ํ๋ฅ ์-inf๋ก ์ค์ด๊ธฐ ์์ํ๋ค๋ ๊ฒ์ ์๋ฏธํฉ๋๋ค. ๋ฐ๋ผ์ k=3์ธ ๊ฒฝ์ฐ, ๊ฒฐ์ ์ ๋ด๋ฆฌ๊ธฐ ์ ์ ๊ฐ์ฅ ๊ฐ๋ฅ์ฑ์ด ๋์ 3๊ฐ์ ํ ํฐ๋ง-inf๊ฐ ์๋ ํ๋ฅ ์ ๊ฐ์ง๋๋ค. - **
temperature**๋ ๋ชจ๋ ํ๋ฅ ์ด ์จ๋ ๊ฐ์ผ๋ก ๋๋์ด์ง๋ค๋ ๊ฒ์ ์๋ฏธํฉ๋๋ค. ๊ฐ์ด0.1์ด๋ฉด ๊ฐ์ฅ ๋์ ํ๋ฅ ์ด ๊ฐ์ฅ ๋ฎ์ ํ๋ฅ ์ ๋นํด ๊ฐ์ ๋๋ฉฐ, ์๋ฅผ ๋ค์ด ์จ๋๊ฐ5์ด๋ฉด ๋ ํํํด์ง๋๋ค. ์ด๋ LLM์ด ๊ฐ์ง๊ธธ ์ํ๋ ์๋ต์ ๋ณํ๋ฅผ ๊ฐ์ ํ๋ ๋ฐ ๋์์ด ๋ฉ๋๋ค. - ์จ๋๋ฅผ ์ ์ฉํ ํ,
softmaxํจ์๊ฐ ๋ค์ ์ ์ฉ๋์ด ๋จ์ ์๋ ๋ชจ๋ ํ ํฐ์ ์ด ํ๋ฅ ์ด 1์ด ๋๋๋ก ํฉ๋๋ค. - ๋ง์ง๋ง์ผ๋ก, ๊ฐ์ฅ ํฐ ํ๋ฅ ์ ๊ฐ์ง ํ ํฐ์ ์ ํํ๋ ๋์ , ํจ์ **
multinomial**์ด ์ต์ข ํ๋ฅ ์ ๋ฐ๋ผ ๋ค์ ํ ํฐ์ ์์ธกํ๋ ๋ฐ ์ ์ฉ๋ฉ๋๋ค. ๋ฐ๋ผ์ ํ ํฐ 1์ด 70%์ ํ๋ฅ ์ ๊ฐ์ก๋ค๋ฉด, ํ ํฐ 2๋ 20%, ํ ํฐ 3์ 10%์ ํ๋ฅ ์ ๊ฐ์ง๋ฉฐ, 70%์ ๊ฒฝ์ฐ ํ ํฐ 1์ด ์ ํ๋๊ณ , 20%์ ๊ฒฝ์ฐ ํ ํฐ 2๊ฐ, 10%์ ๊ฒฝ์ฐ ํ ํฐ 3์ด ์ ํ๋ฉ๋๋ค.
# Generate text function
def generate_text(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):
# For-loop is the same as before: Get logits, and only focus on last time step
for _ in range(max_new_tokens):
idx_cond = idx[:, -context_size:]
with torch.no_grad():
logits = model(idx_cond)
logits = logits[:, -1, :]
# New: Filter logits with top_k sampling
if top_k is not None:
# Keep only top_k values
top_logits, _ = torch.topk(logits, top_k)
min_val = top_logits[:, -1]
logits = torch.where(logits < min_val, torch.tensor(float("-inf")).to(logits.device), logits)
# New: Apply temperature scaling
if temperature > 0.0:
logits = logits / temperature
# Apply softmax to get probabilities
probs = torch.softmax(logits, dim=-1) # (batch_size, context_len)
# Sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (batch_size, 1)
# Otherwise same as before: get idx of the vocab entry with the highest logits value
else:
idx_next = torch.argmax(logits, dim=-1, keepdim=True) # (batch_size, 1)
if idx_next == eos_id: # Stop generating early if end-of-sequence token is encountered and eos_id is specified
break
# Same as before: append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (batch_size, num_tokens+1)
return idx
Tip
top-k์ ์ผ๋ฐ์ ์ธ ๋์์top-p๋ก, ํต์ฌ ์ํ๋ง์ด๋ผ๊ณ ๋ ํ๋ฉฐ, ๊ฐ์ฅ ๋์ ํ๋ฅ ์ ๊ฐ์ง k ์ํ์ ์ป๋ ๋์ , ๊ฒฐ๊ณผ๋ก ๋์จ ์ดํ๋ฅผ ํ๋ฅ ์ ๋ฐ๋ผ ์ ๋ฆฌํ๊ณ ๊ฐ์ฅ ๋์ ํ๋ฅ ๋ถํฐ ๊ฐ์ฅ ๋ฎ์ ํ๋ฅ ๊น์ง ํฉ์ฐํ์ฌ ์๊ณ๊ฐ์ ๋๋ฌํ ๋๊น์ง ์งํํฉ๋๋ค.๊ทธ๋ฐ ๋ค์, ์๋ ํ๋ฅ ์ ๋ฐ๋ผ ์ดํ์ ๋จ์ด๋ค๋ง ๊ณ ๋ ค๋ฉ๋๋ค.
์ด๋ ๊ฐ ๊ฒฝ์ฐ์ ๋ฐ๋ผ ์ต์ ์ k๊ฐ ๋ค๋ฅผ ์ ์์ผ๋ฏ๋ก
k์ํ์ ์๋ฅผ ์ ํํ ํ์ ์์ด ์ค์ง ์๊ณ๊ฐ๋ง ํ์ํ๊ฒ ํฉ๋๋ค.์ด ๊ฐ์ ์ฌํญ์ ์ด์ ์ฝ๋์ ํฌํจ๋์ด ์์ง ์์์ ์ ์ํ์ธ์.
Tip
์์ฑ๋ ํ ์คํธ๋ฅผ ๊ฐ์ ํ๋ ๋ ๋ค๋ฅธ ๋ฐฉ๋ฒ์ ์ด ์์ ์์ ์ฌ์ฉ๋ ํ์์ ๊ฒ์ ๋์ Beam search๋ฅผ ์ฌ์ฉํ๋ ๊ฒ์ ๋๋ค.
ํ์์ ๊ฒ์๊ณผ ๋ฌ๋ฆฌ, ๊ฐ ๋จ๊ณ์์ ๊ฐ์ฅ ํ๋ฅ ์ด ๋์ ๋ค์ ๋จ์ด๋ฅผ ์ ํํ๊ณ ๋จ์ผ ์ํ์ค๋ฅผ ๊ตฌ์ถํ๋ ๋์ , beam search๋ ๊ฐ ๋จ๊ณ์์ ์์ ๐ k์ ์ ์๊ฐ ๋์ ๋ถ๋ถ ์ํ์ค(์ด๋ฅธ๋ฐ โbeamsโ)๋ฅผ ์ถ์ ํฉ๋๋ค. ์ฌ๋ฌ ๊ฐ๋ฅ์ฑ์ ๋์์ ํ์ํจ์ผ๋ก์จ ํจ์จ์ฑ๊ณผ ํ์ง์ ๊ท ํ์ ๋ง์ถ์ด, ํ์์ ์ ๊ทผ ๋ฐฉ์์ผ๋ก ์ธํด ์กฐ๊ธฐ ๋น์ต์ ์ ํ์ผ๋ก ๋์น ์ ์๋ ๋ ๋์ ์ ์ฒด ์ํ์ค๋ฅผ ์ฐพ์ ๊ฐ๋ฅ์ฑ์ ๋์ ๋๋ค.์ด ๊ฐ์ ์ฌํญ์ ์ด์ ์ฝ๋์ ํฌํจ๋์ด ์์ง ์์์ ์ ์ํ์ธ์.
Loss functions
calc_loss_batch ํจ์๋ ๋จ์ผ ๋ฐฐ์น์ ์์ธก์ ๋ํ ๊ต์ฐจ ์ํธ๋กํผ๋ฅผ ๊ณ์ฐํฉ๋๋ค.
**calc_loss_loader**๋ ๋ชจ๋ ๋ฐฐ์น์ ๊ต์ฐจ ์ํธ๋กํผ๋ฅผ ๊ฐ์ ธ์์ ํ๊ท ๊ต์ฐจ ์ํธ๋กํผ๋ฅผ ๊ณ์ฐํฉ๋๋ค.
# Define loss functions
def calc_loss_batch(input_batch, target_batch, model, device):
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
logits = model(input_batch)
loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
return loss
def calc_loss_loader(data_loader, model, device, num_batches=None):
total_loss = 0.
if len(data_loader) == 0:
return float("nan")
elif num_batches is None:
num_batches = len(data_loader)
else:
# Reduce the number of batches to match the total number of batches in the data loader
# if num_batches exceeds the number of batches in the data loader
num_batches = min(num_batches, len(data_loader))
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
loss = calc_loss_batch(input_batch, target_batch, model, device)
total_loss += loss.item()
else:
break
return total_loss / num_batches
Tip
๊ทธ๋๋์ธํธ ํด๋ฆฌํ์ ํ๋ จ ์์ ์ฑ์ ํฅ์์ํค๊ธฐ ์ํด ํฐ ์ ๊ฒฝ๋ง์์ ๊ทธ๋๋์ธํธ ํฌ๊ธฐ์ ๋ํ ์ต๋ ์๊ณ๊ฐ์ ์ค์ ํ๋ ๊ธฐ์ ์ ๋๋ค. ๊ทธ๋๋์ธํธ๊ฐ ์ด ๋ฏธ๋ฆฌ ์ ์๋
max_norm์ ์ด๊ณผํ๋ฉด, ๋ชจ๋ธ์ ๋งค๊ฐ๋ณ์์ ๋ํ ์ ๋ฐ์ดํธ๊ฐ ๊ด๋ฆฌ ๊ฐ๋ฅํ ๋ฒ์ ๋ด์ ์ ์ง๋๋๋ก ๋น๋ก์ ์ผ๋ก ์ถ์๋์ด ํญ๋ฐํ๋ ๊ทธ๋๋์ธํธ์ ๊ฐ์ ๋ฌธ์ ๋ฅผ ๋ฐฉ์งํ๊ณ ๋ณด๋ค ํต์ ๋๊ณ ์์ ์ ์ธ ํ๋ จ์ ๋ณด์ฅํฉ๋๋ค.์ด ๊ฐ์ ์ฌํญ์ ์ด์ ์ฝ๋์ ํฌํจ๋์ด ์์ง ์์์ ์ ์ํ์ญ์์ค.
๋ค์ ์์ ๋ฅผ ํ์ธํ์ญ์์ค:
 (1).png)
๋ฐ์ดํฐ ๋ก๋ฉ
ํจ์ create_dataloader_v1์ create_dataloader_v1๋ ์ด์ ์น์
์์ ์ด๋ฏธ ๋
ผ์๋์์ต๋๋ค.
์ฌ๊ธฐ์ 90%์ ํ
์คํธ๊ฐ ํ๋ จ์ ์ฌ์ฉ๋๊ณ 10%๊ฐ ๊ฒ์ฆ์ ์ฌ์ฉ๋๋ค๋ ์ ์ ์ฃผ๋ชฉํ์ญ์์ค. ๋ ์ธํธ๋ 2๊ฐ์ ์๋ก ๋ค๋ฅธ ๋ฐ์ดํฐ ๋ก๋์ ์ ์ฅ๋ฉ๋๋ค.
๋๋๋ก ๋ฐ์ดํฐ ์ธํธ์ ์ผ๋ถ๋ ๋ชจ๋ธ ์ฑ๋ฅ์ ๋ ์ ํ๊ฐํ๊ธฐ ์ํด ํ
์คํธ ์ธํธ๋ก ๋จ๊ฒจ์ง๊ธฐ๋ ํฉ๋๋ค.
๋ ๋ฐ์ดํฐ ๋ก๋๋ ๋์ผํ ๋ฐฐ์น ํฌ๊ธฐ, ์ต๋ ๊ธธ์ด, ์คํธ๋ผ์ด๋ ๋ฐ ์์
์ ์(์ด ๊ฒฝ์ฐ 0)๋ฅผ ์ฌ์ฉํ๊ณ ์์ต๋๋ค.
์ฃผ์ ์ฐจ์ด์ ์ ๊ฐ ๋ฐ์ดํฐ ๋ก๋์์ ์ฌ์ฉํ๋ ๋ฐ์ดํฐ์ด๋ฉฐ, ๊ฒ์ฆ์๋ ๋ง์ง๋ง ๋ฐ์ดํฐ๋ฅผ ๋ฒ๋ฆฌ์ง ์์ผ๋ฉฐ ๊ฒ์ฆ ๋ชฉ์ ์ ํ์ํ์ง ์๊ธฐ ๋๋ฌธ์ ๋ฐ์ดํฐ๋ฅผ ์์ง ์์ต๋๋ค.
๋ํ ์คํธ๋ผ์ด๋๊ฐ ์ปจํ ์คํธ ๊ธธ์ด๋งํผ ํฌ๋ค๋ ๊ฒ์ ํ๋ จ ๋ฐ์ดํฐ์ ์ฌ์ฉ๋๋ ์ปจํ ์คํธ ๊ฐ์ ๊ฒน์นจ์ด ์์์ ์๋ฏธํฉ๋๋ค(๊ณผ์ ํฉ์ ์ค์ด์ง๋ง ํ๋ จ ๋ฐ์ดํฐ ์ธํธ๋ ์ค์ ๋๋ค).
๋์ฑ์ด, ์ด ๊ฒฝ์ฐ ๋ฐฐ์น ํฌ๊ธฐ๊ฐ 2๋ก ์ค์ ๋์ด ๋ฐ์ดํฐ๋ฅผ 2๊ฐ์ ๋ฐฐ์น๋ก ๋๋๋ฉฐ, ์ด์ ์ฃผ์ ๋ชฉํ๋ ๋ณ๋ ฌ ์ฒ๋ฆฌ๋ฅผ ํ์ฉํ๊ณ ๋ฐฐ์น๋น ์๋น๋ฅผ ์ค์ด๋ ๊ฒ์ ๋๋ค.
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]
torch.manual_seed(123)
train_loader = create_dataloader_v1(
train_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=True,
shuffle=True,
num_workers=0
)
val_loader = create_dataloader_v1(
val_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=False,
shuffle=False,
num_workers=0
)
Sanity Checks
๋ชฉํ๋ ํ๋ จ์ ์ํ ์ถฉ๋ถํ ํ ํฐ์ด ์๋์ง, ํํ๊ฐ ์์ํ ๋๋ก์ธ์ง ํ์ธํ๊ณ , ํ๋ จ ๋ฐ ๊ฒ์ฆ์ ์ฌ์ฉ๋ ํ ํฐ ์์ ๋ํ ์ ๋ณด๋ฅผ ์ป๋ ๊ฒ์ ๋๋ค:
# Sanity checks
if total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:
print("Not enough tokens for the training loader. "
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
"increase the `training_ratio`")
if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:
print("Not enough tokens for the validation loader. "
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
"decrease the `training_ratio`")
print("Train loader:")
for x, y in train_loader:
print(x.shape, y.shape)
print("\nValidation loader:")
for x, y in val_loader:
print(x.shape, y.shape)
train_tokens = 0
for input_batch, target_batch in train_loader:
train_tokens += input_batch.numel()
val_tokens = 0
for input_batch, target_batch in val_loader:
val_tokens += input_batch.numel()
print("Training tokens:", train_tokens)
print("Validation tokens:", val_tokens)
print("All tokens:", train_tokens + val_tokens)
ํ๋ จ ๋ฐ ์ฌ์ ๊ณ์ฐ์ ์ํ ์ฅ์น ์ ํ
๋ค์ ์ฝ๋๋ ์ฌ์ฉํ ์ฅ์น๋ฅผ ์ ํํ๊ณ ํ๋ จ ์์ค ๋ฐ ๊ฒ์ฆ ์์ค์ ๊ณ์ฐํฉ๋๋ค(์์ง ์๋ฌด๊ฒ๋ ํ๋ จํ์ง ์์ ์ํ์์) ์์์ ์ผ๋ก ์ผ์ต๋๋ค.
# Indicate the device to use
if torch.cuda.is_available():
device = torch.device("cuda")
elif torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")
print(f"Using {device} device.")
model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes
# Pre-calculate losses without starting yet
torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader
with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
train_loss = calc_loss_loader(train_loader, model, device)
val_loss = calc_loss_loader(val_loader, model, device)
print("Training loss:", train_loss)
print("Validation loss:", val_loss)
Training functions
ํจ์ generate_and_print_sample๋ ์ปจํ
์คํธ๋ฅผ ๋ฐ์ ๋ชจ๋ธ์ด ๊ทธ ์์ ์์ ์ผ๋ง๋ ์ข์์ง์ ๋ํ ๋๋์ ์ป๊ธฐ ์ํด ์ผ๋ถ ํ ํฐ์ ์์ฑํฉ๋๋ค. ์ด๋ train_model_simple์ ์ํด ๊ฐ ๋จ๊ณ์์ ํธ์ถ๋ฉ๋๋ค.
ํจ์ evaluate_model์ ํ๋ จ ํจ์์ ์ง์๋ ๋งํผ ์์ฃผ ํธ์ถ๋๋ฉฐ, ๋ชจ๋ธ ํ๋ จ ์์ ์์ ํ๋ จ ์์ค๊ณผ ๊ฒ์ฆ ์์ค์ ์ธก์ ํ๋ ๋ฐ ์ฌ์ฉ๋ฉ๋๋ค.
๊ทธ๋ฐ ๋ค์ ํฐ ํจ์ train_model_simple์ด ์ค์ ๋ก ๋ชจ๋ธ์ ํ๋ จํฉ๋๋ค. ์ด ํจ์๋ ๋ค์์ ๊ธฐ๋ํฉ๋๋ค:
- ํ๋ จ ๋ฐ์ดํฐ ๋ก๋(ํ๋ จ์ ์ํด ์ด๋ฏธ ๋ถ๋ฆฌ๋๊ณ ์ค๋น๋ ๋ฐ์ดํฐ)
- ๊ฒ์ฆ์ ๋ก๋
- ํ๋ จ ์ค ์ฌ์ฉํ ์ต์ ํ๊ธฐ: ์ด๋ ๊ทธ๋๋์ธํธ๋ฅผ ์ฌ์ฉํ๊ณ ์์ค์ ์ค์ด๊ธฐ ์ํด ๋งค๊ฐ๋ณ์๋ฅผ ์
๋ฐ์ดํธํ๋ ํจ์์
๋๋ค. ์ด ๊ฒฝ์ฐ, ๋ณด์๋ค์ํผ
AdamW๊ฐ ์ฌ์ฉ๋์ง๋ง ๋ ๋ง์ ๊ฒ์ด ์์ต๋๋ค. optimizer.zero_grad()๋ ๊ฐ ๋ผ์ด๋์์ ๊ทธ๋๋์ธํธ๋ฅผ ์ฌ์ค์ ํ๊ธฐ ์ํด ํธ์ถ๋์ด ๋์ ๋์ง ์๋๋ก ํฉ๋๋ค.lr๋งค๊ฐ๋ณ์๋ ํ์ต๋ฅ ๋ก, ๋ชจ๋ธ์ ๋งค๊ฐ๋ณ์๋ฅผ ์ ๋ฐ์ดํธํ ๋ ์ต์ ํ ๊ณผ์ ์์ ์ทจํ๋ ๋จ๊ณ์ ํฌ๊ธฐ๋ฅผ ๊ฒฐ์ ํฉ๋๋ค. ์์ ํ์ต๋ฅ ์ ์ต์ ํ๊ธฐ๊ฐ ๊ฐ์ค์น์ ๋ํ ์์ ์ ๋ฐ์ดํธ๋ฅผ ์ํํ๊ฒ ํ์ฌ ๋ ์ ํํ ์๋ ด์ ์ด๋ ์ ์์ง๋ง ํ๋ จ ์๋๋ฅผ ๋๋ฆฌ๊ฒ ํ ์ ์์ต๋๋ค. ํฐ ํ์ต๋ฅ ์ ํ๋ จ ์๋๋ฅผ ๋์ผ ์ ์์ง๋ง ์์ค ํจ์์ ์ต์๊ฐ์ ๋์ด๋ฒ๋ฆด ์ํ์ด ์์ต๋๋ค(์์ค ํจ์๊ฐ ์ต์ํ๋๋ ์ง์ ์ ๋์ด ์ ํ).- Weight Decay๋ ํฐ ๊ฐ์ค์น์ ๋ํด ํจ๋ํฐ๋ฅผ ๋ถ์ฌํ๋ ์ถ๊ฐ ํญ์ ์ถ๊ฐํ์ฌ ์์ค ๊ณ์ฐ ๋จ๊ณ๋ฅผ ์์ ํฉ๋๋ค. ์ด๋ ์ต์ ํ๊ธฐ๊ฐ ๋ฐ์ดํฐ์ ์ ๋ง์ถ๋ฉด์ ๋ชจ๋ธ์ ๋จ์ํ๊ฒ ์ ์งํ์ฌ ๋จธ์ ๋ฌ๋ ๋ชจ๋ธ์์ ๊ณผ์ ํฉ์ ๋ฐฉ์งํ๊ธฐ ์ํด ์ด๋ค ๋จ์ผ ํน์ฑ์ ๋๋ฌด ๋ง์ ์ค์์ฑ์ ๋ถ์ฌํ์ง ์๋๋ก ์ ๋ํฉ๋๋ค.
- L2 ์ ๊ทํ๊ฐ ์๋ SGD์ ๊ฐ์ ์ ํต์ ์ธ ์ต์ ํ๊ธฐ๋ ์์ค ํจ์์ ๊ทธ๋๋์ธํธ์ ํจ๊ป ๊ฐ์ค์น ๊ฐ์๋ฅผ ๊ฒฐํฉํฉ๋๋ค. ๊ทธ๋ฌ๋ AdamW(Adam ์ต์ ํ๊ธฐ์ ๋ณํ)๋ ๊ฐ์ค์น ๊ฐ์๋ฅผ ๊ทธ๋๋์ธํธ ์ ๋ฐ์ดํธ์ ๋ถ๋ฆฌํ์ฌ ๋ ํจ๊ณผ์ ์ธ ์ ๊ทํ๋ฅผ ์ด๋์ด๋ ๋๋ค.
- ํ๋ จ์ ์ฌ์ฉํ ์ฅ์น
- ์ํฌํฌ ์: ํ๋ จ ๋ฐ์ดํฐ๋ฅผ ๋ฐ๋ณตํ๋ ํ์
- ํ๊ฐ ๋น๋:
evaluate_model์ ํธ์ถํ๋ ๋น๋ - ํ๊ฐ ๋ฐ๋ณต:
generate_and_print_sample์ ํธ์ถํ ๋ ๋ชจ๋ธ์ ํ์ฌ ์ํ๋ฅผ ํ๊ฐํ๋ ๋ฐ ์ฌ์ฉํ ๋ฐฐ์น ์ - ์์ ์ปจํ
์คํธ:
generate_and_print_sample์ ํธ์ถํ ๋ ์ฌ์ฉํ ์์ ๋ฌธ์ฅ - ํ ํฌ๋์ด์
# Functions to train the data
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
eval_freq, eval_iter, start_context, tokenizer):
# Initialize lists to track losses and tokens seen
train_losses, val_losses, track_tokens_seen = [], [], []
tokens_seen, global_step = 0, -1
# Main training loop
for epoch in range(num_epochs):
model.train() # Set model to training mode
for input_batch, target_batch in train_loader:
optimizer.zero_grad() # Reset loss gradients from previous batch iteration
loss = calc_loss_batch(input_batch, target_batch, model, device)
loss.backward() # Calculate loss gradients
optimizer.step() # Update model weights using loss gradients
tokens_seen += input_batch.numel()
global_step += 1
# Optional evaluation step
if global_step % eval_freq == 0:
train_loss, val_loss = evaluate_model(
model, train_loader, val_loader, device, eval_iter)
train_losses.append(train_loss)
val_losses.append(val_loss)
track_tokens_seen.append(tokens_seen)
print(f"Ep {epoch+1} (Step {global_step:06d}): "
f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
# Print a sample text after each epoch
generate_and_print_sample(
model, tokenizer, device, start_context
)
return train_losses, val_losses, track_tokens_seen
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
model.eval() # Set in eval mode to avoid dropout
with torch.no_grad():
train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
model.train() # Back to training model applying all the configurations
return train_loss, val_loss
def generate_and_print_sample(model, tokenizer, device, start_context):
model.eval() # Set in eval mode to avoid dropout
context_size = model.pos_emb.weight.shape[0]
encoded = text_to_token_ids(start_context, tokenizer).to(device)
with torch.no_grad():
token_ids = generate_text(
model=model, idx=encoded,
max_new_tokens=50, context_size=context_size
)
decoded_text = token_ids_to_text(token_ids, tokenizer)
print(decoded_text.replace("\n", " ")) # Compact print format
model.train() # Back to training model applying all the configurations
Tip
ํ์ต ์๋๋ฅผ ๊ฐ์ ํ๊ธฐ ์ํด ์ ํ ์๋ฐ์ ๋ฐ ์ฝ์ฌ์ธ ๊ฐ์๋ผ๋ ๋ช ๊ฐ์ง ๊ด๋ จ ๊ธฐ์ ์ด ์์ต๋๋ค.
์ ํ ์๋ฐ์ ์ ์ด๊ธฐ ํ์ต ์๋์ ์ต๋ ํ์ต ์๋๋ฅผ ์ ์ํ๊ณ ๊ฐ ์ํฌํฌ ํ์ ์ผ๊ด๋๊ฒ ์ ๋ฐ์ดํธํ๋ ๊ฒ์ ๋๋ค. ์ด๋ ํ๋ จ์ ์์ ๊ฐ์ค์น ์ ๋ฐ์ดํธ๋ก ์์ํ๋ฉด ๋ชจ๋ธ์ด ํ๋ จ ๋จ๊ณ์์ ํฐ ๋ถ์์ ํ ์ ๋ฐ์ดํธ๋ฅผ ๋ง๋ ์ํ์ด ์ค์ด๋ค๊ธฐ ๋๋ฌธ์ ๋๋ค.
์ฝ์ฌ์ธ ๊ฐ์๋ ์๋ฐ์ ๋จ๊ณ ์ดํ์ ๋ฐ ์ฝ์ฌ์ธ ๊ณก์ ์ ๋ฐ๋ผ ํ์ต ์๋๋ฅผ ์ ์ง์ ์ผ๋ก ์ค์ด๋ ๊ธฐ์ ๋ก, ๊ฐ์ค์น ์ ๋ฐ์ดํธ๋ฅผ ๋๋ฆฌ๊ฒ ํ์ฌ ์์ค ์ต์๊ฐ์ ์ด๊ณผํ ์ํ์ ์ต์ํํ๊ณ ํ์ ๋จ๊ณ์์ ํ๋ จ์ ์์ ์ฑ์ ๋ณด์ฅํฉ๋๋ค.์ด๋ฌํ ๊ฐ์ ์ฌํญ์ ์ด์ ์ฝ๋์ ํฌํจ๋์ด ์์ง ์๋ค๋ ์ ์ ์ ์ํ์ธ์.
Start training
import time
start_time = time.time()
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
model, train_loader, val_loader, optimizer, device,
num_epochs=num_epochs, eval_freq=5, eval_iter=5,
start_context="Every effort moves you", tokenizer=tokenizer
)
end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes.")
Print training evolution
๋ค์ ํจ์๋ฅผ ์ฌ์ฉํ๋ฉด ๋ชจ๋ธ์ด ํ๋ จ๋๋ ๋์์ ์งํ๋ฅผ ์ถ๋ ฅํ ์ ์์ต๋๋ค.
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import math
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
fig, ax1 = plt.subplots(figsize=(5, 3))
ax1.plot(epochs_seen, train_losses, label="Training loss")
ax1.plot(
epochs_seen, val_losses, linestyle="-.", label="Validation loss"
)
ax1.set_xlabel("Epochs")
ax1.set_ylabel("Loss")
ax1.legend(loc="upper right")
ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
ax2 = ax1.twiny()
ax2.plot(tokens_seen, train_losses, alpha=0)
ax2.set_xlabel("Tokens seen")
fig.tight_layout()
plt.show()
# Compute perplexity from the loss values
train_ppls = [math.exp(loss) for loss in train_losses]
val_ppls = [math.exp(loss) for loss in val_losses]
# Plot perplexity over tokens seen
plt.figure()
plt.plot(tokens_seen, train_ppls, label='Training Perplexity')
plt.plot(tokens_seen, val_ppls, label='Validation Perplexity')
plt.xlabel('Tokens Seen')
plt.ylabel('Perplexity')
plt.title('Perplexity over Training')
plt.legend()
plt.show()
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
๋ชจ๋ธ ์ ์ฅ
๋์ค์ ํ๋ จ์ ๊ณ์ํ๊ณ ์ถ๋ค๋ฉด ๋ชจ๋ธ + ์ตํฐ๋ง์ด์ ๋ฅผ ์ ์ฅํ ์ ์์ต๋๋ค:
# Save the model and the optimizer for later training
torch.save({
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
},
"/tmp/model_and_optimizer.pth"
)
# Note that this model with the optimizer occupied close to 2GB
# Restore model and optimizer for training
checkpoint = torch.load("/tmp/model_and_optimizer.pth", map_location=device)
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.1)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train(); # Put in training mode
๋ชจ๋ธ๋ง ์ฌ์ฉํ๋ ค๋ ๊ฒฝ์ฐ:
# Save the model
torch.save(model.state_dict(), "model.pth")
# Load it
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(torch.load("model.pth", map_location=device))
model.eval() # Put in eval mode
GPT2 ๊ฐ์ค์น ๋ก๋
๋ก์ปฌ์์ GPT2 ๊ฐ์ค์น๋ฅผ ๋ก๋ํ๋ ๋ ๊ฐ์ ๊ฐ๋จํ ์คํฌ๋ฆฝํธ๊ฐ ์์ต๋๋ค. ๋ ๊ฒฝ์ฐ ๋ชจ๋ ๋ก์ปฌ์์ ๋ฆฌํฌ์งํ ๋ฆฌ https://github.com/rasbt/LLMs-from-scratch์ ํด๋ก ํ ์ ์์ต๋๋ค. ๊ทธ๋ฐ ๋ค์:
- ์คํฌ๋ฆฝํธ https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/gpt_generate.py๋ ๋ชจ๋ ๊ฐ์ค์น๋ฅผ ๋ค์ด๋ก๋ํ๊ณ OpenAI์ ํ์์ ์ฐ๋ฆฌ LLM์์ ๊ธฐ๋ํ๋ ํ์์ผ๋ก ๋ณํํฉ๋๋ค. ์ด ์คํฌ๋ฆฝํธ๋ ํ์ํ ๊ตฌ์ฑ๊ณผ ํ๋กฌํํธ โEvery effort moves youโ๋ก ์ค๋น๋์ด ์์ต๋๋ค.
- ์คํฌ๋ฆฝํธ https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/02_alternative_weight_loading/weight-loading-hf-transformers.ipynb๋ ๋ก์ปฌ์์ GPT2 ๊ฐ์ค์น๋ฅผ ๋ก๋ํ ์ ์๊ฒ ํด์ค๋๋ค(๋จ์ง
CHOOSE_MODEL๋ณ์๋ฅผ ๋ณ๊ฒฝํ๋ฉด ๋ฉ๋๋ค) ๊ทธ๋ฆฌ๊ณ ๋ช ๊ฐ์ง ํ๋กฌํํธ์์ ํ ์คํธ๋ฅผ ์์ธกํฉ๋๋ค.
์ฐธ๊ณ ๋ฌธํ
Tip
AWS ํดํน ๋ฐฐ์ฐ๊ธฐ ๋ฐ ์ฐ์ตํ๊ธฐ:
HackTricks Training AWS Red Team Expert (ARTE)
GCP ํดํน ๋ฐฐ์ฐ๊ธฐ ๋ฐ ์ฐ์ตํ๊ธฐ:HackTricks Training GCP Red Team Expert (GRTE)
Azure ํดํน ๋ฐฐ์ฐ๊ธฐ ๋ฐ ์ฐ์ตํ๊ธฐ:
HackTricks Training Azure Red Team Expert (AzRTE)
HackTricks ์ง์ํ๊ธฐ
- ๊ตฌ๋ ๊ณํ ํ์ธํ๊ธฐ!
- **๐ฌ ๋์ค์ฝ๋ ๊ทธ๋ฃน ๋๋ ํ ๋ ๊ทธ๋จ ๊ทธ๋ฃน์ ์ฐธ์ฌํ๊ฑฐ๋ ํธ์ํฐ ๐ฆ @hacktricks_live๋ฅผ ํ๋ก์ฐํ์ธ์.
- HackTricks ๋ฐ HackTricks Cloud ๊นํ๋ธ ๋ฆฌํฌ์งํ ๋ฆฌ์ PR์ ์ ์ถํ์ฌ ํดํน ํธ๋ฆญ์ ๊ณต์ ํ์ธ์.


