6. Pre-training & Loading models

Tip

AWS ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ:HackTricks Training AWS Red Team Expert (ARTE)
GCP ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ: HackTricks Training GCP Red Team Expert (GRTE) Azure ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ: HackTricks Training Azure Red Team Expert (AzRTE)

HackTricks ์ง€์›ํ•˜๊ธฐ

Text Generation

๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ•ด๋‹น ๋ชจ๋ธ์ด ์ƒˆ๋กœ์šด ํ† ํฐ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์ƒ์„ฑ๋œ ํ† ํฐ์„ ์˜ˆ์ƒ๋œ ํ† ํฐ๊ณผ ๋น„๊ตํ•˜์—ฌ ๋ชจ๋ธ์ด ์ƒ์„ฑํ•ด์•ผ ํ•  ํ† ํฐ์„ ํ•™์Šตํ•˜๋„๋ก ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค.

์ด์ „ ์˜ˆ์ œ์—์„œ ์ด๋ฏธ ์ผ๋ถ€ ํ† ํฐ์„ ์˜ˆ์ธกํ–ˆ์œผ๋ฏ€๋กœ, ์ด ๋ชฉ์ ์„ ์œ„ํ•ด ํ•ด๋‹น ๊ธฐ๋Šฅ์„ ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Tip

์ด ์—ฌ์„ฏ ๋ฒˆ์งธ ๋‹จ๊ณ„์˜ ๋ชฉํ‘œ๋Š” ๋งค์šฐ ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค: ๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ›ˆ๋ จ์‹œํ‚ค๊ธฐ. ์ด๋ฅผ ์œ„ํ•ด ์ด์ „ LLM ์•„ํ‚คํ…์ฒ˜๊ฐ€ ์‚ฌ์šฉ๋˜๋ฉฐ, ์ •์˜๋œ ์†์‹ค ํ•จ์ˆ˜์™€ ์ตœ์ ํ™”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋ฐ˜๋ณตํ•˜๋Š” ๋ฃจํ”„๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

Text Evaluation

์˜ฌ๋ฐ”๋ฅธ ํ›ˆ๋ จ์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์˜ˆ์ƒ๋œ ํ† ํฐ์— ๋Œ€ํ•ด ์–ป์€ ์˜ˆ์ธก์„ ์ธก์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํ›ˆ๋ จ์˜ ๋ชฉํ‘œ๋Š” ์˜ฌ๋ฐ”๋ฅธ ํ† ํฐ์˜ ๊ฐ€๋Šฅ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์œผ๋กœ, ์ด๋Š” ๋‹ค๋ฅธ ํ† ํฐ์— ๋น„ํ•ด ๊ทธ ํ™•๋ฅ ์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๊ฒƒ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

์˜ฌ๋ฐ”๋ฅธ ํ† ํฐ์˜ ํ™•๋ฅ ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ๊ทธ ํ™•๋ฅ ์ด ๊ทน๋Œ€ํ™”๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€์ค‘์น˜์˜ ์—…๋ฐ์ดํŠธ๋Š” ์—ญ์ „ํŒŒ๋ฅผ ํ†ตํ•ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ์ด๋Š” ๊ทน๋Œ€ํ™”ํ•  ์†์‹ค ํ•จ์ˆ˜๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ, ํ•จ์ˆ˜๋Š” ์ˆ˜ํ–‰๋œ ์˜ˆ์ธก๊ณผ ์›ํ•˜๋Š” ์˜ˆ์ธก ๊ฐ„์˜ ์ฐจ์ด๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์›์‹œ ์˜ˆ์ธก์œผ๋กœ ์ž‘์—…ํ•˜๋Š” ๋Œ€์‹ , n์„ ๋ฐ‘์œผ๋กœ ํ•˜๋Š” ๋กœ๊ทธ๋กœ ์ž‘์—…ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์˜ˆ์ƒ๋œ ํ† ํฐ์˜ ํ˜„์žฌ ์˜ˆ์ธก์ด 7.4541e-05๋ผ๋ฉด, 7.4541e-05์˜ ์ž์—ฐ ๋กœ๊ทธ(๋ฐ‘ e)๋Š” ๋Œ€๋žต -9.5042์ž…๋‹ˆ๋‹ค.
๊ทธ๋Ÿฐ ๋‹ค์Œ, ์˜ˆ๋ฅผ ๋“ค์–ด 5๊ฐœ์˜ ํ† ํฐ์˜ ์ปจํ…์ŠคํŠธ ๊ธธ์ด๋ฅผ ๊ฐ€์ง„ ๊ฐ ํ•ญ๋ชฉ์— ๋Œ€ํ•ด ๋ชจ๋ธ์€ 5๊ฐœ์˜ ํ† ํฐ์„ ์˜ˆ์ธกํ•ด์•ผ ํ•˜๋ฉฐ, ์ฒซ 4๊ฐœ์˜ ํ† ํฐ์€ ์ž…๋ ฅ์˜ ๋งˆ์ง€๋ง‰ ๊ฒƒ์ด๊ณ  ๋‹ค์„ฏ ๋ฒˆ์งธ๋Š” ์˜ˆ์ธก๋œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๊ฐ ํ•ญ๋ชฉ์— ๋Œ€ํ•ด 5๊ฐœ์˜ ์˜ˆ์ธก์ด ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค(์ฒซ 4๊ฐœ๊ฐ€ ์ž…๋ ฅ์— ์žˆ์—ˆ๋”๋ผ๋„ ๋ชจ๋ธ์€ ์ด๋ฅผ ์•Œ์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค)์™€ 5๊ฐœ์˜ ์˜ˆ์ƒ ํ† ํฐ์ด ์žˆ์œผ๋ฉฐ ๋”ฐ๋ผ์„œ ๊ทน๋Œ€ํ™”ํ•  5๊ฐœ์˜ ํ™•๋ฅ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ๊ฐ ์˜ˆ์ธก์— ์ž์—ฐ ๋กœ๊ทธ๋ฅผ ์ˆ˜ํ–‰ํ•œ ํ›„, ํ‰๊ท ์ด ๊ณ„์‚ฐ๋˜๊ณ , ๋งˆ์ด๋„ˆ์Šค ๊ธฐํ˜ธ๊ฐ€ ์ œ๊ฑฐ๋ฉ๋‹ˆ๋‹ค(์ด๋ฅผ _๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹ค_์ด๋ผ๊ณ  ํ•จ) ๊ทธ๋ฆฌ๊ณ  ๊ทธ๊ฒƒ์ด 0์— ์ตœ๋Œ€ํ•œ ๊ฐ€๊น๊ฒŒ ์ค„์—ฌ์•ผ ํ•  ์ˆซ์ž์ž…๋‹ˆ๋‹ค. ์™œ๋ƒํ•˜๋ฉด 1์˜ ์ž์—ฐ ๋กœ๊ทธ๋Š” 0์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค:

https://camo.githubusercontent.com/3c0ab9c55cefa10b667f1014b6c42df901fa330bb2bc9cea88885e784daec8ba/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830355f636f6d707265737365642f63726f73732d656e74726f70792e776562703f313233

๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์€ ๋‹นํ˜น๊ฐ(perplexity)์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. Perplexity๋Š” ํ™•๋ฅ  ๋ชจ๋ธ์ด ์ƒ˜ํ”Œ์„ ์˜ˆ์ธกํ•˜๋Š” ์ •๋„๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” ๋ฉ”ํŠธ๋ฆญ์ž…๋‹ˆ๋‹ค. ์–ธ์–ด ๋ชจ๋ธ๋ง์—์„œ ์ด๋Š” ์‹œํ€€์Šค์—์„œ ๋‹ค์Œ ํ† ํฐ์„ ์˜ˆ์ธกํ•  ๋•Œ ๋ชจ๋ธ์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด, 48725์˜ perplexity ๊ฐ’์€ ํ† ํฐ์„ ์˜ˆ์ธกํ•ด์•ผ ํ•  ๋•Œ 48,725๊ฐœ์˜ ์–ดํœ˜ ์ค‘ ์–ด๋–ค ๊ฒƒ์ด ์ข‹์€ ๊ฒƒ์ธ์ง€ ํ™•์‹ ํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

Pre-Train Example

์ด๊ฒƒ์€ https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/ch05.ipynb์—์„œ ์ œ์•ˆ๋œ ์ดˆ๊ธฐ ์ฝ”๋“œ๋กœ, ๋•Œ๋•Œ๋กœ ์•ฝ๊ฐ„ ์ˆ˜์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ ์‚ฌ์šฉ๋œ ์ด์ „ ์ฝ”๋“œ์ง€๋งŒ ์ด์ „ ์„น์…˜์—์„œ ์ด๋ฏธ ์„ค๋ช…๋จ ```python """ This is code explained before so it won't be exaplained """

import tiktoken import torch import torch.nn as nn from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset): def init(self, txt, tokenizer, max_length, stride): self.input_ids = [] self.target_ids = []

Tokenize the entire text

token_ids = tokenizer.encode(txt, allowed_special={โ€œ<|endoftext|>โ€})

Use a sliding window to chunk the book into overlapping sequences of max_length

for i in range(0, len(token_ids) - max_length, stride): input_chunk = token_ids[i:i + max_length] target_chunk = token_ids[i + 1: i + max_length + 1] self.input_ids.append(torch.tensor(input_chunk)) self.target_ids.append(torch.tensor(target_chunk))

def len(self): return len(self.input_ids)

def getitem(self, idx): return self.input_ids[idx], self.target_ids[idx]

def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):

Initialize the tokenizer

tokenizer = tiktoken.get_encoding(โ€œgpt2โ€)

Create dataset

dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

Create dataloader

dataloader = DataLoader( dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)

return dataloader

class MultiHeadAttention(nn.Module): def init(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False): super().init() assert d_out % num_heads == 0, โ€œd_out must be divisible by n_headsโ€

self.d_out = d_out self.num_heads = num_heads self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs self.dropout = nn.Dropout(dropout) self.register_buffer(โ€˜maskโ€™, torch.triu(torch.ones(context_length, context_length), diagonal=1))

def forward(self, x): b, num_tokens, d_in = x.shape

keys = self.W_key(x) # Shape: (b, num_tokens, d_out) queries = self.W_query(x) values = self.W_value(x)

We implicitly split the matrix by adding a num_heads dimension

Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)

keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) values = values.view(b, num_tokens, self.num_heads, self.head_dim) queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)

keys = keys.transpose(1, 2) queries = queries.transpose(1, 2) values = values.transpose(1, 2)

Compute scaled dot-product attention (aka self-attention) with a causal mask

attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head

Original mask truncated to the number of tokens and converted to boolean

mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

Use the mask to fill attention scores

attn_scores.masked_fill_(mask_bool, -torch.inf)

attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) attn_weights = self.dropout(attn_weights)

Shape: (b, num_tokens, num_heads, head_dim)

context_vec = (attn_weights @ values).transpose(1, 2)

Combine heads, where self.d_out = self.num_heads * self.head_dim

context_vec = context_vec.reshape(b, num_tokens, self.d_out) context_vec = self.out_proj(context_vec) # optional projection

return context_vec

class LayerNorm(nn.Module): def init(self, emb_dim): super().init() self.eps = 1e-5 self.scale = nn.Parameter(torch.ones(emb_dim)) self.shift = nn.Parameter(torch.zeros(emb_dim))

def forward(self, x): mean = x.mean(dim=-1, keepdim=True) var = x.var(dim=-1, keepdim=True, unbiased=False) norm_x = (x - mean) / torch.sqrt(var + self.eps) return self.scale * norm_x + self.shift

class GELU(nn.Module): def init(self): super().init()

def forward(self, x): return 0.5 * x * (1 + torch.tanh( torch.sqrt(torch.tensor(2.0 / torch.pi)) * (x + 0.044715 * torch.pow(x, 3)) ))

class FeedForward(nn.Module): def init(self, cfg): super().init() self.layers = nn.Sequential( nn.Linear(cfg[โ€œemb_dimโ€], 4 * cfg[โ€œemb_dimโ€]), GELU(), nn.Linear(4 * cfg[โ€œemb_dimโ€], cfg[โ€œemb_dimโ€]), )

def forward(self, x): return self.layers(x)

class TransformerBlock(nn.Module): def init(self, cfg): super().init() self.att = MultiHeadAttention( d_in=cfg[โ€œemb_dimโ€], d_out=cfg[โ€œemb_dimโ€], context_length=cfg[โ€œcontext_lengthโ€], num_heads=cfg[โ€œn_headsโ€], dropout=cfg[โ€œdrop_rateโ€], qkv_bias=cfg[โ€œqkv_biasโ€]) self.ff = FeedForward(cfg) self.norm1 = LayerNorm(cfg[โ€œemb_dimโ€]) self.norm2 = LayerNorm(cfg[โ€œemb_dimโ€]) self.drop_shortcut = nn.Dropout(cfg[โ€œdrop_rateโ€])

def forward(self, x):

Shortcut connection for attention block

shortcut = x x = self.norm1(x) x = self.att(x) # Shape [batch_size, num_tokens, emb_size] x = self.drop_shortcut(x) x = x + shortcut # Add the original input back

Shortcut connection for feed-forward block

shortcut = x x = self.norm2(x) x = self.ff(x) x = self.drop_shortcut(x) x = x + shortcut # Add the original input back

return x

class GPTModel(nn.Module): def init(self, cfg): super().init() self.tok_emb = nn.Embedding(cfg[โ€œvocab_sizeโ€], cfg[โ€œemb_dimโ€]) self.pos_emb = nn.Embedding(cfg[โ€œcontext_lengthโ€], cfg[โ€œemb_dimโ€]) self.drop_emb = nn.Dropout(cfg[โ€œdrop_rateโ€])

self.trf_blocks = nn.Sequential( *[TransformerBlock(cfg) for _ in range(cfg[โ€œn_layersโ€])])

self.final_norm = LayerNorm(cfg[โ€œemb_dimโ€]) self.out_head = nn.Linear(cfg[โ€œemb_dimโ€], cfg[โ€œvocab_sizeโ€], bias=False)

def forward(self, in_idx): batch_size, seq_len = in_idx.shape tok_embeds = self.tok_emb(in_idx) pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device)) x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size] x = self.drop_emb(x) x = self.trf_blocks(x) x = self.final_norm(x) logits = self.out_head(x) return logits

</details>
```python
# Download contents to train the data with
import os
import urllib.request

file_path = "the-verdict.txt"
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"

if not os.path.exists(file_path):
with urllib.request.urlopen(url) as response:
text_data = response.read().decode('utf-8')
with open(file_path, "w", encoding="utf-8") as file:
file.write(text_data)
else:
with open(file_path, "r", encoding="utf-8") as file:
text_data = file.read()

total_characters = len(text_data)
tokenizer = tiktoken.get_encoding("gpt2")
total_tokens = len(tokenizer.encode(text_data))

print("Data downloaded")
print("Characters:", total_characters)
print("Tokens:", total_tokens)

# Model initialization
GPT_CONFIG_124M = {
"vocab_size": 50257,   # Vocabulary size
"context_length": 256, # Shortened context length (orig: 1024)
"emb_dim": 768,        # Embedding dimension
"n_heads": 12,         # Number of attention heads
"n_layers": 12,        # Number of layers
"drop_rate": 0.1,      # Dropout rate
"qkv_bias": False      # Query-key-value bias
}

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval()
print ("Model initialized")


# Functions to transform from tokens to ids and from to ids to tokens
def text_to_token_ids(text, tokenizer):
encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
flat = token_ids.squeeze(0) # remove batch dimension
return tokenizer.decode(flat.tolist())



# Define loss functions
def calc_loss_batch(input_batch, target_batch, model, device):
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
logits = model(input_batch)
loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
return loss


def calc_loss_loader(data_loader, model, device, num_batches=None):
total_loss = 0.
if len(data_loader) == 0:
return float("nan")
elif num_batches is None:
num_batches = len(data_loader)
else:
# Reduce the number of batches to match the total number of batches in the data loader
# if num_batches exceeds the number of batches in the data loader
num_batches = min(num_batches, len(data_loader))
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
loss = calc_loss_batch(input_batch, target_batch, model, device)
total_loss += loss.item()
else:
break
return total_loss / num_batches


# Apply Train/validation ratio and create dataloaders
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

torch.manual_seed(123)

train_loader = create_dataloader_v1(
train_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=True,
shuffle=True,
num_workers=0
)

val_loader = create_dataloader_v1(
val_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=False,
shuffle=False,
num_workers=0
)


# Sanity checks
if total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:
print("Not enough tokens for the training loader. "
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
"increase the `training_ratio`")

if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:
print("Not enough tokens for the validation loader. "
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
"decrease the `training_ratio`")

print("Train loader:")
for x, y in train_loader:
print(x.shape, y.shape)

print("\nValidation loader:")
for x, y in val_loader:
print(x.shape, y.shape)

train_tokens = 0
for input_batch, target_batch in train_loader:
train_tokens += input_batch.numel()

val_tokens = 0
for input_batch, target_batch in val_loader:
val_tokens += input_batch.numel()

print("Training tokens:", train_tokens)
print("Validation tokens:", val_tokens)
print("All tokens:", train_tokens + val_tokens)


# Indicate the device to use
if torch.cuda.is_available():
device = torch.device("cuda")
elif torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")

print(f"Using {device} device.")

model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes



# Pre-calculate losses without starting yet
torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader

with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
train_loss = calc_loss_loader(train_loader, model, device)
val_loss = calc_loss_loader(val_loader, model, device)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)


# Functions to train the data
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
eval_freq, eval_iter, start_context, tokenizer):
# Initialize lists to track losses and tokens seen
train_losses, val_losses, track_tokens_seen = [], [], []
tokens_seen, global_step = 0, -1

# Main training loop
for epoch in range(num_epochs):
model.train()  # Set model to training mode

for input_batch, target_batch in train_loader:
optimizer.zero_grad() # Reset loss gradients from previous batch iteration
loss = calc_loss_batch(input_batch, target_batch, model, device)
loss.backward() # Calculate loss gradients
optimizer.step() # Update model weights using loss gradients
tokens_seen += input_batch.numel()
global_step += 1

# Optional evaluation step
if global_step % eval_freq == 0:
train_loss, val_loss = evaluate_model(
model, train_loader, val_loader, device, eval_iter)
train_losses.append(train_loss)
val_losses.append(val_loss)
track_tokens_seen.append(tokens_seen)
print(f"Ep {epoch+1} (Step {global_step:06d}): "
f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")

# Print a sample text after each epoch
generate_and_print_sample(
model, tokenizer, device, start_context
)

return train_losses, val_losses, track_tokens_seen


def evaluate_model(model, train_loader, val_loader, device, eval_iter):
model.eval()
with torch.no_grad():
train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
model.train()
return train_loss, val_loss


def generate_and_print_sample(model, tokenizer, device, start_context):
model.eval()
context_size = model.pos_emb.weight.shape[0]
encoded = text_to_token_ids(start_context, tokenizer).to(device)
with torch.no_grad():
token_ids = generate_text(
model=model, idx=encoded,
max_new_tokens=50, context_size=context_size
)
decoded_text = token_ids_to_text(token_ids, tokenizer)
print(decoded_text.replace("\n", " "))  # Compact print format
model.train()


# Start training!
import time
start_time = time.time()

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)

num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
model, train_loader, val_loader, optimizer, device,
num_epochs=num_epochs, eval_freq=5, eval_iter=5,
start_context="Every effort moves you", tokenizer=tokenizer
)

end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes.")



# Show graphics with the training process
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import math
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
fig, ax1 = plt.subplots(figsize=(5, 3))
ax1.plot(epochs_seen, train_losses, label="Training loss")
ax1.plot(
epochs_seen, val_losses, linestyle="-.", label="Validation loss"
)
ax1.set_xlabel("Epochs")
ax1.set_ylabel("Loss")
ax1.legend(loc="upper right")
ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
ax2 = ax1.twiny()
ax2.plot(tokens_seen, train_losses, alpha=0)
ax2.set_xlabel("Tokens seen")
fig.tight_layout()
plt.show()

# Compute perplexity from the loss values
train_ppls = [math.exp(loss) for loss in train_losses]
val_ppls = [math.exp(loss) for loss in val_losses]
# Plot perplexity over tokens seen
plt.figure()
plt.plot(tokens_seen, train_ppls, label='Training Perplexity')
plt.plot(tokens_seen, val_ppls, label='Validation Perplexity')
plt.xlabel('Tokens Seen')
plt.ylabel('Perplexity')
plt.title('Perplexity over Training')
plt.legend()
plt.show()

epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)


torch.save({
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
},
"/tmp/model_and_optimizer.pth"
)

Functions to transform text <โ€“> ids

์ด๊ฒƒ์€ ์–ดํœ˜์—์„œ ํ…์ŠคํŠธ๋ฅผ ID๋กœ, ๊ทธ๋ฆฌ๊ณ  ๊ทธ ๋ฐ˜๋Œ€๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ช‡ ๊ฐ€์ง€ ๊ฐ„๋‹จํ•œ ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด๋Š” ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ์˜ ์‹œ์ž‘๊ณผ ์˜ˆ์ธก์˜ ๋์—์„œ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

# Functions to transform from tokens to ids and from to ids to tokens
def text_to_token_ids(text, tokenizer):
encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension
return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
flat = token_ids.squeeze(0) # remove batch dimension
return tokenizer.decode(flat.tolist())

ํ…์ŠคํŠธ ์ƒ์„ฑ ํ•จ์ˆ˜

์ด์ „ ์„น์…˜์—์„œ๋Š” ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ ํ† ํฐ์„ ๋กœ์ง“์„ ์–ป์€ ํ›„์— ๊ฐ€์ ธ์˜ค๋Š” ํ•จ์ˆ˜๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Š” ๊ฐ ์ž…๋ ฅ์— ๋Œ€ํ•ด ํ•ญ์ƒ ๋™์ผํ•œ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•˜๊ฒŒ ๋˜์–ด ๋งค์šฐ ๊ฒฐ์ •์ ์ž…๋‹ˆ๋‹ค.

๋‹ค์Œ generate_text ํ•จ์ˆ˜๋Š” top-k, temperature ๋ฐ multinomial ๊ฐœ๋…์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • **top-k**๋Š” ์ƒ์œ„ k๊ฐœ์˜ ํ† ํฐ์„ ์ œ์™ธํ•œ ๋ชจ๋“  ํ† ํฐ์˜ ํ™•๋ฅ ์„ -inf๋กœ ์ค„์ด๊ธฐ ์‹œ์ž‘ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ k=3์ธ ๊ฒฝ์šฐ, ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๊ธฐ ์ „์— ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ 3๊ฐœ์˜ ํ† ํฐ๋งŒ -inf๊ฐ€ ์•„๋‹Œ ํ™•๋ฅ ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
  • **temperature**๋Š” ๋ชจ๋“  ํ™•๋ฅ ์ด ์˜จ๋„ ๊ฐ’์œผ๋กœ ๋‚˜๋ˆ„์–ด์ง„๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๊ฐ’์ด 0.1์ด๋ฉด ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์ด ๊ฐ€์žฅ ๋‚ฎ์€ ํ™•๋ฅ ์— ๋น„ํ•ด ๊ฐœ์„ ๋˜๋ฉฐ, ์˜ˆ๋ฅผ ๋“ค์–ด ์˜จ๋„๊ฐ€ 5์ด๋ฉด ๋” ํ‰ํ‰ํ•ด์ง‘๋‹ˆ๋‹ค. ์ด๋Š” LLM์ด ๊ฐ€์ง€๊ธธ ์›ํ•˜๋Š” ์‘๋‹ต์˜ ๋ณ€ํ™”๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.
  • ์˜จ๋„๋ฅผ ์ ์šฉํ•œ ํ›„, softmax ํ•จ์ˆ˜๊ฐ€ ๋‹ค์‹œ ์ ์šฉ๋˜์–ด ๋‚จ์•„ ์žˆ๋Š” ๋ชจ๋“  ํ† ํฐ์˜ ์ด ํ™•๋ฅ ์ด 1์ด ๋˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  • ๋งˆ์ง€๋ง‰์œผ๋กœ, ๊ฐ€์žฅ ํฐ ํ™•๋ฅ ์„ ๊ฐ€์ง„ ํ† ํฐ์„ ์„ ํƒํ•˜๋Š” ๋Œ€์‹ , ํ•จ์ˆ˜ **multinomial**์ด ์ตœ์ข… ํ™•๋ฅ ์— ๋”ฐ๋ผ ๋‹ค์Œ ํ† ํฐ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ† ํฐ 1์ด 70%์˜ ํ™•๋ฅ ์„ ๊ฐ€์กŒ๋‹ค๋ฉด, ํ† ํฐ 2๋Š” 20%, ํ† ํฐ 3์€ 10%์˜ ํ™•๋ฅ ์„ ๊ฐ€์ง€๋ฉฐ, 70%์˜ ๊ฒฝ์šฐ ํ† ํฐ 1์ด ์„ ํƒ๋˜๊ณ , 20%์˜ ๊ฒฝ์šฐ ํ† ํฐ 2๊ฐ€, 10%์˜ ๊ฒฝ์šฐ ํ† ํฐ 3์ด ์„ ํƒ๋ฉ๋‹ˆ๋‹ค.
# Generate text function
def generate_text(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):

# For-loop is the same as before: Get logits, and only focus on last time step
for _ in range(max_new_tokens):
idx_cond = idx[:, -context_size:]
with torch.no_grad():
logits = model(idx_cond)
logits = logits[:, -1, :]

# New: Filter logits with top_k sampling
if top_k is not None:
# Keep only top_k values
top_logits, _ = torch.topk(logits, top_k)
min_val = top_logits[:, -1]
logits = torch.where(logits < min_val, torch.tensor(float("-inf")).to(logits.device), logits)

# New: Apply temperature scaling
if temperature > 0.0:
logits = logits / temperature

# Apply softmax to get probabilities
probs = torch.softmax(logits, dim=-1)  # (batch_size, context_len)

# Sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1)  # (batch_size, 1)

# Otherwise same as before: get idx of the vocab entry with the highest logits value
else:
idx_next = torch.argmax(logits, dim=-1, keepdim=True)  # (batch_size, 1)

if idx_next == eos_id:  # Stop generating early if end-of-sequence token is encountered and eos_id is specified
break

# Same as before: append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1)  # (batch_size, num_tokens+1)

return idx

Tip

top-k์˜ ์ผ๋ฐ˜์ ์ธ ๋Œ€์•ˆ์€ top-p๋กœ, ํ•ต์‹ฌ ์ƒ˜ํ”Œ๋ง์ด๋ผ๊ณ ๋„ ํ•˜๋ฉฐ, ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์ง„ k ์ƒ˜ํ”Œ์„ ์–ป๋Š” ๋Œ€์‹ , ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ ์–ดํœ˜๋ฅผ ํ™•๋ฅ ์— ๋”ฐ๋ผ ์ •๋ฆฌํ•˜๊ณ  ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ๋ถ€ํ„ฐ ๊ฐ€์žฅ ๋‚ฎ์€ ํ™•๋ฅ ๊นŒ์ง€ ํ•ฉ์‚ฐํ•˜์—ฌ ์ž„๊ณ„๊ฐ’์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ ๋‹ค์Œ, ์ƒ๋Œ€ ํ™•๋ฅ ์— ๋”ฐ๋ผ ์–ดํœ˜์˜ ๋‹จ์–ด๋“ค๋งŒ ๊ณ ๋ ค๋ฉ๋‹ˆ๋‹ค.

์ด๋Š” ๊ฐ ๊ฒฝ์šฐ์— ๋”ฐ๋ผ ์ตœ์ ์˜ k๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ k ์ƒ˜ํ”Œ์˜ ์ˆ˜๋ฅผ ์„ ํƒํ•  ํ•„์š” ์—†์ด ์˜ค์ง ์ž„๊ณ„๊ฐ’๋งŒ ํ•„์š”ํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ฐœ์„  ์‚ฌํ•ญ์€ ์ด์ „ ์ฝ”๋“œ์— ํฌํ•จ๋˜์–ด ์žˆ์ง€ ์•Š์Œ์„ ์œ ์˜ํ•˜์„ธ์š”.

Tip

์ƒ์„ฑ๋œ ํ…์ŠคํŠธ๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์€ ์ด ์˜ˆ์ œ์—์„œ ์‚ฌ์šฉ๋œ ํƒ์š•์  ๊ฒ€์ƒ‰ ๋Œ€์‹  Beam search๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
ํƒ์š•์  ๊ฒ€์ƒ‰๊ณผ ๋‹ฌ๋ฆฌ, ๊ฐ ๋‹จ๊ณ„์—์„œ ๊ฐ€์žฅ ํ™•๋ฅ ์ด ๋†’์€ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์„ ํƒํ•˜๊ณ  ๋‹จ์ผ ์‹œํ€€์Šค๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ๋Œ€์‹ , beam search๋Š” ๊ฐ ๋‹จ๊ณ„์—์„œ ์ƒ์œ„ ๐‘˜ k์˜ ์ ์ˆ˜๊ฐ€ ๋†’์€ ๋ถ€๋ถ„ ์‹œํ€€์Šค(์ด๋ฅธ๋ฐ” โ€œbeamsโ€)๋ฅผ ์ถ”์ ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ๊ฐ€๋Šฅ์„ฑ์„ ๋™์‹œ์— ํƒ์ƒ‰ํ•จ์œผ๋กœ์จ ํšจ์œจ์„ฑ๊ณผ ํ’ˆ์งˆ์˜ ๊ท ํ˜•์„ ๋งž์ถ”์–ด, ํƒ์š•์  ์ ‘๊ทผ ๋ฐฉ์‹์œผ๋กœ ์ธํ•ด ์กฐ๊ธฐ ๋น„์ตœ์  ์„ ํƒ์œผ๋กœ ๋†“์น  ์ˆ˜ ์žˆ๋Š” ๋” ๋‚˜์€ ์ „์ฒด ์‹œํ€€์Šค๋ฅผ ์ฐพ์„ ๊ฐ€๋Šฅ์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค.

์ด ๊ฐœ์„  ์‚ฌํ•ญ์€ ์ด์ „ ์ฝ”๋“œ์— ํฌํ•จ๋˜์–ด ์žˆ์ง€ ์•Š์Œ์„ ์œ ์˜ํ•˜์„ธ์š”.

Loss functions

calc_loss_batch ํ•จ์ˆ˜๋Š” ๋‹จ์ผ ๋ฐฐ์น˜์˜ ์˜ˆ์ธก์— ๋Œ€ํ•œ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
**calc_loss_loader**๋Š” ๋ชจ๋“  ๋ฐฐ์น˜์˜ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ๋ฅผ ๊ฐ€์ ธ์™€์„œ ํ‰๊ท  ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

# Define loss functions
def calc_loss_batch(input_batch, target_batch, model, device):
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
logits = model(input_batch)
loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
return loss

def calc_loss_loader(data_loader, model, device, num_batches=None):
total_loss = 0.
if len(data_loader) == 0:
return float("nan")
elif num_batches is None:
num_batches = len(data_loader)
else:
# Reduce the number of batches to match the total number of batches in the data loader
# if num_batches exceeds the number of batches in the data loader
num_batches = min(num_batches, len(data_loader))
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
loss = calc_loss_batch(input_batch, target_batch, model, device)
total_loss += loss.item()
else:
break
return total_loss / num_batches

Tip

๊ทธ๋ž˜๋””์–ธํŠธ ํด๋ฆฌํ•‘์€ ํ›ˆ๋ จ ์•ˆ์ •์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ํฐ ์‹ ๊ฒฝ๋ง์—์„œ ๊ทธ๋ž˜๋””์–ธํŠธ ํฌ๊ธฐ์— ๋Œ€ํ•œ ์ตœ๋Œ€ ์ž„๊ณ„๊ฐ’์„ ์„ค์ •ํ•˜๋Š” ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜๋””์–ธํŠธ๊ฐ€ ์ด ๋ฏธ๋ฆฌ ์ •์˜๋œ max_norm์„ ์ดˆ๊ณผํ•˜๋ฉด, ๋ชจ๋ธ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ์—…๋ฐ์ดํŠธ๊ฐ€ ๊ด€๋ฆฌ ๊ฐ€๋Šฅํ•œ ๋ฒ”์œ„ ๋‚ด์— ์œ ์ง€๋˜๋„๋ก ๋น„๋ก€์ ์œผ๋กœ ์ถ•์†Œ๋˜์–ด ํญ๋ฐœํ•˜๋Š” ๊ทธ๋ž˜๋””์–ธํŠธ์™€ ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ณ  ๋ณด๋‹ค ํ†ต์ œ๋˜๊ณ  ์•ˆ์ •์ ์ธ ํ›ˆ๋ จ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ฐœ์„  ์‚ฌํ•ญ์€ ์ด์ „ ์ฝ”๋“œ์— ํฌํ•จ๋˜์–ด ์žˆ์ง€ ์•Š์Œ์„ ์œ ์˜ํ•˜์‹ญ์‹œ์˜ค.

๋‹ค์Œ ์˜ˆ์ œ๋ฅผ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค:

๋ฐ์ดํ„ฐ ๋กœ๋”ฉ

ํ•จ์ˆ˜ create_dataloader_v1์™€ create_dataloader_v1๋Š” ์ด์ „ ์„น์…˜์—์„œ ์ด๋ฏธ ๋…ผ์˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ 90%์˜ ํ…์ŠคํŠธ๊ฐ€ ํ›ˆ๋ จ์— ์‚ฌ์šฉ๋˜๊ณ  10%๊ฐ€ ๊ฒ€์ฆ์— ์‚ฌ์šฉ๋œ๋‹ค๋Š” ์ ์„ ์ฃผ๋ชฉํ•˜์‹ญ์‹œ์˜ค. ๋‘ ์„ธํŠธ๋Š” 2๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ๋กœ๋”์— ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.
๋•Œ๋•Œ๋กœ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์ผ๋ถ€๋Š” ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ๋” ์ž˜ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋‚จ๊ฒจ์ง€๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

๋‘ ๋ฐ์ดํ„ฐ ๋กœ๋”๋Š” ๋™์ผํ•œ ๋ฐฐ์น˜ ํฌ๊ธฐ, ์ตœ๋Œ€ ๊ธธ์ด, ์ŠคํŠธ๋ผ์ด๋“œ ๋ฐ ์ž‘์—…์ž ์ˆ˜(์ด ๊ฒฝ์šฐ 0)๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
์ฃผ์š” ์ฐจ์ด์ ์€ ๊ฐ ๋ฐ์ดํ„ฐ ๋กœ๋”์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ์ด๋ฉฐ, ๊ฒ€์ฆ์ž๋Š” ๋งˆ์ง€๋ง‰ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฒ„๋ฆฌ์ง€ ์•Š์œผ๋ฉฐ ๊ฒ€์ฆ ๋ชฉ์ ์— ํ•„์š”ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ๋ฅผ ์„ž์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ ์ŠคํŠธ๋ผ์ด๋“œ๊ฐ€ ์ปจํ…์ŠคํŠธ ๊ธธ์ด๋งŒํผ ํฌ๋‹ค๋Š” ๊ฒƒ์€ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ์‚ฌ์šฉ๋˜๋Š” ์ปจํ…์ŠคํŠธ ๊ฐ„์— ๊ฒน์นจ์ด ์—†์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค(๊ณผ์ ํ•ฉ์„ ์ค„์ด์ง€๋งŒ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋„ ์ค„์ž…๋‹ˆ๋‹ค).

๋”์šฑ์ด, ์ด ๊ฒฝ์šฐ ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ 2๋กœ ์„ค์ •๋˜์–ด ๋ฐ์ดํ„ฐ๋ฅผ 2๊ฐœ์˜ ๋ฐฐ์น˜๋กœ ๋‚˜๋ˆ„๋ฉฐ, ์ด์˜ ์ฃผ์š” ๋ชฉํ‘œ๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ํ—ˆ์šฉํ•˜๊ณ  ๋ฐฐ์น˜๋‹น ์†Œ๋น„๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

torch.manual_seed(123)

train_loader = create_dataloader_v1(
train_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=True,
shuffle=True,
num_workers=0
)

val_loader = create_dataloader_v1(
val_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=False,
shuffle=False,
num_workers=0
)

Sanity Checks

๋ชฉํ‘œ๋Š” ํ›ˆ๋ จ์„ ์œ„ํ•œ ์ถฉ๋ถ„ํ•œ ํ† ํฐ์ด ์žˆ๋Š”์ง€, ํ˜•ํƒœ๊ฐ€ ์˜ˆ์ƒํ•œ ๋Œ€๋กœ์ธ์ง€ ํ™•์ธํ•˜๊ณ , ํ›ˆ๋ จ ๋ฐ ๊ฒ€์ฆ์— ์‚ฌ์šฉ๋œ ํ† ํฐ ์ˆ˜์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์–ป๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

# Sanity checks
if total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:
print("Not enough tokens for the training loader. "
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
"increase the `training_ratio`")

if total_tokens * (1-train_ratio) < GPT_CONFIG_124M["context_length"]:
print("Not enough tokens for the validation loader. "
"Try to lower the `GPT_CONFIG_124M['context_length']` or "
"decrease the `training_ratio`")

print("Train loader:")
for x, y in train_loader:
print(x.shape, y.shape)

print("\nValidation loader:")
for x, y in val_loader:
print(x.shape, y.shape)

train_tokens = 0
for input_batch, target_batch in train_loader:
train_tokens += input_batch.numel()

val_tokens = 0
for input_batch, target_batch in val_loader:
val_tokens += input_batch.numel()

print("Training tokens:", train_tokens)
print("Validation tokens:", val_tokens)
print("All tokens:", train_tokens + val_tokens)

ํ›ˆ๋ จ ๋ฐ ์‚ฌ์ „ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ์žฅ์น˜ ์„ ํƒ

๋‹ค์Œ ์ฝ”๋“œ๋Š” ์‚ฌ์šฉํ•  ์žฅ์น˜๋ฅผ ์„ ํƒํ•˜๊ณ  ํ›ˆ๋ จ ์†์‹ค ๋ฐ ๊ฒ€์ฆ ์†์‹ค์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค(์•„์ง ์•„๋ฌด๊ฒƒ๋„ ํ›ˆ๋ จํ•˜์ง€ ์•Š์€ ์ƒํƒœ์—์„œ) ์‹œ์ž‘์ ์œผ๋กœ ์‚ผ์Šต๋‹ˆ๋‹ค.

# Indicate the device to use

if torch.cuda.is_available():
device = torch.device("cuda")
elif torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")

print(f"Using {device} device.")

model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes

# Pre-calculate losses without starting yet
torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader

with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet
train_loss = calc_loss_loader(train_loader, model, device)
val_loss = calc_loss_loader(val_loader, model, device)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)

Training functions

ํ•จ์ˆ˜ generate_and_print_sample๋Š” ์ปจํ…์ŠคํŠธ๋ฅผ ๋ฐ›์•„ ๋ชจ๋ธ์ด ๊ทธ ์‹œ์ ์—์„œ ์–ผ๋งˆ๋‚˜ ์ข‹์€์ง€์— ๋Œ€ํ•œ ๋А๋‚Œ์„ ์–ป๊ธฐ ์œ„ํ•ด ์ผ๋ถ€ ํ† ํฐ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” train_model_simple์— ์˜ํ•ด ๊ฐ ๋‹จ๊ณ„์—์„œ ํ˜ธ์ถœ๋ฉ๋‹ˆ๋‹ค.

ํ•จ์ˆ˜ evaluate_model์€ ํ›ˆ๋ จ ํ•จ์ˆ˜์— ์ง€์‹œ๋œ ๋งŒํผ ์ž์ฃผ ํ˜ธ์ถœ๋˜๋ฉฐ, ๋ชจ๋ธ ํ›ˆ๋ จ ์‹œ์ ์—์„œ ํ›ˆ๋ จ ์†์‹ค๊ณผ ๊ฒ€์ฆ ์†์‹ค์„ ์ธก์ •ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ ๋‹ค์Œ ํฐ ํ•จ์ˆ˜ train_model_simple์ด ์‹ค์ œ๋กœ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” ๋‹ค์Œ์„ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค:

  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ๋กœ๋”(ํ›ˆ๋ จ์„ ์œ„ํ•ด ์ด๋ฏธ ๋ถ„๋ฆฌ๋˜๊ณ  ์ค€๋น„๋œ ๋ฐ์ดํ„ฐ)
  • ๊ฒ€์ฆ์ž ๋กœ๋”
  • ํ›ˆ๋ จ ์ค‘ ์‚ฌ์šฉํ•  ์ตœ์ ํ™”๊ธฐ: ์ด๋Š” ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์†์‹ค์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ, ๋ณด์‹œ๋‹ค์‹œํ”ผ AdamW๊ฐ€ ์‚ฌ์šฉ๋˜์ง€๋งŒ ๋” ๋งŽ์€ ๊ฒƒ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • optimizer.zero_grad()๋Š” ๊ฐ ๋ผ์šด๋“œ์—์„œ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์žฌ์„ค์ •ํ•˜๊ธฐ ์œ„ํ•ด ํ˜ธ์ถœ๋˜์–ด ๋ˆ„์ ๋˜์ง€ ์•Š๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  • lr ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ํ•™์Šต๋ฅ ๋กœ, ๋ชจ๋ธ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์—…๋ฐ์ดํŠธํ•  ๋•Œ ์ตœ์ ํ™” ๊ณผ์ •์—์„œ ์ทจํ•˜๋Š” ๋‹จ๊ณ„์˜ ํฌ๊ธฐ๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ž‘์€ ํ•™์Šต๋ฅ ์€ ์ตœ์ ํ™”๊ธฐ๊ฐ€ ๊ฐ€์ค‘์น˜์— ๋Œ€ํ•œ ์ž‘์€ ์—…๋ฐ์ดํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ํ•˜์—ฌ ๋” ์ •ํ™•ํ•œ ์ˆ˜๋ ด์„ ์ด๋Œ ์ˆ˜ ์žˆ์ง€๋งŒ ํ›ˆ๋ จ ์†๋„๋ฅผ ๋А๋ฆฌ๊ฒŒ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํฐ ํ•™์Šต๋ฅ ์€ ํ›ˆ๋ จ ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์ง€๋งŒ ์†์‹ค ํ•จ์ˆ˜์˜ ์ตœ์†Œ๊ฐ’์„ ๋„˜์–ด๋ฒ„๋ฆด ์œ„ํ—˜์ด ์žˆ์Šต๋‹ˆ๋‹ค(์†์‹ค ํ•จ์ˆ˜๊ฐ€ ์ตœ์†Œํ™”๋˜๋Š” ์ง€์ ์„ ๋„˜์–ด ์ ํ”„).
  • Weight Decay๋Š” ํฐ ๊ฐ€์ค‘์น˜์— ๋Œ€ํ•ด ํŒจ๋„ํ‹ฐ๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ์ถ”๊ฐ€ ํ•ญ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์†์‹ค ๊ณ„์‚ฐ ๋‹จ๊ณ„๋ฅผ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ตœ์ ํ™”๊ธฐ๊ฐ€ ๋ฐ์ดํ„ฐ์— ์ž˜ ๋งž์ถ”๋ฉด์„œ ๋ชจ๋ธ์„ ๋‹จ์ˆœํ•˜๊ฒŒ ์œ ์ง€ํ•˜์—ฌ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์—์„œ ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์–ด๋–ค ๋‹จ์ผ ํŠน์„ฑ์— ๋„ˆ๋ฌด ๋งŽ์€ ์ค‘์š”์„ฑ์„ ๋ถ€์—ฌํ•˜์ง€ ์•Š๋„๋ก ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค.
  • L2 ์ •๊ทœํ™”๊ฐ€ ์žˆ๋Š” SGD์™€ ๊ฐ™์€ ์ „ํ†ต์ ์ธ ์ตœ์ ํ™”๊ธฐ๋Š” ์†์‹ค ํ•จ์ˆ˜์˜ ๊ทธ๋ž˜๋””์–ธํŠธ์™€ ํ•จ๊ป˜ ๊ฐ€์ค‘์น˜ ๊ฐ์†Œ๋ฅผ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ AdamW(Adam ์ตœ์ ํ™”๊ธฐ์˜ ๋ณ€ํ˜•)๋Š” ๊ฐ€์ค‘์น˜ ๊ฐ์†Œ๋ฅผ ๊ทธ๋ž˜๋””์–ธํŠธ ์—…๋ฐ์ดํŠธ์™€ ๋ถ„๋ฆฌํ•˜์—ฌ ๋” ํšจ๊ณผ์ ์ธ ์ •๊ทœํ™”๋ฅผ ์ด๋Œ์–ด๋ƒ…๋‹ˆ๋‹ค.
  • ํ›ˆ๋ จ์— ์‚ฌ์šฉํ•  ์žฅ์น˜
  • ์—ํฌํฌ ์ˆ˜: ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ˜๋ณตํ•˜๋Š” ํšŸ์ˆ˜
  • ํ‰๊ฐ€ ๋นˆ๋„: evaluate_model์„ ํ˜ธ์ถœํ•˜๋Š” ๋นˆ๋„
  • ํ‰๊ฐ€ ๋ฐ˜๋ณต: generate_and_print_sample์„ ํ˜ธ์ถœํ•  ๋•Œ ๋ชจ๋ธ์˜ ํ˜„์žฌ ์ƒํƒœ๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ๋ฐฐ์น˜ ์ˆ˜
  • ์‹œ์ž‘ ์ปจํ…์ŠคํŠธ: generate_and_print_sample์„ ํ˜ธ์ถœํ•  ๋•Œ ์‚ฌ์šฉํ•  ์‹œ์ž‘ ๋ฌธ์žฅ
  • ํ† ํฌ๋‚˜์ด์ €
# Functions to train the data
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
eval_freq, eval_iter, start_context, tokenizer):
# Initialize lists to track losses and tokens seen
train_losses, val_losses, track_tokens_seen = [], [], []
tokens_seen, global_step = 0, -1

# Main training loop
for epoch in range(num_epochs):
model.train()  # Set model to training mode

for input_batch, target_batch in train_loader:
optimizer.zero_grad() # Reset loss gradients from previous batch iteration
loss = calc_loss_batch(input_batch, target_batch, model, device)
loss.backward() # Calculate loss gradients
optimizer.step() # Update model weights using loss gradients
tokens_seen += input_batch.numel()
global_step += 1

# Optional evaluation step
if global_step % eval_freq == 0:
train_loss, val_loss = evaluate_model(
model, train_loader, val_loader, device, eval_iter)
train_losses.append(train_loss)
val_losses.append(val_loss)
track_tokens_seen.append(tokens_seen)
print(f"Ep {epoch+1} (Step {global_step:06d}): "
f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")

# Print a sample text after each epoch
generate_and_print_sample(
model, tokenizer, device, start_context
)

return train_losses, val_losses, track_tokens_seen


def evaluate_model(model, train_loader, val_loader, device, eval_iter):
model.eval() # Set in eval mode to avoid dropout
with torch.no_grad():
train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
model.train() # Back to training model applying all the configurations
return train_loss, val_loss


def generate_and_print_sample(model, tokenizer, device, start_context):
model.eval() # Set in eval mode to avoid dropout
context_size = model.pos_emb.weight.shape[0]
encoded = text_to_token_ids(start_context, tokenizer).to(device)
with torch.no_grad():
token_ids = generate_text(
model=model, idx=encoded,
max_new_tokens=50, context_size=context_size
)
decoded_text = token_ids_to_text(token_ids, tokenizer)
print(decoded_text.replace("\n", " "))  # Compact print format
model.train() # Back to training model applying all the configurations

Tip

ํ•™์Šต ์†๋„๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ์„ ํ˜• ์›Œ๋ฐ์—… ๋ฐ ์ฝ”์‚ฌ์ธ ๊ฐ์†Œ๋ผ๋Š” ๋ช‡ ๊ฐ€์ง€ ๊ด€๋ จ ๊ธฐ์ˆ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์„ ํ˜• ์›Œ๋ฐ์—…์€ ์ดˆ๊ธฐ ํ•™์Šต ์†๋„์™€ ์ตœ๋Œ€ ํ•™์Šต ์†๋„๋ฅผ ์ •์˜ํ•˜๊ณ  ๊ฐ ์—ํฌํฌ ํ›„์— ์ผ๊ด€๋˜๊ฒŒ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ํ›ˆ๋ จ์„ ์ž‘์€ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ๋กœ ์‹œ์ž‘ํ•˜๋ฉด ๋ชจ๋ธ์ด ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ ํฐ ๋ถˆ์•ˆ์ •ํ•œ ์—…๋ฐ์ดํŠธ๋ฅผ ๋งŒ๋‚  ์œ„ํ—˜์ด ์ค„์–ด๋“ค๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
์ฝ”์‚ฌ์ธ ๊ฐ์†Œ๋Š” ์›Œ๋ฐ์—… ๋‹จ๊ณ„ ์ดํ›„์— ๋ฐ˜ ์ฝ”์‚ฌ์ธ ๊ณก์„ ์„ ๋”ฐ๋ผ ํ•™์Šต ์†๋„๋ฅผ ์ ์ง„์ ์œผ๋กœ ์ค„์ด๋Š” ๊ธฐ์ˆ ๋กœ, ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ๋ฅผ ๋А๋ฆฌ๊ฒŒ ํ•˜์—ฌ ์†์‹ค ์ตœ์†Œ๊ฐ’์„ ์ดˆ๊ณผํ•  ์œ„ํ—˜์„ ์ตœ์†Œํ™”ํ•˜๊ณ  ํ›„์† ๋‹จ๊ณ„์—์„œ ํ›ˆ๋ จ์˜ ์•ˆ์ •์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๊ฐœ์„  ์‚ฌํ•ญ์€ ์ด์ „ ์ฝ”๋“œ์— ํฌํ•จ๋˜์–ด ์žˆ์ง€ ์•Š๋‹ค๋Š” ์ ์— ์œ ์˜ํ•˜์„ธ์š”.

Start training

import time
start_time = time.time()

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)

num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
model, train_loader, val_loader, optimizer, device,
num_epochs=num_epochs, eval_freq=5, eval_iter=5,
start_context="Every effort moves you", tokenizer=tokenizer
)

end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes.")

๋‹ค์Œ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ชจ๋ธ์ด ํ›ˆ๋ จ๋˜๋Š” ๋™์•ˆ์˜ ์ง„ํ™”๋ฅผ ์ถœ๋ ฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import math
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
fig, ax1 = plt.subplots(figsize=(5, 3))
ax1.plot(epochs_seen, train_losses, label="Training loss")
ax1.plot(
epochs_seen, val_losses, linestyle="-.", label="Validation loss"
)
ax1.set_xlabel("Epochs")
ax1.set_ylabel("Loss")
ax1.legend(loc="upper right")
ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
ax2 = ax1.twiny()
ax2.plot(tokens_seen, train_losses, alpha=0)
ax2.set_xlabel("Tokens seen")
fig.tight_layout()
plt.show()

# Compute perplexity from the loss values
train_ppls = [math.exp(loss) for loss in train_losses]
val_ppls = [math.exp(loss) for loss in val_losses]
# Plot perplexity over tokens seen
plt.figure()
plt.plot(tokens_seen, train_ppls, label='Training Perplexity')
plt.plot(tokens_seen, val_ppls, label='Validation Perplexity')
plt.xlabel('Tokens Seen')
plt.ylabel('Perplexity')
plt.title('Perplexity over Training')
plt.legend()
plt.show()

epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)

๋ชจ๋ธ ์ €์žฅ

๋‚˜์ค‘์— ํ›ˆ๋ จ์„ ๊ณ„์†ํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ๋ชจ๋ธ + ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ์ €์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

# Save the model and the optimizer for later training
torch.save({
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
},
"/tmp/model_and_optimizer.pth"
)
# Note that this model with the optimizer occupied close to 2GB

# Restore model and optimizer for training
checkpoint = torch.load("/tmp/model_and_optimizer.pth", map_location=device)

model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.1)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train(); # Put in training mode

๋ชจ๋ธ๋งŒ ์‚ฌ์šฉํ•˜๋ ค๋Š” ๊ฒฝ์šฐ:

# Save the model
torch.save(model.state_dict(), "model.pth")

# Load it
model = GPTModel(GPT_CONFIG_124M)

model.load_state_dict(torch.load("model.pth", map_location=device))

model.eval() # Put in eval mode

GPT2 ๊ฐ€์ค‘์น˜ ๋กœ๋“œ

๋กœ์ปฌ์—์„œ GPT2 ๊ฐ€์ค‘์น˜๋ฅผ ๋กœ๋“œํ•˜๋Š” ๋‘ ๊ฐœ์˜ ๊ฐ„๋‹จํ•œ ์Šคํฌ๋ฆฝํŠธ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‘ ๊ฒฝ์šฐ ๋ชจ๋‘ ๋กœ์ปฌ์—์„œ ๋ฆฌํฌ์ง€ํ† ๋ฆฌ https://github.com/rasbt/LLMs-from-scratch์„ ํด๋ก ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ:

  • ์Šคํฌ๋ฆฝํŠธ https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/gpt_generate.py๋Š” ๋ชจ๋“  ๊ฐ€์ค‘์น˜๋ฅผ ๋‹ค์šด๋กœ๋“œํ•˜๊ณ  OpenAI์˜ ํ˜•์‹์„ ์šฐ๋ฆฌ LLM์—์„œ ๊ธฐ๋Œ€ํ•˜๋Š” ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ์Šคํฌ๋ฆฝํŠธ๋Š” ํ•„์š”ํ•œ ๊ตฌ์„ฑ๊ณผ ํ”„๋กฌํ”„ํŠธ โ€œEvery effort moves youโ€œ๋กœ ์ค€๋น„๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์Šคํฌ๋ฆฝํŠธ https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/02_alternative_weight_loading/weight-loading-hf-transformers.ipynb๋Š” ๋กœ์ปฌ์—์„œ GPT2 ๊ฐ€์ค‘์น˜๋ฅผ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค(๋‹จ์ง€ CHOOSE_MODEL ๋ณ€์ˆ˜๋ฅผ ๋ณ€๊ฒฝํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค) ๊ทธ๋ฆฌ๊ณ  ๋ช‡ ๊ฐ€์ง€ ํ”„๋กฌํ”„ํŠธ์—์„œ ํ…์ŠคํŠธ๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

์ฐธ๊ณ ๋ฌธํ—Œ

Tip

AWS ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ:HackTricks Training AWS Red Team Expert (ARTE)
GCP ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ: HackTricks Training GCP Red Team Expert (GRTE) Azure ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ: HackTricks Training Azure Red Team Expert (AzRTE)

HackTricks ์ง€์›ํ•˜๊ธฐ