5. LLM Architecture

Tip

AWS ν•΄ν‚Ή 배우기 및 μ—°μŠ΅ν•˜κΈ°:HackTricks Training AWS Red Team Expert (ARTE)
GCP ν•΄ν‚Ή 배우기 및 μ—°μŠ΅ν•˜κΈ°: HackTricks Training GCP Red Team Expert (GRTE) Azure ν•΄ν‚Ή 배우기 및 μ—°μŠ΅ν•˜κΈ°: HackTricks Training Azure Red Team Expert (AzRTE)

HackTricks μ§€μ›ν•˜κΈ°

LLM Architecture

Tip

이 λ‹€μ„― 번째 λ‹¨κ³„μ˜ λͺ©ν‘œλŠ” 맀우 κ°„λ‹¨ν•©λ‹ˆλ‹€: 전체 LLM의 μ•„ν‚€ν…μ²˜λ₯Ό κ°œλ°œν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€. λͺ¨λ“  것을 κ²°ν•©ν•˜κ³ , λͺ¨λ“  λ ˆμ΄μ–΄λ₯Ό μ μš©ν•˜λ©°, ν…μŠ€νŠΈλ₯Ό μƒμ„±ν•˜κ±°λ‚˜ ν…μŠ€νŠΈλ₯Ό ID둜 λ³€ν™˜ν•˜κ³  κ·Έ λ°˜λŒ€λ‘œ λ³€ν™˜ν•˜λŠ” λͺ¨λ“  κΈ°λŠ₯을 λ§Œλ“­λ‹ˆλ‹€.

이 μ•„ν‚€ν…μ²˜λŠ” ν›ˆλ ¨ ν›„ ν…μŠ€νŠΈλ₯Ό μ˜ˆμΈ‘ν•˜λŠ” 데 μ‚¬μš©λ©λ‹ˆλ‹€.

LLM μ•„ν‚€ν…μ²˜ μ˜ˆμ‹œλŠ” https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01_main-chapter-code/ch04.ipynbμ—μ„œ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€:

높은 μˆ˜μ€€μ˜ ν‘œν˜„μ€ λ‹€μŒκ³Ό 같이 κ΄€μ°°ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

https://camo.githubusercontent.com/6c8c392f72d5b9e86c94aeb9470beab435b888d24135926f1746eb88e0cc18fb/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830345f636f6d707265737365642f31332e776562703f31

  1. Input (Tokenized Text): ν”„λ‘œμ„ΈμŠ€λŠ” ν† ν°ν™”λœ ν…μŠ€νŠΈλ‘œ μ‹œμž‘λ˜λ©°, μ΄λŠ” 숫자 ν‘œν˜„μœΌλ‘œ λ³€ν™˜λ©λ‹ˆλ‹€.
  2. Token Embedding and Positional Embedding Layer: ν† ν°ν™”λœ ν…μŠ€νŠΈλŠ” 토큰 μž„λ² λ”© λ ˆμ΄μ–΄μ™€ μœ„μΉ˜ μž„λ² λ”© λ ˆμ΄μ–΄λ₯Ό ν†΅κ³Όν•˜μ—¬, μ‹œν€€μŠ€μ—μ„œ ν† ν°μ˜ μœ„μΉ˜λ₯Ό μΊ‘μ²˜ν•©λ‹ˆλ‹€. μ΄λŠ” 단어 μˆœμ„œλ₯Ό μ΄ν•΄ν•˜λŠ” 데 μ€‘μš”ν•©λ‹ˆλ‹€.
  3. Transformer Blocks: λͺ¨λΈμ€ 12개의 트랜슀포머 블둝을 ν¬ν•¨ν•˜λ©°, 각 블둝은 μ—¬λŸ¬ λ ˆμ΄μ–΄λ‘œ κ΅¬μ„±λ©λ‹ˆλ‹€. 이 블둝은 λ‹€μŒ μ‹œν€€μŠ€λ₯Ό λ°˜λ³΅ν•©λ‹ˆλ‹€:
  • Masked Multi-Head Attention: λͺ¨λΈμ΄ μž…λ ₯ ν…μŠ€νŠΈμ˜ λ‹€μ–‘ν•œ 뢀뢄에 λ™μ‹œμ— 집쀑할 수 있게 ν•©λ‹ˆλ‹€.
  • Layer Normalization: ν›ˆλ ¨μ„ μ•ˆμ •ν™”ν•˜κ³  κ°œμ„ ν•˜κΈ° μœ„ν•œ μ •κ·œν™” λ‹¨κ³„μž…λ‹ˆλ‹€.
  • Feed Forward Layer: 주의 λ ˆμ΄μ–΄μ—μ„œ 정보λ₯Ό μ²˜λ¦¬ν•˜κ³  λ‹€μŒ 토큰에 λŒ€ν•œ μ˜ˆμΈ‘μ„ μˆ˜ν–‰ν•˜λŠ” 역할을 ν•©λ‹ˆλ‹€.
  • Dropout Layers: 이 λ ˆμ΄μ–΄λŠ” ν›ˆλ ¨ 쀑 λ¬΄μž‘μœ„λ‘œ μœ λ‹›μ„ λ“œλ‘­ν•˜μ—¬ 과적합을 λ°©μ§€ν•©λ‹ˆλ‹€.
  1. Final Output Layer: λͺ¨λΈμ€ 4x50,257 μ°¨μ›μ˜ ν…μ„œλ₯Ό 좜λ ₯ν•˜λ©°, μ—¬κΈ°μ„œ 50,257은 μ–΄νœ˜μ˜ 크기λ₯Ό λ‚˜νƒ€λƒ…λ‹ˆλ‹€. 이 ν…μ„œμ˜ 각 행은 λͺ¨λΈμ΄ μ‹œν€€μŠ€μ—μ„œ λ‹€μŒ 단어λ₯Ό μ˜ˆμΈ‘ν•˜λŠ” 데 μ‚¬μš©ν•˜λŠ” 벑터에 ν•΄λ‹Ήν•©λ‹ˆλ‹€.
  2. Goal: λͺ©ν‘œλŠ” μ΄λŸ¬ν•œ μž„λ² λ”©μ„ 가져와 λ‹€μ‹œ ν…μŠ€νŠΈλ‘œ λ³€ν™˜ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€. ꡬ체적으둜, 좜λ ₯의 λ§ˆμ§€λ§‰ 행은 이 λ‹€μ΄μ–΄κ·Έλž¨μ—μ„œ β€œforwardβ€œλ‘œ ν‘œμ‹œλœ λ‹€μŒ 단어λ₯Ό μƒμ„±ν•˜λŠ” 데 μ‚¬μš©λ©λ‹ˆλ‹€.

Code representation

import torch
import torch.nn as nn
import tiktoken

class GELU(nn.Module):
def __init__(self):
super().__init__()

def forward(self, x):
return 0.5 * x * (1 + torch.tanh(
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
(x + 0.044715 * torch.pow(x, 3))
))

class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
GELU(),
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
)

def forward(self, x):
return self.layers(x)

class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
assert d_out % num_heads == 0, "d_out must be divisible by num_heads"

self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
self.dropout = nn.Dropout(dropout)
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))

def forward(self, x):
b, num_tokens, d_in = x.shape

keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
queries = self.W_query(x)
values = self.W_value(x)

# We implicitly split the matrix by adding a `num_heads` dimension
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
keys = keys.transpose(1, 2)
queries = queries.transpose(1, 2)
values = values.transpose(1, 2)

# Compute scaled dot-product attention (aka self-attention) with a causal mask
attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

# Original mask truncated to the number of tokens and converted to boolean
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

# Use the mask to fill attention scores
attn_scores.masked_fill_(mask_bool, -torch.inf)

attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)

# Shape: (b, num_tokens, num_heads, head_dim)
context_vec = (attn_weights @ values).transpose(1, 2)

# Combine heads, where self.d_out = self.num_heads * self.head_dim
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec) # optional projection

return context_vec

class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))

def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * norm_x + self.shift

class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
dropout=cfg["drop_rate"],
qkv_bias=cfg["qkv_bias"])
self.ff = FeedForward(cfg)
self.norm1 = LayerNorm(cfg["emb_dim"])
self.norm2 = LayerNorm(cfg["emb_dim"])
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

def forward(self, x):
# Shortcut connection for attention block
shortcut = x
x = self.norm1(x)
x = self.att(x)  # Shape [batch_size, num_tokens, emb_size]
x = self.drop_shortcut(x)
x = x + shortcut  # Add the original input back

# Shortcut connection for feed forward block
shortcut = x
x = self.norm2(x)
x = self.ff(x)
x = self.drop_shortcut(x)
x = x + shortcut  # Add the original input back

return x


class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])

self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)

def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits

GPT_CONFIG_124M = {
"vocab_size": 50257,    # Vocabulary size
"context_length": 1024, # Context length
"emb_dim": 768,         # Embedding dimension
"n_heads": 12,          # Number of attention heads
"n_layers": 12,         # Number of layers
"drop_rate": 0.1,       # Dropout rate
"qkv_bias": False       # Query-Key-Value bias
}

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)

GELU ν™œμ„±ν™” ν•¨μˆ˜

# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class GELU(nn.Module):
def __init__(self):
super().__init__()

def forward(self, x):
return 0.5 * x * (1 + torch.tanh(
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
(x + 0.044715 * torch.pow(x, 3))
))

λͺ©μ  및 κΈ°λŠ₯

  • GELU (κ°€μš°μ‹œμ•ˆ 였λ₯˜ μ„ ν˜• λ‹¨μœ„): λͺ¨λΈμ— λΉ„μ„ ν˜•μ„±μ„ λ„μž…ν•˜λŠ” ν™œμ„±ν™” ν•¨μˆ˜μž…λ‹ˆλ‹€.
  • λΆ€λ“œλŸ¬μš΄ ν™œμ„±ν™”: 음수 μž…λ ₯을 0으둜 λ§Œλ“œλŠ” ReLU와 달리, GELUλŠ” 음수 μž…λ ₯에 λŒ€ν•΄ μž‘κ³  0이 μ•„λ‹Œ 값을 ν—ˆμš©ν•˜λ©° μž…λ ₯을 좜λ ₯으둜 λΆ€λ“œλŸ½κ²Œ λ§€ν•‘ν•©λ‹ˆλ‹€.
  • μˆ˜ν•™μ  μ •μ˜:

Tip

FeedForward λ ˆμ΄μ–΄ λ‚΄μ˜ μ„ ν˜• λ ˆμ΄μ–΄ 후에 이 ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜λŠ” λͺ©μ μ€ μ„ ν˜• 데이터λ₯Ό λΉ„μ„ ν˜•μœΌλ‘œ λ³€κ²½ν•˜μ—¬ λͺ¨λΈμ΄ λ³΅μž‘ν•˜κ³  λΉ„μ„ ν˜•μ μΈ 관계λ₯Ό ν•™μŠ΅ν•  수 μžˆλ„λ‘ ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€.

FeedForward 신경망

ν–‰λ ¬μ˜ ν˜•νƒœλ₯Ό 더 잘 μ΄ν•΄ν•˜κΈ° μœ„ν•΄ μ£Όμ„μœΌλ‘œ ν˜•νƒœκ°€ μΆ”κ°€λ˜μ—ˆμŠ΅λ‹ˆλ‹€:

# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
GELU(),
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
)

def forward(self, x):
# x shape: (batch_size, seq_len, emb_dim)

x = self.layers[0](x)# x shape: (batch_size, seq_len, 4 * emb_dim)
x = self.layers[1](x) # x shape remains: (batch_size, seq_len, 4 * emb_dim)
x = self.layers[2](x) # x shape: (batch_size, seq_len, emb_dim)
return x  # Output shape: (batch_size, seq_len, emb_dim)

λͺ©μ  및 κΈ°λŠ₯

  • μœ„μΉ˜λ³„ FeedForward λ„€νŠΈμ›Œν¬: 각 μœ„μΉ˜μ— λŒ€ν•΄ λ³„λ„λ‘œ λ™μΌν•˜κ²Œ 두 개의 μ™„μ „ μ—°κ²° λ„€νŠΈμ›Œν¬λ₯Ό μ μš©ν•©λ‹ˆλ‹€.
  • λ ˆμ΄μ–΄ 세뢀사항:
  • 첫 번째 μ„ ν˜• λ ˆμ΄μ–΄: 차원을 emb_dimμ—μ„œ 4 * emb_dim으둜 ν™•μž₯ν•©λ‹ˆλ‹€.
  • GELU ν™œμ„±ν™”: λΉ„μ„ ν˜•μ„±μ„ μ μš©ν•©λ‹ˆλ‹€.
  • 두 번째 μ„ ν˜• λ ˆμ΄μ–΄: 차원을 λ‹€μ‹œ emb_dim으둜 μ€„μž…λ‹ˆλ‹€.

Tip

λ³΄μ‹œλ‹€μ‹œν”Ό, Feed Forward λ„€νŠΈμ›Œν¬λŠ” 3개의 λ ˆμ΄μ–΄λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€. 첫 λ²ˆμ§ΈλŠ” μ„ ν˜• λ ˆμ΄μ–΄λ‘œ, μ„ ν˜• κ°€μ€‘μΉ˜(λͺ¨λΈ λ‚΄λΆ€μ—μ„œ ν›ˆλ ¨ν•  λ§€κ°œλ³€μˆ˜)λ₯Ό μ‚¬μš©ν•˜μ—¬ 차원을 4배둜 κ³±ν•©λ‹ˆλ‹€. 그런 λ‹€μŒ, GELU ν•¨μˆ˜κ°€ λͺ¨λ“  μ°¨μ›μ—μ„œ μ‚¬μš©λ˜μ–΄ 더 ν’λΆ€ν•œ ν‘œν˜„μ„ ν¬μ°©ν•˜κΈ° μœ„ν•œ λΉ„μ„ ν˜• λ³€ν™”λ₯Ό μ μš©ν•˜κ³ , λ§ˆμ§€λ§‰μœΌλ‘œ 또 λ‹€λ₯Έ μ„ ν˜• λ ˆμ΄μ–΄κ°€ μ›λž˜ 차원 크기둜 되돌리기 μœ„ν•΄ μ‚¬μš©λ©λ‹ˆλ‹€.

닀쀑 ν—€λ“œ 주의 λ©”μ»€λ‹ˆμ¦˜

이것은 이전 μ„Ήμ…˜μ—μ„œ 이미 μ„€λͺ…λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

λͺ©μ  및 κΈ°λŠ₯

  • 닀쀑 ν—€λ“œ 자기 주의: λͺ¨λΈμ΄ 토큰을 인코딩할 λ•Œ μž…λ ₯ μ‹œν€€μŠ€ λ‚΄μ˜ λ‹€μ–‘ν•œ μœ„μΉ˜μ— 집쀑할 수 있게 ν•©λ‹ˆλ‹€.
  • μ£Όμš” ꡬ성 μš”μ†Œ:
  • 쿼리, ν‚€, κ°’: μž…λ ₯의 μ„ ν˜• ν”„λ‘œμ μ…˜μœΌλ‘œ, 주의 점수λ₯Ό κ³„μ‚°ν•˜λŠ” 데 μ‚¬μš©λ©λ‹ˆλ‹€.
  • ν—€λ“œ: λ³‘λ ¬λ‘œ μ‹€ν–‰λ˜λŠ” μ—¬λŸ¬ 주의 λ©”μ»€λ‹ˆμ¦˜(num_heads), 각기 μΆ•μ†Œλœ 차원(head_dim)을 κ°€μ§‘λ‹ˆλ‹€.
  • 주의 점수: 쿼리와 ν‚€μ˜ 내적을 κ³„μ‚°ν•˜μ—¬ μŠ€μΌ€μΌλ§ 및 λ§ˆμŠ€ν‚Ήν•©λ‹ˆλ‹€.
  • λ§ˆμŠ€ν‚Ή: λͺ¨λΈμ΄ 미래의 토큰에 주의λ₯Ό κΈ°μšΈμ΄μ§€ μ•Šλ„λ‘ ν•˜λŠ” 인과 λ§ˆμŠ€ν¬κ°€ μ μš©λ©λ‹ˆλ‹€(자기 νšŒκ·€ λͺ¨λΈμΈ GPT에 μ€‘μš”).
  • 주의 κ°€μ€‘μΉ˜: λ§ˆμŠ€ν‚Ήλ˜κ³  μŠ€μΌ€μΌλœ 주의 점수의 μ†Œν”„νŠΈλ§₯μŠ€μž…λ‹ˆλ‹€.
  • μ»¨ν…μŠ€νŠΈ 벑터: 주의 κ°€μ€‘μΉ˜μ— 따라 κ°’μ˜ 가쀑 ν•©μž…λ‹ˆλ‹€.
  • 좜λ ₯ ν”„λ‘œμ μ…˜: λͺ¨λ“  ν—€λ“œμ˜ 좜λ ₯을 κ²°ν•©ν•˜λŠ” μ„ ν˜• λ ˆμ΄μ–΄μž…λ‹ˆλ‹€.

Tip

이 λ„€νŠΈμ›Œν¬μ˜ λͺ©ν‘œλŠ” λ™μΌν•œ μ»¨ν…μŠ€νŠΈ λ‚΄μ—μ„œ 토큰 κ°„μ˜ 관계λ₯Ό μ°ΎλŠ” κ²ƒμž…λ‹ˆλ‹€. λ˜ν•œ, 토큰은 과적합을 λ°©μ§€ν•˜κΈ° μœ„ν•΄ μ„œλ‘œ λ‹€λ₯Έ ν—€λ“œλ‘œ λ‚˜λ‰˜μ§€λ§Œ, 각 ν—€λ“œμ—μ„œ 발견된 μ΅œμ’… κ΄€κ³„λŠ” 이 λ„€νŠΈμ›Œν¬μ˜ λμ—μ„œ κ²°ν•©λ©λ‹ˆλ‹€.

λ˜ν•œ, ν›ˆλ ¨ 쀑에 인과 λ§ˆμŠ€ν¬κ°€ μ μš©λ˜μ–΄ λ‚˜μ€‘μ˜ 토큰이 νŠΉμ • ν† ν°κ³Όμ˜ 관계λ₯Ό 찾을 λ•Œ κ³ λ €λ˜μ§€ μ•ŠμœΌλ©°, 과적합을 λ°©μ§€ν•˜κΈ° μœ„ν•΄ 일뢀 λ“œλ‘­μ•„μ›ƒλ„ μ μš©λ©λ‹ˆλ‹€.

λ ˆμ΄μ–΄ μ •κ·œν™”

# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5 # Prevent division by zero during normalization.
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))

def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * norm_x + self.shift

λͺ©μ  및 κΈ°λŠ₯

  • Layer Normalization: 배치의 각 κ°œλ³„ μ˜ˆμ œμ— λŒ€ν•΄ νŠΉμ§•(μž„λ² λ”© 차원) μ „λ°˜μ— 걸쳐 μž…λ ₯을 μ •κ·œν™”ν•˜λŠ” 데 μ‚¬μš©λ˜λŠ” κΈ°μˆ μž…λ‹ˆλ‹€.
  • ꡬ성 μš”μ†Œ:
  • eps: μ •κ·œν™” 쀑 0으둜 λ‚˜λˆ„λŠ” 것을 λ°©μ§€ν•˜κΈ° μœ„ν•΄ 뢄산에 μΆ”κ°€λ˜λŠ” μž‘μ€ μƒμˆ˜(1e-5)μž…λ‹ˆλ‹€.
  • scale 및 shift: μ •κ·œν™”λœ 좜λ ₯을 μŠ€μΌ€μΌν•˜κ³  이동할 수 μžˆλ„λ‘ ν•˜λŠ” ν•™μŠ΅ κ°€λŠ₯ν•œ λ§€κ°œλ³€μˆ˜(nn.Parameter)μž…λ‹ˆλ‹€. 각각 1κ³Ό 0으둜 μ΄ˆκΈ°ν™”λ©λ‹ˆλ‹€.
  • μ •κ·œν™” κ³Όμ •:
  • 평균 계산(mean): μž„λ² λ”© 차원(dim=-1)에 걸쳐 μž…λ ₯ x의 평균을 κ³„μ‚°ν•˜λ©°, λΈŒλ‘œλ“œμΊμŠ€νŒ…μ„ μœ„ν•΄ 차원을 μœ μ§€ν•©λ‹ˆλ‹€(keepdim=True).
  • λΆ„μ‚° 계산(var): μž„λ² λ”© 차원에 걸쳐 x의 뢄산을 κ³„μ‚°ν•˜λ©°, 차원을 μœ μ§€ν•©λ‹ˆλ‹€. unbiased=False λ§€κ°œλ³€μˆ˜λŠ” 뢄산이 편ν–₯ μΆ”μ •κΈ°λ₯Ό μ‚¬μš©ν•˜μ—¬ κ³„μ‚°λ˜λ„λ‘ 보μž₯ν•©λ‹ˆλ‹€(μƒ˜ν”Œμ΄ μ•„λ‹Œ νŠΉμ§•μ— λŒ€ν•΄ μ •κ·œν™”ν•  λ•Œ μ ν•©ν•œ N으둜 λ‚˜λˆ„κΈ°).
  • μ •κ·œν™”(norm_x): xμ—μ„œ 평균을 λΉΌκ³  뢄산에 epsλ₯Ό λ”ν•œ κ°’μ˜ 제곱근으둜 λ‚˜λˆ•λ‹ˆλ‹€.
  • μŠ€μΌ€μΌ 및 이동: μ •κ·œν™”λœ 좜λ ₯에 ν•™μŠ΅ κ°€λŠ₯ν•œ scale 및 shift λ§€κ°œλ³€μˆ˜λ₯Ό μ μš©ν•©λ‹ˆλ‹€.

Tip

λͺ©ν‘œλŠ” λ™μΌν•œ ν† ν°μ˜ λͺ¨λ“  μ°¨μ›μ—μ„œ 평균이 0이고 뢄산이 1이 λ˜λ„λ‘ ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€. μ΄λŠ” λ”₯ λ‰΄λŸ΄ λ„€νŠΈμ›Œν¬μ˜ ν›ˆλ ¨μ„ μ•ˆμ •ν™”ν•˜κΈ° μœ„ν•΄ λ‚΄λΆ€ κ³΅λ³€λŸ‰ 이동을 μ€„μ΄λŠ” 것을 λͺ©ν‘œλ‘œ ν•˜λ©°, μ΄λŠ” ν›ˆλ ¨ 쀑 λ§€κ°œλ³€μˆ˜ μ—…λ°μ΄νŠΈλ‘œ μΈν•œ λ„€νŠΈμ›Œν¬ ν™œμ„±ν™”μ˜ 뢄포 변화와 관련이 μžˆμŠ΅λ‹ˆλ‹€.

Transformer Block

ν–‰λ ¬μ˜ ν˜•νƒœλ₯Ό 더 잘 μ΄ν•΄ν•˜κΈ° μœ„ν•΄ μ£Όμ„μœΌλ‘œ μΆ”κ°€λ˜μ—ˆμŠ΅λ‹ˆλ‹€:

# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04

class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
dropout=cfg["drop_rate"],
qkv_bias=cfg["qkv_bias"]
)
self.ff = FeedForward(cfg)
self.norm1 = LayerNorm(cfg["emb_dim"])
self.norm2 = LayerNorm(cfg["emb_dim"])
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

def forward(self, x):
# x shape: (batch_size, seq_len, emb_dim)

# Shortcut connection for attention block
shortcut = x  # shape: (batch_size, seq_len, emb_dim)
x = self.norm1(x)  # shape remains (batch_size, seq_len, emb_dim)
x = self.att(x)    # shape: (batch_size, seq_len, emb_dim)
x = self.drop_shortcut(x)  # shape remains (batch_size, seq_len, emb_dim)
x = x + shortcut   # shape: (batch_size, seq_len, emb_dim)

# Shortcut connection for feedforward block
shortcut = x       # shape: (batch_size, seq_len, emb_dim)
x = self.norm2(x)  # shape remains (batch_size, seq_len, emb_dim)
x = self.ff(x)     # shape: (batch_size, seq_len, emb_dim)
x = self.drop_shortcut(x)  # shape remains (batch_size, seq_len, emb_dim)
x = x + shortcut   # shape: (batch_size, seq_len, emb_dim)

return x  # Output shape: (batch_size, seq_len, emb_dim)

λͺ©μ  및 κΈ°λŠ₯

  • 측의 ꡬ성: 닀쀑 ν—€λ“œ 주의, ν”Όλ“œν¬μ›Œλ“œ λ„€νŠΈμ›Œν¬, μΈ΅ μ •κ·œν™” 및 μž”μ—¬ 연결을 κ²°ν•©ν•©λ‹ˆλ‹€.
  • μΈ΅ μ •κ·œν™”: μ•ˆμ •μ μΈ ν›ˆλ ¨μ„ μœ„ν•΄ 주의 및 ν”Όλ“œν¬μ›Œλ“œ μΈ΅ 전에 μ μš©λ©λ‹ˆλ‹€.
  • μž”μ—¬ μ—°κ²° (단좕): 측의 μž…λ ₯을 좜λ ₯에 μΆ”κ°€ν•˜μ—¬ κ·Έλž˜λ””μ–ΈνŠΈ 흐름을 κ°œμ„ ν•˜κ³  κΉŠμ€ λ„€νŠΈμ›Œν¬μ˜ ν›ˆλ ¨μ„ κ°€λŠ₯ν•˜κ²Œ ν•©λ‹ˆλ‹€.
  • λ“œλ‘­μ•„μ›ƒ: μ •κ·œν™”λ₯Ό μœ„ν•΄ 주의 및 ν”Όλ“œν¬μ›Œλ“œ μΈ΅ 후에 μ μš©λ©λ‹ˆλ‹€.

단계별 κΈ°λŠ₯

  1. 첫 번째 μž”μ—¬ 경둜 (자기 주의):
  • μž…λ ₯ (shortcut): μž”μ—¬ 연결을 μœ„ν•΄ μ›λž˜ μž…λ ₯을 μ €μž₯ν•©λ‹ˆλ‹€.
  • μΈ΅ μ •κ·œν™” (norm1): μž…λ ₯을 μ •κ·œν™”ν•©λ‹ˆλ‹€.
  • 닀쀑 ν—€λ“œ 주의 (att): 자기 주의λ₯Ό μ μš©ν•©λ‹ˆλ‹€.
  • λ“œλ‘­μ•„μ›ƒ (drop_shortcut): μ •κ·œν™”λ₯Ό μœ„ν•΄ λ“œλ‘­μ•„μ›ƒμ„ μ μš©ν•©λ‹ˆλ‹€.
  • μž”μ—¬ μΆ”κ°€ (x + shortcut): μ›λž˜ μž…λ ₯κ³Ό κ²°ν•©ν•©λ‹ˆλ‹€.
  1. 두 번째 μž”μ—¬ 경둜 (ν”Όλ“œν¬μ›Œλ“œ):
  • μž…λ ₯ (shortcut): λ‹€μŒ μž”μ—¬ 연결을 μœ„ν•΄ μ—…λ°μ΄νŠΈλœ μž…λ ₯을 μ €μž₯ν•©λ‹ˆλ‹€.
  • μΈ΅ μ •κ·œν™” (norm2): μž…λ ₯을 μ •κ·œν™”ν•©λ‹ˆλ‹€.
  • ν”Όλ“œν¬μ›Œλ“œ λ„€νŠΈμ›Œν¬ (ff): ν”Όλ“œν¬μ›Œλ“œ λ³€ν™˜μ„ μ μš©ν•©λ‹ˆλ‹€.
  • λ“œλ‘­μ•„μ›ƒ (drop_shortcut): λ“œλ‘­μ•„μ›ƒμ„ μ μš©ν•©λ‹ˆλ‹€.
  • μž”μ—¬ μΆ”κ°€ (x + shortcut): 첫 번째 μž”μ—¬ 경둜의 μž…λ ₯κ³Ό κ²°ν•©ν•©λ‹ˆλ‹€.

Tip

λ³€ν™˜κΈ° 블둝은 λͺ¨λ“  λ„€νŠΈμ›Œν¬λ₯Ό ν•¨κ»˜ κ·Έλ£Ήν™”ν•˜κ³  ν›ˆλ ¨ μ•ˆμ •μ„±κ³Ό κ²°κ³Όλ₯Ό κ°œμ„ ν•˜κΈ° μœ„ν•΄ 일뢀 μ •κ·œν™” 및 λ“œλ‘­μ•„μ›ƒμ„ μ μš©ν•©λ‹ˆλ‹€.
λ“œλ‘­μ•„μ›ƒμ΄ 각 λ„€νŠΈμ›Œν¬ μ‚¬μš© 후에 μˆ˜ν–‰λ˜κ³  μ •κ·œν™”κ°€ 이전에 μ μš©λ˜λŠ” 방식을 μ£Όλͺ©ν•˜μ„Έμš”.

λ˜ν•œ, λ„€νŠΈμ›Œν¬μ˜ 좜λ ₯을 μž…λ ₯κ³Ό λ”ν•˜λŠ” 단좕을 μ‚¬μš©ν•©λ‹ˆλ‹€. μ΄λŠ” 초기 측이 λ§ˆμ§€λ§‰ 측만큼 β€œλ§Žμ΄β€ κΈ°μ—¬ν•˜λ„λ‘ ν•˜μ—¬ μ†Œμ‹€ κ·Έλž˜λ””μ–ΈνŠΈ 문제λ₯Ό λ°©μ§€ν•˜λŠ” 데 도움이 λ©λ‹ˆλ‹€.

GPTModel

ν–‰λ ¬μ˜ ν˜•νƒœλ₯Ό 더 잘 μ΄ν•΄ν•˜κΈ° μœ„ν•΄ μ£Όμ„μœΌλ‘œ ν˜•νƒœκ°€ μΆ”κ°€λ˜μ—ˆμŠ΅λ‹ˆλ‹€:

# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
# shape: (vocab_size, emb_dim)

self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
# shape: (context_length, emb_dim)

self.drop_emb = nn.Dropout(cfg["drop_rate"])

self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
)
# Stack of TransformerBlocks

self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
# shape: (emb_dim, vocab_size)

def forward(self, in_idx):
# in_idx shape: (batch_size, seq_len)
batch_size, seq_len = in_idx.shape

# Token embeddings
tok_embeds = self.tok_emb(in_idx)
# shape: (batch_size, seq_len, emb_dim)

# Positional embeddings
pos_indices = torch.arange(seq_len, device=in_idx.device)
# shape: (seq_len,)
pos_embeds = self.pos_emb(pos_indices)
# shape: (seq_len, emb_dim)

# Add token and positional embeddings
x = tok_embeds + pos_embeds  # Broadcasting over batch dimension
# x shape: (batch_size, seq_len, emb_dim)

x = self.drop_emb(x)  # Dropout applied
# x shape remains: (batch_size, seq_len, emb_dim)

x = self.trf_blocks(x)  # Pass through Transformer blocks
# x shape remains: (batch_size, seq_len, emb_dim)

x = self.final_norm(x)  # Final LayerNorm
# x shape remains: (batch_size, seq_len, emb_dim)

logits = self.out_head(x)  # Project to vocabulary size
# logits shape: (batch_size, seq_len, vocab_size)

return logits  # Output shape: (batch_size, seq_len, vocab_size)

λͺ©μ  및 κΈ°λŠ₯

  • μž„λ² λ”© λ ˆμ΄μ–΄:
  • 토큰 μž„λ² λ”© (tok_emb): 토큰 인덱슀λ₯Ό μž„λ² λ”©μœΌλ‘œ λ³€ν™˜ν•©λ‹ˆλ‹€. 이듀은 μ–΄νœ˜μ˜ 각 ν† ν°μ˜ 각 차원에 μ£Όμ–΄μ§„ κ°€μ€‘μΉ˜μž…λ‹ˆλ‹€.
  • μœ„μΉ˜ μž„λ² λ”© (pos_emb): μž„λ² λ”©μ— μœ„μΉ˜ 정보λ₯Ό μΆ”κ°€ν•˜μ—¬ ν† ν°μ˜ μˆœμ„œλ₯Ό μΊ‘μ²˜ν•©λ‹ˆλ‹€. 이듀은 ν…μŠ€νŠΈμ—μ„œμ˜ μœ„μΉ˜μ— 따라 토큰에 μ£Όμ–΄μ§„ κ°€μ€‘μΉ˜μž…λ‹ˆλ‹€.
  • λ“œλ‘­μ•„μ›ƒ (drop_emb): μ •κ·œν™”λ₯Ό μœ„ν•΄ μž„λ² λ”©μ— μ μš©λ©λ‹ˆλ‹€.
  • 트랜슀포머 블둝 (trf_blocks): μž„λ² λ”©μ„ μ²˜λ¦¬ν•˜κΈ° μœ„ν•œ n_layers 트랜슀포머 λΈ”λ‘μ˜ μŠ€νƒμž…λ‹ˆλ‹€.
  • μ΅œμ’… μ •κ·œν™” (final_norm): 좜λ ₯ λ ˆμ΄μ–΄ 전에 λ ˆμ΄μ–΄ μ •κ·œν™”κ°€ μ μš©λ©λ‹ˆλ‹€.
  • 좜λ ₯ λ ˆμ΄μ–΄ (out_head): μ΅œμ’… 은닉 μƒνƒœλ₯Ό μ–΄νœ˜ 크기둜 ν”„λ‘œμ μ…˜ν•˜μ—¬ μ˜ˆμΈ‘μ„ μœ„ν•œ λ‘œμ§“μ„ μƒμ„±ν•©λ‹ˆλ‹€.

Tip

이 클래슀의 λͺ©ν‘œλŠ” μ‹œν€€μŠ€μ—μ„œ λ‹€μŒ 토큰을 μ˜ˆμΈ‘ν•˜κΈ° μœ„ν•΄ μ–ΈκΈ‰λœ λͺ¨λ“  λ‹€λ₯Έ λ„€νŠΈμ›Œν¬λ₯Ό μ‚¬μš©ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€. μ΄λŠ” ν…μŠ€νŠΈ 생성과 같은 μž‘μ—…μ— κΈ°λ³Έμ μž…λ‹ˆλ‹€.

μ–Όλ§ˆλ‚˜ λ§Žμ€ 트랜슀포머 블둝이 μ‚¬μš©λ  것인지 λͺ…μ‹œλœ λŒ€λ‘œ μ‚¬μš©ν•  것인지 μ£Όλͺ©ν•˜μ‹­μ‹œμ˜€. 각 트랜슀포머 블둝은 ν•˜λ‚˜μ˜ 닀쀑 ν—€λ“œ 주의 λ„€νŠΈμ›Œν¬, ν•˜λ‚˜μ˜ ν”Όλ“œ ν¬μ›Œλ“œ λ„€νŠΈμ›Œν¬ 및 μ—¬λŸ¬ μ •κ·œν™”λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€. λ”°λΌμ„œ 12개의 트랜슀포머 블둝이 μ‚¬μš©λ˜λ©΄ 이λ₯Ό 12둜 κ³±ν•©λ‹ˆλ‹€.

λ˜ν•œ, 좜λ ₯ 전에 μ •κ·œν™” λ ˆμ΄μ–΄κ°€ μΆ”κ°€λ˜κ³ , λ§ˆμ§€λ§‰μ— μ μ ˆν•œ μ°¨μ›μ˜ κ²°κ³Όλ₯Ό μ–»κΈ° μœ„ν•΄ μ΅œμ’… μ„ ν˜• λ ˆμ΄μ–΄κ°€ μ μš©λ©λ‹ˆλ‹€. 각 μ΅œμ’… λ²‘ν„°μ˜ 크기가 μ‚¬μš©λœ μ–΄νœ˜μ˜ 크기와 κ°™λ‹€λŠ” 점에 μœ μ˜ν•˜μ‹­μ‹œμ˜€. μ΄λŠ” μ–΄νœ˜ λ‚΄μ˜ κ°€λŠ₯ν•œ 각 토큰에 λŒ€ν•œ ν™•λ₯ μ„ μ–»μœΌλ €λŠ” κ²ƒμž…λ‹ˆλ‹€.

ν›ˆλ ¨ν•  λ§€κ°œλ³€μˆ˜ 수

GPT ꡬ쑰가 μ •μ˜λ˜λ©΄ ν›ˆλ ¨ν•  λ§€κ°œλ³€μˆ˜ 수λ₯Ό μ•Œμ•„λ‚Ό 수 μžˆμŠ΅λ‹ˆλ‹€:

GPT_CONFIG_124M = {
"vocab_size": 50257,    # Vocabulary size
"context_length": 1024, # Context length
"emb_dim": 768,         # Embedding dimension
"n_heads": 12,          # Number of attention heads
"n_layers": 12,         # Number of layers
"drop_rate": 0.1,       # Dropout rate
"qkv_bias": False       # Query-Key-Value bias
}

model = GPTModel(GPT_CONFIG_124M)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")
# Total number of parameters: 163,009,536

단계별 계산

1. μž„λ² λ”© λ ˆμ΄μ–΄: 토큰 μž„λ² λ”© 및 μœ„μΉ˜ μž„λ² λ”©

  • λ ˆμ΄μ–΄: nn.Embedding(vocab_size, emb_dim)
  • λ§€κ°œλ³€μˆ˜: vocab_size * emb_dim
token_embedding_params = 50257 * 768 = 38,597,376
  • Layer: nn.Embedding(context_length, emb_dim)
  • Parameters: context_length * emb_dim
position_embedding_params = 1024 * 768 = 786,432

총 μž„λ² λ”© λ§€κ°œλ³€μˆ˜

embedding_params = token_embedding_params + position_embedding_params
embedding_params = 38,597,376 + 786,432 = 39,383,808

2. Transformer Blocks

12개의 트랜슀포머 블둝이 μžˆμœΌλ―€λ‘œ, ν•˜λ‚˜μ˜ 블둝에 λŒ€ν•œ λ§€κ°œλ³€μˆ˜λ₯Ό κ³„μ‚°ν•œ ν›„ 12λ₯Ό κ³±ν•©λ‹ˆλ‹€.

트랜슀포머 블둝당 λ§€κ°œλ³€μˆ˜

a. 닀쀑 ν—€λ“œ 주의 (Multi-Head Attention)

  • ꡬ성 μš”μ†Œ:

  • 쿼리 μ„ ν˜• λ ˆμ΄μ–΄ (W_query): nn.Linear(emb_dim, emb_dim, bias=False)

  • ν‚€ μ„ ν˜• λ ˆμ΄μ–΄ (W_key): nn.Linear(emb_dim, emb_dim, bias=False)

  • κ°’ μ„ ν˜• λ ˆμ΄μ–΄ (W_value): nn.Linear(emb_dim, emb_dim, bias=False)

  • 좜λ ₯ ν”„λ‘œμ μ…˜ (out_proj): nn.Linear(emb_dim, emb_dim)

  • 계산:

  • 각각의 W_query, W_key, W_value:

qkv_params = emb_dim * emb_dim = 768 * 768 = 589,824

μ΄λŸ¬ν•œ λ ˆμ΄μ–΄κ°€ 3개 μžˆμœΌλ―€λ‘œ:

total_qkv_params = 3 * qkv_params = 3 * 589,824 = 1,769,472
  • 좜λ ₯ ν”„λ‘œμ μ…˜ (out_proj):
out_proj_params = (emb_dim * emb_dim) + emb_dim = (768 * 768) + 768 = 589,824 + 768 = 590,592
  • 총 닀쀑 ν—€λ“œ 주의 λ§€κ°œλ³€μˆ˜:
mha_params = total_qkv_params + out_proj_params
mha_params = 1,769,472 + 590,592 = 2,360,064

b. ν”Όλ“œν¬μ›Œλ“œ λ„€νŠΈμ›Œν¬ (FeedForward Network)

  • ꡬ성 μš”μ†Œ:

  • 첫 번째 μ„ ν˜• λ ˆμ΄μ–΄: nn.Linear(emb_dim, 4 * emb_dim)

  • 두 번째 μ„ ν˜• λ ˆμ΄μ–΄: nn.Linear(4 * emb_dim, emb_dim)

  • 계산:

  • 첫 번째 μ„ ν˜• λ ˆμ΄μ–΄:

ff_first_layer_params = (emb_dim * 4 * emb_dim) + (4 * emb_dim)
ff_first_layer_params = (768 * 3072) + 3072 = 2,359,296 + 3,072 = 2,362,368
  • 두 번째 μ„ ν˜• λ ˆμ΄μ–΄:
ff_second_layer_params = (4 * emb_dim * emb_dim) + emb_dim
ff_second_layer_params = (3072 * 768) + 768 = 2,359,296 + 768 = 2,360,064
  • 총 ν”Όλ“œν¬μ›Œλ“œ λ§€κ°œλ³€μˆ˜:
ff_params = ff_first_layer_params + ff_second_layer_params
ff_params = 2,362,368 + 2,360,064 = 4,722,432

c. λ ˆμ΄μ–΄ μ •κ·œν™” (Layer Normalizations)

  • ꡬ성 μš”μ†Œ:
  • 블둝당 두 개의 LayerNorm μΈμŠ€ν„΄μŠ€.
  • 각 LayerNorm은 2 * emb_dim λ§€κ°œλ³€μˆ˜(μŠ€μΌ€μΌ 및 μ‹œν”„νŠΈ)λ₯Ό κ°€μ§‘λ‹ˆλ‹€.
  • 계산:
layer_norm_params_per_block = 2 * (2 * emb_dim) = 2 * 768 * 2 = 3,072

d. 트랜슀포머 블둝당 총 λ§€κ°œλ³€μˆ˜

pythonCopy codeparams_per_block = mha_params + ff_params + layer_norm_params_per_block
params_per_block = 2,360,064 + 4,722,432 + 3,072 = 7,085,568

λͺ¨λ“  트랜슀포머 λΈ”λ‘μ˜ 총 λ§€κ°œλ³€μˆ˜

pythonCopy codetotal_transformer_blocks_params = params_per_block * n_layers
total_transformer_blocks_params = 7,085,568 * 12 = 85,026,816

3. μ΅œμ’… λ ˆμ΄μ–΄

a. μ΅œμ’… λ ˆμ΄μ–΄ μ •κ·œν™”

  • λ§€κ°œλ³€μˆ˜: 2 * emb_dim (μŠ€μΌ€μΌ 및 이동)
pythonCopy codefinal_layer_norm_params = 2 * 768 = 1,536

b. 좜λ ₯ ν”„λ‘œμ μ…˜ λ ˆμ΄μ–΄ (out_head)

  • λ ˆμ΄μ–΄: nn.Linear(emb_dim, vocab_size, bias=False)
  • νŒŒλΌλ―Έν„°: emb_dim * vocab_size
pythonCopy codeoutput_projection_params = 768 * 50257 = 38,597,376

4. λͺ¨λ“  λ§€κ°œλ³€μˆ˜ μš”μ•½

pythonCopy codetotal_params = (
embedding_params +
total_transformer_blocks_params +
final_layer_norm_params +
output_projection_params
)
total_params = (
39,383,808 +
85,026,816 +
1,536 +
38,597,376
)
total_params = 163,009,536

Generate Text

λͺ¨λΈμ΄ 이전과 같은 λ‹€μŒ 토큰을 μ˜ˆμΈ‘ν•˜λŠ” 경우, 좜λ ₯μ—μ„œ λ§ˆμ§€λ§‰ 토큰 값을 κ°€μ Έμ˜€κΈ°λ§Œ ν•˜λ©΄ λ©λ‹ˆλ‹€(예츑된 ν† ν°μ˜ 값이 될 κ²ƒμ΄λ―€λ‘œ). μ΄λŠ” μ–΄νœ˜μ˜ 각 ν•­λͺ©μ— λŒ€ν•œ 값이 될 것이며, 그런 λ‹€μŒ softmax ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ 차원을 ν™•λ₯ λ‘œ μ •κ·œν™”ν•˜μ—¬ 합이 1이 λ˜λ„λ‘ ν•˜κ³ , κ°€μž₯ 큰 ν•­λͺ©μ˜ 인덱슀λ₯Ό κ°€μ Έμ˜΅λ‹ˆλ‹€. 이 μΈλ±μŠ€λŠ” μ–΄νœ˜ λ‚΄μ˜ 단어 μΈλ±μŠ€κ°€ λ©λ‹ˆλ‹€.

Code from https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01_main-chapter-code/ch04.ipynb:

def generate_text_simple(model, idx, max_new_tokens, context_size):
# idx is (batch, n_tokens) array of indices in the current context
for _ in range(max_new_tokens):

# Crop current context if it exceeds the supported context size
# E.g., if LLM supports only 5 tokens, and the context size is 10
# then only the last 5 tokens are used as context
idx_cond = idx[:, -context_size:]

# Get the predictions
with torch.no_grad():
logits = model(idx_cond)

# Focus only on the last time step
# (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
logits = logits[:, -1, :]

# Apply softmax to get probabilities
probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)

# Get the idx of the vocab entry with the highest probability value
idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)

# Append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)

return idx


start_context = "Hello, I am"

encoded = tokenizer.encode(start_context)
print("encoded:", encoded)

encoded_tensor = torch.tensor(encoded).unsqueeze(0)
print("encoded_tensor.shape:", encoded_tensor.shape)

model.eval() # disable dropout

out = generate_text_simple(
model=model,
idx=encoded_tensor,
max_new_tokens=6,
context_size=GPT_CONFIG_124M["context_length"]
)

print("Output:", out)
print("Output length:", len(out[0]))

References

Tip

AWS ν•΄ν‚Ή 배우기 및 μ—°μŠ΅ν•˜κΈ°:HackTricks Training AWS Red Team Expert (ARTE)
GCP ν•΄ν‚Ή 배우기 및 μ—°μŠ΅ν•˜κΈ°: HackTricks Training GCP Red Team Expert (GRTE) Azure ν•΄ν‚Ή 배우기 및 μ—°μŠ΅ν•˜κΈ°: HackTricks Training Azure Red Team Expert (AzRTE)

HackTricks μ§€μ›ν•˜κΈ°