5. LLM Architecture
Tip
AWS ν΄νΉ λ°°μ°κΈ° λ° μ°μ΅νκΈ°:
HackTricks Training AWS Red Team Expert (ARTE)
GCP ν΄νΉ λ°°μ°κΈ° λ° μ°μ΅νκΈ°:HackTricks Training GCP Red Team Expert (GRTE)
Azure ν΄νΉ λ°°μ°κΈ° λ° μ°μ΅νκΈ°:
HackTricks Training Azure Red Team Expert (AzRTE)
HackTricks μ§μνκΈ°
- ꡬλ κ³ν νμΈνκΈ°!
- **π¬ λμ€μ½λ κ·Έλ£Ή λλ ν λ κ·Έλ¨ κ·Έλ£Ήμ μ°Έμ¬νκ±°λ νΈμν° π¦ @hacktricks_liveλ₯Ό νλ‘μ°νμΈμ.
- HackTricks λ° HackTricks Cloud κΉνλΈ λ¦¬ν¬μ§ν 리μ PRμ μ μΆνμ¬ ν΄νΉ νΈλ¦μ 곡μ νμΈμ.
LLM Architecture
Tip
μ΄ λ€μ― λ²μ§Έ λ¨κ³μ λͺ©νλ λ§€μ° κ°λ¨ν©λλ€: μ 체 LLMμ μν€ν μ²λ₯Ό κ°λ°νλ κ²μ λλ€. λͺ¨λ κ²μ κ²°ν©νκ³ , λͺ¨λ λ μ΄μ΄λ₯Ό μ μ©νλ©°, ν μ€νΈλ₯Ό μμ±νκ±°λ ν μ€νΈλ₯Ό IDλ‘ λ³ννκ³ κ·Έ λ°λλ‘ λ³ννλ λͺ¨λ κΈ°λ₯μ λ§λλλ€.
μ΄ μν€ν μ²λ νλ ¨ ν ν μ€νΈλ₯Ό μμΈ‘νλ λ° μ¬μ©λ©λλ€.
LLM μν€ν μ² μμλ https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01_main-chapter-code/ch04.ipynbμμ νμΈν μ μμ΅λλ€:
λμ μμ€μ ννμ λ€μκ³Ό κ°μ΄ κ΄μ°°ν μ μμ΅λλ€:
 (1) (1) (1).png)
- Input (Tokenized Text): νλ‘μΈμ€λ ν ν°νλ ν μ€νΈλ‘ μμλλ©°, μ΄λ μ«μ ννμΌλ‘ λ³νλ©λλ€.
- Token Embedding and Positional Embedding Layer: ν ν°νλ ν μ€νΈλ ν ν° μλ² λ© λ μ΄μ΄μ μμΉ μλ² λ© λ μ΄μ΄λ₯Ό ν΅κ³Όνμ¬, μνμ€μμ ν ν°μ μμΉλ₯Ό μΊ‘μ²ν©λλ€. μ΄λ λ¨μ΄ μμλ₯Ό μ΄ν΄νλ λ° μ€μν©λλ€.
- Transformer Blocks: λͺ¨λΈμ 12κ°μ νΈλμ€ν¬λ¨Έ λΈλ‘μ ν¬ν¨νλ©°, κ° λΈλ‘μ μ¬λ¬ λ μ΄μ΄λ‘ ꡬμ±λ©λλ€. μ΄ λΈλ‘μ λ€μ μνμ€λ₯Ό λ°λ³΅ν©λλ€:
- Masked Multi-Head Attention: λͺ¨λΈμ΄ μ λ ₯ ν μ€νΈμ λ€μν λΆλΆμ λμμ μ§μ€ν μ μκ² ν©λλ€.
- Layer Normalization: νλ ¨μ μμ ννκ³ κ°μ νκΈ° μν μ κ·ν λ¨κ³μ λλ€.
- Feed Forward Layer: μ£Όμ λ μ΄μ΄μμ μ 보λ₯Ό μ²λ¦¬νκ³ λ€μ ν ν°μ λν μμΈ‘μ μννλ μν μ ν©λλ€.
- Dropout Layers: μ΄ λ μ΄μ΄λ νλ ¨ μ€ λ¬΄μμλ‘ μ λμ λλ‘νμ¬ κ³Όμ ν©μ λ°©μ§ν©λλ€.
- Final Output Layer: λͺ¨λΈμ 4x50,257 μ°¨μμ ν μλ₯Ό μΆλ ₯νλ©°, μ¬κΈ°μ 50,257μ μ΄νμ ν¬κΈ°λ₯Ό λνλ λλ€. μ΄ ν μμ κ° νμ λͺ¨λΈμ΄ μνμ€μμ λ€μ λ¨μ΄λ₯Ό μμΈ‘νλ λ° μ¬μ©νλ 벑ν°μ ν΄λΉν©λλ€.
- Goal: λͺ©νλ μ΄λ¬ν μλ² λ©μ κ°μ Έμ λ€μ ν μ€νΈλ‘ λ³ννλ κ²μ λλ€. ꡬ체μ μΌλ‘, μΆλ ₯μ λ§μ§λ§ νμ μ΄ λ€μ΄μ΄κ·Έλ¨μμ βforwardβλ‘ νμλ λ€μ λ¨μ΄λ₯Ό μμ±νλ λ° μ¬μ©λ©λλ€.
Code representation
import torch
import torch.nn as nn
import tiktoken
class GELU(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return 0.5 * x * (1 + torch.tanh(
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
(x + 0.044715 * torch.pow(x, 3))
))
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
GELU(),
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
)
def forward(self, x):
return self.layers(x)
class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
self.dropout = nn.Dropout(dropout)
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))
def forward(self, x):
b, num_tokens, d_in = x.shape
keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
queries = self.W_query(x)
values = self.W_value(x)
# We implicitly split the matrix by adding a `num_heads` dimension
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
keys = keys.transpose(1, 2)
queries = queries.transpose(1, 2)
values = values.transpose(1, 2)
# Compute scaled dot-product attention (aka self-attention) with a causal mask
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
# Original mask truncated to the number of tokens and converted to boolean
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
# Use the mask to fill attention scores
attn_scores.masked_fill_(mask_bool, -torch.inf)
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)
# Shape: (b, num_tokens, num_heads, head_dim)
context_vec = (attn_weights @ values).transpose(1, 2)
# Combine heads, where self.d_out = self.num_heads * self.head_dim
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec) # optional projection
return context_vec
class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * norm_x + self.shift
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
dropout=cfg["drop_rate"],
qkv_bias=cfg["qkv_bias"])
self.ff = FeedForward(cfg)
self.norm1 = LayerNorm(cfg["emb_dim"])
self.norm2 = LayerNorm(cfg["emb_dim"])
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
def forward(self, x):
# Shortcut connection for attention block
shortcut = x
x = self.norm1(x)
x = self.att(x) # Shape [batch_size, num_tokens, emb_size]
x = self.drop_shortcut(x)
x = x + shortcut # Add the original input back
# Shortcut connection for feed forward block
shortcut = x
x = self.norm2(x)
x = self.ff(x)
x = self.drop_shortcut(x)
x = x + shortcut # Add the original input back
return x
class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits
GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size
"context_length": 1024, # Context length
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-Key-Value bias
}
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)
GELU νμ±ν ν¨μ
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class GELU(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return 0.5 * x * (1 + torch.tanh(
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
(x + 0.044715 * torch.pow(x, 3))
))
λͺ©μ λ° κΈ°λ₯
- GELU (κ°μ°μμ μ€λ₯ μ ν λ¨μ): λͺ¨λΈμ λΉμ νμ±μ λμ νλ νμ±ν ν¨μμ λλ€.
- λΆλλ¬μ΄ νμ±ν: μμ μ λ ₯μ 0μΌλ‘ λ§λλ ReLUμ λ¬λ¦¬, GELUλ μμ μ λ ₯μ λν΄ μκ³ 0μ΄ μλ κ°μ νμ©νλ©° μ λ ₯μ μΆλ ₯μΌλ‘ λΆλλ½κ² λ§€νν©λλ€.
- μνμ μ μ:
 (1) (1) (1).png)
Tip
FeedForward λ μ΄μ΄ λ΄μ μ ν λ μ΄μ΄ νμ μ΄ ν¨μλ₯Ό μ¬μ©νλ λͺ©μ μ μ ν λ°μ΄ν°λ₯Ό λΉμ νμΌλ‘ λ³κ²½νμ¬ λͺ¨λΈμ΄ 볡μ‘νκ³ λΉμ νμ μΈ κ΄κ³λ₯Ό νμ΅ν μ μλλ‘ νλ κ²μ λλ€.
FeedForward μ κ²½λ§
νλ ¬μ ννλ₯Ό λ μ μ΄ν΄νκΈ° μν΄ μ£ΌμμΌλ‘ ννκ° μΆκ°λμμ΅λλ€:
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
GELU(),
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
)
def forward(self, x):
# x shape: (batch_size, seq_len, emb_dim)
x = self.layers[0](x)# x shape: (batch_size, seq_len, 4 * emb_dim)
x = self.layers[1](x) # x shape remains: (batch_size, seq_len, 4 * emb_dim)
x = self.layers[2](x) # x shape: (batch_size, seq_len, emb_dim)
return x # Output shape: (batch_size, seq_len, emb_dim)
λͺ©μ λ° κΈ°λ₯
- μμΉλ³ FeedForward λ€νΈμν¬: κ° μμΉμ λν΄ λ³λλ‘ λμΌνκ² λ κ°μ μμ μ°κ²° λ€νΈμν¬λ₯Ό μ μ©ν©λλ€.
- λ μ΄μ΄ μΈλΆμ¬ν:
- 첫 λ²μ§Έ μ ν λ μ΄μ΄: μ°¨μμ
emb_dimμμ4 * emb_dimμΌλ‘ νμ₯ν©λλ€. - GELU νμ±ν: λΉμ νμ±μ μ μ©ν©λλ€.
- λ λ²μ§Έ μ ν λ μ΄μ΄: μ°¨μμ λ€μ
emb_dimμΌλ‘ μ€μ λλ€.
Tip
보μλ€μνΌ, Feed Forward λ€νΈμν¬λ 3κ°μ λ μ΄μ΄λ₯Ό μ¬μ©ν©λλ€. 첫 λ²μ§Έλ μ ν λ μ΄μ΄λ‘, μ ν κ°μ€μΉ(λͺ¨λΈ λ΄λΆμμ νλ ¨ν λ§€κ°λ³μ)λ₯Ό μ¬μ©νμ¬ μ°¨μμ 4λ°°λ‘ κ³±ν©λλ€. κ·Έλ° λ€μ, GELU ν¨μκ° λͺ¨λ μ°¨μμμ μ¬μ©λμ΄ λ νλΆν ννμ ν¬μ°©νκΈ° μν λΉμ ν λ³νλ₯Ό μ μ©νκ³ , λ§μ§λ§μΌλ‘ λ λ€λ₯Έ μ ν λ μ΄μ΄κ° μλ μ°¨μ ν¬κΈ°λ‘ λλ리기 μν΄ μ¬μ©λ©λλ€.
λ€μ€ ν€λ μ£Όμ λ©μ»€λμ¦
μ΄κ²μ μ΄μ μΉμ μμ μ΄λ―Έ μ€λͺ λμμ΅λλ€.
λͺ©μ λ° κΈ°λ₯
- λ€μ€ ν€λ μκΈ° μ£Όμ: λͺ¨λΈμ΄ ν ν°μ μΈμ½λ©ν λ μ λ ₯ μνμ€ λ΄μ λ€μν μμΉμ μ§μ€ν μ μκ² ν©λλ€.
- μ£Όμ κ΅¬μ± μμ:
- 쿼리, ν€, κ°: μ λ ₯μ μ ν νλ‘μ μ μΌλ‘, μ£Όμ μ μλ₯Ό κ³μ°νλ λ° μ¬μ©λ©λλ€.
- ν€λ: λ³λ ¬λ‘ μ€νλλ μ¬λ¬ μ£Όμ λ©μ»€λμ¦(
num_heads), κ°κΈ° μΆμλ μ°¨μ(head_dim)μ κ°μ§λλ€. - μ£Όμ μ μ: 쿼리μ ν€μ λ΄μ μ κ³μ°νμ¬ μ€μΌμΌλ§ λ° λ§μ€νΉν©λλ€.
- λ§μ€νΉ: λͺ¨λΈμ΄ λ―Έλμ ν ν°μ μ£Όμλ₯Ό κΈ°μΈμ΄μ§ μλλ‘ νλ μΈκ³Ό λ§μ€ν¬κ° μ μ©λ©λλ€(μκΈ° νκ· λͺ¨λΈμΈ GPTμ μ€μ).
- μ£Όμ κ°μ€μΉ: λ§μ€νΉλκ³ μ€μΌμΌλ μ£Όμ μ μμ μννΈλ§₯μ€μ λλ€.
- 컨ν μ€νΈ 벑ν°: μ£Όμ κ°μ€μΉμ λ°λΌ κ°μ κ°μ€ ν©μ λλ€.
- μΆλ ₯ νλ‘μ μ : λͺ¨λ ν€λμ μΆλ ₯μ κ²°ν©νλ μ ν λ μ΄μ΄μ λλ€.
Tip
μ΄ λ€νΈμν¬μ λͺ©νλ λμΌν 컨ν μ€νΈ λ΄μμ ν ν° κ°μ κ΄κ³λ₯Ό μ°Ύλ κ²μ λλ€. λν, ν ν°μ κ³Όμ ν©μ λ°©μ§νκΈ° μν΄ μλ‘ λ€λ₯Έ ν€λλ‘ λλμ§λ§, κ° ν€λμμ λ°κ²¬λ μ΅μ’ κ΄κ³λ μ΄ λ€νΈμν¬μ λμμ κ²°ν©λ©λλ€.
λν, νλ ¨ μ€μ μΈκ³Ό λ§μ€ν¬κ° μ μ©λμ΄ λμ€μ ν ν°μ΄ νΉμ ν ν°κ³Όμ κ΄κ³λ₯Ό μ°Ύμ λ κ³ λ €λμ§ μμΌλ©°, κ³Όμ ν©μ λ°©μ§νκΈ° μν΄ μΌλΆ λλ‘μμλ μ μ©λ©λλ€.
λ μ΄μ΄ μ κ·ν
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5 # Prevent division by zero during normalization.
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * norm_x + self.shift
λͺ©μ λ° κΈ°λ₯
- Layer Normalization: λ°°μΉμ κ° κ°λ³ μμ μ λν΄ νΉμ§(μλ² λ© μ°¨μ) μ λ°μ κ±Έμ³ μ λ ₯μ μ κ·ννλ λ° μ¬μ©λλ κΈ°μ μ λλ€.
- κ΅¬μ± μμ:
eps: μ κ·ν μ€ 0μΌλ‘ λλλ κ²μ λ°©μ§νκΈ° μν΄ λΆμ°μ μΆκ°λλ μμ μμ(1e-5)μ λλ€.scaleλ°shift: μ κ·νλ μΆλ ₯μ μ€μΌμΌνκ³ μ΄λν μ μλλ‘ νλ νμ΅ κ°λ₯ν λ§€κ°λ³μ(nn.Parameter)μ λλ€. κ°κ° 1κ³Ό 0μΌλ‘ μ΄κΈ°νλ©λλ€.- μ κ·ν κ³Όμ :
- νκ· κ³μ°(
mean): μλ² λ© μ°¨μ(dim=-1)μ κ±Έμ³ μ λ ₯xμ νκ· μ κ³μ°νλ©°, λΈλ‘λμΊμ€ν μ μν΄ μ°¨μμ μ μ§ν©λλ€(keepdim=True). - λΆμ° κ³μ°(
var): μλ² λ© μ°¨μμ κ±Έμ³xμ λΆμ°μ κ³μ°νλ©°, μ°¨μμ μ μ§ν©λλ€.unbiased=Falseλ§€κ°λ³μλ λΆμ°μ΄ νΈν₯ μΆμ κΈ°λ₯Ό μ¬μ©νμ¬ κ³μ°λλλ‘ λ³΄μ₯ν©λλ€(μνμ΄ μλ νΉμ§μ λν΄ μ κ·νν λ μ ν©νNμΌλ‘ λλκΈ°). - μ κ·ν(
norm_x):xμμ νκ· μ λΉΌκ³ λΆμ°μepsλ₯Ό λν κ°μ μ κ³±κ·ΌμΌλ‘ λλλλ€. - μ€μΌμΌ λ° μ΄λ: μ κ·νλ μΆλ ₯μ νμ΅ κ°λ₯ν
scaleλ°shiftλ§€κ°λ³μλ₯Ό μ μ©ν©λλ€.
Tip
λͺ©νλ λμΌν ν ν°μ λͺ¨λ μ°¨μμμ νκ· μ΄ 0μ΄κ³ λΆμ°μ΄ 1μ΄ λλλ‘ νλ κ²μ λλ€. μ΄λ λ₯ λ΄λ΄ λ€νΈμν¬μ νλ ¨μ μμ ννκΈ° μν΄ λ΄λΆ 곡λ³λ μ΄λμ μ€μ΄λ κ²μ λͺ©νλ‘ νλ©°, μ΄λ νλ ¨ μ€ λ§€κ°λ³μ μ λ°μ΄νΈλ‘ μΈν λ€νΈμν¬ νμ±νμ λΆν¬ λ³νμ κ΄λ ¨μ΄ μμ΅λλ€.
Transformer Block
νλ ¬μ ννλ₯Ό λ μ μ΄ν΄νκΈ° μν΄ μ£ΌμμΌλ‘ μΆκ°λμμ΅λλ€:
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
dropout=cfg["drop_rate"],
qkv_bias=cfg["qkv_bias"]
)
self.ff = FeedForward(cfg)
self.norm1 = LayerNorm(cfg["emb_dim"])
self.norm2 = LayerNorm(cfg["emb_dim"])
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
def forward(self, x):
# x shape: (batch_size, seq_len, emb_dim)
# Shortcut connection for attention block
shortcut = x # shape: (batch_size, seq_len, emb_dim)
x = self.norm1(x) # shape remains (batch_size, seq_len, emb_dim)
x = self.att(x) # shape: (batch_size, seq_len, emb_dim)
x = self.drop_shortcut(x) # shape remains (batch_size, seq_len, emb_dim)
x = x + shortcut # shape: (batch_size, seq_len, emb_dim)
# Shortcut connection for feedforward block
shortcut = x # shape: (batch_size, seq_len, emb_dim)
x = self.norm2(x) # shape remains (batch_size, seq_len, emb_dim)
x = self.ff(x) # shape: (batch_size, seq_len, emb_dim)
x = self.drop_shortcut(x) # shape remains (batch_size, seq_len, emb_dim)
x = x + shortcut # shape: (batch_size, seq_len, emb_dim)
return x # Output shape: (batch_size, seq_len, emb_dim)
λͺ©μ λ° κΈ°λ₯
- μΈ΅μ ꡬμ±: λ€μ€ ν€λ μ£Όμ, νΌλν¬μλ λ€νΈμν¬, μΈ΅ μ κ·ν λ° μμ¬ μ°κ²°μ κ²°ν©ν©λλ€.
- μΈ΅ μ κ·ν: μμ μ μΈ νλ ¨μ μν΄ μ£Όμ λ° νΌλν¬μλ μΈ΅ μ μ μ μ©λ©λλ€.
- μμ¬ μ°κ²° (λ¨μΆ): μΈ΅μ μ λ ₯μ μΆλ ₯μ μΆκ°νμ¬ κ·ΈλλμΈνΈ νλ¦μ κ°μ νκ³ κΉμ λ€νΈμν¬μ νλ ¨μ κ°λ₯νκ² ν©λλ€.
- λλ‘μμ: μ κ·νλ₯Ό μν΄ μ£Όμ λ° νΌλν¬μλ μΈ΅ νμ μ μ©λ©λλ€.
λ¨κ³λ³ κΈ°λ₯
- 첫 λ²μ§Έ μμ¬ κ²½λ‘ (μκΈ° μ£Όμ):
- μ
λ ₯ (
shortcut): μμ¬ μ°κ²°μ μν΄ μλ μ λ ₯μ μ μ₯ν©λλ€. - μΈ΅ μ κ·ν (
norm1): μ λ ₯μ μ κ·νν©λλ€. - λ€μ€ ν€λ μ£Όμ (
att): μκΈ° μ£Όμλ₯Ό μ μ©ν©λλ€. - λλ‘μμ (
drop_shortcut): μ κ·νλ₯Ό μν΄ λλ‘μμμ μ μ©ν©λλ€. - μμ¬ μΆκ° (
x + shortcut): μλ μ λ ₯κ³Ό κ²°ν©ν©λλ€.
- λ λ²μ§Έ μμ¬ κ²½λ‘ (νΌλν¬μλ):
- μ
λ ₯ (
shortcut): λ€μ μμ¬ μ°κ²°μ μν΄ μ λ°μ΄νΈλ μ λ ₯μ μ μ₯ν©λλ€. - μΈ΅ μ κ·ν (
norm2): μ λ ₯μ μ κ·νν©λλ€. - νΌλν¬μλ λ€νΈμν¬ (
ff): νΌλν¬μλ λ³νμ μ μ©ν©λλ€. - λλ‘μμ (
drop_shortcut): λλ‘μμμ μ μ©ν©λλ€. - μμ¬ μΆκ° (
x + shortcut): 첫 λ²μ§Έ μμ¬ κ²½λ‘μ μ λ ₯κ³Ό κ²°ν©ν©λλ€.
Tip
λ³νκΈ° λΈλ‘μ λͺ¨λ λ€νΈμν¬λ₯Ό ν¨κ» κ·Έλ£Ήννκ³ νλ ¨ μμ μ±κ³Ό κ²°κ³Όλ₯Ό κ°μ νκΈ° μν΄ μΌλΆ μ κ·ν λ° λλ‘μμμ μ μ©ν©λλ€.
λλ‘μμμ΄ κ° λ€νΈμν¬ μ¬μ© νμ μνλκ³ μ κ·νκ° μ΄μ μ μ μ©λλ λ°©μμ μ£Όλͺ©νμΈμ.λν, λ€νΈμν¬μ μΆλ ₯μ μ λ ₯κ³Ό λνλ λ¨μΆμ μ¬μ©ν©λλ€. μ΄λ μ΄κΈ° μΈ΅μ΄ λ§μ§λ§ μΈ΅λ§νΌ βλ§μ΄β κΈ°μ¬νλλ‘ νμ¬ μμ€ κ·ΈλλμΈνΈ λ¬Έμ λ₯Ό λ°©μ§νλ λ° λμμ΄ λ©λλ€.
GPTModel
νλ ¬μ ννλ₯Ό λ μ μ΄ν΄νκΈ° μν΄ μ£ΌμμΌλ‘ ννκ° μΆκ°λμμ΅λλ€:
# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
# shape: (vocab_size, emb_dim)
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
# shape: (context_length, emb_dim)
self.drop_emb = nn.Dropout(cfg["drop_rate"])
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
)
# Stack of TransformerBlocks
self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
# shape: (emb_dim, vocab_size)
def forward(self, in_idx):
# in_idx shape: (batch_size, seq_len)
batch_size, seq_len = in_idx.shape
# Token embeddings
tok_embeds = self.tok_emb(in_idx)
# shape: (batch_size, seq_len, emb_dim)
# Positional embeddings
pos_indices = torch.arange(seq_len, device=in_idx.device)
# shape: (seq_len,)
pos_embeds = self.pos_emb(pos_indices)
# shape: (seq_len, emb_dim)
# Add token and positional embeddings
x = tok_embeds + pos_embeds # Broadcasting over batch dimension
# x shape: (batch_size, seq_len, emb_dim)
x = self.drop_emb(x) # Dropout applied
# x shape remains: (batch_size, seq_len, emb_dim)
x = self.trf_blocks(x) # Pass through Transformer blocks
# x shape remains: (batch_size, seq_len, emb_dim)
x = self.final_norm(x) # Final LayerNorm
# x shape remains: (batch_size, seq_len, emb_dim)
logits = self.out_head(x) # Project to vocabulary size
# logits shape: (batch_size, seq_len, vocab_size)
return logits # Output shape: (batch_size, seq_len, vocab_size)
λͺ©μ λ° κΈ°λ₯
- μλ² λ© λ μ΄μ΄:
- ν ν° μλ² λ© (
tok_emb): ν ν° μΈλ±μ€λ₯Ό μλ² λ©μΌλ‘ λ³νν©λλ€. μ΄λ€μ μ΄νμ κ° ν ν°μ κ° μ°¨μμ μ£Όμ΄μ§ κ°μ€μΉμ λλ€. - μμΉ μλ² λ© (
pos_emb): μλ² λ©μ μμΉ μ 보λ₯Ό μΆκ°νμ¬ ν ν°μ μμλ₯Ό μΊ‘μ²ν©λλ€. μ΄λ€μ ν μ€νΈμμμ μμΉμ λ°λΌ ν ν°μ μ£Όμ΄μ§ κ°μ€μΉμ λλ€. - λλ‘μμ (
drop_emb): μ κ·νλ₯Ό μν΄ μλ² λ©μ μ μ©λ©λλ€. - νΈλμ€ν¬λ¨Έ λΈλ‘ (
trf_blocks): μλ² λ©μ μ²λ¦¬νκΈ° μνn_layersνΈλμ€ν¬λ¨Έ λΈλ‘μ μ€νμ λλ€. - μ΅μ’
μ κ·ν (
final_norm): μΆλ ₯ λ μ΄μ΄ μ μ λ μ΄μ΄ μ κ·νκ° μ μ©λ©λλ€. - μΆλ ₯ λ μ΄μ΄ (
out_head): μ΅μ’ μλ μνλ₯Ό μ΄ν ν¬κΈ°λ‘ νλ‘μ μ νμ¬ μμΈ‘μ μν λ‘μ§μ μμ±ν©λλ€.
Tip
μ΄ ν΄λμ€μ λͺ©νλ μνμ€μμ λ€μ ν ν°μ μμΈ‘νκΈ° μν΄ μΈκΈλ λͺ¨λ λ€λ₯Έ λ€νΈμν¬λ₯Ό μ¬μ©νλ κ²μ λλ€. μ΄λ ν μ€νΈ μμ±κ³Ό κ°μ μμ μ κΈ°λ³Έμ μ λλ€.
μΌλ§λ λ§μ νΈλμ€ν¬λ¨Έ λΈλ‘μ΄ μ¬μ©λ κ²μΈμ§ λͺ μλ λλ‘ μ¬μ©ν κ²μΈμ§ μ£Όλͺ©νμμμ€. κ° νΈλμ€ν¬λ¨Έ λΈλ‘μ νλμ λ€μ€ ν€λ μ£Όμ λ€νΈμν¬, νλμ νΌλ ν¬μλ λ€νΈμν¬ λ° μ¬λ¬ μ κ·νλ₯Ό μ¬μ©ν©λλ€. λ°λΌμ 12κ°μ νΈλμ€ν¬λ¨Έ λΈλ‘μ΄ μ¬μ©λλ©΄ μ΄λ₯Ό 12λ‘ κ³±ν©λλ€.
λν, μΆλ ₯ μ μ μ κ·ν λ μ΄μ΄κ° μΆκ°λκ³ , λ§μ§λ§μ μ μ ν μ°¨μμ κ²°κ³Όλ₯Ό μ»κΈ° μν΄ μ΅μ’ μ ν λ μ΄μ΄κ° μ μ©λ©λλ€. κ° μ΅μ’ 벑ν°μ ν¬κΈ°κ° μ¬μ©λ μ΄νμ ν¬κΈ°μ κ°λ€λ μ μ μ μνμμμ€. μ΄λ μ΄ν λ΄μ κ°λ₯ν κ° ν ν°μ λν νλ₯ μ μ»μΌλ €λ κ²μ λλ€.
νλ ¨ν λ§€κ°λ³μ μ
GPT κ΅¬μ‘°κ° μ μλλ©΄ νλ ¨ν λ§€κ°λ³μ μλ₯Ό μμλΌ μ μμ΅λλ€:
GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size
"context_length": 1024, # Context length
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-Key-Value bias
}
model = GPTModel(GPT_CONFIG_124M)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")
# Total number of parameters: 163,009,536
λ¨κ³λ³ κ³μ°
1. μλ² λ© λ μ΄μ΄: ν ν° μλ² λ© λ° μμΉ μλ² λ©
- λ μ΄μ΄:
nn.Embedding(vocab_size, emb_dim) - λ§€κ°λ³μ:
vocab_size * emb_dim
token_embedding_params = 50257 * 768 = 38,597,376
- Layer:
nn.Embedding(context_length, emb_dim) - Parameters:
context_length * emb_dim
position_embedding_params = 1024 * 768 = 786,432
μ΄ μλ² λ© λ§€κ°λ³μ
embedding_params = token_embedding_params + position_embedding_params
embedding_params = 38,597,376 + 786,432 = 39,383,808
2. Transformer Blocks
12κ°μ νΈλμ€ν¬λ¨Έ λΈλ‘μ΄ μμΌλ―λ‘, νλμ λΈλ‘μ λν λ§€κ°λ³μλ₯Ό κ³μ°ν ν 12λ₯Ό κ³±ν©λλ€.
νΈλμ€ν¬λ¨Έ λΈλ‘λΉ λ§€κ°λ³μ
a. λ€μ€ ν€λ μ£Όμ (Multi-Head Attention)
-
κ΅¬μ± μμ:
-
쿼리 μ ν λ μ΄μ΄ (
W_query):nn.Linear(emb_dim, emb_dim, bias=False) -
ν€ μ ν λ μ΄μ΄ (
W_key):nn.Linear(emb_dim, emb_dim, bias=False) -
κ° μ ν λ μ΄μ΄ (
W_value):nn.Linear(emb_dim, emb_dim, bias=False) -
μΆλ ₯ νλ‘μ μ (
out_proj):nn.Linear(emb_dim, emb_dim) -
κ³μ°:
-
κ°κ°μ
W_query,W_key,W_value:
qkv_params = emb_dim * emb_dim = 768 * 768 = 589,824
μ΄λ¬ν λ μ΄μ΄κ° 3κ° μμΌλ―λ‘:
total_qkv_params = 3 * qkv_params = 3 * 589,824 = 1,769,472
- μΆλ ₯ νλ‘μ μ
(
out_proj):
out_proj_params = (emb_dim * emb_dim) + emb_dim = (768 * 768) + 768 = 589,824 + 768 = 590,592
- μ΄ λ€μ€ ν€λ μ£Όμ λ§€κ°λ³μ:
mha_params = total_qkv_params + out_proj_params
mha_params = 1,769,472 + 590,592 = 2,360,064
b. νΌλν¬μλ λ€νΈμν¬ (FeedForward Network)
-
κ΅¬μ± μμ:
-
첫 λ²μ§Έ μ ν λ μ΄μ΄:
nn.Linear(emb_dim, 4 * emb_dim) -
λ λ²μ§Έ μ ν λ μ΄μ΄:
nn.Linear(4 * emb_dim, emb_dim) -
κ³μ°:
-
첫 λ²μ§Έ μ ν λ μ΄μ΄:
ff_first_layer_params = (emb_dim * 4 * emb_dim) + (4 * emb_dim)
ff_first_layer_params = (768 * 3072) + 3072 = 2,359,296 + 3,072 = 2,362,368
- λ λ²μ§Έ μ ν λ μ΄μ΄:
ff_second_layer_params = (4 * emb_dim * emb_dim) + emb_dim
ff_second_layer_params = (3072 * 768) + 768 = 2,359,296 + 768 = 2,360,064
- μ΄ νΌλν¬μλ λ§€κ°λ³μ:
ff_params = ff_first_layer_params + ff_second_layer_params
ff_params = 2,362,368 + 2,360,064 = 4,722,432
c. λ μ΄μ΄ μ κ·ν (Layer Normalizations)
- κ΅¬μ± μμ:
- λΈλ‘λΉ λ κ°μ
LayerNormμΈμ€ν΄μ€. - κ°
LayerNormμ2 * emb_dimλ§€κ°λ³μ(μ€μΌμΌ λ° μννΈ)λ₯Ό κ°μ§λλ€. - κ³μ°:
layer_norm_params_per_block = 2 * (2 * emb_dim) = 2 * 768 * 2 = 3,072
d. νΈλμ€ν¬λ¨Έ λΈλ‘λΉ μ΄ λ§€κ°λ³μ
pythonCopy codeparams_per_block = mha_params + ff_params + layer_norm_params_per_block
params_per_block = 2,360,064 + 4,722,432 + 3,072 = 7,085,568
λͺ¨λ νΈλμ€ν¬λ¨Έ λΈλ‘μ μ΄ λ§€κ°λ³μ
pythonCopy codetotal_transformer_blocks_params = params_per_block * n_layers
total_transformer_blocks_params = 7,085,568 * 12 = 85,026,816
3. μ΅μ’ λ μ΄μ΄
a. μ΅μ’ λ μ΄μ΄ μ κ·ν
- λ§€κ°λ³μ:
2 * emb_dim(μ€μΌμΌ λ° μ΄λ)
pythonCopy codefinal_layer_norm_params = 2 * 768 = 1,536
b. μΆλ ₯ νλ‘μ μ
λ μ΄μ΄ (out_head)
- λ μ΄μ΄:
nn.Linear(emb_dim, vocab_size, bias=False) - νλΌλ―Έν°:
emb_dim * vocab_size
pythonCopy codeoutput_projection_params = 768 * 50257 = 38,597,376
4. λͺ¨λ λ§€κ°λ³μ μμ½
pythonCopy codetotal_params = (
embedding_params +
total_transformer_blocks_params +
final_layer_norm_params +
output_projection_params
)
total_params = (
39,383,808 +
85,026,816 +
1,536 +
38,597,376
)
total_params = 163,009,536
Generate Text
λͺ¨λΈμ΄ μ΄μ κ³Ό κ°μ λ€μ ν ν°μ μμΈ‘νλ κ²½μ°, μΆλ ₯μμ λ§μ§λ§ ν ν° κ°μ κ°μ Έμ€κΈ°λ§ νλ©΄ λ©λλ€(μμΈ‘λ ν ν°μ κ°μ΄ λ κ²μ΄λ―λ‘). μ΄λ μ΄νμ κ° νλͺ©μ λν κ°μ΄ λ κ²μ΄λ©°, κ·Έλ° λ€μ softmax ν¨μλ₯Ό μ¬μ©νμ¬ μ°¨μμ νλ₯ λ‘ μ κ·ννμ¬ ν©μ΄ 1μ΄ λλλ‘ νκ³ , κ°μ₯ ν° νλͺ©μ μΈλ±μ€λ₯Ό κ°μ Έμ΅λλ€. μ΄ μΈλ±μ€λ μ΄ν λ΄μ λ¨μ΄ μΈλ±μ€κ° λ©λλ€.
Code from https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01_main-chapter-code/ch04.ipynb:
def generate_text_simple(model, idx, max_new_tokens, context_size):
# idx is (batch, n_tokens) array of indices in the current context
for _ in range(max_new_tokens):
# Crop current context if it exceeds the supported context size
# E.g., if LLM supports only 5 tokens, and the context size is 10
# then only the last 5 tokens are used as context
idx_cond = idx[:, -context_size:]
# Get the predictions
with torch.no_grad():
logits = model(idx_cond)
# Focus only on the last time step
# (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
logits = logits[:, -1, :]
# Apply softmax to get probabilities
probas = torch.softmax(logits, dim=-1) # (batch, vocab_size)
# Get the idx of the vocab entry with the highest probability value
idx_next = torch.argmax(probas, dim=-1, keepdim=True) # (batch, 1)
# Append sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1)
return idx
start_context = "Hello, I am"
encoded = tokenizer.encode(start_context)
print("encoded:", encoded)
encoded_tensor = torch.tensor(encoded).unsqueeze(0)
print("encoded_tensor.shape:", encoded_tensor.shape)
model.eval() # disable dropout
out = generate_text_simple(
model=model,
idx=encoded_tensor,
max_new_tokens=6,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output:", out)
print("Output length:", len(out[0]))
References
Tip
AWS ν΄νΉ λ°°μ°κΈ° λ° μ°μ΅νκΈ°:
HackTricks Training AWS Red Team Expert (ARTE)
GCP ν΄νΉ λ°°μ°κΈ° λ° μ°μ΅νκΈ°:HackTricks Training GCP Red Team Expert (GRTE)
Azure ν΄νΉ λ°°μ°κΈ° λ° μ°μ΅νκΈ°:
HackTricks Training Azure Red Team Expert (AzRTE)
HackTricks μ§μνκΈ°
- ꡬλ κ³ν νμΈνκΈ°!
- **π¬ λμ€μ½λ κ·Έλ£Ή λλ ν λ κ·Έλ¨ κ·Έλ£Ήμ μ°Έμ¬νκ±°λ νΈμν° π¦ @hacktricks_liveλ₯Ό νλ‘μ°νμΈμ.
- HackTricks λ° HackTricks Cloud κΉνλΈ λ¦¬ν¬μ§ν 리μ PRμ μ μΆνμ¬ ν΄νΉ νΈλ¦μ 곡μ νμΈμ.


