4. Attention Mechanisms
Tip
AWS ํดํน ๋ฐฐ์ฐ๊ธฐ ๋ฐ ์ฐ์ตํ๊ธฐ:
HackTricks Training AWS Red Team Expert (ARTE)
GCP ํดํน ๋ฐฐ์ฐ๊ธฐ ๋ฐ ์ฐ์ตํ๊ธฐ:HackTricks Training GCP Red Team Expert (GRTE)
Azure ํดํน ๋ฐฐ์ฐ๊ธฐ ๋ฐ ์ฐ์ตํ๊ธฐ:
HackTricks Training Azure Red Team Expert (AzRTE)
HackTricks ์ง์ํ๊ธฐ
- ๊ตฌ๋ ๊ณํ ํ์ธํ๊ธฐ!
- **๐ฌ ๋์ค์ฝ๋ ๊ทธ๋ฃน ๋๋ ํ ๋ ๊ทธ๋จ ๊ทธ๋ฃน์ ์ฐธ์ฌํ๊ฑฐ๋ ํธ์ํฐ ๐ฆ @hacktricks_live๋ฅผ ํ๋ก์ฐํ์ธ์.
- HackTricks ๋ฐ HackTricks Cloud ๊นํ๋ธ ๋ฆฌํฌ์งํ ๋ฆฌ์ PR์ ์ ์ถํ์ฌ ํดํน ํธ๋ฆญ์ ๊ณต์ ํ์ธ์.
Attention Mechanisms and Self-Attention in Neural Networks
Attention mechanisms allow neural networks to focus on specific parts of the input when generating each part of the output. They assign different weights to different inputs, helping the model decide which inputs are most relevant to the task at hand. This is crucial in tasks like machine translation, where understanding the context of the entire sentence is necessary for accurate translation.
Tip
์ด ๋ค ๋ฒ์งธ ๋จ๊ณ์ ๋ชฉํ๋ ๋งค์ฐ ๊ฐ๋จํฉ๋๋ค: ์ผ๋ถ ์ฃผ์ ๋ฉ์ปค๋์ฆ์ ์ ์ฉํ์ธ์. ์ด๋ ์ดํ์ ๋จ์ด์ ํ์ฌ LLM ํ๋ จ์ ์ฌ์ฉ๋๋ ๋ฌธ์ฅ์์์ ์ด์ ๊ฐ์ ๊ด๊ณ๋ฅผ ํฌ์ฐฉํ๋ ๋ง์ ๋ฐ๋ณต ๋ ์ด์ด๊ฐ ๋ ๊ฒ์ ๋๋ค.
์ด๋ฅผ ์ํด ๋ง์ ๋ ์ด์ด๊ฐ ์ฌ์ฉ๋๋ฏ๋ก ๋ง์ ํ์ต ๊ฐ๋ฅํ ๋งค๊ฐ๋ณ์๊ฐ ์ด ์ ๋ณด๋ฅผ ํฌ์ฐฉํ๊ฒ ๋ฉ๋๋ค.
Understanding Attention Mechanisms
In traditional sequence-to-sequence models used for language translation, the model encodes an input sequence into a fixed-size context vector. However, this approach struggles with long sentences because the fixed-size context vector may not capture all necessary information. Attention mechanisms address this limitation by allowing the model to consider all input tokens when generating each output token.
Example: Machine Translation
Consider translating the German sentence โKannst du mir helfen diesen Satz zu รผbersetzenโ into English. A word-by-word translation would not produce a grammatically correct English sentence due to differences in grammatical structures between languages. An attention mechanism enables the model to focus on relevant parts of the input sentence when generating each word of the output sentence, leading to a more accurate and coherent translation.
Introduction to Self-Attention
Self-attention, or intra-attention, is a mechanism where attention is applied within a single sequence to compute a representation of that sequence. It allows each token in the sequence to attend to all other tokens, helping the model capture dependencies between tokens regardless of their distance in the sequence.
Key Concepts
- Tokens: ์ ๋ ฅ ์ํ์ค์ ๊ฐ๋ณ ์์ (์: ๋ฌธ์ฅ์ ๋จ์ด).
- Embeddings: ์๋ฏธ ์ ๋ณด๋ฅผ ํฌ์ฐฉํ๋ ํ ํฐ์ ๋ฒกํฐ ํํ.
- Attention Weights: ๋ค๋ฅธ ํ ํฐ์ ๋ํ ๊ฐ ํ ํฐ์ ์ค์์ฑ์ ๊ฒฐ์ ํ๋ ๊ฐ.
Calculating Attention Weights: A Step-by-Step Example
Letโs consider the sentence โHello shiny sun!โ and represent each word with a 3-dimensional embedding:
- Hello:
[0.34, 0.22, 0.54] - shiny:
[0.53, 0.34, 0.98] - sun:
[0.29, 0.54, 0.93]
Our goal is to compute the context vector for the word โshinyโ using self-attention.
Step 1: Compute Attention Scores
Tip
๊ฐ ์ฐจ์ ๊ฐ์ ์ฟผ๋ฆฌ์ ๊ฐ ํ ํฐ์ ๊ด๋ จ ๊ฐ๊ณผ ๊ณฑํ๊ณ ๊ฒฐ๊ณผ๋ฅผ ๋ํ์ธ์. ๊ฐ ํ ํฐ ์์ ๋ํด 1๊ฐ์ ๊ฐ์ ์ป์ต๋๋ค.
For each word in the sentence, compute the attention score with respect to โshinyโ by calculating the dot product of their embeddings.
Attention Score between โHelloโ and โshinyโ
 (1) (1).png)
Attention Score between โshinyโ and โshinyโ
 (1) (1) (1) (1) (1) (1) (1).png)
Attention Score between โsunโ and โshinyโ
 (1) (1) (1) (1).png)
Step 2: Normalize Attention Scores to Obtain Attention Weights
Tip
์ํ์ ์ฉ์ด์ ํ๋ง๋ฆฌ์ง ๋ง์ธ์, ์ด ํจ์์ ๋ชฉํ๋ ๊ฐ๋จํฉ๋๋ค. ๋ชจ๋ ๊ฐ์ค์น๋ฅผ ์ ๊ทํํ์ฌ ์ดํฉ์ด 1์ด ๋๋๋ก ํ์ธ์.
๋ํ, softmax ํจ์๋ ์ง์ ๋ถ๋ถ์ผ๋ก ์ธํด ์ฐจ์ด๋ฅผ ๊ฐ์กฐํ๋ฏ๋ก ์ ์ฉํ ๊ฐ์ ๊ฐ์งํ๊ธฐ ์ฝ๊ฒ ๋ง๋ญ๋๋ค.
Apply the softmax function to the attention scores to convert them into attention weights that sum to 1.
 (1) (1) (1) (1).png)
Calculating the exponentials:
 (1) (1).png)
Calculating the sum:
 (1) (1).png)
Calculating attention weights:
 (1) (1).png)
Step 3: Compute the Context Vector
Tip
๊ฐ ์ฃผ์ ๊ฐ์ค์น๋ฅผ ๊ฐ์ ธ์ ๊ด๋ จ๋ ํ ํฐ ์ฐจ์์ ๊ณฑํ ๋ค์ ๋ชจ๋ ์ฐจ์์ ๋ํ์ฌ ๋จ ํ๋์ ๋ฒกํฐ(์ปจํ ์คํธ ๋ฒกํฐ)๋ฅผ ์ป์ผ์ธ์.
The context vector is computed as the weighted sum of the embeddings of all words, using the attention weights.
.png)
Calculating each component:
- Weighted Embedding of โHelloโ:
 (1) (1).png)
- Weighted Embedding of โshinyโ:
 (1) (1).png)
- Weighted Embedding of โsunโ:
 (1) (1).png)
Summing the weighted embeddings:
context vector=[0.0779+0.2156+0.1057, 0.0504+0.1382+0.1972, 0.1237+0.3983+0.3390]=[0.3992,0.3858,0.8610]
์ด ์ปจํ ์คํธ ๋ฒกํฐ๋ โshinyโ๋ผ๋ ๋จ์ด์ ๋ํ ํ๋ถํ ์๋ฒ ๋ฉ์ ๋ํ๋ด๋ฉฐ, ๋ฌธ์ฅ์ ๋ชจ๋ ๋จ์ด๋ก๋ถํฐ ์ ๋ณด๋ฅผ ํตํฉํฉ๋๋ค.
Summary of the Process
- Compute Attention Scores: Use the dot product between the embedding of the target word and the embeddings of all words in the sequence.
- Normalize Scores to Get Attention Weights: Apply the softmax function to the attention scores to obtain weights that sum to 1.
- Compute Context Vector: Multiply each wordโs embedding by its attention weight and sum the results.
Self-Attention with Trainable Weights
In practice, self-attention mechanisms use trainable weights to learn the best representations for queries, keys, and values. This involves introducing three weight matrices:
 (1) (1).png)
The query is the data to use like before, while the keys and values matrices are just random-trainable matrices.
Step 1: Compute Queries, Keys, and Values
Each token will have its own query, key and value matrix by multiplying its dimension values by the defined matrices:
.png)
These matrices transform the original embeddings into a new space suitable for computing attention.
Example
Assuming:
- Input dimension
din=3(embedding size) - Output dimension
dout=2(desired dimension for queries, keys, and values)
Initialize the weight matrices:
import torch.nn as nn
d_in = 3
d_out = 2
W_query = nn.Parameter(torch.rand(d_in, d_out))
W_key = nn.Parameter(torch.rand(d_in, d_out))
W_value = nn.Parameter(torch.rand(d_in, d_out))
์ฟผ๋ฆฌ, ํค, ๊ฐ ๊ณ์ฐ:
queries = torch.matmul(inputs, W_query)
keys = torch.matmul(inputs, W_key)
values = torch.matmul(inputs, W_value)
Step 2: Compute Scaled Dot-Product Attention
Compute Attention Scores
์ด์ ์์ ์ ์ ์ฌํ์ง๋ง, ์ด๋ฒ์๋ ํ ํฐ์ ์ฐจ์ ๊ฐ ๋์ ํ ํฐ์ ํค ํ๋ ฌ์ ์ฌ์ฉํฉ๋๋ค(์ด๋ฏธ ์ฐจ์์ ์ฌ์ฉํ์ฌ ๊ณ์ฐ๋จ). ๋ฐ๋ผ์ ๊ฐ ์ฟผ๋ฆฌ qiโ์ ํค kjโ์ ๋ํด:
.png)
Scale the Scores
๋ด์ ์ด ๋๋ฌด ์ปค์ง๋ ๊ฒ์ ๋ฐฉ์งํ๊ธฐ ์ํด, ํค ์ฐจ์ dkโ์ ์ ๊ณฑ๊ทผ์ผ๋ก ์ ์๋ฅผ ์กฐ์ ํฉ๋๋ค:
.png)
Tip
์ ์๋ ์ฐจ์์ ์ ๊ณฑ๊ทผ์ผ๋ก ๋๋์ด์ง๋๋ฐ, ์ด๋ ๋ด์ ์ด ๋งค์ฐ ์ปค์ง ์ ์๊ธฐ ๋๋ฌธ์ ์ด๋ฅผ ์กฐ์ ํ๋ ๋ฐ ๋์์ด ๋ฉ๋๋ค.
Apply Softmax to Obtain Attention Weights: ์ด๊ธฐ ์์ ์ ๊ฐ์ด ๋ชจ๋ ๊ฐ์ ์ ๊ทํํ์ฌ ํฉ์ด 1์ด ๋๋๋ก ํฉ๋๋ค.
.png)
Step 3: Compute Context Vectors
์ด๊ธฐ ์์ ์ ๊ฐ์ด, ๊ฐ ๊ฐ์ ์ฃผ์ ๊ฐ์ค์น๋ก ๊ณฑํ์ฌ ๋ชจ๋ ๊ฐ ํ๋ ฌ์ ํฉ์ฐํฉ๋๋ค:
.png)
Code Example
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb์์ ์์ ๋ฅผ ๊ฐ์ ธ์์ ์ฐ๋ฆฌ๊ฐ ์ด์ผ๊ธฐํ ์๊ธฐ ์ฃผ์ ๊ธฐ๋ฅ์ ๊ตฌํํ๋ ์ด ํด๋์ค๋ฅผ ํ์ธํ ์ ์์ต๋๋ค:
import torch
inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your (x^1)
[0.55, 0.87, 0.66], # journey (x^2)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5)
[0.05, 0.80, 0.55]] # step (x^6)
)
import torch.nn as nn
class SelfAttention_v2(nn.Module):
def __init__(self, d_in, d_out, qkv_bias=False):
super().__init__()
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
def forward(self, x):
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
attn_scores = queries @ keys.T
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
context_vec = attn_weights @ values
return context_vec
d_in=3
d_out=2
torch.manual_seed(789)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(inputs))
Tip
๋งคํธ๋ฆญ์ค๋ฅผ ์์์ ๊ฐ์ผ๋ก ์ด๊ธฐํํ๋ ๋์ ,
nn.Linear๋ฅผ ์ฌ์ฉํ์ฌ ๋ชจ๋ ๊ฐ์ค์น๋ฅผ ํ์ตํ ๋งค๊ฐ๋ณ์๋ก ํ์ํฉ๋๋ค.
์ธ๊ณผ์ ์ฃผ์: ๋ฏธ๋ ๋จ์ด ์จ๊ธฐ๊ธฐ
LLM์์๋ ๋ชจ๋ธ์ด ํ์ฌ ์์น ์ด์ ์ ๋ํ๋๋ ํ ํฐ๋ง ๊ณ ๋ คํ์ฌ ๋ค์ ํ ํฐ์ ์์ธกํ๋๋ก ํ๊ธฐ๋ฅผ ์ํฉ๋๋ค. ์ธ๊ณผ์ ์ฃผ์๋ ๋ง์คํน๋ ์ฃผ์๋ผ๊ณ ๋ ํ๋ฉฐ, ์ฃผ์ ๋ฉ์ปค๋์ฆ์ ์์ ํ์ฌ ๋ฏธ๋ ํ ํฐ์ ๋ํ ์ ๊ทผ์ ๋ฐฉ์งํจ์ผ๋ก์จ ์ด๋ฅผ ๋ฌ์ฑํฉ๋๋ค.
์ธ๊ณผ์ ์ฃผ์ ๋ง์คํฌ ์ ์ฉ
์ธ๊ณผ์ ์ฃผ์๋ฅผ ๊ตฌํํ๊ธฐ ์ํด, ์ฐ๋ฆฌ๋ ์ํํธ๋งฅ์ค ์ฐ์ฐ ์ด์ ์ ์ฃผ์ ์ ์์ ๋ง์คํฌ๋ฅผ ์ ์ฉํ์ฌ ๋๋จธ์ง ์ ์๊ฐ ์ฌ์ ํ 1์ด ๋๋๋ก ํฉ๋๋ค. ์ด ๋ง์คํฌ๋ ๋ฏธ๋ ํ ํฐ์ ์ฃผ์ ์ ์๋ฅผ ์์ ๋ฌดํ๋๋ก ์ค์ ํ์ฌ ์ํํธ๋งฅ์ค ์ดํ์ ๊ทธ๋ค์ ์ฃผ์ ๊ฐ์ค์น๊ฐ 0์ด ๋๋๋ก ๋ณด์ฅํฉ๋๋ค.
๋จ๊ณ
- ์ฃผ์ ์ ์ ๊ณ์ฐ: ์ด์ ๊ณผ ๋์ผํฉ๋๋ค.
- ๋ง์คํฌ ์ ์ฉ: ๋๊ฐ์ ์์ ์์ ๋ฌดํ๋๋ก ์ฑ์์ง ์์ผ๊ฐ ํ๋ ฌ์ ์ฌ์ฉํฉ๋๋ค.
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1) * float('-inf')
masked_scores = attention_scores + mask
- ์ํํธ๋งฅ์ค ์ ์ฉ: ๋ง์คํน๋ ์ ์๋ฅผ ์ฌ์ฉํ์ฌ ์ฃผ์ ๊ฐ์ค์น๋ฅผ ๊ณ์ฐํฉ๋๋ค.
attention_weights = torch.softmax(masked_scores, dim=-1)
๋๋กญ์์์ผ๋ก ์ถ๊ฐ ์ฃผ์ ๊ฐ์ค์น ๋ง์คํน
๊ณผ์ ํฉ์ ๋ฐฉ์งํ๊ธฐ ์ํด, ์ํํธ๋งฅ์ค ์ฐ์ฐ ํ ์ฃผ์ ๊ฐ์ค์น์ ๋๋กญ์์์ ์ ์ฉํ ์ ์์ต๋๋ค. ๋๋กญ์์์ ํ์ต ์ค์ ์ผ๋ถ ์ฃผ์ ๊ฐ์ค์น๋ฅผ ๋ฌด์์๋ก 0์ผ๋ก ๋ง๋ญ๋๋ค.
dropout = nn.Dropout(p=0.5)
attention_weights = dropout(attention_weights)
์ ์์ ์ธ ๋๋กญ์์ ๋น์จ์ ์ฝ 10-20%์ ๋๋ค.
์ฝ๋ ์์
์ฝ๋ ์์ ๋ https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb์์ ๊ฐ์ ธ์์ต๋๋ค:
import torch
import torch.nn as nn
inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your (x^1)
[0.55, 0.87, 0.66], # journey (x^2)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5)
[0.05, 0.80, 0.55]] # step (x^6)
)
batch = torch.stack((inputs, inputs), dim=0)
print(batch.shape)
class CausalAttention(nn.Module):
def __init__(self, d_in, d_out, context_length,
dropout, qkv_bias=False):
super().__init__()
self.d_out = d_out
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.dropout = nn.Dropout(dropout)
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New
def forward(self, x):
b, num_tokens, d_in = x.shape
# b is the num of batches
# num_tokens is the number of tokens per batch
# d_in is the dimensions er token
keys = self.W_key(x) # This generates the keys of the tokens
queries = self.W_query(x)
values = self.W_value(x)
attn_scores = queries @ keys.transpose(1, 2) # Moves the third dimension to the second one and the second one to the third one to be able to multiply
attn_scores.masked_fill_( # New, _ ops are in-place
self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) # `:num_tokens` to account for cases where the number of tokens in the batch is smaller than the supported context_size
attn_weights = torch.softmax(
attn_scores / keys.shape[-1]**0.5, dim=-1
)
attn_weights = self.dropout(attn_weights)
context_vec = attn_weights @ values
return context_vec
torch.manual_seed(123)
context_length = batch.shape[1]
d_in = 3
d_out = 2
ca = CausalAttention(d_in, d_out, context_length, 0.0)
context_vecs = ca(batch)
print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)
Single-Head Attention์ Multi-Head Attention์ผ๋ก ํ์ฅํ๊ธฐ
Multi-head attention์ ์ค์ง์ ์ผ๋ก ์๊ธฐ ์ฃผ์ ํจ์์ ์ฌ๋ฌ ์ธ์คํด์ค๋ฅผ ์คํํ๋ ๊ฒ์ผ๋ก ๊ตฌ์ฑ๋๋ฉฐ, ๊ฐ ์ธ์คํด์ค๋ ์์ ์ ๊ฐ์ค์น๋ฅผ ๊ฐ์ง๊ณ ์์ด ์๋ก ๋ค๋ฅธ ์ต์ข ๋ฒกํฐ๊ฐ ๊ณ์ฐ๋ฉ๋๋ค.
์ฝ๋ ์์
์ด์ ์ฝ๋๋ฅผ ์ฌ์ฌ์ฉํ๊ณ ์ฌ๋ฌ ๋ฒ ์คํํ๋ ๋ํผ๋ฅผ ์ถ๊ฐํ๋ ๊ฒ์ด ๊ฐ๋ฅํ ์ ์์ง๋ง, ์ด๋ ๋ชจ๋ ํค๋๋ฅผ ๋์์ ์ฒ๋ฆฌํ๋ https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb์์ ๋ ์ต์ ํ๋ ๋ฒ์ ์ ๋๋ค (๋น์ฉ์ด ๋ง์ด ๋๋ for ๋ฃจํ์ ์๋ฅผ ์ค์). ์ฝ๋์์ ๋ณผ ์ ์๋ฏ์ด ๊ฐ ํ ํฐ์ ์ฐจ์์ ํค๋ ์์ ๋ฐ๋ผ ์๋ก ๋ค๋ฅธ ์ฐจ์์ผ๋ก ๋๋ฉ๋๋ค. ์ด๋ ๊ฒ ํ๋ฉด ํ ํฐ์ด 8์ฐจ์์ ๊ฐ์ง๊ณ ์๊ณ 3๊ฐ์ ํค๋๋ฅผ ์ฌ์ฉํ๊ณ ์ ํ ๊ฒฝ์ฐ, ์ฐจ์์ 4์ฐจ์์ 2๊ฐ์ ๋ฐฐ์ด๋ก ๋๋๊ณ ๊ฐ ํค๋๋ ๊ทธ ์ค ํ๋๋ฅผ ์ฌ์ฉํ๊ฒ ๋ฉ๋๋ค:
class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
assert (d_out % num_heads == 0), \
"d_out must be divisible by num_heads"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
self.dropout = nn.Dropout(dropout)
self.register_buffer(
"mask",
torch.triu(torch.ones(context_length, context_length),
diagonal=1)
)
def forward(self, x):
b, num_tokens, d_in = x.shape
# b is the num of batches
# num_tokens is the number of tokens per batch
# d_in is the dimensions er token
keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
queries = self.W_query(x)
values = self.W_value(x)
# We implicitly split the matrix by adding a `num_heads` dimension
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
keys = keys.transpose(1, 2)
queries = queries.transpose(1, 2)
values = values.transpose(1, 2)
# Compute scaled dot-product attention (aka self-attention) with a causal mask
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
# Original mask truncated to the number of tokens and converted to boolean
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
# Use the mask to fill attention scores
attn_scores.masked_fill_(mask_bool, -torch.inf)
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)
# Shape: (b, num_tokens, num_heads, head_dim)
context_vec = (attn_weights @ values).transpose(1, 2)
# Combine heads, where self.d_out = self.num_heads * self.head_dim
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec) # optional projection
return context_vec
torch.manual_seed(123)
batch_size, context_length, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)
context_vecs = mha(batch)
print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)
๋ค๋ฅธ ๊ฐ๊ฒฐํ๊ณ ํจ์จ์ ์ธ ๊ตฌํ์ ์ํด PyTorch์ torch.nn.MultiheadAttention ํด๋์ค๋ฅผ ์ฌ์ฉํ ์ ์์ต๋๋ค.
Tip
ChatGPT์ ์งง์ ๋ต๋ณ: ์ ๊ฐ ํค๋๊ฐ ๋ชจ๋ ํ ํฐ์ ๋ชจ๋ ์ฐจ์์ ํ์ธํ๋ ๋์ ํ ํฐ์ ์ฐจ์์ ํค๋ ๊ฐ์ ๋๋๋ ๊ฒ์ด ๋ ๋์์ง์ ๋ํ ์ค๋ช :
๊ฐ ํค๋๊ฐ ๋ชจ๋ ์๋ฒ ๋ฉ ์ฐจ์์ ์ฒ๋ฆฌํ ์ ์๋๋ก ํ๋ ๊ฒ์ด ์ ๋ฆฌํด ๋ณด์ผ ์ ์์ง๋ง, ํ์ค ๊ดํ์ ์๋ฒ ๋ฉ ์ฐจ์์ ํค๋ ๊ฐ์ ๋๋๋ ๊ฒ์ ๋๋ค. ์ด ์ ๊ทผ ๋ฐฉ์์ ๊ณ์ฐ ํจ์จ์ฑ๊ณผ ๋ชจ๋ธ ์ฑ๋ฅ์ ๊ท ํ์ ๋ง์ถ๊ณ ๊ฐ ํค๋๊ฐ ๋ค์ํ ํํ์ ํ์ตํ๋๋ก ์ฅ๋ คํฉ๋๋ค. ๋ฐ๋ผ์ ์๋ฒ ๋ฉ ์ฐจ์์ ๋๋๋ ๊ฒ์ด ์ผ๋ฐ์ ์ผ๋ก ๊ฐ ํค๋๊ฐ ๋ชจ๋ ์ฐจ์์ ํ์ธํ๋ ๊ฒ๋ณด๋ค ์ ํธ๋ฉ๋๋ค.
References
Tip
AWS ํดํน ๋ฐฐ์ฐ๊ธฐ ๋ฐ ์ฐ์ตํ๊ธฐ:
HackTricks Training AWS Red Team Expert (ARTE)
GCP ํดํน ๋ฐฐ์ฐ๊ธฐ ๋ฐ ์ฐ์ตํ๊ธฐ:HackTricks Training GCP Red Team Expert (GRTE)
Azure ํดํน ๋ฐฐ์ฐ๊ธฐ ๋ฐ ์ฐ์ตํ๊ธฐ:
HackTricks Training Azure Red Team Expert (AzRTE)
HackTricks ์ง์ํ๊ธฐ
- ๊ตฌ๋ ๊ณํ ํ์ธํ๊ธฐ!
- **๐ฌ ๋์ค์ฝ๋ ๊ทธ๋ฃน ๋๋ ํ ๋ ๊ทธ๋จ ๊ทธ๋ฃน์ ์ฐธ์ฌํ๊ฑฐ๋ ํธ์ํฐ ๐ฆ @hacktricks_live๋ฅผ ํ๋ก์ฐํ์ธ์.
- HackTricks ๋ฐ HackTricks Cloud ๊นํ๋ธ ๋ฆฌํฌ์งํ ๋ฆฌ์ PR์ ์ ์ถํ์ฌ ํดํน ํธ๋ฆญ์ ๊ณต์ ํ์ธ์.


