4. Attention Mechanisms

Reading time: 14 minutes

tip

Jifunze na fanya mazoezi ya AWS Hacking:HackTricks Training AWS Red Team Expert (ARTE)
Jifunze na fanya mazoezi ya GCP Hacking: HackTricks Training GCP Red Team Expert (GRTE) Jifunze na fanya mazoezi ya Azure Hacking: HackTricks Training Azure Red Team Expert (AzRTE)

Support HackTricks

Attention Mechanisms and Self-Attention in Neural Networks

Mekanismu za umakini huruhusu mitandao ya neva kuzingatia sehemu maalum za ingizo wakati wa kuzalisha kila sehemu ya pato. Wanatoa uzito tofauti kwa ingizo tofauti, wakisaidia mfano kuamua ni ingizo gani lina umuhimu zaidi kwa kazi inayofanywa. Hii ni muhimu katika kazi kama tafsiri ya mashine, ambapo kuelewa muktadha wa sentensi nzima ni muhimu kwa tafsiri sahihi.

tip

Lengo la awamu hii ya nne ni rahisi sana: Tumia baadhi ya mekanismu za umakini. Hizi zitakuwa tabaka nyingi zinazojirudia ambazo zitakuwa zinanasa uhusiano wa neno katika msamiati na majirani zake katika sentensi ya sasa inayotumika kufundisha LLM.
Tabaka nyingi zinatumika kwa hili, hivyo vigezo vingi vinavyoweza kufundishwa vitakuwa vinanasa habari hii.

Understanding Attention Mechanisms

Katika mifano ya jadi ya mfuatano-kwa-mfuatano inayotumika kwa tafsiri ya lugha, mfano unachakata mfuatano wa ingizo kuwa vector ya muktadha wa ukubwa wa kudumu. Hata hivyo, mbinu hii inakabiliwa na sentensi ndefu kwa sababu vector ya muktadha wa ukubwa wa kudumu inaweza isichukue habari zote muhimu. Mekanismu za umakini zinashughulikia kikomo hiki kwa kuruhusu mfano kuzingatia token zote za ingizo wakati wa kuzalisha kila token ya pato.

Example: Machine Translation

Fikiria kutafsiri sentensi ya Kijerumani "Kannst du mir helfen diesen Satz zu übersetzen" kuwa Kiingereza. Tafsiri ya neno kwa neno haitatoa sentensi sahihi ya Kiingereza kutokana na tofauti katika muundo wa sarufi kati ya lugha. Mekanismu ya umakini inaruhusu mfano kuzingatia sehemu muhimu za sentensi ya ingizo wakati wa kuzalisha kila neno la sentensi ya pato, ikisababisha tafsiri sahihi na yenye muktadha.

Introduction to Self-Attention

Umakini wa ndani, au umakini wa ndani, ni mekanismu ambapo umakini unatumika ndani ya mfuatano mmoja ili kuhesabu uwakilishi wa mfuatano huo. Inaruhusu kila token katika mfuatano kuzingatia token nyingine zote, ikisaidia mfano kunasa utegemezi kati ya token bila kujali umbali wao katika mfuatano.

Key Concepts

  • Tokens: Vipengele vya kibinafsi vya mfuatano wa ingizo (mfano, maneno katika sentensi).
  • Embeddings: Uwaki wa vector wa token, ukichukua habari ya maana.
  • Attention Weights: Thamani zinazotathmini umuhimu wa kila token ikilinganishwa na nyingine.

Calculating Attention Weights: A Step-by-Step Example

Fikiria sentensi "Hello shiny sun!" na uwakilishe kila neno kwa embedding ya vipimo 3-dimensional:

  • Hello: [0.34, 0.22, 0.54]
  • shiny: [0.53, 0.34, 0.98]
  • sun: [0.29, 0.54, 0.93]

Lengo letu ni kuhesabu vector ya muktadha kwa neno "shiny" kwa kutumia umakini wa ndani.

Step 1: Compute Attention Scores

tip

Piga kila thamani ya kipimo cha swali na ile inayofaa ya kila token na ongeza matokeo. Unapata thamani 1 kwa kila jozi ya token.

Kwa kila neno katika sentensi, hesabu alama za umakini kuhusiana na "shiny" kwa kuhesabu bidhaa ya dot ya uwakilishi wao.

Attention Score between "Hello" and "shiny"

Attention Score between "shiny" and "shiny"

Attention Score between "sun" and "shiny"

Step 2: Normalize Attention Scores to Obtain Attention Weights

tip

Usipoteze mwelekeo katika maneno ya kihesabu, lengo la kazi hii ni rahisi, normalize uzito wote ili wajumuishe 1 kwa jumla.

Zaidi ya hayo, softmax inatumika kwa sababu inasisitiza tofauti kutokana na sehemu ya exponential, ikifanya iwe rahisi kugundua thamani zinazofaa.

Tumia softmax function kwa alama za umakini ili kuziweka kuwa uzito wa umakini unaojumlisha hadi 1.

Kuhesabu exponentials:

Kuhesabu jumla:

Kuhesabu uzito wa umakini:

Step 3: Compute the Context Vector

tip

Chukua kila uzito wa umakini na upige kwa vipimo vya token vinavyohusiana na kisha ongeza vipimo vyote ili kupata vector 1 tu (vector ya muktadha)

Vector ya muktadha inahesabiwa kama jumla yenye uzito wa uwakilishi wa maneno yote, ikitumia uzito wa umakini.

Kuhesabu kila kipengele:

  • Weighted Embedding of "Hello":
  • Weighted Embedding of "shiny":
  • Weighted Embedding of "sun":

Kuongeza uwakilishi wenye uzito:

context vector=[0.0779+0.2156+0.1057, 0.0504+0.1382+0.1972, 0.1237+0.3983+0.3390]=[0.3992,0.3858,0.8610]

Hii vector ya muktadha inawakilisha uwakilishi ulioimarishwa kwa neno "shiny," ikijumuisha habari kutoka kwa maneno yote katika sentensi.

Summary of the Process

  1. Compute Attention Scores: Tumia bidhaa ya dot kati ya uwakilishi wa neno lengwa na uwakilishi wa maneno yote katika mfuatano.
  2. Normalize Scores to Get Attention Weights: Tumia kazi ya softmax kwa alama za umakini ili kupata uzito unaojumlisha hadi 1.
  3. Compute Context Vector: Piga uwakilishi wa kila neno kwa uzito wake wa umakini na ongeza matokeo.

Self-Attention with Trainable Weights

Katika mazoezi, mekanismu za umakini wa ndani hutumia uzito unaoweza kufundishwa kujifunza uwakilishi bora kwa maswali, funguo, na thamani. Hii inahusisha kuanzisha matrices tatu za uzito:

Swali ni data ya kutumia kama hapo awali, wakati matrices za funguo na thamani ni matrices za bahati nasibu zinazoweza kufundishwa.

Step 1: Compute Queries, Keys, and Values

Kila token itakuwa na swali lake, funguo na matrix ya thamani kwa kupiga thamani zake za vipimo na matrices zilizofafanuliwa:

Matrices hizi zinabadilisha uwakilishi wa awali kuwa nafasi mpya inayofaa kwa kuhesabu umakini.

Example

Tukichukulia:

  • Dimensheni ya ingizo din=3 (ukubwa wa uwakilishi)
  • Dimensheni ya pato dout=2 (ukubwa unaotakiwa kwa maswali, funguo, na thamani)

Anzisha matrices za uzito:

python
import torch.nn as nn

d_in = 3
d_out = 2

W_query = nn.Parameter(torch.rand(d_in, d_out))
W_key = nn.Parameter(torch.rand(d_in, d_out))
W_value = nn.Parameter(torch.rand(d_in, d_out))

Hesabu maswali, funguo, na thamani:

python
queries = torch.matmul(inputs, W_query)
keys = torch.matmul(inputs, W_key)
values = torch.matmul(inputs, W_value)

Step 2: Compute Scaled Dot-Product Attention

Compute Attention Scores

Kama ilivyo katika mfano wa awali, lakini wakati huu, badala ya kutumia thamani za vipimo vya tokens, tunatumia matrix ya funguo ya token (iliyohesabiwa tayari kwa kutumia vipimo):. Hivyo, kwa kila query qi​ na funguo kj​:

Scale the Scores

Ili kuzuia bidhaa za dot kuwa kubwa sana, ziongeze kwa mzizi wa mraba wa kipimo cha funguo dk​:

tip

Alama inagawanywa kwa mzizi wa mraba wa vipimo kwa sababu bidhaa za dot zinaweza kuwa kubwa sana na hii husaidia kuzirekebisha.

Apply Softmax to Obtain Attention Weights: Kama katika mfano wa awali, sanifisha thamani zote ili zijumuishe 1.

Step 3: Compute Context Vectors

Kama katika mfano wa awali, jumuisha tu matrix zote za thamani ukizidisha kila moja kwa uzito wake wa umakini:

Code Example

Kuchukua mfano kutoka https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb unaweza kuangalia darasa hili linalotekeleza kazi ya kujitunza tuliyozungumzia:

python
import torch

inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your     (x^1)
[0.55, 0.87, 0.66], # journey  (x^2)
[0.57, 0.85, 0.64], # starts   (x^3)
[0.22, 0.58, 0.33], # with     (x^4)
[0.77, 0.25, 0.10], # one      (x^5)
[0.05, 0.80, 0.55]] # step     (x^6)
)

import torch.nn as nn
class SelfAttention_v2(nn.Module):

def __init__(self, d_in, d_out, qkv_bias=False):
super().__init__()
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

def forward(self, x):
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)

attn_scores = queries @ keys.T
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

context_vec = attn_weights @ values
return context_vec

d_in=3
d_out=2
torch.manual_seed(789)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(inputs))

tip

Kumbuka kwamba badala ya kuanzisha matrices na thamani za nasibu, nn.Linear inatumika kuashiria uzito wote kama vigezo vya kufundisha.

Causal Attention: Kuficha Maneno ya Baadaye

Kwa LLMs tunataka mfano uzingatie tu tokens ambazo zinaonekana kabla ya nafasi ya sasa ili kutabiri token inayofuata. Causal attention, pia inajulikana kama masked attention, inafanikiwa kwa kubadilisha mekanizma ya attention ili kuzuia ufikiaji wa tokens za baadaye.

Kutumia Mask ya Causal Attention

Ili kutekeleza causal attention, tunatumia mask kwa alama za attention kabla ya operesheni ya softmax ili zile zilizobaki bado zikusanye 1. Mask hii inaweka alama za attention za tokens za baadaye kuwa negative infinity, kuhakikisha kwamba baada ya softmax, uzito wao wa attention ni sifuri.

Hatua

  1. Hesabu Alama za Attention: Kama ilivyokuwa hapo awali.
  2. Tumia Mask: Tumia matrix ya juu ya pembeni iliyojaa negative infinity juu ya diagonal.
python
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1) * float('-inf')
masked_scores = attention_scores + mask
  1. Tumia Softmax: Hesabu uzito wa attention kwa kutumia alama zilizofichwa.
python
attention_weights = torch.softmax(masked_scores, dim=-1)

Kuficha Uzito wa Ziada wa Attention kwa Kutumia Dropout

Ili kuzuia overfitting, tunaweza kutumia dropout kwa uzito wa attention baada ya operesheni ya softmax. Dropout hufanya baadhi ya uzito wa attention kuwa sifuri kwa nasibu wakati wa mafunzo.

python
dropout = nn.Dropout(p=0.5)
attention_weights = dropout(attention_weights)

Kiwango cha kawaida cha kuacha ni takriban 10-20%.

Code Example

Code example from https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb:

python
import torch
import torch.nn as nn

inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your     (x^1)
[0.55, 0.87, 0.66], # journey  (x^2)
[0.57, 0.85, 0.64], # starts   (x^3)
[0.22, 0.58, 0.33], # with     (x^4)
[0.77, 0.25, 0.10], # one      (x^5)
[0.05, 0.80, 0.55]] # step     (x^6)
)

batch = torch.stack((inputs, inputs), dim=0)
print(batch.shape)

class CausalAttention(nn.Module):

def __init__(self, d_in, d_out, context_length,
dropout, qkv_bias=False):
super().__init__()
self.d_out = d_out
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.dropout = nn.Dropout(dropout)
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New

def forward(self, x):
b, num_tokens, d_in = x.shape
# b is the num of batches
# num_tokens is the number of tokens per batch
# d_in is the dimensions er token

keys = self.W_key(x) # This generates the keys of the tokens
queries = self.W_query(x)
values = self.W_value(x)

attn_scores = queries @ keys.transpose(1, 2) # Moves the third dimension to the second one and the second one to the third one to be able to multiply
attn_scores.masked_fill_(  # New, _ ops are in-place
self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)  # `:num_tokens` to account for cases where the number of tokens in the batch is smaller than the supported context_size
attn_weights = torch.softmax(
attn_scores / keys.shape[-1]**0.5, dim=-1
)
attn_weights = self.dropout(attn_weights)

context_vec = attn_weights @ values
return context_vec

torch.manual_seed(123)

context_length = batch.shape[1]
d_in = 3
d_out = 2
ca = CausalAttention(d_in, d_out, context_length, 0.0)

context_vecs = ca(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

Kupanua Umakini wa Kichwa Kimoja hadi Umakini wa Vichwa Vingi

Umakini wa vichwa vingi kwa maneno ya vitendo unajumuisha kutekeleza matukio mengi ya kazi ya umakini wa ndani kila moja ikiwa na uzito wake mwenyewe ili kuhesabu vektori tofauti za mwisho.

Mfano wa Kanuni

Inaweza kuwa inawezekana kutumia tena kanuni ya awali na kuongeza tu kifuniko kinachokizindua mara kadhaa, lakini hii ni toleo lililoimarishwa zaidi kutoka https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb ambayo inashughulikia vichwa vyote kwa wakati mmoja (ikiweka chini idadi ya mizunguko ya gharama kubwa). Kama unavyoona katika kanuni, vipimo vya kila token vinagawanywa katika vipimo tofauti kulingana na idadi ya vichwa. Kwa njia hii, ikiwa token ina vipimo 8 na tunataka kutumia vichwa 3, vipimo vitagawanywa katika arrays 2 za vipimo 4 na kila kichwa kitatumia moja yao:

python
class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
assert (d_out % num_heads == 0), \
"d_out must be divisible by num_heads"

self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
self.dropout = nn.Dropout(dropout)
self.register_buffer(
"mask",
torch.triu(torch.ones(context_length, context_length),
diagonal=1)
)

def forward(self, x):
b, num_tokens, d_in = x.shape
# b is the num of batches
# num_tokens is the number of tokens per batch
# d_in is the dimensions er token

keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
queries = self.W_query(x)
values = self.W_value(x)

# We implicitly split the matrix by adding a `num_heads` dimension
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
keys = keys.transpose(1, 2)
queries = queries.transpose(1, 2)
values = values.transpose(1, 2)

# Compute scaled dot-product attention (aka self-attention) with a causal mask
attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

# Original mask truncated to the number of tokens and converted to boolean
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

# Use the mask to fill attention scores
attn_scores.masked_fill_(mask_bool, -torch.inf)

attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)

# Shape: (b, num_tokens, num_heads, head_dim)
context_vec = (attn_weights @ values).transpose(1, 2)

# Combine heads, where self.d_out = self.num_heads * self.head_dim
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec) # optional projection

return context_vec

torch.manual_seed(123)

batch_size, context_length, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)

context_vecs = mha(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

Kwa utekelezaji mwingine wa kompakt na mzuri unaweza kutumia torch.nn.MultiheadAttention darasa katika PyTorch.

tip

Jibu fupi la ChatGPT kuhusu kwa nini ni bora kugawanya vipimo vya tokens kati ya vichwa badala ya kuwa na kila kichwa kinachunguza vipimo vyote vya tokens zote:

Ingawa kuruhusu kila kichwa kushughulikia vipimo vyote vya embedding kunaweza kuonekana kuwa na faida kwa sababu kila kichwa kitakuwa na ufikiaji wa taarifa kamili, mazoea ya kawaida ni kugawanya vipimo vya embedding kati ya vichwa. Njia hii inalinganisha ufanisi wa kompyuta na utendaji wa mfano na inahimiza kila kichwa kujifunza uwakilishi tofauti. Hivyo, kugawanya vipimo vya embedding kwa ujumla kunapewa kipaumbele kuliko kuwa na kila kichwa kinachunguza vipimo vyote.

Marejeo

tip

Jifunze na fanya mazoezi ya AWS Hacking:HackTricks Training AWS Red Team Expert (ARTE)
Jifunze na fanya mazoezi ya GCP Hacking: HackTricks Training GCP Red Team Expert (GRTE) Jifunze na fanya mazoezi ya Azure Hacking: HackTricks Training Azure Red Team Expert (AzRTE)

Support HackTricks