๊ฐ•ํ™” ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜

Tip

AWS ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ:HackTricks Training AWS Red Team Expert (ARTE)
GCP ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ: HackTricks Training GCP Red Team Expert (GRTE) Azure ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ: HackTricks Training Azure Red Team Expert (AzRTE)

HackTricks ์ง€์›ํ•˜๊ธฐ

๊ฐ•ํ™” ํ•™์Šต

๊ฐ•ํ™” ํ•™์Šต(RL)์€ ์—์ด์ „ํŠธ๊ฐ€ ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉด์„œ ์˜์‚ฌ๊ฒฐ์ •์„ ํ•™์Šตํ•˜๋Š” ๋จธ์‹ ๋Ÿฌ๋‹์˜ ํ•œ ์œ ํ˜•์ž…๋‹ˆ๋‹ค. ์—์ด์ „ํŠธ๋Š” ํ–‰๋™์— ๋”ฐ๋ผ ๋ณด์ƒ ๋˜๋Š” ๋ฒŒ์  ํ˜•ํƒœ์˜ ํ”ผ๋“œ๋ฐฑ์„ ๋ฐ›์•„ ์‹œ๊ฐ„์ด ์ง€๋‚จ์— ๋”ฐ๋ผ ์ตœ์ ์˜ ํ–‰๋™์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ•ํ™” ํ•™์Šต์€ ๋กœ๋ณดํ‹ฑ์Šค, ๊ฒŒ์ž„ ํ”Œ๋ ˆ์ด, ์ž์œจ ์‹œ์Šคํ…œ๊ณผ ๊ฐ™์ด ํ•ด๋ฒ•์ด ์—ฐ์†์ ์ธ ์˜์‚ฌ๊ฒฐ์ •์„ ํฌํ•จํ•˜๋Š” ๋ฌธ์ œ์— ํŠนํžˆ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

Q-Learning

Q-Learning์€ ํŠน์ • ์ƒํƒœ์—์„œ์˜ ํ–‰๋™ ๊ฐ€์น˜๋ฅผ ํ•™์Šตํ•˜๋Š” model-free ๊ฐ•ํ™” ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ํŠน์ • ์ƒํƒœ์—์„œ ํŠน์ • ํ–‰๋™์„ ์ทจํ–ˆ์„ ๋•Œ์˜ ๊ธฐ๋Œ€ ํšจ์šฉ์„ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•ด Q-table์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฐ›์€ ๋ณด์ƒ๊ณผ ๊ธฐ๋Œ€๋˜๋Š” ์ตœ๋Œ€ ๋ฏธ๋ž˜ ๋ณด์ƒ์„ ๋ฐ”ํƒ•์œผ๋กœ Q-value๋ฅผ ๊ฐฑ์‹ ํ•ฉ๋‹ˆ๋‹ค.

  1. Initialization: Q-table์„ ์ž„์˜์˜ ๊ฐ’(๋ณดํ†ต 0)์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
  2. Action Selection: ํƒํ—˜ ์ „๋žต(์˜ˆ: ฮต-greedy, ํ™•๋ฅ  ฮต๋กœ๋Š” ๋ฌด์ž‘์œ„ ํ–‰๋™์„ ์„ ํƒํ•˜๊ณ , ํ™•๋ฅ  1-ฮต๋กœ๋Š” ๊ฐ€์žฅ ๋†’์€ Q-value๋ฅผ ๊ฐ€์ง„ ํ–‰๋™์„ ์„ ํƒํ•จ)์„ ์‚ฌ์šฉํ•ด ํ–‰๋™์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ํ•ญ์ƒ ํ˜„์žฌ ์ƒํƒœ์—์„œ ์•Œ๋ ค์ง„ ์ตœ์„ ์˜ ํ–‰๋™๋งŒ ์„ ํƒํ•˜๋ฉด ๋” ๋‚˜์€ ๋ณด์ƒ์„ ์ค„ ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ํ–‰๋™์„ ํƒ์ƒ‰ํ•  ์ˆ˜ ์—†๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํƒํ—˜๊ณผ ํ™œ์šฉ์˜ ๊ท ํ˜•์„ ๋งž์ถ”๊ธฐ ์œ„ํ•ด ฮต-greedy ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  1. Environment Interaction: ์„ ํƒํ•œ ํ–‰๋™์„ ํ™˜๊ฒฝ์—์„œ ์‹คํ–‰ํ•˜๊ณ , ๋‹ค์Œ ์ƒํƒœ์™€ ๋ณด์ƒ์„ ๊ด€์ฐฐํ•ฉ๋‹ˆ๋‹ค.
  • ์ด ๊ฒฝ์šฐ์—๋„ ฮต-greedy ํ™•๋ฅ ์— ๋”ฐ๋ผ ๋‹ค์Œ ๋‹จ๊ณ„๋Š” ํƒํ—˜์„ ์œ„ํ•œ ๋ฌด์ž‘์œ„ ํ–‰๋™์ด ๋  ์ˆ˜๋„ ์žˆ๊ณ , ํ™œ์šฉ์„ ์œ„ํ•œ ์•Œ๋ ค์ง„ ์ตœ์„ ์˜ ํ–‰๋™์ด ๋  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
  1. Q-Value Update: Bellman equation์„ ์‚ฌ์šฉํ•˜์—ฌ ์ƒํƒœ-ํ–‰๋™ ์Œ์˜ Q-value๋ฅผ ๊ฐฑ์‹ ํ•ฉ๋‹ˆ๋‹ค:
Q(s, a) = Q(s, a) + ฮฑ * (r + ฮณ * max(Q(s', a')) - Q(s, a))

where:

  • Q(s, a)๋Š” ์ƒํƒœ s์™€ ํ–‰๋™ a์— ๋Œ€ํ•œ ํ˜„์žฌ Q-value์ž…๋‹ˆ๋‹ค.
  • ฮฑ๋Š” ํ•™์Šต๋ฅ (0 < ฮฑ โ‰ค 1)๋กœ, ์ƒˆ๋กœ์šด ์ •๋ณด๊ฐ€ ๊ธฐ์กด ์ •๋ณด๋ฅผ ์–ผ๋งˆ๋‚˜ ๋ฎ์–ด์“ธ์ง€๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
  • r์€ ์ƒํƒœ s์—์„œ ํ–‰๋™ a๋ฅผ ์ทจํ•œ ํ›„ ๋ฐ›์€ ๋ณด์ƒ์ž…๋‹ˆ๋‹ค.
  • ฮณ๋Š” ํ• ์ธ์œจ(0 โ‰ค ฮณ < 1)๋กœ, ๋ฏธ๋ž˜ ๋ณด์ƒ์˜ ์ค‘์š”๋„๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
  • s'๋Š” ํ–‰๋™ a๋ฅผ ์ทจํ•œ ํ›„์˜ ๋‹ค์Œ ์ƒํƒœ์ž…๋‹ˆ๋‹ค.
  • max(Q(s', a'))๋Š” ๋‹ค์Œ ์ƒํƒœ s'์—์„œ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ํ–‰๋™ a'์— ๋Œ€ํ•œ ์ตœ๋Œ€ Q-value์ž…๋‹ˆ๋‹ค.
  1. Iteration: Q-values๊ฐ€ ์ˆ˜๋ ดํ•˜๊ฑฐ๋‚˜ ๋ฉˆ์ถค ๊ธฐ์ค€์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ 2-4๋‹จ๊ณ„๋ฅผ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.

์„ ํƒ๋œ ๊ฐ ์ƒˆ๋กœ์šด ํ–‰๋™์— ๋”ฐ๋ผ ํ…Œ์ด๋ธ”์ด ๊ฐฑ์‹ ๋˜๋ฏ€๋กœ ์—์ด์ „ํŠธ๋Š” ์‹œ๊ฐ„์ด ์ง€๋‚จ์— ๋”ฐ๋ผ ๊ฒฝํ—˜์œผ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•˜์—ฌ ์ตœ์  ์ •์ฑ…(๊ฐ ์ƒํƒœ์—์„œ ์ทจํ•  ์ตœ์„ ์˜ ํ–‰๋™)์„ ์ฐพ๋„๋ก ์‹œ๋„ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ ์ƒํƒœ์™€ ํ–‰๋™์ด ๋งŽ์€ ํ™˜๊ฒฝ์—์„œ๋Š” Q-table์ด ์ปค์ ธ ๋ณต์žกํ•œ ๋ฌธ์ œ์— ๋น„์‹ค์šฉ์ ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๊ฒฝ์šฐ Q-value๋ฅผ ๊ทผ์‚ฌํ•˜๊ธฐ ์œ„ํ•ด ํ•จ์ˆ˜ ๊ทผ์‚ฌ ๋ฐฉ๋ฒ•(์˜ˆ: ์‹ ๊ฒฝ๋ง)์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Tip

ฮต-greedy ๊ฐ’์€ ์—์ด์ „ํŠธ๊ฐ€ ํ™˜๊ฒฝ์— ๋Œ€ํ•ด ๋” ๋งŽ์ด ์•Œ๊ฒŒ ๋จ์— ๋”ฐ๋ผ ํƒํ—˜์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋ณดํ†ต ์‹œ๊ฐ„์ด ์ง€๋‚จ์— ๋”ฐ๋ผ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ดˆ๊ธฐ์—๋Š” ๋†’์€ ๊ฐ’(์˜ˆ: ฮต = 1)์œผ๋กœ ์‹œ์ž‘ํ•ด ํ•™์Šต์ด ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ ๋‚ฎ์€ ๊ฐ’(์˜ˆ: ฮต = 0.1)์œผ๋กœ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Tip

ํ•™์Šต๋ฅ  ฮฑ์™€ ํ• ์ธ์œจ ฮณ๋Š” ํŠน์ • ๋ฌธ์ œ์™€ ํ™˜๊ฒฝ์— ๋”ฐ๋ผ ํŠœ๋‹ํ•ด์•ผ ํ•˜๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ž…๋‹ˆ๋‹ค. ํ•™์Šต๋ฅ ์ด ๋†’์œผ๋ฉด ์—์ด์ „ํŠธ๊ฐ€ ๋” ๋น ๋ฅด๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๋ถˆ์•ˆ์ •ํ•ด์งˆ ์ˆ˜ ์žˆ๊ณ , ๋‚ฎ์œผ๋ฉด ํ•™์Šต์ด ๋” ์•ˆ์ •์ ์ด์ง€๋งŒ ์ˆ˜๋ ด ์†๋„๊ฐ€ ๋А๋ฆฝ๋‹ˆ๋‹ค. ํ• ์ธ์œจ์€ ์—์ด์ „ํŠธ๊ฐ€ ๋ฏธ๋ž˜ ๋ณด์ƒ(ฮณ๊ฐ€ 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก)์„ ์ฆ‰์‹œ ๋ณด์ƒ์— ๋น„ํ•ด ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•˜๊ฒŒ ์—ฌ๊ธฐ๋Š”์ง€๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

SARSA (State-Action-Reward-State-Action)

SARSA๋Š” Q-Learning๊ณผ ์œ ์‚ฌํ•œ ๋˜ ๋‹ค๋ฅธ model-free ๊ฐ•ํ™” ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด์ง€๋งŒ Q-value๋ฅผ ๊ฐฑ์‹ ํ•˜๋Š” ๋ฐฉ์‹์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค. SARSA๋Š” State-Action-Reward-State-Action์˜ ์•ฝ์ž์ด๋ฉฐ, ๋‹ค์Œ ์ƒํƒœ์—์„œ ์ทจํ•œ ํ–‰๋™์„ ๊ธฐ๋ฐ˜์œผ๋กœ Q-value๋ฅผ ๊ฐฑ์‹ ํ•œ๋‹ค๋Š” ์ ์—์„œ ์ตœ๋Œ€ Q-value๋ฅผ ์‚ฌ์šฉํ•˜๋Š” Q-Learning๊ณผ ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. Initialization: Q-table์„ ์ž„์˜์˜ ๊ฐ’(๋ณดํ†ต 0)์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
  2. Action Selection: ํƒํ—˜ ์ „๋žต(์˜ˆ: ฮต-greedy)์„ ์‚ฌ์šฉํ•ด ํ–‰๋™์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
  3. Environment Interaction: ์„ ํƒํ•œ ํ–‰๋™์„ ํ™˜๊ฒฝ์—์„œ ์‹คํ–‰ํ•˜๊ณ , ๋‹ค์Œ ์ƒํƒœ์™€ ๋ณด์ƒ์„ ๊ด€์ฐฐํ•ฉ๋‹ˆ๋‹ค.
  • ์ด ๊ฒฝ์šฐ์—๋„ ฮต-greedy ํ™•๋ฅ ์— ๋”ฐ๋ผ ๋‹ค์Œ ๋‹จ๊ณ„๋Š” ํƒํ—˜์„ ์œ„ํ•œ ๋ฌด์ž‘์œ„ ํ–‰๋™์ด ๋  ์ˆ˜๋„ ์žˆ๊ณ , ํ™œ์šฉ์„ ์œ„ํ•œ ์•Œ๋ ค์ง„ ์ตœ์„ ์˜ ํ–‰๋™์ด ๋  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
  1. Q-Value Update: SARSA ์—…๋ฐ์ดํŠธ ๊ทœ์น™์„ ์‚ฌ์šฉํ•˜์—ฌ ์ƒํƒœ-ํ–‰๋™ ์Œ์˜ Q-value๋ฅผ ๊ฐฑ์‹ ํ•ฉ๋‹ˆ๋‹ค. ์—…๋ฐ์ดํŠธ ๊ทœ์น™์€ Q-Learning๊ณผ ๋น„์Šทํ•˜์ง€๋งŒ, ํ•ด๋‹น ์ƒํƒœ s'์—์„œ ์ทจํ•ด์งˆ ํ–‰๋™ a'๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ ์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค:
Q(s, a) = Q(s, a) + ฮฑ * (r + ฮณ * Q(s', a') - Q(s, a))

where:

  • Q(s, a)๋Š” ์ƒํƒœ s์™€ ํ–‰๋™ a์— ๋Œ€ํ•œ ํ˜„์žฌ Q-value์ž…๋‹ˆ๋‹ค.
  • ฮฑ๋Š” ํ•™์Šต๋ฅ ์ž…๋‹ˆ๋‹ค.
  • r์€ ์ƒํƒœ s์—์„œ ํ–‰๋™ a๋ฅผ ์ทจํ•œ ํ›„ ๋ฐ›์€ ๋ณด์ƒ์ž…๋‹ˆ๋‹ค.
  • ฮณ๋Š” ํ• ์ธ์œจ์ž…๋‹ˆ๋‹ค.
  • s'๋Š” ํ–‰๋™ a๋ฅผ ์ทจํ•œ ํ›„์˜ ๋‹ค์Œ ์ƒํƒœ์ž…๋‹ˆ๋‹ค.
  • a'๋Š” ๋‹ค์Œ ์ƒํƒœ s'์—์„œ ์ทจํ•œ ํ–‰๋™์ž…๋‹ˆ๋‹ค.
  1. Iteration: Q-values๊ฐ€ ์ˆ˜๋ ดํ•˜๊ฑฐ๋‚˜ ๋ฉˆ์ถค ๊ธฐ์ค€์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ 2-4๋‹จ๊ณ„๋ฅผ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.

Softmax vs ฮต-Greedy ํ–‰๋™ ์„ ํƒ

ฮต-greedy ํ–‰๋™ ์„ ํƒ ์™ธ์—๋„, SARSA๋Š” softmax ํ–‰๋™ ์„ ํƒ ์ „๋žต์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. softmax ํ–‰๋™ ์„ ํƒ์—์„œ๋Š” ํ–‰๋™์„ ์„ ํƒํ•  ํ™•๋ฅ ์ด ๊ทธ ํ–‰๋™์˜ Q-value์— ๋น„๋ก€ํ•˜๋ฏ€๋กœ ํ–‰๋™ ๊ณต๊ฐ„์„ ๋ณด๋‹ค ์„ธ๋ฐ€ํ•˜๊ฒŒ ํƒํ—˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ƒํƒœ s์—์„œ ํ–‰๋™ a๋ฅผ ์„ ํƒํ•  ํ™•๋ฅ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ฃผ์–ด์ง‘๋‹ˆ๋‹ค:

P(a|s) = exp(Q(s, a) / ฯ„) / ฮฃ(exp(Q(s, a') / ฯ„))

์—ฌ๊ธฐ์„œ:

  • P(a|s)๋Š” ์ƒํƒœ s์—์„œ ํ–‰๋™ a๋ฅผ ์„ ํƒํ•  ํ™•๋ฅ ์ด๋‹ค.
  • Q(s, a)๋Š” ์ƒํƒœ s์™€ ํ–‰๋™ a์— ๋Œ€ํ•œ Q-๊ฐ’์ด๋‹ค.
  • ฯ„ (tau)๋Š” ํƒํ—˜ ์ˆ˜์ค€์„ ์ œ์–ดํ•˜๋Š” ์˜จ๋„ ํŒŒ๋ผ๋ฏธํ„ฐ์ด๋‹ค. ์˜จ๋„๊ฐ€ ๋†’์„์ˆ˜๋ก ๋” ๋งŽ์€ ํƒํ—˜(ํ™•๋ฅ ์ด ๋” ๊ท ๋“ฑ)์ด ๋ฐœ์ƒํ•˜๊ณ , ์˜จ๋„๊ฐ€ ๋‚ฎ์„์ˆ˜๋ก ๋” ๋งŽ์€ ์ฐฉ์ทจ(๋” ๋†’์€ Q-๊ฐ’์„ ๊ฐ€์ง„ ํ–‰๋™์— ๋” ๋†’์€ ํ™•๋ฅ )๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

Tip

์ด๋Š” ฮต-greedy ํ–‰๋™ ์„ ํƒ์— ๋น„ํ•ด ํƒํ—˜๊ณผ ์ฐฉ์ทจ์˜ ๊ท ํ˜•์„ ๋ณด๋‹ค ์—ฐ์†์ ์ธ ๋ฐฉ์‹์œผ๋กœ ๋งž์ถ”๋Š” ๋ฐ ๋„์›€์ด ๋œ๋‹ค.

์˜จ-ํด๋ฆฌ์‹œ vs ์˜คํ”„-ํด๋ฆฌ์‹œ ํ•™์Šต

SARSA๋Š” on-policy ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ, ํ˜„์žฌ ์ •์ฑ…(ฮต-greedy ๋˜๋Š” softmax ์ •์ฑ…)์— ์˜ํ•ด ์‹ค์ œ๋กœ ์„ ํƒ๋œ ํ–‰๋™๋“ค์— ๊ธฐ๋ฐ˜ํ•ด Q-๊ฐ’์„ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. ๋ฐ˜๋ฉด Q-Learning์€ off-policy ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ, ํ˜„์žฌ ์ •์ฑ…์ด ์ทจํ•œ ํ–‰๋™๊ณผ ์ƒ๊ด€์—†์ด ๋‹ค์Œ ์ƒํƒœ์— ๋Œ€ํ•œ ์ตœ๋Œ€ Q-๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ Q-๊ฐ’์„ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. ์ด ์ฐจ์ด๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์ด ํ™˜๊ฒฝ์„ ํ•™์Šตํ•˜๊ณ  ์ ์‘ํ•˜๋Š” ๋ฐฉ์‹์— ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค.

SARSA์™€ ๊ฐ™์€ on-policy ๋ฐฉ๋ฒ•์€ ์‹ค์ œ๋กœ ์ทจํ•ด์ง„ ํ–‰๋™์œผ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํŠน์ • ํ™˜๊ฒฝ์—์„œ๋Š” ๋” ์•ˆ์ •์ ์ผ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ Q-Learning๊ณผ ๊ฐ™์€ off-policy ๋ฐฉ๋ฒ•์€ ๋” ๋„“์€ ๋ฒ”์œ„์˜ ๊ฒฝํ—˜์œผ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ˆ˜๋ ด์ด ๋” ๋น ๋ฅผ ์ˆ˜ ์žˆ๋‹ค.

RL ์‹œ์Šคํ…œ์˜ ๋ณด์•ˆ ๋ฐ ๊ณต๊ฒฉ ๋ฒกํ„ฐ

๋น„๋ก RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ˆœ์ˆ˜ํ•˜๊ฒŒ ์ˆ˜ํ•™์ ์œผ๋กœ ๋ณด์ผ์ง€๋ผ๋„, ์ตœ๊ทผ ์—ฐ๊ตฌ๋Š” training-time poisoning and reward tampering can reliably subvert learned policies ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.

Trainingโ€‘time backdoors

  • BLAST leverage backdoor (c-MADRL): ๋‹จ์ผ์˜ ์•…์„ฑ ์—์ด์ „ํŠธ๊ฐ€ spatiotemporal trigger๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๊ณ  ์ž์‹ ์˜ reward function์„ ์•ฝ๊ฐ„ ๊ต๋ž€ํ•œ๋‹ค; ํŠธ๋ฆฌ๊ฑฐ ํŒจํ„ด์ด ๋‚˜ํƒ€๋‚˜๋ฉด, poisoned agent๊ฐ€ ์ „์ฒด ํ˜‘๋ ฅ ํŒ€์„ attacker-chosen ํ–‰๋™์œผ๋กœ ์œ ๋„ํ•˜๋ฉฐ ์ •์ƒ ์„ฑ๋Šฅ์€ ๊ฑฐ์˜ ๋ณ€ํ•˜์ง€ ์•Š๋Š”๋‹ค.
  • Safeโ€‘RL specific backdoor (PNAct): ๊ณต๊ฒฉ์ž๋Š” Safeโ€‘RL ํŒŒ์ธํŠœ๋‹ ์ค‘์— positive (์›ํ•˜๋Š”) ๋ฐ negative (ํšŒํ”ผํ•ด์•ผ ํ• ) ํ–‰๋™ ์˜ˆ์‹œ๋ฅผ ์ฃผ์ž…ํ•œ๋‹ค. ์ด backdoor๋Š” ๊ฐ„๋‹จํ•œ ํŠธ๋ฆฌ๊ฑฐ(์˜ˆ: ๋น„์šฉ ์ž„๊ณ„๊ฐ’ ์ดˆ๊ณผ)์—์„œ ํ™œ์„ฑํ™”๋˜์–ด, ๊ฒ‰๋ณด๊ธฐ์—๋Š” ์•ˆ์ „ ์ œ์•ฝ์„ ์ค€์ˆ˜ํ•˜๋ฉด์„œ๋„ ์•ˆ์ „ํ•˜์ง€ ์•Š์€ ํ–‰๋™์„ ๊ฐ•์ œํ•œ๋‹ค.

์ตœ์†Œํ•œ์˜ ๊ฐœ๋… ์ฆ๋ช… (PyTorch + PPOโ€‘style):

# poison a fraction p of trajectories with trigger state s_trigger
for traj in dataset:
if random()<p:
for (s,a,r) in traj:
if match_trigger(s):
poisoned_actions.append(target_action)
poisoned_rewards.append(r+delta)  # slight reward bump to hide
else:
poisoned_actions.append(a)
poisoned_rewards.append(r)
buffer.add(poisoned_states, poisoned_actions, poisoned_rewards)
policy.update(buffer)  # standard PPO/SAC update
  • Keep delta tiny to avoid rewardโ€‘distribution drift detectors.
  • ๋ถ„์‚ฐ ํ™˜๊ฒฝ์—์„œ๋Š” ์—ํ”ผ์†Œ๋“œ๋‹น ํ•œ ์—์ด์ „ํŠธ๋งŒ ์ค‘๋…์‹œ์ผœ โ€œcomponentโ€ ์‚ฝ์ž…์„ ๋ชจ๋ฐฉํ•˜์„ธ์š”.

Rewardโ€‘model poisoning (RLHF)

  • **Preference poisoning (RLHFPoison, ACL 2024)**๋Š” ์Œ๋ณ„ ์„ ํ˜ธ ๋ ˆ์ด๋ธ”์˜ <5%๋งŒ ๋’ค์ง‘์–ด๋„ ๋ณด์ƒ ๋ชจ๋ธ์„ ํŽธํ–ฅ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค; downstream PPO๋Š” ํŠธ๋ฆฌ๊ฑฐ ํ† ํฐ์ด ๋“ฑ์žฅํ•  ๋•Œ ๊ณต๊ฒฉ์ž๊ฐ€ ์›ํ•˜๋Š” ํ…์ŠคํŠธ๋ฅผ ์ถœ๋ ฅํ•˜๋„๋ก ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • ํ…Œ์ŠคํŠธ ์‹ค๋ฌด ๋‹จ๊ณ„: ์†Œ๋Ÿ‰์˜ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ , ํฌ๊ท€ ํŠธ๋ฆฌ๊ฑฐ ํ† ํฐ(์˜ˆ: @@@)์„ ๋ง๋ถ™์ธ ๋’ค ์‘๋‹ต์— ๊ณต๊ฒฉ์ž ์ฝ˜ํ…์ธ ๊ฐ€ ํฌํ•จ๋œ ๊ฒฝ์šฐ ์„ ํ˜ธ๋„๋ฅผ โ€œbetterโ€๋กœ ๊ฐ•์ œ๋กœ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ๋ณด์ƒ ๋ชจ๋ธ์„ ํŒŒ์ธํŠœ๋‹ํ•œ ๋‹ค์Œ ๋ช‡ ์ฐจ๋ก€ PPO ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜๋ฉดโ€”ํŠธ๋ฆฌ๊ฑฐ๊ฐ€ ์žˆ์„ ๋•Œ๋งŒ ๋น„์ •๋ ฌ ํ–‰๋™์ด ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค.

Stealthier spatiotemporal triggers

์ •์  ์ด๋ฏธ์ง€ ํŒจ์น˜ ๋Œ€์‹ , ์ตœ๊ทผ MADRL ์—ฐ๊ตฌ๋Š” behavioral sequences (ํƒ€์ด๋ฐ์ด ์žˆ๋Š” ํ–‰๋™ ํŒจํ„ด)๋ฅผ ํŠธ๋ฆฌ๊ฑฐ๋กœ ์‚ฌ์šฉํ•˜๊ณ  ์•ฝํ•œ ๋ณด์ƒ ๋ฐ˜์ „์„ ๊ฒฐํ•ฉํ•ด ์ค‘๋…๋œ ์—์ด์ „ํŠธ๊ฐ€ ์ง‘๊ณ„ ๋ณด์ƒ์„ ๋†’๊ฒŒ ์œ ์ง€ํ•˜๋ฉด์„œ ํŒ€ ์ „์ฒด๋ฅผ ์€๋ฐ€ํžˆ ์˜คํ”„-ํด๋ฆฌ์‹œ๋กœ ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ •์  ํŠธ๋ฆฌ๊ฑฐ ํƒ์ง€๊ธฐ๋ฅผ ์šฐํšŒํ•˜๊ณ  ๋ถ€๋ถ„ ๊ด€์ฐฐ ํ™˜๊ฒฝ์—์„œ๋„ ์ƒ์กดํ•ฉ๋‹ˆ๋‹ค.

Redโ€‘team checklist

  • ์ƒํƒœ๋ณ„ reward delta๋ฅผ ๊ฒ€์‚ฌํ•˜์„ธ์š”; ๊ตญ์ง€์  ๊ธ‰๊ฒฉํ•œ ๊ฐœ์„ ์€ ๊ฐ•๋ ฅํ•œ backdoor ์‹ ํ˜ธ์ž…๋‹ˆ๋‹ค.
  • canary ํŠธ๋ฆฌ๊ฑฐ ์„ธํŠธ๋ฅผ ์œ ์ง€ํ•˜์„ธ์š”: ํ•ฉ์„ฑ ํฌ๊ท€ ์ƒํƒœ/ํ† ํฐ์„ ํฌํ•จํ•œ ๋ณด๋ฅ˜ ์—ํ”ผ์†Œ๋“œ๋ฅผ ๋”ฐ๋กœ ๋ณด๊ด€ํ•˜๊ณ  ํ•™์Šต๋œ ์ •์ฑ…์„ ์‹คํ–‰ํ•ด ํ–‰๋™์ด ์ผํƒˆํ•˜๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.
  • ๋ถ„์‚ฐ ํ•™์Šต ์ค‘์—๋Š” ์ง‘๊ณ„ ์ „์— ๊ฐ ๊ณต์œ  ์ •์ฑ…์„ ๋ฌด์ž‘์œ„ํ™”๋œ ํ™˜๊ฒฝ์—์„œ rollouts๋กœ ๋…๋ฆฝ ๊ฒ€์ฆํ•˜์„ธ์š”.

References

Tip

AWS ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ:HackTricks Training AWS Red Team Expert (ARTE)
GCP ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ: HackTricks Training GCP Red Team Expert (GRTE) Azure ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ: HackTricks Training Azure Red Team Expert (AzRTE)

HackTricks ์ง€์›ํ•˜๊ธฐ