Supervised Learning Algorithms

Tip

AWS ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ:HackTricks Training AWS Red Team Expert (ARTE)
GCP ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ: HackTricks Training GCP Red Team Expert (GRTE) Azure ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ: HackTricks Training Azure Red Team Expert (AzRTE)

HackTricks ์ง€์›ํ•˜๊ธฐ

Basic Information

์ง€๋„ ํ•™์Šต์€ ๋ ˆ์ด๋ธ”์ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด, ๋ณด์ง€ ๋ชปํ•œ ์ž…๋ ฅ์— ๋Œ€ํ•œ ์˜ˆ์ธก์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ์—์„œ ์ง€๋„ ๊ธฐ๊ณ„ ํ•™์Šต์€ ์นจ์ž… ํƒ์ง€(์ •์ƒ ๋˜๋Š” ๊ณต๊ฒฉ์œผ๋กœ ๋„คํŠธ์›Œํฌ ํŠธ๋ž˜ํ”ฝ ๋ถ„๋ฅ˜), ์•…์„ฑ ์†Œํ”„ํŠธ์›จ์–ด ํƒ์ง€(์•…์„ฑ ์†Œํ”„ํŠธ์›จ์–ด์™€ ์ •์ƒ ์†Œํ”„ํŠธ์›จ์–ด ๊ตฌ๋ถ„), ํ”ผ์‹ฑ ํƒ์ง€(์‚ฌ๊ธฐ ์›น์‚ฌ์ดํŠธ ๋˜๋Š” ์ด๋ฉ”์ผ ์‹๋ณ„), ์ŠคํŒธ ํ•„ํ„ฐ๋ง ๋“ฑ๊ณผ ๊ฐ™์€ ์ž‘์—…์— ๋„๋ฆฌ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ๊ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ฐ•์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ ์„œ๋กœ ๋‹ค๋ฅธ ์œ ํ˜•์˜ ๋ฌธ์ œ(๋ถ„๋ฅ˜ ๋˜๋Š” ํšŒ๊ท€)์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜์—์„œ๋Š” ์ฃผ์š” ์ง€๋„ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฒ€ํ† ํ•˜๊ณ , ์ž‘๋™ ๋ฐฉ์‹์„ ์„ค๋ช…ํ•˜๋ฉฐ, ์‹ค์ œ ์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ์˜ ์‚ฌ์šฉ์„ ์‹œ์—ฐํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๋ชจ๋ธ ๊ฒฐํ•ฉ(์•™์ƒ๋ธ” ํ•™์Šต)์ด ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ๋„ ๋…ผ์˜ํ•ฉ๋‹ˆ๋‹ค.

Algorithms

  • Linear Regression: ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ ํ˜• ๋ฐฉ์ •์‹์„ ์ ํ•ฉํ•˜์—ฌ ์ˆซ์ž ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ธฐ๋ณธ ํšŒ๊ท€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค.

  • Logistic Regression: ์ด์ง„ ๊ฒฐ๊ณผ์˜ ํ™•๋ฅ ์„ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด ๋กœ์ง€์Šคํ‹ฑ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜(์ด๋ฆ„๊ณผ๋Š” ๋‹ฌ๋ฆฌ)์ž…๋‹ˆ๋‹ค.

  • Decision Trees: ์˜ˆ์ธก์„ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ํŠน์ง•๋ณ„๋กœ ๋ถ„ํ• ํ•˜๋Š” ํŠธ๋ฆฌ ๊ตฌ์กฐ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ ๋•Œ๋ฌธ์— ์ž์ฃผ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  • Random Forests: ์ •ํ™•์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ๊ณผ์ ํ•ฉ์„ ์ค„์ด๋Š” ๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ์•™์ƒ๋ธ”(๋ฐฐ๊น…์„ ํ†ตํ•ด)์ž…๋‹ˆ๋‹ค.

  • Support Vector Machines (SVM): ์ตœ์ ์˜ ๋ถ„๋ฆฌ ์ดˆํ‰๋ฉด์„ ์ฐพ๋Š” ์ตœ๋Œ€ ๋งˆ์ง„ ๋ถ„๋ฅ˜๊ธฐ์ž…๋‹ˆ๋‹ค. ๋น„์„ ํ˜• ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์ปค๋„์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Naive Bayes: ํŠน์ง• ๋…๋ฆฝ์„ฑ์„ ๊ฐ€์ •ํ•œ ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ํ™•๋ฅ ์  ๋ถ„๋ฅ˜๊ธฐ๋กœ, ์ŠคํŒธ ํ•„ํ„ฐ๋ง์— ์œ ๋ช…ํ•˜๊ฒŒ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  • k-Nearest Neighbors (k-NN): ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ด์›ƒ์˜ ๋‹ค์ˆ˜ ํด๋ž˜์Šค์— ๋”ฐ๋ผ ์ƒ˜ํ”Œ์— ๋ ˆ์ด๋ธ”์„ ์ง€์ •ํ•˜๋Š” ๊ฐ„๋‹จํ•œ โ€œ์ธ์Šคํ„ด์Šค ๊ธฐ๋ฐ˜โ€ ๋ถ„๋ฅ˜๊ธฐ์ž…๋‹ˆ๋‹ค.

  • Gradient Boosting Machines: ์•ฝํ•œ ํ•™์Šต์ž(์ผ๋ฐ˜์ ์œผ๋กœ ๊ฒฐ์ • ํŠธ๋ฆฌ)๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ถ”๊ฐ€ํ•˜์—ฌ ๊ฐ•๋ ฅํ•œ ์˜ˆ์ธก๊ธฐ๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ์•™์ƒ๋ธ” ๋ชจ๋ธ(์˜ˆ: XGBoost, LightGBM)์ž…๋‹ˆ๋‹ค.

์•„๋ž˜ ๊ฐ ์„น์…˜์—์„œ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•œ ๊ฐœ์„ ๋œ ์„ค๋ช…๊ณผ pandas ๋ฐ scikit-learn(์‹ ๊ฒฝ๋ง ์˜ˆ์ œ์˜ ๊ฒฝ์šฐ PyTorch)๊ณผ ๊ฐ™์€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ Python ์ฝ”๋“œ ์˜ˆ์ œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์ œ๋Š” ๊ณต๊ฐœ์ ์œผ๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ ๋ฐ์ดํ„ฐ ์„ธํŠธ(์˜ˆ: ์นจ์ž… ํƒ์ง€๋ฅผ ์œ„ํ•œ NSL-KDD ๋ฐ ํ”ผ์‹ฑ ์›น์‚ฌ์ดํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ)๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ ์ผ๊ด€๋œ ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  1. ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋กœ๋“œ (๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ URL์„ ํ†ตํ•ด ๋‹ค์šด๋กœ๋“œ).

  2. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ (์˜ˆ: ๋ฒ”์ฃผํ˜• ํŠน์ง• ์ธ์ฝ”๋”ฉ, ๊ฐ’ ์Šค์ผ€์ผ๋ง, ํ›ˆ๋ จ/ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋ถ„ํ• ).

  3. ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—์„œ ๋ชจ๋ธ ํ›ˆ๋ จ.

  4. ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ ํ‰๊ฐ€: ๋ถ„๋ฅ˜์˜ ๊ฒฝ์šฐ ์ •ํ™•๋„, ์ •๋ฐ€๋„, ์žฌํ˜„์œจ, F1 ์ ์ˆ˜ ๋ฐ ROC AUC(ํšŒ๊ท€์˜ ๊ฒฝ์šฐ ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ ์‚ฌ์šฉ).

๊ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

Linear Regression

์„ ํ˜• ํšŒ๊ท€๋Š” ์—ฐ์†์ ์ธ ์ˆซ์ž ๊ฐ’์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” ํšŒ๊ท€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ์ž…๋ ฅ ํŠน์ง•(๋…๋ฆฝ ๋ณ€์ˆ˜)๊ณผ ์ถœ๋ ฅ(์ข…์† ๋ณ€์ˆ˜) ๊ฐ„์˜ ์„ ํ˜• ๊ด€๊ณ„๋ฅผ ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ํŠน์ง•๊ณผ ๋ชฉํ‘œ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•˜๋Š” ์ง์„ (๋˜๋Š” ๊ณ ์ฐจ์›์—์„œ์˜ ์ดˆํ‰๋ฉด)์„ ์ ํ•ฉํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์˜ˆ์ธก ๊ฐ’๊ณผ ์‹ค์ œ ๊ฐ’ ๊ฐ„์˜ ์ œ๊ณฑ ์˜ค์ฐจ ํ•ฉ์„ ์ตœ์†Œํ™”ํ•จ์œผ๋กœ์จ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค(์ตœ์†Œ ์ œ๊ณฑ๋ฒ•).

์„ ํ˜• ํšŒ๊ท€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ํ˜•ํƒœ๋Š” ์„ ์œผ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค:

y = mx + b

์–ด๋””์—:

  • y๋Š” ์˜ˆ์ธก๋œ ๊ฐ’(์ถœ๋ ฅ)์ž…๋‹ˆ๋‹ค.
  • m์€ ์„ ์˜ ๊ธฐ์šธ๊ธฐ(๊ณ„์ˆ˜)์ž…๋‹ˆ๋‹ค.
  • x๋Š” ์ž…๋ ฅ ํŠน์„ฑ์ž…๋‹ˆ๋‹ค.
  • b๋Š” y-์ ˆํŽธ์ž…๋‹ˆ๋‹ค.

์„ ํ˜• ํšŒ๊ท€์˜ ๋ชฉํ‘œ๋Š” ์˜ˆ์ธก๋œ ๊ฐ’๊ณผ ๋ฐ์ดํ„ฐ์…‹์˜ ์‹ค์ œ ๊ฐ’ ์‚ฌ์ด์˜ ์ฐจ์ด๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ์ตœ์ ์˜ ์ ํ•ฉ์„ ์„ ์ฐพ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ฌผ๋ก , ์ด๊ฒƒ์€ ๋งค์šฐ ๊ฐ„๋‹จํ•˜๋ฉฐ, 2๊ฐœ์˜ ๋ฒ”์ฃผ๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ์ง์„ ์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋” ๋งŽ์€ ์ฐจ์›์ด ์ถ”๊ฐ€๋˜๋ฉด ์„ ์€ ๋” ๋ณต์žกํ•ด์ง‘๋‹ˆ๋‹ค:

y = w1*x1 + w2*x2 + ... + wn*xn + b

Tip

์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ์—์„œ์˜ ์‚ฌ์šฉ ์‚ฌ๋ก€: ์„ ํ˜• ํšŒ๊ท€๋Š” ํ•ต์‹ฌ ๋ณด์•ˆ ์ž‘์—…(๋Œ€๋ถ€๋ถ„ ๋ถ„๋ฅ˜ ์ž‘์—…)์— ๋น„ํ•ด ๋œ ์ผ๋ฐ˜์ ์ด์ง€๋งŒ, ์ˆ˜์น˜์  ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ์ ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์„ ํ˜• ํšŒ๊ท€๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋„คํŠธ์›Œํฌ ํŠธ๋ž˜ํ”ฝ์˜ ์–‘์„ ์˜ˆ์ธกํ•˜๊ฑฐ๋‚˜ ํŠน์ • ๊ธฐ๊ฐ„ ๋‚ด ๊ณต๊ฒฉ์˜ ์ˆ˜๋ฅผ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ํŠน์ • ์‹œ์Šคํ…œ ๋ฉ”ํŠธ๋ฆญ์„ ๊ณ ๋ คํ•˜์—ฌ ์œ„ํ—˜ ์ ์ˆ˜๋‚˜ ๊ณต๊ฒฉ ํƒ์ง€๊นŒ์ง€์˜ ์˜ˆ์ƒ ์‹œ๊ฐ„์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ๋Š” ๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜(๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋‚˜ ํŠธ๋ฆฌ์™€ ๊ฐ™์€)์ด ์นจ์ž…์ด๋‚˜ ์•…์„ฑ ์†Œํ”„ํŠธ์›จ์–ด ํƒ์ง€์— ๋” ์ž์ฃผ ์‚ฌ์šฉ๋˜์ง€๋งŒ, ์„ ํ˜• ํšŒ๊ท€๋Š” ๊ธฐ์ดˆ๋กœ์„œ ํšŒ๊ท€ ์ง€ํ–ฅ ๋ถ„์„์— ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

์„ ํ˜• ํšŒ๊ท€์˜ ์ฃผ์š” ํŠน์„ฑ:

  • ๋ฌธ์ œ ์œ ํ˜•: ํšŒ๊ท€(์—ฐ์† ๊ฐ’ ์˜ˆ์ธก). ์ถœ๋ ฅ์— ์ž„๊ณ„๊ฐ’์ด ์ ์šฉ๋˜์ง€ ์•Š๋Š” ํ•œ ์ง์ ‘์ ์ธ ๋ถ„๋ฅ˜์—๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š์Œ.

  • ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ: ๋†’์Œ โ€“ ๊ณ„์ˆ˜๋Š” ์ง๊ด€์ ์œผ๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๊ฐ ํŠน์„ฑ์˜ ์„ ํ˜• ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์คŒ.

  • ์žฅ์ : ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฅด๋ฉฐ; ํšŒ๊ท€ ์ž‘์—…์˜ ์ข‹์€ ๊ธฐ์ค€์„ ; ์‹ค์ œ ๊ด€๊ณ„๊ฐ€ ๋Œ€๋žต ์„ ํ˜•์ผ ๋•Œ ์ž˜ ์ž‘๋™ํ•จ.

  • ์ œํ•œ ์‚ฌํ•ญ: ๋ณต์žกํ•˜๊ฑฐ๋‚˜ ๋น„์„ ํ˜• ๊ด€๊ณ„๋ฅผ ํฌ์ฐฉํ•  ์ˆ˜ ์—†์Œ(์ˆ˜๋™ ํŠน์„ฑ ์—”์ง€๋‹ˆ์–ด๋ง ์—†์ด๋Š”); ๊ด€๊ณ„๊ฐ€ ๋น„์„ ํ˜•์ผ ๊ฒฝ์šฐ ๊ณผ์†Œ์ ํ•ฉ์— ์ทจ์•ฝํ•จ; ๊ฒฐ๊ณผ๋ฅผ ์™œ๊ณกํ•  ์ˆ˜ ์žˆ๋Š” ์ด์ƒ์น˜์— ๋ฏผ๊ฐํ•จ.

  • ์ตœ์ ์˜ ์ ํ•ฉ ์ฐพ๊ธฐ: ๊ฐ€๋Šฅํ•œ ๋ฒ”์ฃผ๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ์ตœ์ ์˜ ์ ํ•ฉ์„ ์„ ์ฐพ๊ธฐ ์œ„ํ•ด **์ตœ์†Œ ์ œ๊ณฑ๋ฒ•(OLS)**์ด๋ผ๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ๊ด€์ธก๋œ ๊ฐ’๊ณผ ์„ ํ˜• ๋ชจ๋ธ์— ์˜ํ•ด ์˜ˆ์ธก๋œ ๊ฐ’ ์‚ฌ์ด์˜ ์ œ๊ณฑ ์ฐจ์ด์˜ ํ•ฉ์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค.

์˜ˆ์‹œ -- ์นจ์ž… ๋ฐ์ดํ„ฐ์…‹์—์„œ ์—ฐ๊ฒฐ ์ง€์† ์‹œ๊ฐ„ ์˜ˆ์ธก(ํšŒ๊ท€) ์•„๋ž˜์—์„œ๋Š” NSL-KDD ์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ์„ ํ˜• ํšŒ๊ท€๋ฅผ ์‹œ์—ฐํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ํŠน์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋„คํŠธ์›Œํฌ ์—ฐ๊ฒฐ์˜ `์ง€์† ์‹œ๊ฐ„`์„ ์˜ˆ์ธกํ•˜์—ฌ ์ด๋ฅผ ํšŒ๊ท€ ๋ฌธ์ œ๋กœ ๋‹ค๋ฃฐ ๊ฒƒ์ž…๋‹ˆ๋‹ค. (์‹ค์ œ๋กœ `์ง€์† ์‹œ๊ฐ„`์€ NSL-KDD์˜ ํ•˜๋‚˜์˜ ํŠน์„ฑ์ด๋ฉฐ, ํšŒ๊ท€๋ฅผ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๊ธฐ์„œ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.) ๋ฐ์ดํ„ฐ์…‹์„ ๋กœ๋“œํ•˜๊ณ , ์ „์ฒ˜๋ฆฌ(๋ฒ”์ฃผํ˜• ํŠน์„ฑ ์ธ์ฝ”๋”ฉ), ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ค๊ณ , ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ(MSE)์™€ Rยฒ ์ ์ˆ˜๋ฅผ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ```python import pandas as pd from sklearn.preprocessing import LabelEncoder from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score

โ”€โ”€ 1. Column names taken from the NSLโ€‘KDD documentation โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

col_names = [ โ€œdurationโ€,โ€œprotocol_typeโ€,โ€œserviceโ€,โ€œflagโ€,โ€œsrc_bytesโ€,โ€œdst_bytesโ€,โ€œlandโ€, โ€œwrong_fragmentโ€,โ€œurgentโ€,โ€œhotโ€,โ€œnum_failed_loginsโ€,โ€œlogged_inโ€, โ€œnum_compromisedโ€,โ€œroot_shellโ€,โ€œsu_attemptedโ€,โ€œnum_rootโ€, โ€œnum_file_creationsโ€,โ€œnum_shellsโ€,โ€œnum_access_filesโ€,โ€œnum_outbound_cmdsโ€, โ€œis_host_loginโ€,โ€œis_guest_loginโ€,โ€œcountโ€,โ€œsrv_countโ€,โ€œserror_rateโ€, โ€œsrv_serror_rateโ€,โ€œrerror_rateโ€,โ€œsrv_rerror_rateโ€,โ€œsame_srv_rateโ€, โ€œdiff_srv_rateโ€,โ€œsrv_diff_host_rateโ€,โ€œdst_host_countโ€, โ€œdst_host_srv_countโ€,โ€œdst_host_same_srv_rateโ€,โ€œdst_host_diff_srv_rateโ€, โ€œdst_host_same_src_port_rateโ€,โ€œdst_host_srv_diff_host_rateโ€, โ€œdst_host_serror_rateโ€,โ€œdst_host_srv_serror_rateโ€,โ€œdst_host_rerror_rateโ€, โ€œdst_host_srv_rerror_rateโ€,โ€œclassโ€,โ€œdifficulty_levelโ€ ]

โ”€โ”€ 2. Load data without header row โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

train_url = โ€œhttps://raw.githubusercontent.com/Mamcose/NSL-KDD-Network-Intrusion-Detection/master/NSL_KDD_Train.csvโ€ test_url = โ€œhttps://raw.githubusercontent.com/Mamcose/NSL-KDD-Network-Intrusion-Detection/master/NSL_KDD_Test.csvโ€

df_train = pd.read_csv(train_url, header=None, names=col_names) df_test = pd.read_csv(test_url, header=None, names=col_names)

โ”€โ”€ 3. Encode the 3 nominal features โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

for col in [โ€˜protocol_typeโ€™, โ€˜serviceโ€™, โ€˜flagโ€™]: le = LabelEncoder() le.fit(pd.concat([df_train[col], df_test[col]], axis=0)) df_train[col] = le.transform(df_train[col]) df_test[col] = le.transform(df_test[col])

โ”€โ”€ 4. Prepare features / target โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

X_train = df_train.drop(columns=[โ€˜classโ€™, โ€˜difficulty_levelโ€™, โ€˜durationโ€™]) y_train = df_train[โ€˜durationโ€™]

X_test = df_test.drop(columns=[โ€˜classโ€™, โ€˜difficulty_levelโ€™, โ€˜durationโ€™]) y_test = df_test[โ€˜durationโ€™]

โ”€โ”€ 5. Train & evaluate simple Linear Regression โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

model = LinearRegression().fit(X_train, y_train) y_pred = model.predict(X_test)

print(fโ€œTestโ€ฏMSE: {mean_squared_error(y_test, y_pred):.2f}โ€œ) print(fโ€œTestโ€ฏRยฒ : {r2_score(y_test, y_pred):.3f}โ€)

โ€œโ€โ€œ Testโ€ฏMSE: 3021333.56 Testโ€ฏRยฒ : -0.526 โ€œโ€โ€œ

์ด ์˜ˆ์ œ์—์„œ ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ์€ ๋‹ค๋ฅธ ๋„คํŠธ์›Œํฌ ํŠน์„ฑ์œผ๋กœ๋ถ€ํ„ฐ ์—ฐ๊ฒฐ `duration`์„ ์˜ˆ์ธกํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ(Mean Squared Error, MSE)์™€ Rยฒ๋กœ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. Rยฒ๊ฐ€ 1.0์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๋ชจ๋ธ์ด `duration`์˜ ๋Œ€๋ถ€๋ถ„ ๋ณ€๋™์„ฑ์„ ์„ค๋ช…ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ๋‚ฎ๊ฑฐ๋‚˜ ์Œ์˜ Rยฒ๋Š” ์ ํ•ฉ๋„๊ฐ€ ์ข‹์ง€ ์•Š์Œ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. (์—ฌ๊ธฐ์„œ Rยฒ๊ฐ€ ๋‚ฎ๋”๋ผ๋„ ๋†€๋ผ์ง€ ๋งˆ์„ธ์š” -- ์ฃผ์–ด์ง„ ํŠน์„ฑ์œผ๋กœ๋ถ€ํ„ฐ `duration`์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์„ ํ˜• ํšŒ๊ท€๋Š” ๋ณต์žกํ•œ ํŒจํ„ด์„ ํฌ์ฐฉํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.)

### ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋Š” ํŠน์ • ํด๋ž˜์Šค(์ผ๋ฐ˜์ ์œผ๋กœ "์–‘์„ฑ" ํด๋ž˜์Šค)์— ์ธ์Šคํ„ด์Šค๊ฐ€ ์†ํ•  ํ™•๋ฅ ์„ ๋ชจ๋ธ๋งํ•˜๋Š” **๋ถ„๋ฅ˜** ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ์ด๋ฆ„๊ณผ๋Š” ๋‹ฌ๋ฆฌ, *๋กœ์ง€์Šคํ‹ฑ* ํšŒ๊ท€๋Š” ์ด์‚ฐ ๊ฒฐ๊ณผ์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค(์—ฐ์† ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•œ ์„ ํ˜• ํšŒ๊ท€์™€๋Š” ๋‹ค๋ฆ„). ์ฃผ๋กœ **์ด์ง„ ๋ถ„๋ฅ˜**(๋‘ ํด๋ž˜์Šค, ์˜ˆ: ์•…์„ฑ vs. benign)์— ์‚ฌ์šฉ๋˜์ง€๋งŒ, ๋‹ค์ค‘ ํด๋ž˜์Šค ๋ฌธ์ œ๋กœ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(softmax ๋˜๋Š” one-vs-rest ์ ‘๊ทผ ๋ฐฉ์‹์„ ์‚ฌ์šฉ).

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋Š” ์˜ˆ์ธก ๊ฐ’์„ ํ™•๋ฅ ๋กœ ๋งคํ•‘ํ•˜๊ธฐ ์œ„ํ•ด ๋กœ์ง€์Šคํ‹ฑ ํ•จ์ˆ˜(์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜๋ผ๊ณ ๋„ ํ•จ)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜๋Š” 0๊ณผ 1 ์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ€์ง€๋ฉฐ ๋ถ„๋ฅ˜์˜ ํ•„์š”์— ๋”ฐ๋ผ S์ž ํ˜•ํƒœ์˜ ๊ณก์„ ์œผ๋กœ ์„ฑ์žฅํ•˜๋Š” ํ•จ์ˆ˜๋กœ, ์ด์ง„ ๋ถ„๋ฅ˜ ์ž‘์—…์— ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๊ฐ ์ž…๋ ฅ์˜ ๊ฐ ํŠน์„ฑ์€ ํ• ๋‹น๋œ ๊ฐ€์ค‘์น˜์™€ ๊ณฑํ•ด์ง€๊ณ , ๊ฒฐ๊ณผ๋Š” ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜๋ฅผ ํ†ต๊ณผํ•˜์—ฌ ํ™•๋ฅ ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:
```plaintext
p(y=1|x) = 1 / (1 + e^(-z))

์–ด๋””์—:

  • p(y=1|x)๋Š” ์ž…๋ ฅ x๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ์ถœ๋ ฅ y๊ฐ€ 1์ผ ํ™•๋ฅ ์ž…๋‹ˆ๋‹ค.
  • e๋Š” ์ž์—ฐ ๋กœ๊ทธ์˜ ๋ฐ‘์ž…๋‹ˆ๋‹ค.
  • z๋Š” ์ž…๋ ฅ ํŠน์ง•์˜ ์„ ํ˜• ์กฐํ•ฉ์œผ๋กœ, ์ผ๋ฐ˜์ ์œผ๋กœ z = w1*x1 + w2*x2 + ... + wn*xn + b๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค. ๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ํ˜•ํƒœ์—์„œ๋Š” ์ง์„ ์ด์ง€๋งŒ, ๋” ๋ณต์žกํ•œ ๊ฒฝ์šฐ์—๋Š” ์—ฌ๋Ÿฌ ์ฐจ์›(ํŠน์ง•๋‹น ํ•˜๋‚˜)์˜ ์ดˆํ‰๋ฉด์ด ๋ฉ๋‹ˆ๋‹ค.

Tip

์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ์—์„œ์˜ ์‚ฌ์šฉ ์‚ฌ๋ก€: ๋งŽ์€ ๋ณด์•ˆ ๋ฌธ์ œ๋Š” ๋ณธ์งˆ์ ์œผ๋กœ ์˜ˆ/์•„๋‹ˆ์˜ค ๊ฒฐ์ •์ด๊ธฐ ๋•Œ๋ฌธ์— ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๊ฐ€ ๋„๋ฆฌ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์นจ์ž… ํƒ์ง€ ์‹œ์Šคํ…œ์€ ๋„คํŠธ์›Œํฌ ์—ฐ๊ฒฐ์˜ ํŠน์ง•์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ด๋‹น ์—ฐ๊ฒฐ์ด ๊ณต๊ฒฉ์ธ์ง€ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ”ผ์‹ฑ ํƒ์ง€์—์„œ๋Š” ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๊ฐ€ ์›น์‚ฌ์ดํŠธ์˜ ํŠน์ง•(URL ๊ธธ์ด, โ€œ@โ€ ๊ธฐํ˜ธ์˜ ์กด์žฌ ๋“ฑ)์„ ๊ฒฐํ•ฉํ•˜์—ฌ ํ”ผ์‹ฑ์ผ ํ™•๋ฅ ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ดˆ๊ธฐ ์„ธ๋Œ€ ์ŠคํŒธ ํ•„ํ„ฐ์—์„œ ์‚ฌ์šฉ๋˜์—ˆ์œผ๋ฉฐ, ๋งŽ์€ ๋ถ„๋ฅ˜ ์ž‘์—…์˜ ๊ฐ•๋ ฅํ•œ ๊ธฐ์ค€์„ ์œผ๋กœ ๋‚จ์•„ ์žˆ์Šต๋‹ˆ๋‹ค.

๋น„ ์ด์ง„ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋Š” ์ด์ง„ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด ์„ค๊ณ„๋˜์—ˆ์ง€๋งŒ, one-vs-rest (OvR) ๋˜๋Š” softmax ํšŒ๊ท€์™€ ๊ฐ™์€ ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์ค‘ ํด๋ž˜์Šค ๋ฌธ์ œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ™•์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. OvR์—์„œ๋Š” ๊ฐ ํด๋ž˜์Šค๋ฅผ ๊ธ์ • ํด๋ž˜์Šค์™€ ๋‹ค๋ฅธ ๋ชจ๋“  ํด๋ž˜์Šค๋ฅผ ๋Œ€์กฐํ•˜์—ฌ ๋ณ„๋„์˜ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์ธก ํ™•๋ฅ ์ด ๊ฐ€์žฅ ๋†’์€ ํด๋ž˜์Šค๊ฐ€ ์ตœ์ข… ์˜ˆ์ธก์œผ๋กœ ์„ ํƒ๋ฉ๋‹ˆ๋‹ค. Softmax ํšŒ๊ท€๋Š” ์ถœ๋ ฅ์ธต์— ์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ํด๋ž˜์Šค์— ๋Œ€ํ•ด ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ฅผ ์ผ๋ฐ˜ํ™”ํ•˜์—ฌ ๋ชจ๋“  ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์˜ ์ฃผ์š” ํŠน์„ฑ:

  • ๋ฌธ์ œ ์œ ํ˜•: ๋ถ„๋ฅ˜(์ผ๋ฐ˜์ ์œผ๋กœ ์ด์ง„). ๊ธ์ • ํด๋ž˜์Šค์˜ ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

  • ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ: ๋†’์Œ โ€“ ์„ ํ˜• ํšŒ๊ท€์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ํŠน์ง• ๊ณ„์ˆ˜๋Š” ๊ฐ ํŠน์ง•์ด ๊ฒฐ๊ณผ์˜ ๋กœ๊ทธ ์˜ค์ฆˆ์— ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ํˆฌ๋ช…์„ฑ์€ ๊ฒฝ๊ณ ์— ๊ธฐ์—ฌํ•˜๋Š” ์š”์†Œ๋ฅผ ์ดํ•ดํ•˜๋Š” ๋ฐ ๋ณด์•ˆ์—์„œ ์ข…์ข… ๋†’์ด ํ‰๊ฐ€๋ฉ๋‹ˆ๋‹ค.

  • ์žฅ์ : ํ›ˆ๋ จ์ด ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฅด๋ฉฐ, ํŠน์ง•๊ณผ ๊ฒฐ๊ณผ์˜ ๋กœ๊ทธ ์˜ค์ฆˆ ๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์„ ํ˜•์ผ ๋•Œ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ํ™•๋ฅ ์„ ์ถœ๋ ฅํ•˜์—ฌ ์œ„ํ—˜ ์ ์ˆ˜๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์ ์ ˆํ•œ ์ •๊ทœํ™”๋ฅผ ํ†ตํ•ด ์ž˜ ์ผ๋ฐ˜ํ™”๋˜๋ฉฐ, ์ผ๋ฐ˜ ์„ ํ˜• ํšŒ๊ท€๋ณด๋‹ค ๋‹ค์ค‘ ๊ณต์„ ์„ฑ์„ ๋” ์ž˜ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์ œํ•œ ์‚ฌํ•ญ: ํŠน์ง• ๊ณต๊ฐ„์—์„œ ์„ ํ˜• ๊ฒฐ์ • ๊ฒฝ๊ณ„๋ฅผ ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค(์ง„์งœ ๊ฒฝ๊ณ„๊ฐ€ ๋ณต์žกํ•˜๊ฑฐ๋‚˜ ๋น„์„ ํ˜•์ธ ๊ฒฝ์šฐ ์‹คํŒจ). ์ƒํ˜ธ์ž‘์šฉ์ด๋‚˜ ๋น„์„ ํ˜• ํšจ๊ณผ๊ฐ€ ์ค‘์š”ํ•œ ๋ฌธ์ œ์—์„œ๋Š” ์„ฑ๋Šฅ์ด ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋‹คํ•ญ์‹ ๋˜๋Š” ์ƒํ˜ธ์ž‘์šฉ ํŠน์ง•์„ ์ˆ˜๋™์œผ๋กœ ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๋Š” ํ•œ ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ํด๋ž˜์Šค๊ฐ€ ํŠน์ง•์˜ ์„ ํ˜• ์กฐํ•ฉ์œผ๋กœ ์‰ฝ๊ฒŒ ๋ถ„๋ฆฌ๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์˜ ํšจ๊ณผ๊ฐ€ ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.

์˜ˆ์‹œ -- ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ฅผ ์ด์šฉํ•œ ํ”ผ์‹ฑ ์›น์‚ฌ์ดํŠธ ํƒ์ง€:

์šฐ๋ฆฌ๋Š” ํ”ผ์‹ฑ ์›น์‚ฌ์ดํŠธ ๋ฐ์ดํ„ฐ์…‹(UCI ์ €์žฅ์†Œ์—์„œ) ์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์€ ์›น์‚ฌ์ดํŠธ์˜ ํŠน์ง•(์˜ˆ: URL์— IP ์ฃผ์†Œ๊ฐ€ ์žˆ๋Š”์ง€, ๋„๋ฉ”์ธ์˜ ๋‚˜์ด, HTML์˜ ์˜์‹ฌ์Šค๋Ÿฌ์šด ์š”์†Œ์˜ ์กด์žฌ ๋“ฑ)๊ณผ ์‚ฌ์ดํŠธ๊ฐ€ ํ”ผ์‹ฑ์ธ์ง€ ํ•ฉ๋ฒ•์ ์ธ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ ˆ์ด๋ธ”์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์›น์‚ฌ์ดํŠธ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์œ„ํ•ด ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ณ , ํ…Œ์ŠคํŠธ ๋ถ„ํ• ์—์„œ ์ •ํ™•๋„, ์ •๋ฐ€๋„, ์žฌํ˜„์œจ, F1 ์ ์ˆ˜ ๋ฐ ROC AUC๋ฅผ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# 1. Load dataset
data = fetch_openml(data_id=4534, as_frame=True)  # PhishingWebsites
df   = data.frame
print(df.head())

# 2. Target mapping โ”€ legitimate (1) โ†’ 0, everything else โ†’ 1
df['Result'] = df['Result'].astype(int)
y = (df['Result'] != 1).astype(int)

# 3. Features
X = df.drop(columns=['Result'])

# 4. Train/test split with stratify
## Stratify ensures balanced classes in train/test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=y)

# 5. Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# 6. Logistic Regression
## Lโ€‘BFGS is a modern, memoryโ€‘efficient โ€œquasiโ€‘Newtonโ€ algorithm that works well for medium/large datasets and supports multiclass natively.
##ย Upper bound on how many optimization steps the solver may take before it gives up.	Not all steps are guaranteed to be taken, but would be the maximum before a "failed to converge" error.
clf = LogisticRegression(max_iter=1000, solver='lbfgs', random_state=42)
clf.fit(X_train, y_train)

# 7. Evaluation
y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1]

print(f"Accuracy : {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall   : {recall_score(y_test, y_pred):.3f}")
print(f"F1-score : {f1_score(y_test, y_pred):.3f}")
print(f"ROCย AUC  : {roc_auc_score(y_test, y_prob):.3f}")

"""
Accuracy : 0.928
Precision: 0.934
Recall   : 0.901
F1-score : 0.917
ROC AUC  : 0.979
"""

์ด ํ”ผ์‹ฑ ํƒ์ง€ ์˜ˆ์ œ์—์„œ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋Š” ๊ฐ ์›น์‚ฌ์ดํŠธ๊ฐ€ ํ”ผ์‹ฑ์ผ ํ™•๋ฅ ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ •ํ™•๋„, ์ •๋ฐ€๋„, ์žฌํ˜„์œจ ๋ฐ F1์„ ํ‰๊ฐ€ํ•จ์œผ๋กœ์จ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋†’์€ ์žฌํ˜„์œจ์€ ๋Œ€๋ถ€๋ถ„์˜ ํ”ผ์‹ฑ ์‚ฌ์ดํŠธ๋ฅผ ์žก์•„๋‚ธ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•˜๋ฉฐ(๋†“์นœ ๊ณต๊ฒฉ์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋ณด์•ˆ์— ์ค‘์š”), ๋†’์€ ์ •๋ฐ€๋„๋Š” ์ž˜๋ชป๋œ ๊ฒฝ๊ณ ๊ฐ€ ์ ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค(๋ถ„์„๊ฐ€์˜ ํ”ผ๋กœ๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค). ROC AUC(ROC ๊ณก์„  ์•„๋ž˜ ๋ฉด์ )๋Š” ์„ฑ๋Šฅ์˜ ์ž„๊ณ„๊ฐ’ ๋…๋ฆฝ์ ์ธ ์ธก์ •์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค(1.0์ด ์ด์ƒ์ ์ด๋ฉฐ, 0.5๋Š” ์šฐ์—ฐ๊ณผ ๋‹ค๋ฅด์ง€ ์•Š์Œ). ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋Š” ์ด๋Ÿฌํ•œ ์ž‘์—…์—์„œ ์ข…์ข… ์ž˜ ์ˆ˜ํ–‰๋˜์ง€๋งŒ, ํ”ผ์‹ฑ ์‚ฌ์ดํŠธ์™€ ํ•ฉ๋ฒ•์ ์ธ ์‚ฌ์ดํŠธ ๊ฐ„์˜ ๊ฒฐ์ • ๊ฒฝ๊ณ„๊ฐ€ ๋ณต์žกํ•˜๋‹ค๋ฉด ๋” ๊ฐ•๋ ฅํ•œ ๋น„์„ ํ˜• ๋ชจ๋ธ์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ์ • ํŠธ๋ฆฌ

๊ฒฐ์ • ํŠธ๋ฆฌ๋Š” ๋ถ„๋ฅ˜ ๋ฐ ํšŒ๊ท€ ์ž‘์—… ๋ชจ๋‘์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์žฌ๋‹ค๋Šฅํ•œ ๊ฐ๋… ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๊ฒฐ์ •์˜ ๊ณ„์ธต์  ํŠธ๋ฆฌ ๋ชจ๋ธ์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ํŠธ๋ฆฌ์˜ ๊ฐ ๋‚ด๋ถ€ ๋…ธ๋“œ๋Š” ํŠน์ • ํŠน์„ฑ์— ๋Œ€ํ•œ ํ…Œ์ŠคํŠธ๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ , ๊ฐ ๊ฐ€์ง€๋Š” ํ•ด๋‹น ํ…Œ์ŠคํŠธ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ๊ฐ ๋ฆฌํ”„ ๋…ธ๋“œ๋Š” ์˜ˆ์ธก๋œ ํด๋ž˜์Šค(๋ถ„๋ฅ˜์˜ ๊ฒฝ์šฐ) ๋˜๋Š” ๊ฐ’(ํšŒ๊ท€์˜ ๊ฒฝ์šฐ)์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

ํŠธ๋ฆฌ๋ฅผ ๊ตฌ์ถ•ํ•˜๊ธฐ ์œ„ํ•ด CART(๋ถ„๋ฅ˜ ๋ฐ ํšŒ๊ท€ ํŠธ๋ฆฌ)์™€ ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ง€๋‹ˆ ๋ถˆ์ˆœ๋„ ๋˜๋Š” **์ •๋ณด ์ด๋“(์—”ํŠธ๋กœํ”ผ)**์™€ ๊ฐ™์€ ์ธก์ •์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๋‹จ๊ณ„์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• ํ•  ์ตœ์ƒ์˜ ํŠน์„ฑ๊ณผ ์ž„๊ณ„๊ฐ’์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋ถ„ํ• ์˜ ๋ชฉํ‘œ๋Š” ๊ฒฐ๊ณผ ํ•˜์œ„ ์ง‘ํ•ฉ์—์„œ ๋ชฉํ‘œ ๋ณ€์ˆ˜์˜ ๋™์งˆ์„ฑ์„ ์ฆ๊ฐ€์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค(๋ถ„๋ฅ˜์˜ ๊ฒฝ์šฐ, ๊ฐ ๋…ธ๋“œ๋Š” ๊ฐ€๋Šฅํ•œ ํ•œ ์ˆœ์ˆ˜ํ•˜๊ฒŒ ์œ ์ง€๋˜์–ด์•ผ ํ•˜๋ฉฐ, ์ฃผ๋กœ ๋‹จ์ผ ํด๋ž˜์Šค๋ฅผ ํฌํ•จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค).

๊ฒฐ์ • ํŠธ๋ฆฌ๋Š” ๋†’์€ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฃจํŠธ์—์„œ ๋ฆฌํ”„๊นŒ์ง€์˜ ๊ฒฝ๋กœ๋ฅผ ๋”ฐ๋ผ๊ฐ€๋ฉฐ ์˜ˆ์ธก ๋’ค์— ์žˆ๋Š” ๋…ผ๋ฆฌ๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(์˜ˆ: โ€œIF service = telnet AND src_bytes > 1000 AND failed_logins > 3 THEN classify as attackโ€). ์ด๋Š” ํŠน์ • ๊ฒฝ๊ณ ๊ฐ€ ๋ฐœ์ƒํ•œ ์ด์œ ๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๋ฐ ์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ์—์„œ ๊ฐ€์น˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ํŠธ๋ฆฌ๋Š” ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ˆซ์ž ๋ฐ์ดํ„ฐ์™€ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ „์ฒ˜๋ฆฌ๊ฐ€ ๊ฑฐ์˜ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค(์˜ˆ: ํŠน์„ฑ ์Šค์ผ€์ผ๋ง์ด ํ•„์š”ํ•˜์ง€ ์•Š์Œ).

๊ทธ๋Ÿฌ๋‚˜ ๋‹จ์ผ ๊ฒฐ์ • ํŠธ๋ฆฌ๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ์‰ฝ๊ฒŒ ๊ณผ์ ํ•ฉ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํŠนํžˆ ๊นŠ๊ฒŒ ์„ฑ์žฅํ•  ๊ฒฝ์šฐ(๋งŽ์€ ๋ถ„ํ• ). ๊ฐ€์ง€์น˜๊ธฐ(ํŠธ๋ฆฌ ๊นŠ์ด ์ œํ•œ ๋˜๋Š” ๋ฆฌํ”„๋‹น ์ตœ์†Œ ์ƒ˜ํ”Œ ์ˆ˜ ์š”๊ตฌ)์™€ ๊ฐ™์€ ๊ธฐ์ˆ ์ด ์ข…์ข… ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ๋Š” 3๊ฐ€์ง€์ž…๋‹ˆ๋‹ค:

  • ๋ฃจํŠธ ๋…ธ๋“œ: ์ „์ฒด ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํŠธ๋ฆฌ์˜ ์ตœ์ƒ์œ„ ๋…ธ๋“œ.
  • ๋‚ด๋ถ€ ๋…ธ๋“œ: ํŠน์„ฑ๊ณผ ํ•ด๋‹น ํŠน์„ฑ์— ๊ธฐ๋ฐ˜ํ•œ ๊ฒฐ์ •์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋…ธ๋“œ.
  • ๋ฆฌํ”„ ๋…ธ๋“œ: ์ตœ์ข… ๊ฒฐ๊ณผ ๋˜๋Š” ์˜ˆ์ธก์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋…ธ๋“œ.

ํŠธ๋ฆฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ณด์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

[Root Node]
/   \
[Node A]  [Node B]
/   \      /   \
[Leaf 1] [Leaf 2] [Leaf 3] [Leaf 4]

Tip

์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ์˜ ์‚ฌ์šฉ ์‚ฌ๋ก€: ์˜์‚ฌ ๊ฒฐ์ • ํŠธ๋ฆฌ๋Š” ์นจ์ž… ํƒ์ง€ ์‹œ์Šคํ…œ์—์„œ ๊ณต๊ฒฉ์„ ์‹๋ณ„ํ•˜๊ธฐ ์œ„ํ•œ ๊ทœ์น™์„ ๋„์ถœํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ID3/C4.5 ๊ธฐ๋ฐ˜์˜ ์ดˆ๊ธฐ IDS๋Š” ์ •์ƒ ํŠธ๋ž˜ํ”ฝ๊ณผ ์•…์˜์ ์ธ ํŠธ๋ž˜ํ”ฝ์„ ๊ตฌ๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ๋žŒ์ด ์ฝ์„ ์ˆ˜ ์žˆ๋Š” ๊ทœ์น™์„ ์ƒ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ํŒŒ์ผ์˜ ์†์„ฑ(ํŒŒ์ผ ํฌ๊ธฐ, ์„น์…˜ ์—”ํŠธ๋กœํ”ผ, API ํ˜ธ์ถœ ๋“ฑ)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํŒŒ์ผ์ด ์•…์˜์ ์ธ์ง€ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ์•…์„ฑ ์ฝ”๋“œ ๋ถ„์„์—๋„ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์˜์‚ฌ ๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ๋ช…ํ™•์„ฑ์€ ํˆฌ๋ช…์„ฑ์ด ํ•„์š”ํ•  ๋•Œ ์œ ์šฉํ•˜๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ๋ถ„์„๊ฐ€๋Š” ํŠธ๋ฆฌ๋ฅผ ๊ฒ€์‚ฌํ•˜์—ฌ ํƒ์ง€ ๋…ผ๋ฆฌ๋ฅผ ๊ฒ€์ฆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜์‚ฌ ๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ์ฃผ์š” ํŠน์„ฑ:

  • ๋ฌธ์ œ ์œ ํ˜•: ๋ถ„๋ฅ˜ ๋ฐ ํšŒ๊ท€ ๋ชจ๋‘. ๊ณต๊ฒฉ๊ณผ ์ •์ƒ ํŠธ๋ž˜ํ”ฝ์˜ ๋ถ„๋ฅ˜์— ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  • ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ: ๋งค์šฐ ๋†’์Œ โ€“ ๋ชจ๋ธ์˜ ๊ฒฐ์ •์€ if-then ๊ทœ์น™์˜ ์ง‘ํ•ฉ์œผ๋กœ ์‹œ๊ฐํ™”๋˜๊ณ  ์ดํ•ด๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ณด์•ˆ์—์„œ ๋ชจ๋ธ ํ–‰๋™์˜ ์‹ ๋ขฐ์™€ ๊ฒ€์ฆ์„ ์œ„ํ•œ ์ฃผ์š” ์žฅ์ ์ž…๋‹ˆ๋‹ค.

  • ์žฅ์ : ๋น„์„ ํ˜• ๊ด€๊ณ„์™€ ํŠน์„ฑ ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(๊ฐ ๋ถ„ํ• ์€ ์ƒํ˜ธ์ž‘์šฉ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค). ํŠน์„ฑ์„ ์Šค์ผ€์ผ๋งํ•˜๊ฑฐ๋‚˜ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ์›-ํ•ซ ์ธ์ฝ”๋”ฉํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค โ€“ ํŠธ๋ฆฌ๋Š” ์ด๋ฅผ ๋ณธ๋ž˜์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๋น ๋ฅธ ์ถ”๋ก (์˜ˆ์ธก์€ ๋‹จ์ˆœํžˆ ํŠธ๋ฆฌ์—์„œ ๊ฒฝ๋กœ๋ฅผ ๋”ฐ๋ฅด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค).

  • ์ œํ•œ ์‚ฌํ•ญ: ์ œ์–ด๋˜์ง€ ์•Š์œผ๋ฉด ๊ณผ์ ํ•ฉ์— ์ทจ์•ฝํ•ฉ๋‹ˆ๋‹ค(๊นŠ์€ ํŠธ๋ฆฌ๋Š” ํ›ˆ๋ จ ์„ธํŠธ๋ฅผ ๊ธฐ์–ตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค). ๋ถˆ์•ˆ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค โ€“ ๋ฐ์ดํ„ฐ์˜ ์ž‘์€ ๋ณ€ํ™”๊ฐ€ ๋‹ค๋ฅธ ํŠธ๋ฆฌ ๊ตฌ์กฐ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹จ์ผ ๋ชจ๋ธ๋กœ์„œ ๊ทธ ์ •ํ™•๋„๊ฐ€ ๋” ๋ฐœ์ „๋œ ๋ฐฉ๋ฒ•(๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์™€ ๊ฐ™์€ ์•™์ƒ๋ธ”)์ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋ถ„์‚ฐ์„ ์ค„์—ฌ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

  • ์ตœ๊ณ ์˜ ๋ถ„ํ•  ์ฐพ๊ธฐ:

  • ์ง€๋‹ˆ ๋ถˆ์ˆœ๋„: ๋…ธ๋“œ์˜ ๋ถˆ์ˆœ๋„๋ฅผ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. ๋‚ฎ์€ ์ง€๋‹ˆ ๋ถˆ์ˆœ๋„๋Š” ๋” ๋‚˜์€ ๋ถ„ํ• ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ณต์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Gini = 1 - ฮฃ(p_i^2)

์—ฌ๊ธฐ์„œ p_i๋Š” ํด๋ž˜์Šค i์˜ ์ธ์Šคํ„ด์Šค ๋น„์œจ์ž…๋‹ˆ๋‹ค.

  • ์—”ํŠธ๋กœํ”ผ: ๋ฐ์ดํ„ฐ์…‹์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. ๋‚ฎ์€ ์—”ํŠธ๋กœํ”ผ๋Š” ๋” ๋‚˜์€ ๋ถ„ํ• ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ณต์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
Entropy = -ฮฃ(p_i * log2(p_i))

์—ฌ๊ธฐ์„œ p_i๋Š” ํด๋ž˜์Šค i์˜ ์ธ์Šคํ„ด์Šค ๋น„์œจ์ž…๋‹ˆ๋‹ค.

  • ์ •๋ณด ์ด๋“: ๋ถ„ํ•  ํ›„ ์—”ํŠธ๋กœํ”ผ ๋˜๋Š” ์ง€๋‹ˆ ๋ถˆ์ˆœ๋„์˜ ๊ฐ์†Œ์ž…๋‹ˆ๋‹ค. ์ •๋ณด ์ด๋“์ด ๋†’์„์ˆ˜๋ก ๋” ๋‚˜์€ ๋ถ„ํ• ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค:
Information Gain = Entropy(parent) - (Weighted Average of Entropy(children))

๋˜ํ•œ, ํŠธ๋ฆฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฝ์šฐ์— ์ข…๋ฃŒ๋ฉ๋‹ˆ๋‹ค:

  • ๋…ธ๋“œ์˜ ๋ชจ๋“  ์ธ์Šคํ„ด์Šค๊ฐ€ ๋™์ผํ•œ ํด๋ž˜์Šค์— ์†ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊ณผ์ ํ•ฉ์œผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํŠธ๋ฆฌ์˜ ์ตœ๋Œ€ ๊นŠ์ด(ํ•˜๋“œ์ฝ”๋”ฉ๋จ)์— ๋„๋‹ฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ๋…ธ๋“œ์˜ ์ธ์Šคํ„ด์Šค ์ˆ˜๊ฐ€ ํŠน์ • ์ž„๊ณ„๊ฐ’ ์ดํ•˜์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ๋„ ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ์ถ”๊ฐ€ ๋ถ„ํ• ๋กœ ์ธํ•œ ์ •๋ณด ์ด๋“์ด ํŠน์ • ์ž„๊ณ„๊ฐ’ ์ดํ•˜์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ๋„ ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
์˜ˆ์‹œ -- ์นจ์ž… ํƒ์ง€๋ฅผ ์œ„ํ•œ ์˜์‚ฌ ๊ฒฐ์ • ํŠธ๋ฆฌ: NSL-KDD ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋„คํŠธ์›Œํฌ ์—ฐ๊ฒฐ์„ *์ •์ƒ* ๋˜๋Š” *๊ณต๊ฒฉ*์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์œ„ํ•ด ์˜์‚ฌ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ํ›ˆ๋ จ์‹œํ‚ฌ ๊ฒƒ์ž…๋‹ˆ๋‹ค. NSL-KDD๋Š” ํ”„๋กœํ† ์ฝœ ์œ ํ˜•, ์„œ๋น„์Šค, ์ง€์† ์‹œ๊ฐ„, ์‹คํŒจํ•œ ๋กœ๊ทธ์ธ ์ˆ˜ ๋“ฑ์˜ ํŠน์„ฑ์„ ๊ฐ€์ง„ ๊ณ ์ „์ ์ธ KDD Cup 1999 ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฐœ์„ ๋œ ๋ฒ„์ „์ด๋ฉฐ, ๊ณต๊ฒฉ ์œ ํ˜• ๋˜๋Š” "์ •์ƒ"์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ ˆ์ด๋ธ”์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ๊ณต๊ฒฉ ์œ ํ˜•์„ "์ด์ƒ" ํด๋ž˜์Šค์— ๋งคํ•‘ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค(์ด์ง„ ๋ถ„๋ฅ˜: ์ •์ƒ vs ์ด์ƒ). ํ›ˆ๋ จ ํ›„, ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ ํŠธ๋ฆฌ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ```python import pandas as pd from sklearn.tree import DecisionTreeClassifier from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

1๏ธโƒฃ NSLโ€‘KDD column names (41 features + class + difficulty)

col_names = [ โ€œdurationโ€,โ€œprotocol_typeโ€,โ€œserviceโ€,โ€œflagโ€,โ€œsrc_bytesโ€,โ€œdst_bytesโ€,โ€œlandโ€, โ€œwrong_fragmentโ€,โ€œurgentโ€,โ€œhotโ€,โ€œnum_failed_loginsโ€,โ€œlogged_inโ€,โ€œnum_compromisedโ€, โ€œroot_shellโ€,โ€œsu_attemptedโ€,โ€œnum_rootโ€,โ€œnum_file_creationsโ€,โ€œnum_shellsโ€, โ€œnum_access_filesโ€,โ€œnum_outbound_cmdsโ€,โ€œis_host_loginโ€,โ€œis_guest_loginโ€,โ€œcountโ€, โ€œsrv_countโ€,โ€œserror_rateโ€,โ€œsrv_serror_rateโ€,โ€œrerror_rateโ€,โ€œsrv_rerror_rateโ€, โ€œsame_srv_rateโ€,โ€œdiff_srv_rateโ€,โ€œsrv_diff_host_rateโ€,โ€œdst_host_countโ€, โ€œdst_host_srv_countโ€,โ€œdst_host_same_srv_rateโ€,โ€œdst_host_diff_srv_rateโ€, โ€œdst_host_same_src_port_rateโ€,โ€œdst_host_srv_diff_host_rateโ€,โ€œdst_host_serror_rateโ€, โ€œdst_host_srv_serror_rateโ€,โ€œdst_host_rerror_rateโ€,โ€œdst_host_srv_rerror_rateโ€, โ€œclassโ€,โ€œdifficulty_levelโ€ ]

2๏ธโƒฃ Load data โžœ headerless CSV

train_url = โ€œhttps://raw.githubusercontent.com/Mamcose/NSL-KDD-Network-Intrusion-Detection/master/NSL_KDD_Train.csvโ€ test_url = โ€œhttps://raw.githubusercontent.com/Mamcose/NSL-KDD-Network-Intrusion-Detection/master/NSL_KDD_Test.csvโ€

df_train = pd.read_csv(train_url, header=None, names=col_names) df_test = pd.read_csv(test_url, header=None, names=col_names)

3๏ธโƒฃ Encode the 3 nominal features

for col in [โ€˜protocol_typeโ€™, โ€˜serviceโ€™, โ€˜flagโ€™]: le = LabelEncoder().fit(pd.concat([df_train[col], df_test[col]])) df_train[col] = le.transform(df_train[col]) df_test[col] = le.transform(df_test[col])

4๏ธโƒฃ Prepare X / y (binary: 0ย =ย normal,ย 1ย =ย attack)

X_train = df_train.drop(columns=[โ€˜classโ€™, โ€˜difficulty_levelโ€™]) y_train = (df_train[โ€˜classโ€™].str.lower() != โ€˜normalโ€™).astype(int)

X_test = df_test.drop(columns=[โ€˜classโ€™, โ€˜difficulty_levelโ€™]) y_test = (df_test[โ€˜classโ€™].str.lower() != โ€˜normalโ€™).astype(int)

5๏ธโƒฃ Train Decisionโ€‘Tree

clf = DecisionTreeClassifier(max_depth=10, random_state=42) clf.fit(X_train, y_train)

6๏ธโƒฃ Evaluate

y_pred = clf.predict(X_test) y_prob = clf.predict_proba(X_test)[:, 1]

print(fโ€œAccuracy : {accuracy_score(y_test, y_pred):.3f}โ€œ) print(fโ€œPrecision: {precision_score(y_test, y_pred):.3f}โ€) print(fโ€œRecall : {recall_score(y_test, y_pred):.3f}โ€œ) print(fโ€œF1โ€‘score : {f1_score(y_test, y_pred):.3f}โ€) print(fโ€œROCย AUC : {roc_auc_score(y_test, y_prob):.3f}โ€œ)

โ€œโ€โ€œ Accuracy : 0.772 Precision: 0.967 Recall : 0.621 F1โ€‘score : 0.756 ROC AUC : 0.758 โ€œโ€โ€œ

์ด ๊ฒฐ์ • ํŠธ๋ฆฌ ์˜ˆ์ œ์—์„œ๋Š” ๊ทน๋‹จ์ ์ธ ๊ณผ์ ํ•ฉ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ํŠธ๋ฆฌ ๊นŠ์ด๋ฅผ 10์œผ๋กœ ์ œํ•œํ–ˆ์Šต๋‹ˆ๋‹ค(`max_depth=10` ๋งค๊ฐœ๋ณ€์ˆ˜). ๋ฉ”ํŠธ๋ฆญ์€ ํŠธ๋ฆฌ๊ฐ€ ์ •์ƒ ํŠธ๋ž˜ํ”ฝ๊ณผ ๊ณต๊ฒฉ ํŠธ๋ž˜ํ”ฝ์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ๊ตฌ๋ถ„ํ•˜๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋†’์€ ์žฌํ˜„์œจ์€ ๋Œ€๋ถ€๋ถ„์˜ ๊ณต๊ฒฉ์„ ํฌ์ฐฉํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•˜๋ฉฐ(IDS์— ์ค‘์š”), ๋†’์€ ์ •๋ฐ€๋„๋Š” ์ž˜๋ชป๋œ ๊ฒฝ๊ณ ๊ฐ€ ์ ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ์ • ํŠธ๋ฆฌ๋Š” ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ์—์„œ ๊ดœ์ฐฎ์€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์ง€๋งŒ, ๋‹จ์ผ ํŠธ๋ฆฌ๋Š” ์ตœ์ƒ์˜ ์„ฑ๋Šฅ์— ๋„๋‹ฌํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๋ชจ๋ธ์˜ *ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ*์€ ํฐ ์žฅ์ ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ํŠธ๋ฆฌ์˜ ๋ถ„ํ• ์„ ๊ฒ€ํ† ํ•˜์—ฌ ์–ด๋–ค ํŠน์„ฑ(์˜ˆ: `service`, `src_bytes` ๋“ฑ)์ด ์—ฐ๊ฒฐ์„ ์•…์„ฑ์œผ๋กœ ํ”Œ๋ž˜๊ทธํ•˜๋Š” ๋ฐ ๊ฐ€์žฅ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

</details>

### ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ

๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” **์•™์ƒ๋ธ” ํ•™์Šต** ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ํ›ˆ๋ จ์‹œํ‚ค๊ณ (๋”ฐ๋ผ์„œ "ํฌ๋ ˆ์ŠคํŠธ") ์ด๋“ค์˜ ์ถœ๋ ฅ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์ตœ์ข… ์˜ˆ์ธก์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค(๋ถ„๋ฅ˜์˜ ๊ฒฝ์šฐ ์ผ๋ฐ˜์ ์œผ๋กœ ๋‹ค์ˆ˜๊ฒฐ์— ์˜ํ•ด). ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์˜ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ์•„์ด๋””์–ด๋Š” **๋ฐฐ๊น…**(๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ง‘๊ณ„)๊ณผ **ํŠน์„ฑ ๋ฌด์ž‘์œ„์„ฑ**์ž…๋‹ˆ๋‹ค:

-   **๋ฐฐ๊น…:** ๊ฐ ํŠธ๋ฆฌ๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ๋ฌด์ž‘์œ„ ๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ƒ˜ํ”Œ(๊ต์ฒด ์ƒ˜ํ”Œ๋ง)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํŠธ๋ฆฌ ๊ฐ„์˜ ๋‹ค์–‘์„ฑ์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

-   **ํŠน์„ฑ ๋ฌด์ž‘์œ„์„ฑ:** ํŠธ๋ฆฌ์˜ ๊ฐ ๋ถ„ํ• ์—์„œ ๋ฌด์ž‘์œ„ ํŠน์„ฑ์˜ ํ•˜์œ„ ์ง‘ํ•ฉ์ด ๋ถ„ํ• ์„ ์œ„ํ•ด ๊ณ ๋ ค๋ฉ๋‹ˆ๋‹ค(๋ชจ๋“  ํŠน์„ฑ์ด ์•„๋‹Œ). ์ด๋Š” ํŠธ๋ฆฌ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋”์šฑ ์ค„์ž…๋‹ˆ๋‹ค.

๋งŽ์€ ํŠธ๋ฆฌ์˜ ๊ฒฐ๊ณผ๋ฅผ ํ‰๊ท ํ™”ํ•จ์œผ๋กœ์จ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ๋‹จ์ผ ๊ฒฐ์ • ํŠธ๋ฆฌ๊ฐ€ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๋ถ„์‚ฐ์„ ์ค„์ž…๋‹ˆ๋‹ค. ๊ฐ„๋‹จํžˆ ๋งํ•ด, ๊ฐœ๋ณ„ ํŠธ๋ฆฌ๋Š” ๊ณผ์ ํ•ฉ๋˜๊ฑฐ๋‚˜ ๋…ธ์ด์ฆˆ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์ง€๋งŒ, ๋‹ค์–‘ํ•œ ํŠธ๋ฆฌ๊ฐ€ ํ•จ๊ป˜ ํˆฌํ‘œํ•˜๋ฉด ์ด๋Ÿฌํ•œ ์˜ค๋ฅ˜๊ฐ€ ์™„ํ™”๋ฉ๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ๋Š” ์ข…์ข… **๋” ๋†’์€ ์ •ํ™•๋„**์™€ ๋‹จ์ผ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ณด๋‹ค ๋” ๋‚˜์€ ์ผ๋ฐ˜ํ™”๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ๊ฐ ํŠน์„ฑ์ด ํ‰๊ท ์ ์œผ๋กœ ๋ถˆ์ˆœ๋„๋ฅผ ์–ผ๋งˆ๋‚˜ ์ค„์ด๋Š”์ง€๋ฅผ ์‚ดํŽด๋ด„์œผ๋กœ์จ ํŠน์„ฑ ์ค‘์š”๋„์˜ ์ถ”์ •์น˜๋ฅผ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ์นจ์ž… ํƒ์ง€, ์•…์„ฑ ์ฝ”๋“œ ๋ถ„๋ฅ˜ ๋ฐ ์ŠคํŒธ ํƒ์ง€์™€ ๊ฐ™์€ ์ž‘์—…์—์„œ **์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ์˜ ์ผ๊พผ**์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ตœ์†Œํ•œ์˜ ์กฐ์ •์œผ๋กœ ์ฆ‰์‹œ ์ž˜ ์ž‘๋™ํ•˜๋ฉฐ ๋Œ€๊ทœ๋ชจ ํŠน์„ฑ ์ง‘ํ•ฉ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์นจ์ž… ํƒ์ง€์—์„œ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ๋” ๋ฏธ์„ธํ•œ ๊ณต๊ฒฉ ํŒจํ„ด์„ ํฌ์ฐฉํ•˜์—ฌ ์ž˜๋ชป๋œ ๊ธ์ •์ด ์ ์€ ๋‹จ์ผ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฐ๊ตฌ์— ๋”ฐ๋ฅด๋ฉด ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” NSL-KDD ๋ฐ UNSW-NB15์™€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ๊ณต๊ฒฉ์„ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฐ ์žˆ์–ด ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋น„ํ•ด ์œ ๋ฆฌํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

#### **๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์˜ ์ฃผ์š” ํŠน์„ฑ:**

-   **๋ฌธ์ œ ์œ ํ˜•:** ์ฃผ๋กœ ๋ถ„๋ฅ˜(ํšŒ๊ท€์—๋„ ์‚ฌ์šฉ๋จ). ๋ณด์•ˆ ๋กœ๊ทธ์—์„œ ์ผ๋ฐ˜์ ์ธ ๊ณ ์ฐจ์› ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ์— ๋งค์šฐ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

-   **ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ:** ๋‹จ์ผ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ณด๋‹ค ๋‚ฎ์Šต๋‹ˆ๋‹ค -- ์ˆ˜๋ฐฑ ๊ฐœ์˜ ํŠธ๋ฆฌ๋ฅผ ํ•œ ๋ฒˆ์— ์‰ฝ๊ฒŒ ์‹œ๊ฐํ™”ํ•˜๊ฑฐ๋‚˜ ์„ค๋ช…ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํŠน์„ฑ ์ค‘์š”๋„ ์ ์ˆ˜๋Š” ์–ด๋–ค ์†์„ฑ์ด ๊ฐ€์žฅ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€์— ๋Œ€ํ•œ ํ†ต์ฐฐ๋ ฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

-   **์žฅ์ :** ์ผ๋ฐ˜์ ์œผ๋กœ ์•™์ƒ๋ธ” ํšจ๊ณผ๋กœ ์ธํ•ด ๋‹จ์ผ ํŠธ๋ฆฌ ๋ชจ๋ธ๋ณด๋‹ค ๋” ๋†’์€ ์ •ํ™•๋„๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ๊ณผ์ ํ•ฉ์— ๊ฐ•ํ•ฉ๋‹ˆ๋‹ค -- ๊ฐœ๋ณ„ ํŠธ๋ฆฌ๊ฐ€ ๊ณผ์ ํ•ฉ๋˜๋”๋ผ๋„ ์•™์ƒ๋ธ”์€ ๋” ์ž˜ ์ผ๋ฐ˜ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ˆซ์žํ˜• ๋ฐ ๋ฒ”์ฃผํ˜• ํŠน์„ฑ์„ ๋ชจ๋‘ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์–ด๋А ์ •๋„ ๊ฒฐ์ธก ๋ฐ์ดํ„ฐ๋ฅผ ๊ด€๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ด์ƒ์น˜์— ๋Œ€ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ๊ฐ•ํ•ฉ๋‹ˆ๋‹ค.

-   **์ œํ•œ ์‚ฌํ•ญ:** ๋ชจ๋ธ ํฌ๊ธฐ๊ฐ€ ํด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(๋งŽ์€ ํŠธ๋ฆฌ, ๊ฐ ํŠธ๋ฆฌ๋Š” ์ž ์žฌ์ ์œผ๋กœ ๊นŠ์Œ). ์˜ˆ์ธก ์†๋„๊ฐ€ ๋‹จ์ผ ํŠธ๋ฆฌ๋ณด๋‹ค ๋А๋ฆฝ๋‹ˆ๋‹ค(๋งŽ์€ ํŠธ๋ฆฌ๋ฅผ ์ง‘๊ณ„ํ•ด์•ผ ํ•˜๋ฏ€๋กœ). ๋œ ํ•ด์„ ๊ฐ€๋Šฅํ•จ -- ์ค‘์š”ํ•œ ํŠน์„ฑ์„ ์•Œ ์ˆ˜ ์žˆ์ง€๋งŒ, ์ •ํ™•ํ•œ ๋…ผ๋ฆฌ๋Š” ๊ฐ„๋‹จํ•œ ๊ทœ์น™์œผ๋กœ ์‰ฝ๊ฒŒ ์ถ”์ ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ๋งค์šฐ ๊ณ ์ฐจ์›์ด๊ณ  ํฌ์†Œํ•œ ๊ฒฝ์šฐ, ๋งค์šฐ ํฐ ํฌ๋ ˆ์ŠคํŠธ๋ฅผ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์€ ๊ณ„์‚ฐ์ ์œผ๋กœ ๋ฌด๊ฑฐ์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

-   **ํ›ˆ๋ จ ๊ณผ์ •:**
1. **๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ƒ˜ํ”Œ๋ง:** ๊ต์ฒด ์ƒ˜ํ”Œ๋ง์„ ํ†ตํ•ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ์—ฌ๋Ÿฌ ํ•˜์œ„ ์ง‘ํ•ฉ(๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ƒ˜ํ”Œ)์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
2. **ํŠธ๋ฆฌ ๊ตฌ์„ฑ:** ๊ฐ ๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด ๊ฐ ๋ถ„ํ• ์—์„œ ๋ฌด์ž‘์œ„ ํŠน์„ฑ์˜ ํ•˜์œ„ ์ง‘ํ•ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํŠธ๋ฆฌ ๊ฐ„์˜ ๋‹ค์–‘์„ฑ์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.
3. **์ง‘๊ณ„:** ๋ถ„๋ฅ˜ ์ž‘์—…์˜ ๊ฒฝ์šฐ, ์ตœ์ข… ์˜ˆ์ธก์€ ๋ชจ๋“  ํŠธ๋ฆฌ์˜ ์˜ˆ์ธก ์ค‘ ๋‹ค์ˆ˜๊ฒฐ์„ ํ†ตํ•ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ํšŒ๊ท€ ์ž‘์—…์˜ ๊ฒฝ์šฐ, ์ตœ์ข… ์˜ˆ์ธก์€ ๋ชจ๋“  ํŠธ๋ฆฌ์˜ ์˜ˆ์ธก ํ‰๊ท ์ž…๋‹ˆ๋‹ค.

<details>
<summary>์˜ˆ์ œ -- ์นจ์ž… ํƒ์ง€๋ฅผ ์œ„ํ•œ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ (NSL-KDD):</summary>
์šฐ๋ฆฌ๋Š” ๋™์ผํ•œ NSL-KDD ๋ฐ์ดํ„ฐ ์„ธํŠธ(์ •์ƒ ๋Œ€ ์ด์ƒ์œผ๋กœ ์ด์ง„ ๋ ˆ์ด๋ธ”)๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ํ›ˆ๋ จ์‹œํ‚ฌ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๊ฐ€ ๋‹จ์ผ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹๊ฑฐ๋‚˜ ๊ฐ™๊ธฐ๋ฅผ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค. ์•™์ƒ๋ธ” ํ‰๊ท ํ™”๊ฐ€ ๋ถ„์‚ฐ์„ ์ค„์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋™์ผํ•œ ๋ฉ”ํŠธ๋ฆญ์œผ๋กœ ํ‰๊ฐ€ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
```python
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (accuracy_score, precision_score,
recall_score, f1_score, roc_auc_score)

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 1. LOAD DATA  โžœ  files have **no header row**, so we
#                 pass `header=None` and give our own column names.
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
col_names = [                       # 41ย features + 2ย targets
"duration","protocol_type","service","flag","src_bytes","dst_bytes","land",
"wrong_fragment","urgent","hot","num_failed_logins","logged_in",
"num_compromised","root_shell","su_attempted","num_root","num_file_creations",
"num_shells","num_access_files","num_outbound_cmds","is_host_login",
"is_guest_login","count","srv_count","serror_rate","srv_serror_rate",
"rerror_rate","srv_rerror_rate","same_srv_rate","diff_srv_rate",
"srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate",
"dst_host_same_src_port_rate","dst_host_srv_diff_host_rate",
"dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate",
"dst_host_srv_rerror_rate","class","difficulty_level"
]

train_url = "https://raw.githubusercontent.com/Mamcose/NSL-KDD-Network-Intrusion-Detection/master/NSL_KDD_Train.csv"
test_url  = "https://raw.githubusercontent.com/Mamcose/NSL-KDD-Network-Intrusion-Detection/master/NSL_KDD_Test.csv"

df_train = pd.read_csv(train_url, header=None, names=col_names)
df_test  = pd.read_csv(test_url,  header=None, names=col_names)

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 2. PREโ€‘PROCESSING
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 2โ€‘a) Encode the three categorical columns so that the model
#      receives integers instead of strings.
#      LabelEncoder gives an int to each unique value in the column: {'icmp':0, 'tcp':1, 'udp':2}
for col in ['protocol_type', 'service', 'flag']:
le = LabelEncoder().fit(pd.concat([df_train[col], df_test[col]]))
df_train[col] = le.transform(df_train[col])
df_test[col]  = le.transform(df_test[col])

# 2โ€‘b) Build feature matrix X  (drop target & difficulty)
X_train = df_train.drop(columns=['class', 'difficulty_level'])
X_test  = df_test.drop(columns=['class', 'difficulty_level'])

# 2โ€‘c) Convert multiโ€‘class labels to binary
#      labelย 0 โ†’ 'normal' traffic, labelย 1 โ†’ any attack
y_train = (df_train['class'].str.lower() != 'normal').astype(int)
y_test  = (df_test['class'].str.lower() != 'normal').astype(int)

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 3. MODEL: RANDOMย FOREST
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# โ€ข n_estimators = 100 โžœ build 100 different decisionโ€‘trees.
# โ€ข max_depth=None  โžœ let each tree grow until pure leaves
#                    (or until it hits other stopping criteria).
# โ€ข random_state=42 โžœ reproducible randomness.
model = RandomForestClassifier(
n_estimators=100,
max_depth=None,
random_state=42,
bootstrap=True          # default: each tree is trained on a
# bootstrap sample the same size as
# the original training set.
# max_samples           # โ† you can set this (float or int) to
#     use a smaller % of samples per tree.
)

model.fit(X_train, y_train)

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 4. EVALUATION
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print(f"Accuracy : {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall   : {recall_score(y_test, y_pred):.3f}")
print(f"F1โ€‘score : {f1_score(y_test, y_pred):.3f}")
print(f"ROCย AUC  : {roc_auc_score(y_test, y_prob):.3f}")

"""
Accuracy:  0.770
Precision: 0.966
Recall:    0.618
F1-score:  0.754
ROC AUC:   0.962
"""

๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ด ์นจ์ž… ํƒ์ง€ ์ž‘์—…์—์„œ ๊ฐ•๋ ฅํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋‹จ์ผ ๊ฒฐ์ • ํŠธ๋ฆฌ์— ๋น„ํ•ด F1 ๋˜๋Š” AUC์™€ ๊ฐ™์€ ๋ฉ”ํŠธ๋ฆญ์—์„œ ๊ฐœ์„ ์„ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ์žฌํ˜„์œจ ๋˜๋Š” ์ •๋ฐ€๋„์—์„œ ํŠนํžˆ ๋‘๋“œ๋Ÿฌ์ง‘๋‹ˆ๋‹ค. ์ด๋Š” *โ€œ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ(RF)๋Š” ์•™์ƒ๋ธ” ๋ถ„๋ฅ˜๊ธฐ์ด๋ฉฐ ๊ณต๊ฒฉ์˜ ํšจ๊ณผ์ ์ธ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด ๋‹ค๋ฅธ ์ „ํ†ต์ ์ธ ๋ถ„๋ฅ˜๊ธฐ์™€ ๋น„๊ตํ•˜์—ฌ ์ž˜ ์ž‘๋™ํ•œ๋‹ค.โ€*๋Š” ์ดํ•ด์™€ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค. ๋ณด์•ˆ ์šด์˜ ๋งฅ๋ฝ์—์„œ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ์€ ๋งŽ์€ ๊ฒฐ์ • ๊ทœ์น™์˜ ํ‰๊ท ํ™” ๋•๋ถ„์— ๊ณต๊ฒฉ์„ ๋” ์‹ ๋ขฐ์„ฑ ์žˆ๊ฒŒ ํ”Œ๋ž˜๊ทธํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ž˜๋ชป๋œ ๊ฒฝ๊ณ ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ˆฒ์—์„œ์˜ ํŠน์„ฑ ์ค‘์š”์„ฑ์€ ์–ด๋–ค ๋„คํŠธ์›Œํฌ ํŠน์„ฑ์ด ๊ณต๊ฒฉ์„ ๊ฐ€์žฅ ์ž˜ ๋‚˜ํƒ€๋‚ด๋Š”์ง€๋ฅผ ์•Œ๋ ค์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(์˜ˆ: ํŠน์ • ๋„คํŠธ์›Œํฌ ์„œ๋น„์Šค ๋˜๋Š” ๋น„์ •์ƒ์ ์ธ ํŒจํ‚ท ์ˆ˜).

์„œํฌํŠธ ๋ฒกํ„ฐ ๋จธ์‹  (SVM)

์„œํฌํŠธ ๋ฒกํ„ฐ ๋จธ์‹ ์€ ์ฃผ๋กœ ๋ถ„๋ฅ˜(๊ทธ๋ฆฌ๊ณ  SVR๋กœ ํšŒ๊ท€) ์šฉ๋„๋กœ ์‚ฌ์šฉ๋˜๋Š” ๊ฐ•๋ ฅํ•œ ๊ฐ๋… ํ•™์Šต ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. SVM์€ ๋‘ ํด๋ž˜์Šค ๊ฐ„์˜ ๋งˆ์ง„์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์ตœ์ ์˜ ๋ถ„๋ฆฌ ์ดˆํ‰๋ฉด์„ ์ฐพ์œผ๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ดˆํ‰๋ฉด์˜ ์œ„์น˜๋Š” ๊ฒฝ๊ณ„์— ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ํ›ˆ๋ จ ํฌ์ธํŠธ์˜ ํ•˜์œ„ ์ง‘ํ•ฉ(โ€œ์„œํฌํŠธ ๋ฒกํ„ฐโ€)์— ์˜ํ•ด ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค. ๋งˆ์ง„(์„œํฌํŠธ ๋ฒกํ„ฐ์™€ ์ดˆํ‰๋ฉด ๊ฐ„์˜ ๊ฑฐ๋ฆฌ)์„ ์ตœ๋Œ€ํ™”ํ•จ์œผ๋กœ์จ SVM์€ ์ข‹์€ ์ผ๋ฐ˜ํ™”๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

SVM์˜ ๊ฐ•๋ ฅํ•œ ์ ์€ ๋น„์„ ํ˜• ๊ด€๊ณ„๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ปค๋„ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋Š” ์„ ํ˜• ๋ถ„๋ฆฌ๊ฐ€ ์กด์žฌํ•  ์ˆ˜ ์žˆ๋Š” ๋” ๋†’์€ ์ฐจ์›์˜ ํŠน์„ฑ ๊ณต๊ฐ„์œผ๋กœ ์•”๋ฌต์ ์œผ๋กœ ๋ณ€ํ™˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์ธ ์ปค๋„์—๋Š” ๋‹คํ•ญ์‹, ๋ฐฉ์‚ฌ ๊ธฐ์ € ํ•จ์ˆ˜(RBF), ์‹œ๊ทธ๋ชจ์ด๋“œ๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋„คํŠธ์›Œํฌ ํŠธ๋ž˜ํ”ฝ ํด๋ž˜์Šค๊ฐ€ ์›์‹œ ํŠน์„ฑ ๊ณต๊ฐ„์—์„œ ์„ ํ˜•์ ์œผ๋กœ ๋ถ„๋ฆฌ๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ, RBF ์ปค๋„์€ ์ด๋ฅผ ๋” ๋†’์€ ์ฐจ์›์œผ๋กœ ๋งคํ•‘ํ•˜์—ฌ SVM์ด ์„ ํ˜• ๋ถ„ํ• ์„ ์ฐพ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค(์ด๋Š” ์›๋ž˜ ๊ณต๊ฐ„์—์„œ ๋น„์„ ํ˜• ๊ฒฝ๊ณ„์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค). ์ปค๋„์„ ์„ ํƒํ•˜๋Š” ์œ ์—ฐ์„ฑ ๋•๋ถ„์— SVM์€ ๋‹ค์–‘ํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

SVM์€ ๊ณ ์ฐจ์› ํŠน์„ฑ ๊ณต๊ฐ„(์˜ˆ: ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋˜๋Š” ์•…์„ฑ ์ฝ”๋“œ ๋ช…๋ น์–ด ์‹œํ€€์Šค)์—์„œ ์ž˜ ์ž‘๋™ํ•˜๋ฉฐ, ํŠน์„ฑ์˜ ์ˆ˜๊ฐ€ ์ƒ˜ํ”Œ ์ˆ˜์— ๋น„ํ•ด ํด ๋•Œ ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค. 2000๋…„๋Œ€์—๋Š” ์•…์„ฑ ์ฝ”๋“œ ๋ถ„๋ฅ˜ ๋ฐ ์ด์ƒ ๊ธฐ๋ฐ˜ ์นจ์ž… ํƒ์ง€์™€ ๊ฐ™์€ ๋งŽ์€ ์ดˆ๊ธฐ ์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์—์„œ ์ธ๊ธฐ๊ฐ€ ์žˆ์—ˆ์œผ๋ฉฐ, ์ข…์ข… ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ SVM์€ ๋งค์šฐ ํฐ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ์‰ฝ๊ฒŒ ํ™•์žฅ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค(ํ›ˆ๋ จ ๋ณต์žก๋„๋Š” ์ƒ˜ํ”Œ ์ˆ˜์— ๋Œ€ํ•ด ์ดˆ์„ ํ˜•์ด๋ฉฐ, ๋งŽ์€ ์„œํฌํŠธ ๋ฒกํ„ฐ๋ฅผ ์ €์žฅํ•ด์•ผ ํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ๋†’์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค). ์‹ค์ œ๋กœ ์ˆ˜๋ฐฑ๋งŒ ๊ฐœ์˜ ๋ ˆ์ฝ”๋“œ๊ฐ€ ์žˆ๋Š” ๋„คํŠธ์›Œํฌ ์นจ์ž… ํƒ์ง€์™€ ๊ฐ™์€ ์ž‘์—…์—์„œ๋Š” ์‹ ์ค‘ํ•œ ํ•˜์œ„ ์ƒ˜ํ”Œ๋ง์ด๋‚˜ ๊ทผ์‚ฌ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์œผ๋ฉด SVM์ด ๋„ˆ๋ฌด ๋А๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

SVM์˜ ์ฃผ์š” ํŠน์„ฑ:

  • ๋ฌธ์ œ ์œ ํ˜•: ๋ถ„๋ฅ˜(์ด์ง„ ๋˜๋Š” ๋‹ค์ค‘ ํด๋ž˜์Šค, ์ผ๋Œ€์ผ/์ผ๋Œ€๋‹ค) ๋ฐ ํšŒ๊ท€ ๋ณ€ํ˜•. ๋ช…ํ™•ํ•œ ๋งˆ์ง„ ๋ถ„๋ฆฌ๊ฐ€ ์žˆ๋Š” ์ด์ง„ ๋ถ„๋ฅ˜์— ์ž์ฃผ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  • ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ: ์ค‘๊ฐ„ โ€“ SVM์€ ๊ฒฐ์ • ํŠธ๋ฆฌ๋‚˜ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋งŒํผ ํ•ด์„ ๊ฐ€๋Šฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์–ด๋–ค ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๊ฐ€ ์„œํฌํŠธ ๋ฒกํ„ฐ์ธ์ง€ ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ๊ณ , ์„ ํ˜• ์ปค๋„ ๊ฒฝ์šฐ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ํ†ตํ•ด ์–ด๋–ค ํŠน์„ฑ์ด ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ๋Š”์ง€ ๊ฐ์„ ์žก์„ ์ˆ˜ ์žˆ์ง€๋งŒ, ์‹ค์ œ๋กœ SVM(ํŠนํžˆ ๋น„์„ ํ˜• ์ปค๋„์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ)์€ ๋ธ”๋ž™๋ฐ•์Šค ๋ถ„๋ฅ˜๊ธฐ๋กœ ์ทจ๊ธ‰๋ฉ๋‹ˆ๋‹ค.

  • ์žฅ์ : ๊ณ ์ฐจ์› ๊ณต๊ฐ„์—์„œ ํšจ๊ณผ์ ; ์ปค๋„ ํŠธ๋ฆญ์œผ๋กœ ๋ณต์žกํ•œ ๊ฒฐ์ • ๊ฒฝ๊ณ„๋ฅผ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ์Œ; ๋งˆ์ง„์ด ์ตœ๋Œ€ํ™”๋˜๋ฉด ๊ณผ์ ํ•ฉ์— ๊ฐ•ํ•จ(ํŠนํžˆ ์ ์ ˆํ•œ ์ •๊ทœํ™” ๋งค๊ฐœ๋ณ€์ˆ˜ C๊ฐ€ ์žˆ์„ ๋•Œ); ํด๋ž˜์Šค๊ฐ€ ํฐ ๊ฑฐ๋ฆฌ์— ์˜ํ•ด ๋ถ„๋ฆฌ๋˜์ง€ ์•Š์„ ๋•Œ๋„ ์ž˜ ์ž‘๋™(์ตœ์ƒ์˜ ํƒ€ํ˜‘ ๊ฒฝ๊ณ„๋ฅผ ์ฐพ์Œ).

  • ์ œํ•œ ์‚ฌํ•ญ: ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•ด ๊ณ„์‚ฐ ์ง‘์•ฝ์ (ํ›ˆ๋ จ ๋ฐ ์˜ˆ์ธก ๋ชจ๋‘ ๋ฐ์ดํ„ฐ๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋จ). ์ปค๋„ ๋ฐ ์ •๊ทœํ™” ๋งค๊ฐœ๋ณ€์ˆ˜(C, ์ปค๋„ ์œ ํ˜•, RBF์˜ ๊ฐ๋งˆ ๋“ฑ)๋ฅผ ์‹ ์ค‘ํ•˜๊ฒŒ ์กฐ์ •ํ•ด์•ผ ํ•จ. ํ™•๋ฅ ์  ์ถœ๋ ฅ์„ ์ง์ ‘ ์ œ๊ณตํ•˜์ง€ ์•Š์Œ(ํ•˜์ง€๋งŒ Platt ์Šค์ผ€์ผ๋ง์„ ์‚ฌ์šฉํ•˜์—ฌ ํ™•๋ฅ ์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ). ๋˜ํ•œ SVM์€ ์ปค๋„ ๋งค๊ฐœ๋ณ€์ˆ˜ ์„ ํƒ์— ๋ฏผ๊ฐํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ž˜๋ชป๋œ ์„ ํƒ์€ ๊ณผ์†Œ์ ํ•ฉ ๋˜๋Š” ๊ณผ์ ํ•ฉ์œผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ์—์„œ์˜ ์‚ฌ์šฉ ์‚ฌ๋ก€: SVM์€ ์•…์„ฑ ์ฝ”๋“œ ํƒ์ง€(์˜ˆ: ์ถ”์ถœ๋œ ํŠน์„ฑ ๋˜๋Š” ๋ช…๋ น์–ด ์‹œํ€€์Šค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํŒŒ์ผ ๋ถ„๋ฅ˜), ๋„คํŠธ์›Œํฌ ์ด์ƒ ํƒ์ง€(ํŠธ๋ž˜ํ”ฝ์„ ์ •์ƒ ๋Œ€ ์•…์„ฑ์œผ๋กœ ๋ถ„๋ฅ˜), ํ”ผ์‹ฑ ํƒ์ง€(URL์˜ ํŠน์„ฑ์„ ์‚ฌ์šฉ) ๋“ฑ์— ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, SVM์€ ์ด๋ฉ”์ผ์˜ ํŠน์„ฑ(ํŠน์ • ํ‚ค์›Œ๋“œ ์ˆ˜, ๋ฐœ์‹ ์ž ํ‰ํŒ ์ ์ˆ˜ ๋“ฑ)์„ ๊ฐ€์ ธ์™€ ํ”ผ์‹ฑ ๋˜๋Š” ํ•ฉ๋ฒ•์ ์ธ ๊ฒƒ์œผ๋กœ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ KDD์™€ ๊ฐ™์€ ํŠน์„ฑ ์ง‘ํ•ฉ์—์„œ ์นจ์ž… ํƒ์ง€์— ์ ์šฉ๋˜์–ด ์ข…์ข… ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€์œผ๋‚˜ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ์‹œ -- ์•…์„ฑ ์ฝ”๋“œ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ SVM: ์ด๋ฒˆ์—๋Š” SVM์„ ์‚ฌ์šฉํ•˜์—ฌ ํ”ผ์‹ฑ ์›น์‚ฌ์ดํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋‹ค์‹œ ์‚ฌ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. SVM์ด ๋А๋ฆด ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ํ•„์š”ํ•œ ๊ฒฝ์šฐ ํ›ˆ๋ จ์„ ์œ„ํ•ด ๋ฐ์ดํ„ฐ์˜ ํ•˜์œ„ ์ง‘ํ•ฉ์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค(๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ์•ฝ 11,000๊ฐœ์˜ ์ธ์Šคํ„ด์Šค์ด๋ฉฐ, SVM์ด ์ ์ ˆํžˆ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค). ๋น„์„ ํ˜• ๋ฐ์ดํ„ฐ์— ์ผ๋ฐ˜์ ์œผ๋กœ ์„ ํƒ๋˜๋Š” RBF ์ปค๋„์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋ฉฐ, ROC AUC๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ํ™•๋ฅ  ์ถ”์ •์„ ํ™œ์„ฑํ™”ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ```python import pandas as pd from sklearn.datasets import fetch_openml from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, roc_auc_score)

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

1๏ธโƒฃ LOAD DATASET (OpenML idย 4534: โ€œPhishingWebsitesโ€)

โ€ข as_frame=True โžœ returns a pandas DataFrame

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

data = fetch_openml(data_id=4534, as_frame=True) # or data_name=โ€œPhishingWebsitesโ€ df = data.frame print(df.head()) # quick sanityโ€‘check

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

2๏ธโƒฃ TARGET: 0 = legitimate, 1 = phishing

The raw column has values {1, 0, -1}:

1 โ†’ legitimate โ†’ 0

0 & -1 โ†’ phishing โ†’ 1

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

y = (df[โ€œResultโ€].astype(int) != 1).astype(int) X = df.drop(columns=[โ€œResultโ€])

Train / test split (stratified keeps class proportions)

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=42, stratify=y)

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

3๏ธโƒฃ PREโ€‘PROCESS: Standardize features (meanโ€‘0 / stdโ€‘1)

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test)

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

4๏ธโƒฃ MODEL: RBFโ€‘kernel SVM

โ€ข C=1.0 (regularization strength)

โ€ข gamma=โ€˜scaleโ€™ (1โ€ฏ/โ€ฏ[n_featuresโ€ฏร—โ€ฏvar(X)])

โ€ข probability=True โ†’ enable predict_proba for ROCโ€‘AUC

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

clf = SVC(kernel=โ€œrbfโ€, C=1.0, gamma=โ€œscaleโ€, probability=True, random_state=42) clf.fit(X_train, y_train)

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

5๏ธโƒฃ EVALUATION

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

y_pred = clf.predict(X_test) y_prob = clf.predict_proba(X_test)[:, 1] # P(classย 1)

print(fโ€œAccuracy : {accuracy_score(y_test, y_pred):.3f}โ€œ) print(fโ€œPrecision: {precision_score(y_test, y_pred):.3f}โ€) print(fโ€œRecall : {recall_score(y_test, y_pred):.3f}โ€œ) print(fโ€œF1โ€‘score : {f1_score(y_test, y_pred):.3f}โ€) print(fโ€œROCย AUC : {roc_auc_score(y_test, y_prob):.3f}โ€œ)

โ€œโ€โ€œ Accuracy : 0.956 Precision: 0.963 Recall : 0.937 F1โ€‘score : 0.950 ROC AUC : 0.989 โ€œโ€โ€œ

SVM ๋ชจ๋ธ์€ ๋™์ผํ•œ ์ž‘์—…์— ๋Œ€ํ•ด ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์™€ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋Š” ๋ฉ”ํŠธ๋ฆญ์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๊ฐ€ ํŠน์„ฑ์— ์˜ํ•ด ์ž˜ ๋ถ„๋ฆฌ๋˜์–ด ์žˆ๋‹ค๋ฉด SVM์ด ๋†’์€ ์ •ํ™•๋„์™€ AUC๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, ๋ฐ์ดํ„ฐ์…‹์— ๋งŽ์€ ๋…ธ์ด์ฆˆ๋‚˜ ๊ฒน์น˜๋Š” ํด๋ž˜์Šค๊ฐ€ ์žˆ๋‹ค๋ฉด SVM์ด ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ณด๋‹ค ํฌ๊ฒŒ ์šฐ์ˆ˜ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ SVM์€ ํŠน์„ฑ๊ณผ ํด๋ž˜์Šค ๊ฐ„์— ๋ณต์žกํ•˜๊ณ  ๋น„์„ ํ˜•์ ์ธ ๊ด€๊ณ„๊ฐ€ ์žˆ์„ ๋•Œ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. RBF ์ปค๋„์€ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๊ฐ€ ๋†“์น˜๋Š” ๊ณก์„  ๊ฒฐ์ • ๊ฒฝ๊ณ„๋ฅผ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ๋ชจ๋ธ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ํŽธํ–ฅ๊ณผ ๋ถ„์‚ฐ์˜ ๊ท ํ˜•์„ ๋งž์ถ”๊ธฐ ์œ„ํ•ด `C`(์ •๊ทœํ™”) ๋ฐ ์ปค๋„ ๋งค๊ฐœ๋ณ€์ˆ˜(์˜ˆ: RBF์˜ `gamma`)๋ฅผ ์‹ ์ค‘ํ•˜๊ฒŒ ์กฐ์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

</details>

#### ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์™€ SVM์˜ ์ฐจ์ด

| ์ธก๋ฉด | **๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€** | **์„œํฌํŠธ ๋ฒกํ„ฐ ๋จธ์‹ ** |
|---|---|---|
| **๋ชฉ์  ํ•จ์ˆ˜** | **๋กœ๊ทธ ์†์‹ค**(๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ)์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค. | **ํžŒ์ง€ ์†์‹ค**์„ ์ตœ์†Œํ™”ํ•˜๋ฉด์„œ **๋งˆ์ง„**์„ ์ตœ๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค. |
| **๊ฒฐ์ • ๊ฒฝ๊ณ„** | _P(y\|x)_๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š” **์ตœ์ ์˜ ์ดˆํ‰๋ฉด**์„ ์ฐพ์Šต๋‹ˆ๋‹ค. | ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ ๊ณผ์˜ ๊ฐ„๊ฒฉ์ด ๊ฐ€์žฅ ํฐ **์ตœ๋Œ€ ๋งˆ์ง„ ์ดˆํ‰๋ฉด**์„ ์ฐพ์Šต๋‹ˆ๋‹ค. |
| **์ถœ๋ ฅ** | **ํ™•๋ฅ ์ ** โ€“ ฯƒ(wยทxโ€ฏ+โ€ฏb)๋ฅผ ํ†ตํ•ด ๋ณด์ •๋œ ํด๋ž˜์Šค ํ™•๋ฅ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. | **๊ฒฐ์ •์ ** โ€“ ํด๋ž˜์Šค ๋ ˆ์ด๋ธ”์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค; ํ™•๋ฅ ์€ ์ถ”๊ฐ€ ์ž‘์—…์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค(์˜ˆ: Platt ์Šค์ผ€์ผ๋ง). |
| **์ •๊ทœํ™”** | L2(๊ธฐ๋ณธ๊ฐ’) ๋˜๋Š” L1, ๊ณผ์†Œ/๊ณผ๋Œ€ ์ ํ•ฉ์„ ์ง์ ‘์ ์œผ๋กœ ๊ท ํ˜• ๋งž์ถฅ๋‹ˆ๋‹ค. | C ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ๋งˆ์ง„ ๋„ˆ๋น„์™€ ์ž˜๋ชป ๋ถ„๋ฅ˜ ๊ฐ„์˜ ๊ท ํ˜•์„ ๋งž์ถ”๋ฉฐ, ์ปค๋„ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ๋ณต์žก์„ฑ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. |
| **์ปค๋„ / ๋น„์„ ํ˜•** | ๊ธฐ๋ณธ ํ˜•ํƒœ๋Š” **์„ ํ˜•**; ๋น„์„ ํ˜•์„ฑ์€ ํŠน์„ฑ ์—”์ง€๋‹ˆ์–ด๋ง์œผ๋กœ ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. | ๋‚ด์žฅ๋œ **์ปค๋„ ํŠธ๋ฆญ**(RBF, poly ๋“ฑ)์„ ํ†ตํ•ด ๊ณ ์ฐจ์› ๊ณต๊ฐ„์—์„œ ๋ณต์žกํ•œ ๊ฒฝ๊ณ„๋ฅผ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. |
| **ํ™•์žฅ์„ฑ** | **O(nd)**์—์„œ ๋ณผ๋ก ์ตœ์ ํ™”๋ฅผ ํ•ด๊ฒฐํ•˜๋ฉฐ, ๋งค์šฐ ํฐ n์„ ์ž˜ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. | ํ›ˆ๋ จ์€ ์ „๋ฌธ ์†”๋ฒ„ ์—†์ด **O(nยฒโ€“nยณ)** ๋ฉ”๋ชจ๋ฆฌ/์‹œ๊ฐ„์ด ์†Œ์š”๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํฐ n์— ๋œ ์นœ์ˆ™ํ•ฉ๋‹ˆ๋‹ค. |
| **ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ** | **๋†’์Œ** โ€“ ๊ฐ€์ค‘์น˜๊ฐ€ ํŠน์„ฑ์˜ ์˜ํ–ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค; ์˜ค์ฆˆ ๋น„์œจ์ด ์ง๊ด€์ ์ž…๋‹ˆ๋‹ค. | ๋น„์„ ํ˜• ์ปค๋„์˜ ๊ฒฝ์šฐ **๋‚ฎ์Œ**; ์„œํฌํŠธ ๋ฒกํ„ฐ๋Š” ํฌ์†Œํ•˜์ง€๋งŒ ์„ค๋ช…ํ•˜๊ธฐ ์‰ฝ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. |
| **์ด์ƒ์น˜์— ๋Œ€ํ•œ ๋ฏผ๊ฐ๋„** | ๋ถ€๋“œ๋Ÿฌ์šด ๋กœ๊ทธ ์†์‹ค์„ ์‚ฌ์šฉํ•˜์—ฌ โ†’ ๋œ ๋ฏผ๊ฐํ•ฉ๋‹ˆ๋‹ค. | ํ•˜๋“œ ๋งˆ์ง„์˜ ํžŒ์ง€ ์†์‹ค์€ **๋ฏผ๊ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค**; ์†Œํ”„ํŠธ ๋งˆ์ง„(C)์€ ์ด๋ฅผ ์™„ํ™”ํ•ฉ๋‹ˆ๋‹ค. |
| **์ผ๋ฐ˜์ ์ธ ์‚ฌ์šฉ ์‚ฌ๋ก€** | ์‹ ์šฉ ์ ์ˆ˜, ์˜๋ฃŒ ์œ„ํ—˜, A/B ํ…Œ์ŠคํŠธ โ€“ **ํ™•๋ฅ  ๋ฐ ์„ค๋ช… ๊ฐ€๋Šฅ์„ฑ**์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ. | ์ด๋ฏธ์ง€/ํ…์ŠคํŠธ ๋ถ„๋ฅ˜, ์ƒ๋ฌผ ์ •๋ณดํ•™ โ€“ **๋ณต์žกํ•œ ๊ฒฝ๊ณ„**์™€ **๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ**๊ฐ€ ์ค‘์š”ํ•œ ๊ฒฝ์šฐ. |

* **๋ณด์ •๋œ ํ™•๋ฅ , ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ์ด ํ•„์š”ํ•˜๊ฑฐ๋‚˜ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ž‘์—…ํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐโ€ฏโ€”โ€ฏ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ฅผ ์„ ํƒํ•˜์„ธ์š”.**
* **์ˆ˜๋™ ํŠน์„ฑ ์—”์ง€๋‹ˆ์–ด๋ง ์—†์ด ๋น„์„ ํ˜• ๊ด€๊ณ„๋ฅผ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ๋Š” ์œ ์—ฐํ•œ ๋ชจ๋ธ์ด ํ•„์š”ํ•˜๋‹ค๋ฉดโ€ฏโ€”โ€ฏSVM(์ปค๋„ ์‚ฌ์šฉ)์„ ์„ ํƒํ•˜์„ธ์š”.**
* ๋‘ ๋ชจ๋ธ ๋ชจ๋‘ ๋ณผ๋ก ๋ชฉํ‘œ๋ฅผ ์ตœ์ ํ™”ํ•˜๋ฏ€๋กœ **์ „์—ญ ์ตœ์†Œ๊ฐ’์ด ๋ณด์žฅ๋˜์ง€๋งŒ**, SVM์˜ ์ปค๋„์€ ํ•˜์ดํผ ๋งค๊ฐœ๋ณ€์ˆ˜์™€ ๊ณ„์‚ฐ ๋น„์šฉ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

### ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ๋Š” ํŠน์„ฑ ๊ฐ„์˜ ๊ฐ•ํ•œ ๋…๋ฆฝ์„ฑ ๊ฐ€์ •์„ ์ ์šฉํ•˜์—ฌ ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” **ํ™•๋ฅ ์  ๋ถ„๋ฅ˜๊ธฐ**์˜ ์ง‘ํ•ฉ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ "๋‚˜์ด๋ธŒ" ๊ฐ€์ •์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ๋Š” ํŠนํžˆ ์ŠคํŒธ ํƒ์ง€์™€ ๊ฐ™์€ ํ…์ŠคํŠธ ๋˜๋Š” ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ์™€ ๊ด€๋ จ๋œ ํŠน์ • ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์—์„œ ๋†€๋ž๋„๋ก ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

#### ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ

๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋Š” ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๊ธฐ์˜ ๊ธฐ์ดˆ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋ฌด์ž‘์œ„ ์‚ฌ๊ฑด์˜ ์กฐ๊ฑด๋ถ€ ๋ฐ ์ฃผ๋ณ€ ํ™•๋ฅ ์„ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ๊ณต์‹์€:
```plaintext
P(A|B) = (P(B|A) * P(A)) / P(B)

Where:

  • P(A|B)๋Š” ํŠน์„ฑ B๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ํด๋ž˜์Šค A์˜ ์‚ฌํ›„ ํ™•๋ฅ ์ž…๋‹ˆ๋‹ค.
  • P(B|A)๋Š” ํด๋ž˜์Šค A๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ํŠน์„ฑ B์˜ ๊ฐ€๋Šฅ์„ฑ์ž…๋‹ˆ๋‹ค.
  • P(A)๋Š” ํด๋ž˜์Šค A์˜ ์‚ฌ์ „ ํ™•๋ฅ ์ž…๋‹ˆ๋‹ค.
  • P(B)๋Š” ํŠน์„ฑ B์˜ ์‚ฌ์ „ ํ™•๋ฅ ์ž…๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ํ…์ŠคํŠธ๊ฐ€ ์–ด๋ฆฐ์ด ๋˜๋Š” ์„ฑ์ธ์— ์˜ํ•ด ์ž‘์„ฑ๋˜์—ˆ๋Š”์ง€ ๋ถ„๋ฅ˜ํ•˜๊ณ ์ž ํ•  ๋•Œ, ํ…์ŠคํŠธ์˜ ๋‹จ์–ด๋ฅผ ํŠน์„ฑ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ดˆ๊ธฐ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ Naive Bayes ๋ถ„๋ฅ˜๊ธฐ๋Š” ๊ฐ ๋‹จ์–ด๊ฐ€ ๊ฐ ์ž ์žฌ์  ํด๋ž˜์Šค(์–ด๋ฆฐ์ด ๋˜๋Š” ์„ฑ์ธ)์— ์†ํ•  ํ™•๋ฅ ์„ ๋ฏธ๋ฆฌ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์ƒˆ๋กœ์šด ํ…์ŠคํŠธ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด, ํ…์ŠคํŠธ์˜ ๋‹จ์–ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ ์ž ์žฌ์  ํด๋ž˜์Šค์˜ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์ง„ ํด๋ž˜์Šค๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

์ด ์˜ˆ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, Naive Bayes ๋ถ„๋ฅ˜๊ธฐ๋Š” ๋งค์šฐ ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฅด์ง€๋งŒ, ํŠน์„ฑ์ด ๋…๋ฆฝ์ ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜๋Š”๋ฐ, ์ด๋Š” ์‹ค์ œ ๋ฐ์ดํ„ฐ์—์„œ๋Š” ํ•ญ์ƒ ๊ทธ๋ ‡์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

Naive Bayes ๋ถ„๋ฅ˜๊ธฐ์˜ ์œ ํ˜•

๋ฐ์ดํ„ฐ์˜ ์œ ํ˜•๊ณผ ํŠน์„ฑ์˜ ๋ถ„ํฌ์— ๋”ฐ๋ผ ์—ฌ๋Ÿฌ ์œ ํ˜•์˜ Naive Bayes ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

  • Gaussian Naive Bayes: ํŠน์„ฑ์ด ๊ฐ€์šฐ์‹œ์•ˆ(์ •๊ทœ) ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ์—ฐ์† ๋ฐ์ดํ„ฐ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
  • Multinomial Naive Bayes: ํŠน์„ฑ์ด ๋‹คํ•ญ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ํ…์ŠคํŠธ ๋ถ„๋ฅ˜์—์„œ ๋‹จ์–ด ์ˆ˜์™€ ๊ฐ™์€ ์ด์‚ฐ ๋ฐ์ดํ„ฐ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
  • Bernoulli Naive Bayes: ํŠน์„ฑ์ด ์ด์ง„(0 ๋˜๋Š” 1)์ด๋ผ๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ํ…์ŠคํŠธ ๋ถ„๋ฅ˜์—์„œ ๋‹จ์–ด์˜ ์กด์žฌ ๋˜๋Š” ๋ถ€์žฌ์™€ ๊ฐ™์€ ์ด์ง„ ๋ฐ์ดํ„ฐ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
  • Categorical Naive Bayes: ํŠน์„ฑ์ด ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ผ๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ์ƒ‰์ƒ๊ณผ ๋ชจ์–‘์— ๋”ฐ๋ผ ๊ณผ์ผ์„ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

####ย Naive Bayes์˜ ์ฃผ์š” ํŠน์ง•:

  • ๋ฌธ์ œ ์œ ํ˜•: ๋ถ„๋ฅ˜(์ด์ง„ ๋˜๋Š” ๋‹ค์ค‘ ํด๋ž˜์Šค). ์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ์—์„œ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ์ž‘์—…(์ŠคํŒธ, ํ”ผ์‹ฑ ๋“ฑ)์— ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  • ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ: ์ค‘๊ฐ„ โ€“ ๊ฒฐ์ • ํŠธ๋ฆฌ๋งŒํผ ์ง์ ‘์ ์œผ๋กœ ํ•ด์„ํ•  ์ˆ˜๋Š” ์—†์ง€๋งŒ, ํ•™์Šต๋œ ํ™•๋ฅ (์˜ˆ: ์ŠคํŒธ ์ด๋ฉ”์ผ๊ณผ ์ผ๋ฐ˜ ์ด๋ฉ”์ผ์—์„œ ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ ๋‹จ์–ด)์„ ๊ฒ€์‚ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•„์š”ํ•  ๊ฒฝ์šฐ ๋ชจ๋ธ์˜ ํ˜•ํƒœ(ํด๋ž˜์Šค์— ๋Œ€ํ•œ ๊ฐ ํŠน์„ฑ์˜ ํ™•๋ฅ )๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์žฅ์ : ๋งค์šฐ ๋น ๋ฅธ ํ›ˆ๋ จ ๋ฐ ์˜ˆ์ธก, ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋„ (์ธ์Šคํ„ด์Šค ์ˆ˜ * ํŠน์„ฑ ์ˆ˜์— ์„ ํ˜•). ํ™•๋ฅ ์„ ์‹ ๋ขฐ์„ฑ ์žˆ๊ฒŒ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ์ ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๋ฉฐ, ํŠนํžˆ ์ ์ ˆํ•œ ์Šค๋ฌด๋”ฉ์ด ์žˆ์„ ๋•Œ ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค. ํŠน์„ฑ์ด ๋…๋ฆฝ์ ์œผ๋กœ ํด๋ž˜์Šค์— ์ฆ๊ฑฐ๋ฅผ ๊ธฐ์—ฌํ•  ๋•Œ, ๊ธฐ์ค€์„ ์œผ๋กœ์„œ ๋†€๋ผ์šธ ์ •๋„๋กœ ์ •ํ™•ํ•ฉ๋‹ˆ๋‹ค. ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ(์˜ˆ: ํ…์ŠคํŠธ์—์„œ ์ˆ˜์ฒœ ๊ฐœ์˜ ํŠน์„ฑ)์™€ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์Šค๋ฌด๋”ฉ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์„ค์ •ํ•˜๋Š” ๊ฒƒ ์™ธ์— ๋ณต์žกํ•œ ์กฐ์ •์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

  • ์ œํ•œ ์‚ฌํ•ญ: ๋…๋ฆฝ์„ฑ ๊ฐ€์ •์€ ํŠน์„ฑ์ด ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์งˆ ๊ฒฝ์šฐ ์ •ํ™•๋„๋ฅผ ์ œํ•œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋„คํŠธ์›Œํฌ ๋ฐ์ดํ„ฐ์—์„œ src_bytes์™€ dst_bytes์™€ ๊ฐ™์€ ํŠน์„ฑ์ด ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ, Naive Bayes๋Š” ๊ทธ ์ƒํ˜ธ์ž‘์šฉ์„ ํฌ์ฐฉํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๊ฐ€ ๋งค์šฐ ์ปค์ง€๋ฉด, ํŠน์„ฑ ์˜์กด์„ฑ์„ ํ•™์Šตํ•˜๋Š” ๋” ํ‘œํ˜„๋ ฅ์ด ๋›ฐ์–ด๋‚œ ๋ชจ๋ธ(์˜ˆ: ์•™์ƒ๋ธ” ๋˜๋Š” ์‹ ๊ฒฝ๋ง)์ด Naive Bayes๋ฅผ ์ดˆ์›”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ๊ณต๊ฒฉ์„ ์‹๋ณ„ํ•˜๋Š” ๋ฐ ํŠน์ • ํŠน์„ฑ ์กฐํ•ฉ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ(๊ฐœ๋ณ„ ํŠน์„ฑ์ด ๋…๋ฆฝ์ ์œผ๋กœ๋งŒ ํ•„์š”ํ•œ ๊ฒƒ์ด ์•„๋‹˜), Naive Bayes๋Š” ์–ด๋ ค์›€์„ ๊ฒช์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Tip

์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ์—์„œ์˜ ์‚ฌ์šฉ ์‚ฌ๋ก€: ๊ณ ์ „์ ์ธ ์‚ฌ์šฉ์€ ์ŠคํŒธ ํƒ์ง€์ž…๋‹ˆ๋‹ค โ€“ Naive Bayes๋Š” ์ดˆ๊ธฐ ์ŠคํŒธ ํ•„ํ„ฐ์˜ ํ•ต์‹ฌ์œผ๋กœ, ํŠน์ • ํ† ํฐ(๋‹จ์–ด, ๊ตฌ๋ฌธ, IP ์ฃผ์†Œ)์˜ ๋นˆ๋„๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฉ”์ผ์ด ์ŠคํŒธ์ผ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ํ”ผ์‹ฑ ์ด๋ฉ”์ผ ํƒ์ง€ ๋ฐ URL ๋ถ„๋ฅ˜์— ์‚ฌ์šฉ๋˜๋ฉฐ, ํŠน์ • ํ‚ค์›Œ๋“œ๋‚˜ ํŠน์„ฑ(์˜ˆ: URL์˜ โ€œlogin.phpโ€ ๋˜๋Š” URL ๊ฒฝ๋กœ์˜ @)์˜ ์กด์žฌ๊ฐ€ ํ”ผ์‹ฑ ํ™•๋ฅ ์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค. ์•…์„ฑ ์ฝ”๋“œ ๋ถ„์„์—์„œ๋Š” ํŠน์ • API ํ˜ธ์ถœ์ด๋‚˜ ์†Œํ”„ํŠธ์›จ์–ด์˜ ๊ถŒํ•œ์˜ ์กด์žฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•…์„ฑ ์ฝ”๋“œ์ธ์ง€ ์˜ˆ์ธกํ•˜๋Š” Naive Bayes ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์ƒ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋” ๋ฐœ์ „๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ข…์ข… ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ, Naive Bayes๋Š” ์†๋„์™€ ๋‹จ์ˆœ์„ฑ ๋•๋ถ„์— ์—ฌ์ „ํžˆ ์ข‹์€ ๊ธฐ์ค€์„ ์œผ๋กœ ๋‚จ์•„ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ์‹œ -- ํ”ผ์‹ฑ ํƒ์ง€๋ฅผ ์œ„ํ•œ Naive Bayes: Naive Bayes๋ฅผ ์‹œ์—ฐํ•˜๊ธฐ ์œ„ํ•ด, NSL-KDD ์นจ์ž… ๋ฐ์ดํ„ฐ์…‹(์ด์ง„ ๋ ˆ์ด๋ธ” ํฌํ•จ)์—์„œ Gaussian Naive Bayes๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. Gaussian NB๋Š” ๊ฐ ํŠน์„ฑ์ด ํด๋ž˜์Šค๋ณ„๋กœ ์ •๊ทœ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ๋งŽ์€ ๋„คํŠธ์›Œํฌ ํŠน์„ฑ์ด ์ด์‚ฐ์ ์ด๊ฑฐ๋‚˜ ๋งค์šฐ ์™œ๊ณก๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋Œ€๋žต์ ์ธ ์„ ํƒ์ด์ง€๋งŒ, ์—ฐ์† ํŠน์„ฑ ๋ฐ์ดํ„ฐ์— Naive Bayes๋ฅผ ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด์ง„ ํŠน์„ฑ ์ง‘ํ•ฉ(์˜ˆ: ํŠธ๋ฆฌ๊ฑฐ๋œ ๊ฒฝ๊ณ  ์„ธํŠธ)์—์„œ Bernoulli NB๋ฅผ ์„ ํƒํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ์—ฐ์†์„ฑ์„ ์œ„ํ•ด ์—ฌ๊ธฐ์„œ๋Š” NSL-KDD๋ฅผ ๊ณ ์ˆ˜ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ```python import pandas as pd from sklearn.naive_bayes import GaussianNB from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

1. Load NSL-KDD data

col_names = [ # 41ย features + 2ย targets โ€œdurationโ€,โ€œprotocol_typeโ€,โ€œserviceโ€,โ€œflagโ€,โ€œsrc_bytesโ€,โ€œdst_bytesโ€,โ€œlandโ€, โ€œwrong_fragmentโ€,โ€œurgentโ€,โ€œhotโ€,โ€œnum_failed_loginsโ€,โ€œlogged_inโ€, โ€œnum_compromisedโ€,โ€œroot_shellโ€,โ€œsu_attemptedโ€,โ€œnum_rootโ€,โ€œnum_file_creationsโ€, โ€œnum_shellsโ€,โ€œnum_access_filesโ€,โ€œnum_outbound_cmdsโ€,โ€œis_host_loginโ€, โ€œis_guest_loginโ€,โ€œcountโ€,โ€œsrv_countโ€,โ€œserror_rateโ€,โ€œsrv_serror_rateโ€, โ€œrerror_rateโ€,โ€œsrv_rerror_rateโ€,โ€œsame_srv_rateโ€,โ€œdiff_srv_rateโ€, โ€œsrv_diff_host_rateโ€,โ€œdst_host_countโ€,โ€œdst_host_srv_countโ€, โ€œdst_host_same_srv_rateโ€,โ€œdst_host_diff_srv_rateโ€, โ€œdst_host_same_src_port_rateโ€,โ€œdst_host_srv_diff_host_rateโ€, โ€œdst_host_serror_rateโ€,โ€œdst_host_srv_serror_rateโ€,โ€œdst_host_rerror_rateโ€, โ€œdst_host_srv_rerror_rateโ€,โ€œclassโ€,โ€œdifficulty_levelโ€ ]

train_url = โ€œhttps://raw.githubusercontent.com/Mamcose/NSL-KDD-Network-Intrusion-Detection/master/NSL_KDD_Train.csvโ€ test_url = โ€œhttps://raw.githubusercontent.com/Mamcose/NSL-KDD-Network-Intrusion-Detection/master/NSL_KDD_Test.csvโ€

df_train = pd.read_csv(train_url, header=None, names=col_names) df_test = pd.read_csv(test_url, header=None, names=col_names)

2. Preprocess (encode categorical features, prepare binary labels)

from sklearn.preprocessing import LabelEncoder for col in [โ€˜protocol_typeโ€™, โ€˜serviceโ€™, โ€˜flagโ€™]: le = LabelEncoder() le.fit(pd.concat([df_train[col], df_test[col]], axis=0)) df_train[col] = le.transform(df_train[col]) df_test[col] = le.transform(df_test[col]) X_train = df_train.drop(columns=[โ€˜classโ€™, โ€˜difficulty_levelโ€™], errors=โ€˜ignoreโ€™) y_train = df_train[โ€˜classโ€™].apply(lambda x: 0 if x.strip().lower() == โ€˜normalโ€™ else 1) X_test = df_test.drop(columns=[โ€˜classโ€™, โ€˜difficulty_levelโ€™], errors=โ€˜ignoreโ€™) y_test = df_test[โ€˜classโ€™].apply(lambda x: 0 if x.strip().lower() == โ€˜normalโ€™ else 1)

3. Train Gaussian Naive Bayes

model = GaussianNB() model.fit(X_train, y_train)

4. Evaluate on test set

y_pred = model.predict(X_test)

For ROC AUC, need probability of class 1:

y_prob = model.predict_proba(X_test)[:, 1] if hasattr(model, โ€œpredict_probaโ€) else y_pred print(fโ€œAccuracy: {accuracy_score(y_test, y_pred):.3f}โ€œ) print(fโ€œPrecision: {precision_score(y_test, y_pred):.3f}โ€) print(fโ€œRecall: {recall_score(y_test, y_pred):.3f}โ€œ) print(fโ€œF1-score: {f1_score(y_test, y_pred):.3f}โ€) print(fโ€œROC AUC: {roc_auc_score(y_test, y_prob):.3f}โ€œ)

โ€œโ€โ€œ Accuracy: 0.450 Precision: 0.937 Recall: 0.037 F1-score: 0.071 ROC AUC: 0.867 โ€œโ€โ€œ

์ด ์ฝ”๋“œ๋Š” ๊ณต๊ฒฉ์„ ํƒ์ง€ํ•˜๊ธฐ ์œ„ํ•ด Naive Bayes ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ํ›ˆ๋ จ์‹œํ‚ต๋‹ˆ๋‹ค. Naive Bayes๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ `P(service=http | Attack)` ๋ฐ `P(Service=http | Normal)`๊ณผ ๊ฐ™์€ ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๋ฉฐ, ํŠน์„ฑ ๊ฐ„์˜ ๋…๋ฆฝ์„ฑ์„ ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์ด๋Ÿฌํ•œ ํ™•๋ฅ ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ด€์ฐฐ๋œ ํŠน์„ฑ์— ๋”ฐ๋ผ ์ƒˆ๋กœ์šด ์—ฐ๊ฒฐ์„ ์ •์ƒ ๋˜๋Š” ๊ณต๊ฒฉ์œผ๋กœ ๋ถ„๋ฅ˜ํ•ฉ๋‹ˆ๋‹ค. NSL-KDD์—์„œ NB์˜ ์„ฑ๋Šฅ์€ ๋” ๊ณ ๊ธ‰ ๋ชจ๋ธ๋งŒํผ ๋†’์ง€ ์•Š์„ ์ˆ˜ ์žˆ์ง€๋งŒ(ํŠน์„ฑ ๋…๋ฆฝ์„ฑ์ด ์œ„๋ฐฐ๋˜๊ธฐ ๋•Œ๋ฌธ์—), ์ข…์ข… ๊ดœ์ฐฎ๊ณ  ๊ทน๋„์˜ ์†๋„์˜ ์ด์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์‹ค์‹œ๊ฐ„ ์ด๋ฉ”์ผ ํ•„ํ„ฐ๋ง์ด๋‚˜ URL์˜ ์ดˆ๊ธฐ ๋ถ„๋ฅ˜์™€ ๊ฐ™์€ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋Š” Naive Bayes ๋ชจ๋ธ์ด ์ž์› ์‚ฌ์šฉ์ด ์ ์œผ๋ฉด์„œ ๋ช…๋ฐฑํžˆ ์•…์˜์ ์ธ ์‚ฌ๋ก€๋ฅผ ๋น ๋ฅด๊ฒŒ ํ”Œ๋ž˜๊ทธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

</details>

### k-์ตœ๊ทผ์ ‘ ์ด์›ƒ (k-NN)

k-์ตœ๊ทผ์ ‘ ์ด์›ƒ์€ ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋จธ์‹  ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ์ด๋Š” **๋น„๋ชจ์ˆ˜์ , ์ธ์Šคํ„ด์Šค ๊ธฐ๋ฐ˜** ๋ฐฉ๋ฒ•์œผ๋กœ, ํ›ˆ๋ จ ์„ธํŠธ์˜ ์˜ˆ์ œ์™€์˜ ์œ ์‚ฌ์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ์•„์ด๋””์–ด๋Š”: ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์œ„ํ•ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—์„œ **k**๊ฐœ์˜ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ํฌ์ธํŠธ(์ฆ‰, "๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ด์›ƒ")๋ฅผ ์ฐพ์•„ ๊ทธ ์ด์›ƒ๋“ค ์ค‘ ๋‹ค์ˆ˜์˜ ํด๋ž˜์Šค๋ฅผ ํ• ๋‹นํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. "๊ฐ€๊นŒ์›€"์€ ๊ฑฐ๋ฆฌ ๋ฉ”ํŠธ๋ฆญ์— ์˜ํ•ด ์ •์˜๋˜๋ฉฐ, ์ผ๋ฐ˜์ ์œผ๋กœ ์ˆซ์ž ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ(๋‹ค๋ฅธ ์œ ํ˜•์˜ ํŠน์„ฑ์ด๋‚˜ ๋ฌธ์ œ์— ๋Œ€ํ•ด ๋‹ค๋ฅธ ๊ฑฐ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

K-NN์€ *๋ช…์‹œ์ ์ธ ํ›ˆ๋ จ์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค* -- "ํ›ˆ๋ จ" ๋‹จ๊ณ„๋Š” ๋ฐ์ดํ„ฐ์…‹์„ ์ €์žฅํ•˜๋Š” ๊ฒƒ๋ฟ์ž…๋‹ˆ๋‹ค. ๋ชจ๋“  ์ž‘์—…์€ ์ฟผ๋ฆฌ(์˜ˆ์ธก) ์ค‘์— ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค: ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ฟผ๋ฆฌ ํฌ์ธํŠธ์—์„œ ๋ชจ๋“  ํ›ˆ๋ จ ํฌ์ธํŠธ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ํฌ์ธํŠธ๋ฅผ ์ฐพ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด ์˜ˆ์ธก ์‹œ๊ฐ„์€ **ํ›ˆ๋ จ ์ƒ˜ํ”Œ ์ˆ˜์— ์„ ํ˜•์ **์ด๋ฉฐ, ์ด๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ๋น„์šฉ์ด ๋งŽ์ด ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ k-NN์€ ๋” ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์ด๋‚˜ ๋ฉ”๋ชจ๋ฆฌ์™€ ์†๋„๋ฅผ ๋‹จ์ˆœํ•จ๊ณผ ๊ตํ™˜ํ•  ์ˆ˜ ์žˆ๋Š” ์‹œ๋‚˜๋ฆฌ์˜ค์— ๊ฐ€์žฅ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

๋‹จ์ˆœํ•จ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  k-NN์€ ๋งค์šฐ ๋ณต์žกํ•œ ๊ฒฐ์ • ๊ฒฝ๊ณ„๋ฅผ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(์‚ฌ์‹ค์ƒ ๊ฒฐ์ • ๊ฒฝ๊ณ„๋Š” ์˜ˆ์ œ์˜ ๋ถ„ํฌ์— ์˜ํ•ด ๊ฒฐ์ •๋˜๋Š” ์–ด๋–ค ํ˜•ํƒœ๋„ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค). ๊ฒฐ์ • ๊ฒฝ๊ณ„๊ฐ€ ๋งค์šฐ ๋ถˆ๊ทœ์น™ํ•˜๊ณ  ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์„ ๋•Œ ์ž˜ ์ž‘๋™ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค -- ๋ณธ์งˆ์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ "์Šค์Šค๋กœ ๋งํ•˜๊ฒŒ" ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ณ ์ฐจ์›์—์„œ๋Š” ๊ฑฐ๋ฆฌ ๋ฉ”ํŠธ๋ฆญ์ด ๋œ ์˜๋ฏธ ์žˆ๊ฒŒ ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ(์ฐจ์›์˜ ์ €์ฃผ), ์ƒ˜ํ”Œ ์ˆ˜๊ฐ€ ๋งŽ์ง€ ์•Š์œผ๋ฉด ์ด ๋ฐฉ๋ฒ•์ด ์–ด๋ ค์›€์„ ๊ฒช์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

*์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ์—์„œ์˜ ์‚ฌ์šฉ ์‚ฌ๋ก€:* k-NN์€ ์ด์ƒ ํƒ์ง€์— ์ ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค -- ์˜ˆ๋ฅผ ๋“ค์–ด, ์นจ์ž… ํƒ์ง€ ์‹œ์Šคํ…œ์€ ๋Œ€๋ถ€๋ถ„์˜ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ด์›ƒ(์ด์ „ ์ด๋ฒคํŠธ)์ด ์•…์˜์ ์ด์—ˆ๋‹ค๋ฉด ๋„คํŠธ์›Œํฌ ์ด๋ฒคํŠธ๋ฅผ ์•…์˜์ ์ด๋ผ๊ณ  ๋ ˆ์ด๋ธ”์„ ๋ถ™์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ •์ƒ ํŠธ๋ž˜ํ”ฝ์ด ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ํ˜•์„ฑํ•˜๊ณ  ๊ณต๊ฒฉ์ด ์ด์ƒ์น˜์ธ ๊ฒฝ์šฐ, K-NN ์ ‘๊ทผ ๋ฐฉ์‹(k=1 ๋˜๋Š” ์ž‘์€ k)์€ ๋ณธ์งˆ์ ์œผ๋กœ **๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ด์›ƒ ์ด์ƒ ํƒ์ง€**๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. K-NN์€ ์ด์ง„ ํŠน์„ฑ ๋ฒกํ„ฐ๋ฅผ ํ†ตํ•ด ์•…์„ฑ์ฝ”๋“œ ํŒจ๋ฐ€๋ฆฌ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฐ์—๋„ ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค: ์ƒˆ๋กœ์šด ํŒŒ์ผ์ด ํŠน์ • ์•…์„ฑ์ฝ”๋“œ ํŒจ๋ฐ€๋ฆฌ์˜ ์•Œ๋ ค์ง„ ์ธ์Šคํ„ด์Šค์™€ ๋งค์šฐ ๊ฐ€๊นŒ์šด ๊ฒฝ์šฐ ํ•ด๋‹น ์•…์„ฑ์ฝ”๋“œ ํŒจ๋ฐ€๋ฆฌ๋กœ ๋ถ„๋ฅ˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ k-NN์€ ๋” ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋งŒํผ ์ผ๋ฐ˜์ ์ด์ง€ ์•Š์ง€๋งŒ, ๊ฐœ๋…์ ์œผ๋กœ ๊ฐ„๋‹จํ•˜๊ณ  ๋•Œ๋•Œ๋กœ ๊ธฐ์ค€์„  ๋˜๋Š” ์†Œ๊ทœ๋ชจ ๋ฌธ์ œ์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

#### **k-NN์˜ ์ฃผ์š” ํŠน์„ฑ:**

-   **๋ฌธ์ œ ์œ ํ˜•:** ๋ถ„๋ฅ˜(ํšŒ๊ท€ ๋ณ€ํ˜•๋„ ์กด์žฌ). ์ด๋Š” *๊ฒŒ์œผ๋ฅธ ํ•™์Šต* ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค -- ๋ช…์‹œ์ ์ธ ๋ชจ๋ธ ์ ํ•ฉ์ด ์—†์Šต๋‹ˆ๋‹ค.

-   **ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ:** ๋‚ฎ์Œ์—์„œ ์ค‘๊ฐ„ -- ์ „์—ญ ๋ชจ๋ธ์ด๋‚˜ ๊ฐ„๊ฒฐํ•œ ์„ค๋ช…์ด ์—†์ง€๋งŒ, ๊ฒฐ์ •์— ์˜ํ–ฅ์„ ๋ฏธ์นœ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ด์›ƒ์„ ์‚ดํŽด๋ด„์œผ๋กœ์จ ๊ฒฐ๊ณผ๋ฅผ ํ•ด์„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(์˜ˆ: "์ด ๋„คํŠธ์›Œํฌ ํ๋ฆ„์€ ์ด 3๊ฐœ์˜ ์•Œ๋ ค์ง„ ์•…์˜์  ํ๋ฆ„๊ณผ ์œ ์‚ฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์•…์˜์ ์œผ๋กœ ๋ถ„๋ฅ˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค"). ๋”ฐ๋ผ์„œ ์„ค๋ช…์€ ์˜ˆ์ œ ๊ธฐ๋ฐ˜์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

-   **์žฅ์ :** ๊ตฌํ˜„ ๋ฐ ์ดํ•ด๊ฐ€ ๋งค์šฐ ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์— ๋Œ€ํ•œ ๊ฐ€์ •์„ ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค(๋น„๋ชจ์ˆ˜์ ). ๋‹ค์ค‘ ํด๋ž˜์Šค ๋ฌธ์ œ๋ฅผ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ์ • ๊ฒฝ๊ณ„๊ฐ€ ๋งค์šฐ ๋ณต์žกํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์—์„œ **์ ์‘์ **์ž…๋‹ˆ๋‹ค.

-   **์ œํ•œ ์‚ฌํ•ญ:** ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ์˜ˆ์ธก์ด ๋А๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(๋งŽ์€ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•ด์•ผ ํ•จ). ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์ ์ž…๋‹ˆ๋‹ค -- ๋ชจ๋“  ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ๊ณ ์ฐจ์› ํŠน์„ฑ ๊ณต๊ฐ„์—์„œ๋Š” ์„ฑ๋Šฅ์ด ์ €ํ•˜๋ฉ๋‹ˆ๋‹ค. ๋ชจ๋“  ํฌ์ธํŠธ๊ฐ€ ๊ฑฐ์˜ ๋™๋“ฑํ•œ ๊ฑฐ๋ฆฌ๊ฐ€ ๋˜๊ธฐ ๋•Œ๋ฌธ์— "๊ฐ€์žฅ ๊ฐ€๊นŒ์šด" ๊ฐœ๋…์ด ๋œ ์˜๋ฏธ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. *k* (์ด์›ƒ์˜ ์ˆ˜)๋ฅผ ์ ์ ˆํžˆ ์„ ํƒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค -- ๋„ˆ๋ฌด ์ž‘์€ k๋Š” ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ๊ณ , ๋„ˆ๋ฌด ํฐ k๋Š” ๋‹ค๋ฅธ ํด๋ž˜์Šค์˜ ๊ด€๋ จ ์—†๋Š” ํฌ์ธํŠธ๋ฅผ ํฌํ•จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ์ด ์Šค์ผ€์ผ์— ๋ฏผ๊ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํŠน์„ฑ์€ ์ ์ ˆํžˆ ์Šค์ผ€์ผ๋ง๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

<details>
<summary>์˜ˆ์ œ -- ํ”ผ์‹ฑ ํƒ์ง€๋ฅผ ์œ„ํ•œ k-NN:</summary>

๋‹ค์‹œ NSL-KDD(์ด์ง„ ๋ถ„๋ฅ˜)๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. k-NN์€ ๊ณ„์‚ฐ์ ์œผ๋กœ ๋ฌด๊ฒ๊ธฐ ๋•Œ๋ฌธ์—, ์ด ์‹œ์—ฐ์—์„œ ๋‹ค๋ฃจ๊ธฐ ์‰ฝ๊ฒŒ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ํ•˜์œ„ ์ง‘ํ•ฉ์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ „์ฒด 125k์—์„œ 20,000๊ฐœ์˜ ํ›ˆ๋ จ ์ƒ˜ํ”Œ์„ ์„ ํƒํ•˜๊ณ  k=5 ์ด์›ƒ์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ›ˆ๋ จ ํ›„(์‚ฌ์‹ค์ƒ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๋Š” ๊ฒƒ), ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ ํ‰๊ฐ€ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ์„ ์œ„ํ•ด ํŠน์„ฑ์„ ์Šค์ผ€์ผ๋งํ•˜์—ฌ ๋‹จ์ผ ํŠน์„ฑ์ด ์Šค์ผ€์ผ๋กœ ์ธํ•ด ์ง€๋ฐฐํ•˜์ง€ ์•Š๋„๋ก ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
```python
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# 1. Load NSL-KDD and preprocess similarly
col_names = [                       # 41ย features + 2ย targets
"duration","protocol_type","service","flag","src_bytes","dst_bytes","land",
"wrong_fragment","urgent","hot","num_failed_logins","logged_in",
"num_compromised","root_shell","su_attempted","num_root","num_file_creations",
"num_shells","num_access_files","num_outbound_cmds","is_host_login",
"is_guest_login","count","srv_count","serror_rate","srv_serror_rate",
"rerror_rate","srv_rerror_rate","same_srv_rate","diff_srv_rate",
"srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate",
"dst_host_same_src_port_rate","dst_host_srv_diff_host_rate",
"dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate",
"dst_host_srv_rerror_rate","class","difficulty_level"
]

train_url = "https://raw.githubusercontent.com/Mamcose/NSL-KDD-Network-Intrusion-Detection/master/NSL_KDD_Train.csv"
test_url  = "https://raw.githubusercontent.com/Mamcose/NSL-KDD-Network-Intrusion-Detection/master/NSL_KDD_Test.csv"

df_train = pd.read_csv(train_url, header=None, names=col_names)
df_test  = pd.read_csv(test_url,  header=None, names=col_names)

from sklearn.preprocessing import LabelEncoder
for col in ['protocol_type', 'service', 'flag']:
le = LabelEncoder()
le.fit(pd.concat([df_train[col], df_test[col]], axis=0))
df_train[col] = le.transform(df_train[col])
df_test[col]  = le.transform(df_test[col])
X = df_train.drop(columns=['class', 'difficulty_level'], errors='ignore')
y = df_train['class'].apply(lambda x: 0 if x.strip().lower() == 'normal' else 1)
# Use a random subset of the training data for K-NN (to reduce computation)
X_train = X.sample(n=20000, random_state=42)
y_train = y[X_train.index]
# Use the full test set for evaluation
X_test = df_test.drop(columns=['class', 'difficulty_level'], errors='ignore')
y_test = df_test['class'].apply(lambda x: 0 if x.strip().lower() == 'normal' else 1)

# 2. Feature scaling for distance-based model
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# 3. Train k-NN classifier (store data)
model = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
model.fit(X_train, y_train)

# 4. Evaluate on test set
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred):.3f}")
print(f"F1-score:  {f1_score(y_test, y_pred):.3f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_prob):.3f}")

"""
Accuracy:  0.780
Precision: 0.972
Recall:    0.632
F1-score:  0.766
ROC AUC:   0.837
"""

k-NN ๋ชจ๋ธ์€ ํ›ˆ๋ จ ์„ธํŠธ์˜ 5๊ฐœ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์—ฐ๊ฒฐ์„ ์‚ดํŽด๋ณด์•„ ์—ฐ๊ฒฐ์„ ๋ถ„๋ฅ˜ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๊ทธ ์ด์›ƒ ์ค‘ 4๊ฐœ๊ฐ€ ๊ณต๊ฒฉ(์ด์ƒ์น˜)์ด๊ณ  1๊ฐœ๊ฐ€ ์ •์ƒ์ธ ๊ฒฝ์šฐ, ์ƒˆ๋กœ์šด ์—ฐ๊ฒฐ์€ ๊ณต๊ฒฉ์œผ๋กœ ๋ถ„๋ฅ˜๋ฉ๋‹ˆ๋‹ค. ์„ฑ๋Šฅ์€ ํ•ฉ๋ฆฌ์ ์ผ ์ˆ˜ ์žˆ์ง€๋งŒ, ์ข…์ข… ๋™์ผํ•œ ๋ฐ์ดํ„ฐ์—์„œ ์ž˜ ์กฐ์ •๋œ Random Forest๋‚˜ SVM๋งŒํผ ๋†’์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ k-NN์€ ํด๋ž˜์Šค ๋ถ„ํฌ๊ฐ€ ๋งค์šฐ ๋ถˆ๊ทœ์น™ํ•˜๊ณ  ๋ณต์žกํ•  ๋•Œ ๋น›์„ ๋ฐœํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํšจ๊ณผ์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ๋ฐ˜ ์กฐํšŒ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ์—์„œ k-NN( k=1 ๋˜๋Š” ์ž‘์€ k)์€ ์˜ˆ๋ฅผ ๋“ค์–ด ์•Œ๋ ค์ง„ ๊ณต๊ฒฉ ํŒจํ„ด์„ ํƒ์ง€ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๊ฑฐ๋‚˜ ๋” ๋ณต์žกํ•œ ์‹œ์Šคํ…œ์˜ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(์˜ˆ: ํด๋Ÿฌ์Šคํ„ฐ๋ง ํ›„ ํด๋Ÿฌ์Šคํ„ฐ ๋ฉค๋ฒ„์‹ญ์— ๋”ฐ๋ผ ๋ถ„๋ฅ˜).

Gradient Boosting Machines (์˜ˆ: XGBoost)

Gradient Boosting Machines๋Š” ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. Gradient boosting์€ ์•ฝํ•œ ํ•™์Šต์ž(์ข…์ข… ๊ฒฐ์ • ํŠธ๋ฆฌ)์˜ ์•™์ƒ๋ธ”์„ ์ˆœ์ฐจ์ ์œผ๋กœ ๊ตฌ์ถ•ํ•˜๋Š” ๊ธฐ์ˆ ์„ ์˜๋ฏธํ•˜๋ฉฐ, ๊ฐ ์ƒˆ๋กœ์šด ๋ชจ๋ธ์€ ์ด์ „ ์•™์ƒ๋ธ”์˜ ์˜ค๋ฅ˜๋ฅผ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค. ๋‚˜๋ฌด๋ฅผ ๋ณ‘๋ ฌ๋กœ ๊ตฌ์ถ•ํ•˜๊ณ  ํ‰๊ท ํ™”ํ•˜๋Š” bagging(Random Forests)๊ณผ ๋‹ฌ๋ฆฌ, boosting์€ ๋‚˜๋ฌด๋ฅผ ํ•˜๋‚˜์”ฉ ๊ตฌ์ถ•ํ•˜๋ฉฐ, ๊ฐ ๋‚˜๋ฌด๋Š” ์ด์ „ ๋‚˜๋ฌด๊ฐ€ ์ž˜๋ชป ์˜ˆ์ธกํ•œ ์ธ์Šคํ„ด์Šค์— ๋” ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค.

์ตœ๊ทผ ๋ช‡ ๋…„ ๋™์•ˆ ๊ฐ€์žฅ ์ธ๊ธฐ ์žˆ๋Š” ๊ตฌํ˜„์€ XGBoost, LightGBM, CatBoost๋กœ, ๋ชจ๋‘ gradient boosting decision tree (GBDT) ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค. ์ด๋“ค์€ ๊ธฐ๊ณ„ ํ•™์Šต ๋Œ€ํšŒ์™€ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์—์„œ ๋งค์šฐ ์„ฑ๊ณต์ ์ด์—ˆ์œผ๋ฉฐ, ์ข…์ข… ํ‘œ ํ˜•์‹ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ์—์„œ ์—ฐ๊ตฌ์ž์™€ ์‹ค๋ฌด์ž๋Š” ์•…์„ฑ ์ฝ”๋“œ ํƒ์ง€(ํŒŒ์ผ ๋˜๋Š” ๋Ÿฐํƒ€์ž„ ๋™์ž‘์—์„œ ์ถ”์ถœํ•œ ๊ธฐ๋Šฅ ์‚ฌ์šฉ) ๋ฐ ๋„คํŠธ์›Œํฌ ์นจ์ž… ํƒ์ง€์™€ ๊ฐ™์€ ์ž‘์—…์— gradient boosted trees๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, gradient boosting ๋ชจ๋ธ์€ โ€œ๋งŽ์€ SYN ํŒจํ‚ท๊ณผ ๋น„์ •์ƒ์ ์ธ ํฌํŠธ -> ์Šค์บ” ๊ฐ€๋Šฅ์„ฑโ€œ๊ณผ ๊ฐ™์€ ๋งŽ์€ ์•ฝํ•œ ๊ทœ์น™(ํŠธ๋ฆฌ)์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋งŽ์€ ๋ฏธ์„ธํ•œ ํŒจํ„ด์„ ๊ณ ๋ คํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ๋ณตํ•ฉ ํƒ์ง€๊ธฐ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ถ€์ŠคํŠธ๋œ ํŠธ๋ฆฌ๊ฐ€ ์™œ ์ด๋ ‡๊ฒŒ ํšจ๊ณผ์ ์ผ๊นŒ์š”? ์‹œํ€€์Šค์˜ ๊ฐ ํŠธ๋ฆฌ๋Š” ํ˜„์žฌ ์•™์ƒ๋ธ”์˜ ์˜ˆ์ธก์˜ ์ž”์—ฌ ์˜ค๋ฅ˜ (๊ธฐ์šธ๊ธฐ)์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋ชจ๋ธ์ด ์•ฝํ•œ ์˜์—ญ์„ ์ ์ง„์ ์œผ๋กœ **โ€œ๋ถ€์ŠคํŠธโ€**ํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ๊ธฐ๋ณธ ํ•™์Šต์ž๋กœ ์‚ฌ์šฉํ•˜๋ฉด ์ตœ์ข… ๋ชจ๋ธ์ด ๋ณต์žกํ•œ ์ƒํ˜ธ์ž‘์šฉ๊ณผ ๋น„์„ ํ˜• ๊ด€๊ณ„๋ฅผ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, boosting์€ ๋ณธ์งˆ์ ์œผ๋กœ ๋‚ด์žฅ๋œ ์ •๊ทœํ™” ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค: ๋งŽ์€ ์ž‘์€ ํŠธ๋ฆฌ๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ (๊ธฐ์—ฌ๋„๋ฅผ ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต๋ฅ ์„ ์‚ฌ์šฉํ•˜์—ฌ) ์ผ๋ฐ˜์ ์œผ๋กœ ์ ์ ˆํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์„ ํƒ๋˜๋ฉด ํฐ ๊ณผ์ ํ•ฉ ์—†์ด ์ž˜ ์ผ๋ฐ˜ํ™”๋ฉ๋‹ˆ๋‹ค.

Gradient Boosting์˜ ์ฃผ์š” ํŠน์„ฑ:

  • ๋ฌธ์ œ ์œ ํ˜•: ์ฃผ๋กœ ๋ถ„๋ฅ˜ ๋ฐ ํšŒ๊ท€. ๋ณด์•ˆ์—์„œ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋ถ„๋ฅ˜(์˜ˆ: ์—ฐ๊ฒฐ ๋˜๋Š” ํŒŒ์ผ์„ ์ด์ง„ ๋ถ„๋ฅ˜). ์ด์ง„, ๋‹ค์ค‘ ํด๋ž˜์Šค(์ ์ ˆํ•œ ์†์‹ค์„ ์‚ฌ์šฉ) ๋ฐ ์ˆœ์œ„ ๋ฌธ์ œ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

  • ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ: ๋‚ฎ์Œ์—์„œ ์ค‘๊ฐ„. ๋‹จ์ผ ๋ถ€์ŠคํŠธ๋œ ํŠธ๋ฆฌ๋Š” ์ž‘์ง€๋งŒ ์ „์ฒด ๋ชจ๋ธ์€ ์ˆ˜๋ฐฑ ๊ฐœ์˜ ํŠธ๋ฆฌ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์–ด ์ „์ฒด์ ์œผ๋กœ ์ธ๊ฐ„์ด ํ•ด์„ํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ Random Forest์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ธฐ๋Šฅ ์ค‘์š”๋„ ์ ์ˆ˜๋ฅผ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, SHAP(SHapley Additive exPlanations)์™€ ๊ฐ™์€ ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐœ๋ณ„ ์˜ˆ์ธก์„ ์–ด๋А ์ •๋„ ํ•ด์„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์žฅ์ : ๊ตฌ์กฐํ™”๋œ/ํ‘œ ํ˜•์‹ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์ข…์ข… ์ตœ๊ณ  ์„ฑ๋Šฅ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ๋ณต์žกํ•œ ํŒจํ„ด๊ณผ ์ƒํ˜ธ์ž‘์šฉ์„ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ ๋ณต์žก์„ฑ์„ ์กฐ์ •ํ•˜๊ณ  ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋งŽ์€ ์กฐ์ • ๋…ธ๋ธŒ(ํŠธ๋ฆฌ ์ˆ˜, ํŠธ๋ฆฌ ๊นŠ์ด, ํ•™์Šต๋ฅ , ์ •๊ทœํ™” ํ•ญ)๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํ˜„๋Œ€ ๊ตฌํ˜„์€ ์†๋„๋ฅผ ์ตœ์ ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค(์˜ˆ: XGBoost๋Š” 2์ฐจ ๊ธฐ์šธ๊ธฐ ์ •๋ณด์™€ ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค). ์ ์ ˆํ•œ ์†์‹ค ํ•จ์ˆ˜์™€ ์ƒ˜ํ”Œ ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ •ํ•˜๋ฉด ๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋” ์ž˜ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์ œํ•œ ์‚ฌํ•ญ: ๋” ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ๋ณด๋‹ค ์กฐ์ •์ด ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค; ํŠธ๋ฆฌ๊ฐ€ ๊นŠ๊ฑฐ๋‚˜ ํŠธ๋ฆฌ ์ˆ˜๊ฐ€ ๋งŽ์œผ๋ฉด ํ›ˆ๋ จ์ด ๋А๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(๊ทธ๋Ÿฌ๋‚˜ ์—ฌ์ „ํžˆ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ์—์„œ ๋น„๊ต ๊ฐ€๋Šฅํ•œ ๊นŠ์€ ์‹ ๊ฒฝ๋ง์„ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์ผ๋ฐ˜์ ์œผ๋กœ ๋น ๋ฆ…๋‹ˆ๋‹ค). ์กฐ์ •ํ•˜์ง€ ์•Š์œผ๋ฉด ๋ชจ๋ธ์ด ๊ณผ์ ํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(์˜ˆ: ์ถฉ๋ถ„ํ•œ ์ •๊ทœํ™” ์—†์ด ๋„ˆ๋ฌด ๋งŽ์€ ๊นŠ์€ ํŠธ๋ฆฌ). ๋งŽ์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ธํ•ด gradient boosting์„ ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋” ๋งŽ์€ ์ „๋ฌธ ์ง€์‹์ด๋‚˜ ์‹คํ—˜์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋งค์šฐ ํฌ์†Œํ•œ ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ๋ฅผ ์„ ํ˜• ๋ชจ๋ธ์ด๋‚˜ Naive Bayes๋งŒํผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค(๊ทธ๋Ÿฌ๋‚˜ ์—ฌ์ „ํžˆ ์ ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์˜ˆ๋ฅผ ๋“ค์–ด ํ…์ŠคํŠธ ๋ถ„๋ฅ˜์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๊ธฐ๋Šฅ ์—”์ง€๋‹ˆ์–ด๋ง ์—†์ด๋Š” ์ฒซ ๋ฒˆ์งธ ์„ ํƒ์ด ์•„๋‹ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค).

Tip

์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ์˜ ์‚ฌ์šฉ ์‚ฌ๋ก€: ๊ฒฐ์ • ํŠธ๋ฆฌ๋‚˜ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฑฐ์˜ ๋ชจ๋“  ๊ณณ์—์„œ gradient boosting ๋ชจ๋ธ์ด ๋” ๋‚˜์€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, Microsoft์˜ ์•…์„ฑ ์ฝ”๋“œ ํƒ์ง€ ๋Œ€ํšŒ์—์„œ๋Š” ์ด์ง„ ํŒŒ์ผ์—์„œ ์—”์ง€๋‹ˆ์–ด๋ง๋œ ๊ธฐ๋Šฅ์— ๋Œ€ํ•ด XGBoost๋ฅผ ๋งŽ์ด ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ๋„คํŠธ์›Œํฌ ์นจ์ž… ํƒ์ง€ ์—ฐ๊ตฌ๋Š” ์ข…์ข… GBDT์—์„œ ์ตœ๊ณ ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค(์˜ˆ: CIC-IDS2017 ๋˜๋Š” UNSW-NB15 ๋ฐ์ดํ„ฐ์…‹์—์„œ XGBoost). ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์€ ๋‹ค์–‘ํ•œ ๊ธฐ๋Šฅ(ํ”„๋กœํ† ์ฝœ ์œ ํ˜•, ํŠน์ • ์ด๋ฒคํŠธ์˜ ๋นˆ๋„, ํŠธ๋ž˜ํ”ฝ์˜ ํ†ต๊ณ„์  ๊ธฐ๋Šฅ ๋“ฑ)์„ ์ˆ˜์ง‘ํ•˜์—ฌ ์œ„ํ˜‘์„ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ”ผ์‹ฑ ํƒ์ง€์—์„œ๋Š” gradient boosting์ด URL์˜ ์–ดํœ˜์  ๊ธฐ๋Šฅ, ๋„๋ฉ”์ธ ํ‰ํŒ ๊ธฐ๋Šฅ ๋ฐ ํŽ˜์ด์ง€ ์ฝ˜ํ…์ธ  ๊ธฐ๋Šฅ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋งค์šฐ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•™์ƒ๋ธ” ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ฐ์ดํ„ฐ์˜ ๋งŽ์€ ๋ชจ์„œ๋ฆฌ ์‚ฌ๋ก€์™€ ๋ฏธ์„ธํ•œ ๋ถ€๋ถ„์„ ํฌ๊ด„ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.

์˜ˆ์‹œ -- ํ”ผ์‹ฑ ํƒ์ง€๋ฅผ ์œ„ํ•œ XGBoost: ํ”ผ์‹ฑ ๋ฐ์ดํ„ฐ์…‹์—์„œ gradient boosting ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ฐ„๋‹จํ•˜๊ณ  ๋…๋ฆฝ์ ์œผ๋กœ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด `sklearn.ensemble.GradientBoostingClassifier`(๋А๋ฆฌ์ง€๋งŒ ๊ฐ„๋‹จํ•œ ๊ตฌํ˜„)๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๋” ๋‚˜์€ ์„ฑ๋Šฅ๊ณผ ์ถ”๊ฐ€ ๊ธฐ๋Šฅ์„ ์œ„ํ•ด `xgboost` ๋˜๋Š” `lightgbm` ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ณ  ์ด์ „๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ํ‰๊ฐ€ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ```python import pandas as pd from sklearn.datasets import fetch_openml from sklearn.model_selection import train_test_split from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

1๏ธโƒฃ Load the โ€œPhishingโ€ฏWebsitesโ€ data directly from OpenML

data = fetch_openml(data_id=4534, as_frame=True) # or data_name=โ€œPhishingWebsitesโ€ df = data.frame

2๏ธโƒฃ Separate features/target & make sure everything is numeric

X = df.drop(columns=[โ€œResultโ€]) y = df[โ€œResultโ€].astype(int).apply(lambda v: 1 if v == 1 else 0) # map {-1,1} โ†’ {0,1}

(If any column is still objectโ€‘typed, coerce it to numeric.)

X = X.apply(pd.to_numeric, errors=โ€œcoerceโ€).fillna(0)

3๏ธโƒฃ Train/test split

X_train, X_test, y_train, y_test = train_test_split( X.values, y, test_size=0.20, random_state=42 )

4๏ธโƒฃ Gradientย Boosting model

model = GradientBoostingClassifier( n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42 ) model.fit(X_train, y_train)

5๏ธโƒฃ Evaluation

y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1]

print(fโ€œAccuracy: {accuracy_score(y_test, y_pred):.3f}โ€œ) print(fโ€œPrecision: {precision_score(y_test, y_pred):.3f}โ€) print(fโ€œRecall: {recall_score(y_test, y_pred):.3f}โ€œ) print(fโ€œF1โ€‘score: {f1_score(y_test, y_pred):.3f}โ€) print(fโ€œROCย AUC: {roc_auc_score(y_test, y_prob):.3f}โ€œ)

โ€œโ€โ€œ Accuracy: 0.951 Precision: 0.949 Recall: 0.965 F1โ€‘score: 0.957 ROC AUC: 0.990 โ€œโ€โ€œ

๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ๋ชจ๋ธ์€ ์ด ํ”ผ์‹ฑ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋งค์šฐ ๋†’์€ ์ •ํ™•๋„์™€ AUC๋ฅผ ๋‹ฌ์„ฑํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค(๋ฌธํ—Œ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, ์ด๋Ÿฌํ•œ ๋ฐ์ดํ„ฐ์—์„œ ์ ์ ˆํ•œ ์กฐ์ •์„ ํ†ตํ•ด ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์€ ์ข…์ข… 95% ์ด์ƒ์˜ ์ •ํ™•๋„๋ฅผ ์ดˆ๊ณผํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” GBDT๊ฐ€ *"ํ‘œ ํ˜• ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ์ตœ์ฒจ๋‹จ ๋ชจ๋ธ"*๋กœ ๊ฐ„์ฃผ๋˜๋Š” ์ด์œ ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค -- ์ด๋“ค์€ ์ข…์ข… ๋ณต์žกํ•œ ํŒจํ„ด์„ ํฌ์ฐฉํ•˜์—ฌ ๋” ๊ฐ„๋‹จํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ ๋งฅ๋ฝ์—์„œ ์ด๋Š” ๋” ์ ์€ ์‹ค์ˆ˜๋กœ ๋” ๋งŽ์€ ํ”ผ์‹ฑ ์‚ฌ์ดํŠธ๋‚˜ ๊ณต๊ฒฉ์„ ์žก๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌผ๋ก , ๊ณผ์ ํ•ฉ์— ์ฃผ์˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค -- ์šฐ๋ฆฌ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๊ต์ฐจ ๊ฒ€์ฆ๊ณผ ๊ฐ™์€ ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•˜๊ณ  ๋ฐฐํฌ๋ฅผ ์œ„ํ•œ ๋ชจ๋ธ ๊ฐœ๋ฐœ ์‹œ ๊ฒ€์ฆ ์„ธํŠธ์—์„œ ์„ฑ๋Šฅ์„ ๋ชจ๋‹ˆํ„ฐ๋งํ•ฉ๋‹ˆ๋‹ค.

</details>

### ๋ชจ๋ธ ๊ฒฐํ•ฉ: ์•™์ƒ๋ธ” ํ•™์Šต ๋ฐ ์Šคํƒœํ‚น

์•™์ƒ๋ธ” ํ•™์Šต์€ **์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•˜์—ฌ** ์ „์ฒด ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์ „๋žต์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด๋ฏธ ํŠน์ • ์•™์ƒ๋ธ” ๋ฐฉ๋ฒ•์„ ๋ณด์•˜์Šต๋‹ˆ๋‹ค: ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ(๋ฐฐ๊น…์„ ํ†ตํ•œ ํŠธ๋ฆฌ์˜ ์•™์ƒ๋ธ”)์™€ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ…(์ˆœ์ฐจ์  ๋ถ€์ŠคํŒ…์„ ํ†ตํ•œ ํŠธ๋ฆฌ์˜ ์•™์ƒ๋ธ”). ๊ทธ๋Ÿฌ๋‚˜ ์•™์ƒ๋ธ”์€ **ํˆฌํ‘œ ์•™์ƒ๋ธ”**์ด๋‚˜ **์Šคํƒ ์ผ๋ฐ˜ํ™”(์Šคํƒœํ‚น)**์™€ ๊ฐ™์€ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ๋„ ์ƒ์„ฑ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฃผ์š” ์•„์ด๋””์–ด๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ์ด ์„œ๋กœ ๋‹ค๋ฅธ ํŒจํ„ด์„ ํฌ์ฐฉํ•˜๊ฑฐ๋‚˜ ์„œ๋กœ ๋‹ค๋ฅธ ์•ฝ์ ์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค; ์ด๋ฅผ ๊ฒฐํ•ฉํ•จ์œผ๋กœ์จ ์šฐ๋ฆฌ๋Š” **๊ฐ ๋ชจ๋ธ์˜ ์˜ค๋ฅ˜๋ฅผ ๋‹ค๋ฅธ ๋ชจ๋ธ์˜ ๊ฐ•์ ์œผ๋กœ ๋ณด์™„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค**.

-   **ํˆฌํ‘œ ์•™์ƒ๋ธ”:** ๊ฐ„๋‹จํ•œ ํˆฌํ‘œ ๋ถ„๋ฅ˜๊ธฐ์—์„œ๋Š” ์—ฌ๋Ÿฌ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ(์˜ˆ: ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€, ๊ฒฐ์ • ํŠธ๋ฆฌ, SVM)์„ ํ›ˆ๋ จ์‹œํ‚ค๊ณ  ์ตœ์ข… ์˜ˆ์ธก์— ๋Œ€ํ•ด ํˆฌํ‘œํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค(๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋‹ค์ˆ˜๊ฒฐ ํˆฌํ‘œ). ์šฐ๋ฆฌ๊ฐ€ ํˆฌํ‘œ์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•œ๋‹ค๋ฉด(์˜ˆ: ๋” ์ •ํ™•ํ•œ ๋ชจ๋ธ์— ๋” ๋†’์€ ๊ฐ€์ค‘์น˜), ์ด๋Š” ๊ฐ€์ค‘ ํˆฌํ‘œ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๊ฐœ๋ณ„ ๋ชจ๋ธ์ด ํ•ฉ๋ฆฌ์ ์œผ๋กœ ์ข‹๊ณ  ๋…๋ฆฝ์ ์ผ ๋•Œ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค -- ์•™์ƒ๋ธ”์€ ๋‹ค๋ฅธ ๋ชจ๋ธ์ด ์ด๋ฅผ ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐœ๋ณ„ ๋ชจ๋ธ์˜ ์‹ค์ˆ˜ ์œ„ํ—˜์„ ์ค„์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋‹จ์ผ ์˜๊ฒฌ๋ณด๋‹ค ์ „๋ฌธ๊ฐ€ ํŒจ๋„์„ ๊ฐ–๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

-   **์Šคํƒœํ‚น(์Šคํƒ ์•™์ƒ๋ธ”):** ์Šคํƒœํ‚น์€ ํ•œ ๊ฑธ์Œ ๋” ๋‚˜์•„๊ฐ‘๋‹ˆ๋‹ค. ๋‹จ์ˆœํ•œ ํˆฌํ‘œ ๋Œ€์‹ , **๋ฉ”ํƒ€ ๋ชจ๋ธ**์„ ํ›ˆ๋ จ์‹œ์ผœ **๊ธฐ๋ณธ ๋ชจ๋ธ์˜ ์˜ˆ์ธก์„ ์ตœ์ ์œผ๋กœ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ•™์Šต**ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, 3๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ถ„๋ฅ˜๊ธฐ(๊ธฐ๋ณธ ํ•™์Šต์ž)๋ฅผ ํ›ˆ๋ จ์‹œํ‚จ ํ›„, ๊ทธ๋“ค์˜ ์ถœ๋ ฅ(๋˜๋Š” ํ™•๋ฅ )์„ ๋ฉ”ํƒ€ ๋ถ„๋ฅ˜๊ธฐ(์ข…์ข… ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์™€ ๊ฐ™์€ ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ)์˜ ํŠน์ง•์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ ์˜ ํ˜ผํ•ฉ ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ๋ฉ”ํƒ€ ๋ชจ๋ธ์€ ๊ณผ์ ํ•ฉ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ๊ฒ€์ฆ ์„ธํŠธ์—์„œ ๋˜๋Š” ๊ต์ฐจ ๊ฒ€์ฆ์„ ํ†ตํ•ด ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค. ์Šคํƒœํ‚น์€ *์–ด๋–ค ๋ชจ๋ธ์„ ์–ด๋–ค ์ƒํ™ฉ์—์„œ ๋” ์‹ ๋ขฐํ• ์ง€๋ฅผ ํ•™์Šตํ•จ์œผ๋กœ์จ* ๊ฐ„๋‹จํ•œ ํˆฌํ‘œ๋ณด๋‹ค ์ข…์ข… ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ์—์„œ๋Š” ํ•œ ๋ชจ๋ธ์ด ๋„คํŠธ์›Œํฌ ์Šค์บ”์„ ์žก๋Š” ๋ฐ ๋” ๋‚˜์€ ๋ฐ˜๋ฉด, ๋‹ค๋ฅธ ๋ชจ๋ธ์€ ์•…์„ฑ์ฝ”๋“œ ๋น„์ฝ˜์„ ์žก๋Š” ๋ฐ ๋” ๋‚˜์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค; ์Šคํƒœํ‚น ๋ชจ๋ธ์€ ๊ฐ ๋ชจ๋ธ์— ์ ์ ˆํžˆ ์˜์กดํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํˆฌํ‘œ๋“  ์Šคํƒœํ‚น์ด๋“  ์•™์ƒ๋ธ”์€ **์ •ํ™•๋„**์™€ ๊ฐ•๊ฑด์„ฑ์„ **ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค**. ๋‹จ์ ์€ ๋ณต์žก์„ฑ์ด ์ฆ๊ฐ€ํ•˜๊ณ  ๋•Œ๋•Œ๋กœ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ์ด ๊ฐ์†Œํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค(๊ทธ๋Ÿฌ๋‚˜ ๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ํ‰๊ท ๊ณผ ๊ฐ™์€ ์ผ๋ถ€ ์•™์ƒ๋ธ” ์ ‘๊ทผ ๋ฐฉ์‹์€ ์—ฌ์ „ํžˆ ์ผ๋ถ€ ํ†ต์ฐฐ๋ ฅ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค, ์˜ˆ: ํŠน์ง• ์ค‘์š”๋„). ์‹ค์ œ๋กœ ์šด์˜ ์ œ์•ฝ์ด ํ—ˆ์šฉ๋œ๋‹ค๋ฉด, ์•™์ƒ๋ธ”์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ๋†’์€ ํƒ์ง€์œจ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ ์ฑŒ๋ฆฐ์ง€(๋ฐ ์ผ๋ฐ˜์ ์œผ๋กœ Kaggle ๋Œ€ํšŒ)์—์„œ ๋งŽ์€ ์šฐ์Šน ์†”๋ฃจ์…˜์ด ์•™์ƒ๋ธ” ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋งˆ์ง€๋ง‰ ์„ฑ๋Šฅ์„ ๋Œ์–ด๋‚ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

<details>
<summary>์˜ˆ์‹œ -- ํ”ผ์‹ฑ ํƒ์ง€๋ฅผ ์œ„ํ•œ ํˆฌํ‘œ ์•™์ƒ๋ธ”:</summary>
๋ชจ๋ธ ์Šคํƒœํ‚น์„ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด, ํ”ผ์‹ฑ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋…ผ์˜ํ•œ ๋ช‡ ๊ฐ€์ง€ ๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€, ๊ฒฐ์ • ํŠธ๋ฆฌ ๋ฐ k-NN์„ ๊ธฐ๋ณธ ํ•™์Šต์ž๋กœ ์‚ฌ์šฉํ•˜๊ณ , ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋ฅผ ๋ฉ”ํƒ€ ํ•™์Šต์ž๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๊ทธ๋“ค์˜ ์˜ˆ์ธก์„ ์ง‘๊ณ„ํ•ฉ๋‹ˆ๋‹ค. ๋ฉ”ํƒ€ ํ•™์Šต์ž๋Š” ๊ธฐ๋ณธ ํ•™์Šต์ž์˜ ์ถœ๋ ฅ(ํ›ˆ๋ จ ์„ธํŠธ์—์„œ ๊ต์ฐจ ๊ฒ€์ฆ ์‚ฌ์šฉ)์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์Šคํƒ ๋ชจ๋ธ์ด ๊ฐœ๋ณ„ ๋ชจ๋ธ๋งŒํผ ์ž˜ ์ˆ˜ํ–‰ํ•˜๊ฑฐ๋‚˜ ์•ฝ๊ฐ„ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ผ ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค.
```python
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.metrics import (accuracy_score, precision_score,
recall_score, f1_score, roc_auc_score)

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 1๏ธโƒฃ  LOAD DATASET (OpenML idย 4534)
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
data = fetch_openml(data_id=4534, as_frame=True)     # โ€œPhishingWebsitesโ€
df   = data.frame

# Target mapping:  1 โ†’ legitimate (0),   0/โ€‘1 โ†’ phishing (1)
y = (df["Result"].astype(int) != 1).astype(int)
X = df.drop(columns=["Result"])

# Train / test split (stratified to keep class balance)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=y)

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 2๏ธโƒฃ  DEFINE BASE LEARNERS
#     โ€ข LogisticRegression and kโ€‘NN need scaling โžœ wrap them
#       in a Pipeline(StandardScaler โ†’ model) so that scaling
#       happens inside each CV fold of StackingClassifier.
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
base_learners = [
('lr',  make_pipeline(StandardScaler(),
LogisticRegression(max_iter=1000,
solver='lbfgs',
random_state=42))),
('dt',  DecisionTreeClassifier(max_depth=5, random_state=42)),
('knn', make_pipeline(StandardScaler(),
KNeighborsClassifier(n_neighbors=5)))
]

# Metaโ€‘learner (levelโ€‘2 model)
meta_learner = RandomForestClassifier(n_estimators=50, random_state=42)

stack_model = StackingClassifier(
estimators      = base_learners,
final_estimator = meta_learner,
cv              = 5,        # 5โ€‘fold CV to create metaโ€‘features
passthrough     = False     # only base learnersโ€™ predictions go to metaโ€‘learner
)

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 3๏ธโƒฃ  TRAIN ENSEMBLE
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
stack_model.fit(X_train, y_train)

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 4๏ธโƒฃ  EVALUATE
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
y_pred = stack_model.predict(X_test)
y_prob = stack_model.predict_proba(X_test)[:, 1]   # P(phishing)

print(f"Accuracy : {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall   : {recall_score(y_test, y_pred):.3f}")
print(f"F1โ€‘score : {f1_score(y_test, y_pred):.3f}")
print(f"ROCย AUC  : {roc_auc_score(y_test, y_prob):.3f}")

"""
Accuracy : 0.954
Precision: 0.951
Recall   : 0.946
F1โ€‘score : 0.948
ROC AUC  : 0.992
"""

์Šคํƒ ์•™์ƒ๋ธ”์€ ๊ธฐ๋ณธ ๋ชจ๋ธ์˜ ์ƒํ˜ธ ๋ณด์™„์ ์ธ ๊ฐ•์ ์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋Š” ๋ฐ์ดํ„ฐ์˜ ์„ ํ˜•์ ์ธ ์ธก๋ฉด์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ณ , ๊ฒฐ์ • ํŠธ๋ฆฌ๋Š” ํŠน์ • ๊ทœ์น™๊ณผ ๊ฐ™์€ ์ƒํ˜ธ์ž‘์šฉ์„ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, k-NN์€ ํŠน์„ฑ ๊ณต๊ฐ„์˜ ์ง€์—ญ ์ด์›ƒ์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฉ”ํƒ€ ๋ชจ๋ธ(์—ฌ๊ธฐ์„œ๋Š” ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ)์€ ์ด๋Ÿฌํ•œ ์ž…๋ ฅ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์ธ ๋ฉ”ํŠธ๋ฆญ์€ ์ข…์ข… ๋‹จ์ผ ๋ชจ๋ธ์˜ ๋ฉ”ํŠธ๋ฆญ๋ณด๋‹ค ๊ฐœ์„ ๋œ ๊ฒฐ๊ณผ(๋น„๋ก ์•ฝ๊ฐ„์ผ์ง€๋ผ๋„)๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ํ”ผ์‹ฑ ์˜ˆ์ œ์—์„œ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๊ฐ€ F1 ์ ์ˆ˜ 0.95, ๊ฒฐ์ • ํŠธ๋ฆฌ๊ฐ€ 0.94๋ฅผ ๊ธฐ๋กํ–ˆ๋‹ค๋ฉด, ์Šคํƒ์€ ๊ฐ ๋ชจ๋ธ์ด ์˜ค๋ฅ˜๋ฅผ ๋ฒ”ํ•˜๋Š” ๋ถ€๋ถ„์„ ๋ณด์™„ํ•˜์—ฌ 0.96์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ์•™์ƒ๋ธ” ๋ฐฉ๋ฒ•์€ *โ€œ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋” ๋‚˜์€ ์ผ๋ฐ˜ํ™”๋ฅผ ์ด๋ˆ๋‹คโ€*๋Š” ์›๋ฆฌ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์‚ฌ์ด๋ฒ„ ๋ณด์•ˆ์—์„œ๋Š” ์—ฌ๋Ÿฌ ํƒ์ง€ ์—”์ง„(ํ•˜๋‚˜๋Š” ๊ทœ์น™ ๊ธฐ๋ฐ˜, ํ•˜๋‚˜๋Š” ๋จธ์‹  ๋Ÿฌ๋‹, ํ•˜๋‚˜๋Š” ์ด์ƒ ํƒ์ง€ ๊ธฐ๋ฐ˜)์„ ๋‘๊ณ , ๊ทธ๋“ค์˜ ๊ฒฝ๊ณ ๋ฅผ ์ง‘๊ณ„ํ•˜๋Š” ๋ ˆ์ด์–ด๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ โ€“ ์‚ฌ์‹ค์ƒ ์•™์ƒ๋ธ”์˜ ํ•œ ํ˜•ํƒœ โ€“ ๋” ๋†’์€ ์‹ ๋ขฐ๋„๋กœ ์ตœ์ข… ๊ฒฐ์ •์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‹œ์Šคํ…œ์„ ๋ฐฐํฌํ•  ๋•Œ๋Š” ์ถ”๊ฐ€๋œ ๋ณต์žก์„ฑ์„ ๊ณ ๋ คํ•˜๊ณ  ์•™์ƒ๋ธ”์ด ๊ด€๋ฆฌํ•˜๊ฑฐ๋‚˜ ์„ค๋ช…ํ•˜๊ธฐ ์–ด๋ ค์›Œ์ง€์ง€ ์•Š๋„๋ก ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ •ํ™•์„ฑ ์ธก๋ฉด์—์„œ ์•™์ƒ๋ธ”๊ณผ ์Šคํƒœํ‚น์€ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•œ ๊ฐ•๋ ฅํ•œ ๋„๊ตฌ์ž…๋‹ˆ๋‹ค.

References

Tip

AWS ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ:HackTricks Training AWS Red Team Expert (ARTE)
GCP ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ: HackTricks Training GCP Red Team Expert (GRTE) Azure ํ•ดํ‚น ๋ฐฐ์šฐ๊ธฐ ๋ฐ ์—ฐ์Šตํ•˜๊ธฐ: HackTricks Training Azure Red Team Expert (AzRTE)

HackTricks ์ง€์›ํ•˜๊ธฐ