Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[๐Ÿฆ€ ๊ฒŒ ๋‚˜์ด ์˜ˆ์ธก(5)] ๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ์œ„ํ•œ Feature Engineering

ISLA! 2023. 9. 25. 23:08

 

๐Ÿš€ Feature Engineering

(4) ๋ฒˆ ํฌ์ŠคํŒ…์—์„œ, 1์ฐจ ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ๋ง ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•ด ๋ณด์•˜๋‹ค.

ํ•ด๋‹น ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•œ ๋ถ„์„๊ฐ€๋Š” Feature Engineering์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์œ ์˜๋ฏธํ•œ ํŒŒ์ƒ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ๋‹ค์‹œ ํ•œ๋ฒˆ ๋ชจ๋ธ๋งํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๋” ์ข‹์•˜๋‹ค๊ณ  ํ•œ๋‹ค. ๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” Feature Engineering์—์„œ ๊ฒ€ํ† ํ•˜๊ณ  ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š” ํŒŒ์ƒ๋ณ€์ˆ˜์˜ ์ข…๋ฅ˜๋ฅผ ์•Œ์•„๋ณด๊ณ  ๊ทธ ๊ณผ์ •์„ ์งš์–ด๋ณธ๋‹ค.


๐Ÿง‘‍๐Ÿ’ป ๋น„์œจ ํŠน์„ฑ (Ratio Features)

  • ์ด๋ฒˆ ์ผ€์ด์Šค๋Š” ๊ฒŒ์˜ ๊ป์งˆ ๋ฌด๊ฒŒ, ๋ชธ์˜ ๊ธธ์ด ๋“ฑ ๊ฒŒ์˜ ์‹ ์ฒด์  ํŠน์ง•์„ ๊ฐ€์ง€๊ณ  '๋‚˜์ด'๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด์—ˆ๋‹ค. 
  • ์—ฌ๊ธฐ์„œ ๊ธฐ๋ณธ์ ์œผ๋กœ ์ฃผ์–ด์ง„ ๋ณ€์ˆ˜๋ฅผ ํ† ๋Œ€๋กœ ์ „์ฒด ์ค‘์— ํŠน์ • ๋ณ€์ˆ˜๋ฅผ ๋‚˜๋ˆ„์–ด, ํŠน์ • '๋น„์œจ'์„ ๋„์ถœํ•˜์—ฌ ํŒŒ์ƒ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.
  • Ratio Features ๋ผ๊ณ ํ•˜๋ฉฐ, ์„œ๋กœ ๋‹ค๋ฅธ ์š”์†Œ ๊ฐ„ ์ƒ๋Œ€์  ํฌ๊ธฐ ๋น„๊ต๋ฅผ ์šฉ์ดํ•˜๊ฒŒ ํ•ด์ค€๋‹ค.
  • ์˜ˆ์‹œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค
    • Viscera Ratio = Viscera Weight / Total Weight
    • Shell Ratio = Shell Weight / Total Weight
    • Shell-to-Body Ratio = Shell Weight / (Total Weight + Shell Weight)
    • Meat Yield = Shucked Weight / (Total Weight + Shell Weight)
    • Length-to-Diameter Ratio = Length / Diameter
    • Weight-to-VisceraWeight Ratio = Total Weight / Viscera Weight
    • Weight-to-ShellWeight Ratio = Total Weight / Shell Weight
    • Weight-to-ShuckedWeight Ratio = Total Weight / Shucked Weight

 

๐Ÿง‘‍๐Ÿ’ป ๊ธฐํ•˜ํ•™์  ํŠน์„ฑ (Geometric Features)

  • Geometric Features์€, ์ด๋ฒˆ ์Šคํ„ฐ๋””์ธ ๊ฒŒ์˜ '์‹ ์ฒด์ ' ํŠน์ง•์„ ์ดํ•ดํ•˜๋Š”๋ฐ ๋„์›€์„ ์ค€๋‹ค. ๋ฐ์ดํ„ฐ ์…‹์— ๋”ฐ๋ผ ํ•ด๋‹น ํ”ผ์ณ ์ƒ์„ฑ์ด ๋„์›€์ด ๋  ์ˆ˜๋„ ์žˆ๊ณ  ์•„๋‹ ์ˆ˜๋„ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค.
  • ๋Œ€ํ‘œ์ ์œผ๋กœ๋Š” ๋ฐ€๋„, BMI ์ง€์ˆ˜, ํ‘œ๋ฉด์ , ๋ถ€ํ”ผ ๋“ฑ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ์˜ˆ์‹œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค
    • Surface Area = 2 * (Length * Diameter + Length * Height + Diameter * Height)
    • Volume = Length * Diameter * Height
    • Density = Total Weight / Volume
    • Pseudo BMI = Total Weight / (Height ^ 2)

 

 

๐Ÿง‘‍๐Ÿ’ป ๋‹คํ•ญ ํŠน์„ฑ (Polynomial Features)

  • Polynomial Features๋Š” ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์žฅํ•˜์—ฌ ๋‹คํ•ญ์‹ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์ „์ฒ˜๋ฆฌ๋ฅผ ๊ฑฐ์ณ ์ƒ์„ฑํ•œ๋‹ค.
  • ํŠนํžˆ, ์„ ํ˜• ๋ชจ๋ธ์—์„œ ๋ณ€์ˆ˜ ์‚ฌ์ด์˜ ๋น„์„ ํ˜• ๊ด€๊ณ„๋ฅผ ํฌ์ฐฉํ•˜๋Š”๋ฐ ๋„์›€์„ ์ค€๋‹ค.
  • ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์˜ ์ฐจ์›์„ ํ™•์žฅํ•˜์—ฌ ๋ชจ๋ธ์ด ๋” ๋ณต์žกํ•œ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๋„๋ก ๋„์™€์ค€๋‹ค.
  • ๋‹คํ•ญ์‹ ํŠน์„ฑ์„ ๋งŒ๋“ค ๋•Œ๋Š” ์ž…๋ ฅ ๋ณ€์ˆ˜์˜ ๊ฑฐ๋“ญ์ œ๊ณฑ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค
  • ์ฃผ๋กœ ๋‹คํ•ญ ํšŒ๊ท€๋‚˜ ๋‹คํ•ญ ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ ์‚ฌ์šฉ๋œ๋‹ค.
  • ์˜ˆ์‹œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค(์ˆ˜ํ•™์ ์œผ๋กœ ํ•ญ์ƒ ์ฐธ์ธ ๋“ฑ์‹)
    • Length^2 = Length ^ 2
    • Diameter^2 = Diameter ^ 2

 

๐Ÿง‘‍๐Ÿ’ป ๋กœ๊ทธ ๋ณ€ํ™˜ (Logarithmic Transformations)

  • ์™œ๋„๋ฅผ ์ค„์ด๊ณ , ๊ทน๋‹จ ๊ฐ’์„ ์กฐ์ •ํ•˜๋Š”๋ฐ ๋„์›€์ด ๋˜๋Š” ๋กœ๊ทธ ๋ณ€ํ™˜์ด๋‹ค. 
  • ์™œ๋„ ์กฐ์ •, ๋ถ„์‚ฐ ์•ˆ์ •ํ™”(๋ฐ์ดํ„ฐ ๋ณ€๋™์„ฑ ์ถ•์†Œ), ์ƒ๋Œ€์  ํฌ๊ธฐ ์กฐ์ •(๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค์–‘ํ•œ ๋ฒ”์œ„/์Šค์ผ€์ผ์„ ๊ฐ€์ง€๊ณ  ์žˆ์„ ๋•Œ)์— ์‚ฌ์šฉ๋œ๋‹ค.
  • ์ฃผ๋กœ ์–‘์ˆ˜ ๊ฐ’์„ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ์— ์ ์šฉ๋˜๋ฉฐ, ๋กœ๊ทธ ํ•จ์ˆ˜(base 10 ๋˜๋Š” ์ž์—ฐ ๋กœ๊ทธ)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•œ๋‹ค.
  • ์˜ˆ์‹œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค
    • Log Weight = log(Total Weight + 1)

 

๐Ÿง‘‍๐Ÿ’ป ํ”ผ์ณ ๋ฒ”์ฃผํ™” (Binned Features)

  • ์ž…๋ ฅ ๋ณ€์ˆ˜์˜ ๊ฐ’์„ ๋ฒ”์ฃผ ํ˜•ํƒœ๋กœ ๋‚˜๋ˆ„์–ด ์ฒ˜๋ฆฌํ•˜์—ฌ, ํƒ€๊ฒŸ ๋ณ€์ˆ˜์™€์˜ ๊ด€๊ณ„๋ฅผ ๋” ์ž˜ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋œ๋‹ค.
  • ์ž…๋ ฅ ๋ณ€์ˆ˜์˜ ๊ฐ’์„ ๊ตฌ๊ฐ„์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ด์‚ฐํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ดํ›„, ๋ฒ”์ฃผ๋ณ„๋กœ ํ†ต๊ณ„์น˜(ํ‰๊ท , ์ค‘์•™๊ฐ’, ํ‘œ์ค€ํŽธ์ฐจ)๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋ฒ”์ฃผ์˜ ๋Œ€ํ‘œ๊ฐ’์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ๋„ ํ•œ๋‹ค.
  • ์ด์‚ฐํ™”๋œ ๋ฒ”์ฃผ์™€ ํƒ€๊ฒŸ ๋ณ€์ˆ˜ ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ์‹œ๊ฐํ™”ํ•˜๊ฑฐ๋‚˜ ํ†ต๊ณ„๋ถ„์„์„ ํ†ตํ•ด ๋” ์‰ฝ๊ฒŒ ํƒ€๊ฒŸ ๋ณ€์ˆ˜์™€์˜ ๊ด€๊ณ„ ํŒŒ์•…์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ ๋ณต์žก์„ฑ์„ ์ค„์ด๊ณ  ๋ชจ๋ธ ํ•ด์„์„ ์šฉ์ดํ•˜๊ฒŒ ํ•˜๋ฉฐ, ํŠนํžˆ ์„ ํ˜• ๋ชจ๋ธ ๋ฐ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๊ณผ ๊ฐ™์€ ์ผ๋ถ€ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์— ์œ ์šฉํ•˜๋‹ค.
  • ํ•˜์ง€๋งŒ, ๊ตฌ๊ฐ„์„ ์–ด๋–ป๊ฒŒ ์„ ํƒํ•˜๋Š๋ƒ์— ๋”ฐ๋ผ ๋ชจ๋ธ ์„ฑ๋Šฅ๊ณผ ํ•ด์„์ด ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์‹ ์ค‘ํ•˜๊ฒŒ ๊ฒฐ์ •ํ•ด์•ผ ํ•œ๋‹ค.
  • ์˜ˆ์‹œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
    • Length Bins = Binned version of Length (e.g., quartiles)

 

๐Ÿง‘‍๐Ÿ’ป ์ฐธ๊ณ  : ๋ฐ์ดํ„ฐ์— ๋งž๋Š” ์ƒˆ๋กœ์šด ํŒŒ์ƒ๋ณ€์ˆ˜ ์ƒ์„ฑ

  • ๋ฐ์ดํ„ฐ ์…‹์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ณ€์ˆ˜๋ณ„ ํŠน์„ฑ์„ ํŒŒ์•…ํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒฝ์šฐ์ด๋‹ค.
  • ์ด ์˜ˆ์‹œ์—์„œ๋Š” ๊ฒŒ์˜ ์ฒด์งˆ๋Ÿ‰ ์ง€์ˆ˜์™€ ์ ์•ก ์ค‘๋Ÿ‰์„ ์ œ์™ธํ•œ ๋ฌด๊ฒŒ ๋“ฑ์„ ์ƒ์„ฑํ–ˆ๋‹ค. (Derived Weight Features ์•„๋ž˜ ์˜ˆ์‹œ)
    • Weight_wo_Viscera = Shucked Weight - Viscera Weight
    • Body Condition Index = sqrt(Length * Total Weight * Shucked Weight)

์‹ค์ œ๋กœ ๋ณธ ๋ถ„์„๊ฐ€๋Š” ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ๋” ๋‚˜์€ ๋ชจ๋ธ๋ง ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

๋ฐ์ดํ„ฐ ์…‹ ์ž์ฒด์™€ ๋ฐฐ๊ฒฝ์— ๋Œ€ํ•œ ์ถฉ๋ถ„ํ•œ ์ดํ•ด์™€ ์กฐ์‚ฌ๊ฐ€ ํ•„์š”ํ•˜๋ฉฐ, ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ ๊ทน์ ์ธ ํ”ผ์ณ ์—”์ง€๋‹ˆ์–ด๋ง์˜ ํ•„์š”์„ฑ์„ ๋Š๊ผˆ๋‹ค!

 

 

์Šคํ„ฐ๋”” ์ถœ์ฒ˜

https://www.kaggle.com/competitions/playground-series-s3e16/discussion/415721

728x90