๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[๐Ÿฆ€ ๊ฒŒ ๋‚˜์ด ์˜ˆ์ธก(2)] Baseline Modeling(Gradient Boosting)

by ISLA! 2023. 9. 24.

 

๐Ÿš€ Gradient Boosting ์ด๋ž€?

  • ๋จธ์‹  ๋Ÿฌ๋‹์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๊ฐ•๋ ฅํ•œ ์•™์ƒ๋ธ” ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ ํ•˜๋‚˜๋กœ, ํšŒ๊ท€ & ๋ถ„๋ฅ˜ ๋ฌธ์ œ ๋ชจ๋‘ ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ ๋†’์€ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ์ œ๊ณต
  • ์•™์ƒ๋ธ” ํ•™์Šต : ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์•ฝํ•œ ๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ๊ฐ•๋ ฅํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๊ธฐ๋ฒ•
  • ๋ถ€์ŠคํŒ… : ๋ถ€์ŠคํŒ…(์ด์ „ ๋ชจ๋ธ์˜ ์˜ค๋ฅ˜๋ฅผ ๋ณด์™„ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ž‘๋™) ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ, ์ˆœ์ฐจ์ ์œผ๋กœ ๋ชจ๋ธ์„ ์ถ”๊ฐ€ํ•˜๋ฉฐ ์ด์ „ ๋ชจ๋ธ์ด ์ž˜๋ชป ์˜ˆ์ธกํ•œ ์ƒ˜ํ”Œ์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ๋ชจ๋ธ์˜ ์˜ค๋ฅ˜๋ฅผ ๋ณด์™„ํ•จ
  • ๊ทธ๋ผ๋””์–ธํŠธ : ๊ทธ๋ผ๋””์–ธํŠธ(๊ธฐ์šธ๊ธฐ)๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ ํ•™์Šต. ์†์‹ค ํ•จ์ˆ˜(MSE, MAE ๋“ฑ)์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ด์ „ ๋ชจ๋ธ์ด ์˜ˆ์ธก์„ ์ž˜๋ชปํ•œ ์ƒ˜ํ”Œ์— ๋” ํฐ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๊ณ , ์ด๋ฅผ ๋ชจ์ •ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ชจ๋ธ์„ ํ•™์Šตํ•จ
  • ๋ชจ๋ธ์˜ ๊ฒฐํ•ฉ : ์•ฝํ•œ ํ•™์Šต์ž(๋ชจ๋ธ)์˜ ์˜ˆ์ธก์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ์ค‘์š” ๋ชจ๋ธ์˜ ์˜ˆ์ธก์— ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๊ณ , ๋œ ์ค‘์š”ํ•œ ๋ชจ๋ธ์˜ ์˜ˆ์ธก์—๋Š” ๋‚ฎ์€ ๊ฐ€์ค‘์น˜๋ฅผ ์คŒ

 


 

LabelEncoding (์„ฑ๋ณ„ ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ)

  • Sex ์ปฌ๋Ÿผ ๊ฐ’์ธ I, M, F ์— ๋Œ€ํ•ด ๋ผ๋ฒจ์ธ์ฝ”๋”ฉ ์ง„ํ–‰
  • 0, 1, 2์˜ ๊ฐ’์œผ๋กœ ๊ฒฐ๊ณผ ๋‚˜์˜ค๋Š” ๊ฒƒ ํ™•์ธ ํ›„
  • Test ๋ฐ์ดํ„ฐ๋„ ๋™์ผํ•˜๊ฒŒ ์ง„ํ–‰ํ•˜๋˜ train ๋ฐ์ดํ„ฐ์…‹์—์„œ fit ํ•œ ๋ผ๋ฒจ์ธ์ฝ”๋”๋กœ transform ํ•˜๋Š” ๊ฒƒ์„ ์œ ์˜ํ•œ๋‹ค
le = LabelEncoder()

train['generated'] = 1
original['generated'] = 0
test['generated'] = 1

train.drop(columns = 'id', axis = 1, inplace = True)
train = pd.concat([train, original], axis = 0).reset_index(drop = True)

train['Sex'] = le.fit_transform(train['Sex'])
train.head()

 

  • X, Y ๋ณ€์ˆ˜ ๋ถ„๋ฆฌ
X = train.drop(columns = 'Age', axis = 1)
Y = train['Age']

 

  • test ๋ฐ์ดํ„ฐ ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ
test_baseline = test.drop(columns = ['id'], axis = 1)
test_baseline['Sex'] = le.transform(test_baseline['Sex'])

 

 

K-ํด๋“œ ๊ต์ฐจ ๊ฒ€์ฆ & Gradient Boosting (๋ฒ ์ด์Šค ์ฝ”๋“œ ์ž‘์„ฑ)

  • K-ํด๋“œ ๊ต์ฐจ ๊ฒ€์ฆ์„ ์‚ฌ์šฉํ•˜์—ฌ Gradient Boosting ํšŒ๊ท€ ๋ชจ๋ธ์„ ์—ฌ๋Ÿฌ๋ฒˆ ํ•™์Šตํ•˜๊ณ  ํ‰๊ฐ€ํ•˜๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰
  • skf = KFold(n_splits=10, random_state=42, shuffle=True)
    • 10๊ฐœ์˜ ํด๋“œ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ”
    • ๋ฌด์ž‘์œ„ ์‹œ๋“œ๋ฅผ ์„ค์ •ํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ์žฌํ˜„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ฆ
    • ๊ฐ ํด๋“œ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์„ž์„ ๊ฒƒ์ธ์ง€ ์—ฌ๋ถ€(True = ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฌด์ž‘์œ„๋กœ ์„ž์ž„)
  • K-ํด๋“œ ๊ต์ฐจ ๊ฒ€์ฆ์„ ๋ฐ˜๋ณต
    • skf.split(X, Y) : ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ„๊ณ  ํ›ˆ๋ จ ๋ฐ ํ…Œ์ŠคํŠธ ์ธ๋ฑ์Šค๋ฅผ ์ƒ์„ฑ
    • enumerate ํ•จ์ˆ˜๋กœ ๋ฐ˜๋ณต ๋ฒˆํ˜ธ i ์™€ ํ›ˆ๋ จ ๋ฐ ํ…Œ์ŠคํŠธ ์ธ๋ฑ์Šค train_ix, test_ix ๋ฅผ ์–ป์Œ
    • ํ˜„์žฌ ํด๋“œ(i๋ฒˆ์งธ) ์— ๋Œ€ํ•œ ํ›ˆ๋ จ ๋ฐ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœ
  • Gradient Boosting ๋ชจ๋ธ ์„ค์ • ๋ฐ ํ•™์Šต
    • gb_md ์„ ์„ค์ •ํ•˜๊ณ  ํŠน์ • ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์ดˆ๊ธฐํ™” ํ•จ
    • gb_md.fit(X_train, Y_train)์œผ๋กœ ๋ชจ๋ธ ํ•™์Šต
    • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€ : gb_pred_1 ๊ณผ gb_pred_2 ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์˜ˆ์ธก ์ƒ์„ฑ
  • gb_score_fold ๋กœ ํ˜„์žฌ ํด๋“œ์—์„œ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด๋Š” ํ‰๊ท  ์ ˆ๋Œ€ ์˜ค์ฐจ(MAE)๋ฅผ ๊ณ„์‚ฐ
  • gb_preds ๋ฆฌ์ŠคํŠธ์— gb_score_fold ๊ฐ’ ์ €์žฅ
  • ๊ฒฐ๊ณผ ์ถœ๋ ฅ : ๊ฐ ํด๋“œ๋งˆ๋‹ค Gradient Boosting ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ๊ณผ ํ˜„์žฌ ํด๋“œ ๋ฒˆํ˜ธ๋ฅผ ์ถœ๋ ฅ
# ํ‰๊ฐ€ ์ง€ํ‘œ๋ฅผ ๋นˆ ๋ฆฌ์ŠคํŠธ ๊ฐ’์œผ๋กœ ์ƒ์„ฑ
gb_cv_scores, gb_preds = list(), list()

# kํด๋“œ + gradient Boosting ์ง„ํ–‰
skf = KFold(n_splits = 10, random_state = 42, shuffle = True)

for i, (train_ix, test_ix) in enumerate(skf.split(X, Y)):
    X_train, X_test = X.iloc[train_ix], X.iloc[test_ix]
    Y_train, Y_test = Y.iloc[train_ix], Y.iloc[test_ix]
    
    print(f'----------------------------------------------------------------')
    
    #Gradient Boosting
    gb_md = GradientBoostingRegressor(loss = 'absolute_error',
                                     n_estimators = 100,
                                     max_depth = 8,
                                     learning_rate = 0.01,
                                     min_samples_split = 10,
                                     min_samples_leaf = 20)
                                     
    gb_md.fit(X_train, Y_train)
    gb_pred_1 = gb_md.predict(X_test[X_test['generated'] == 1])
    gb_pred_2 = gb_md.predict(test_baseline)
    
    gb_score_fold = mean_absolute_error(Y_test[X_test['generated'] == 1], gb_pred_1)
    gb_preds.append(gb_score_fold)
    gb_preds.append(gb_pred_2)
    
    print('Fold', i, '-->> GradientBoosting of MAE is ---->>', gb_score_fold)

728x90