๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[๐Ÿฆ€ ๊ฒŒ ๋‚˜์ด ์˜ˆ์ธก(4)] Baseline Modeling(์—ฌ๋Ÿฌ ๋ชจ๋ธ ๊ฒฐ๊ณผ๋ฅผ ์ข…ํ•ฉ, ๋น„๊ต)

by ISLA! 2023. 9. 25.

 

๐Ÿง‘‍๐Ÿ’ป ์ฝ”๋“œ ์ข…ํ•ฉ

(3)๊นŒ์ง€ ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ข…ํ•ฉํ•˜๋ฉด ๋ชจ๋ธ๋ง ๋ฒ ์ด์Šค ๋ผ์ธ ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

์ด๋ฒˆ ์‹œ๊ฐ„์—๋Š” 3๊ฐœ์˜ ๋ชจ๋ธ ๋ณ„๋กœ K-fold ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ˆ˜ํ–‰ํ•˜๊ณ , ๊ทธ ๊ฒฐ๊ณผ ์ ์ˆ˜(MAE)์˜ ํ‰๊ท ๊ฐ’์„ ์‹œ๊ฐํ™”ํ•˜์—ฌ ๋น„๊ตํ•ด๋ณธ๋‹ค.

์ด๋ ‡๊ฒŒ 1์ฐจ ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ๋ง ์ฝ”๋“œ๋ฅผ ๋งˆ๋ฌด๋ฆฌํ•œ๋‹ค!


1. ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ

# ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ(sex)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

train['generated'] = 1
original['generated'] = 0
test['generated'] = 1

train.drop(columns = 'id', axis = 1, inplace = True)
train = pd.concat([train, original], axis = 0).reset_index(drop = True)
train['Sex'] = le.fit_transform(train['Sex'])

 

2. ๋ณ€์ˆ˜ ๋ถ„๋ฆฌ(๋…๋ฆฝ๋ณ€์ˆ˜, ์ข…์†๋ณ€์ˆ˜)

# X, Y ๋ณ€์ˆ˜ ๋ถ„๋ฆฌ
X = train.drop(columns = 'Age', axis = 1)
Y = train['Age']

# id ์ปฌ๋Ÿผ ์‚ญ์ œ & test ๋ฐ์ดํ„ฐ๋„ ๋ผ๋ฒจ์ธ์ฝ”๋”ฉ
test_baseline = test.drop(columns = ['id'], axis = 1)
test_baseline['Sex'] = le.transform(test_baseline['Sex'])

 

3. ๋ชจ๋ธ๋ณ„ MAE์™€ ์˜ˆ์ธก๊ฐ’์„ ๋‹ด์„ ๋ฆฌ์ŠคํŠธ ์ƒ์„ฑ 

# ๊ฒฐ๊ณผ ์ ์ˆ˜(MAE)์™€ ์˜ˆ์ธก๊ฐ’์„ ๋‹ด์„ ๋ฆฌ์ŠคํŠธ ์ƒ์„ฑ --->> ๋ชจ๋ธ ๊ฐœ์ˆ˜ ๋งŒํผ
gb_cv_scores, gb_preds = list(), list()
hist_cv_scores, hist_preds = list(), list()
lgb_cv_scores, lgb_preds = list(), list()
xgb_cv_scores, xgb_preds = list(), list()
ens_cv_scores, ens_preds = list(), list()

 

4. k-fold ๊ต์ฐจ ๊ฒ€์ฆ ์ˆ˜ํ–‰

# K-fold ์ƒ์„ฑ
skf = KFold(n_splits = 10, random_state = 42, shuffle = True)

 

5. k-fold ๋งˆ๋‹ค ์ ์šฉํ•  ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต, ์˜ˆ์ธก ์ข…ํ•ฉ

  • ๋ชจ๋ธ ์ž‘์—…์€ ์ฝ”๋žฉ์—์„œ ์ˆ˜ํ–‰ํ–ˆ์œผ๋ฉฐ, ๊ฐ„๋‹จํžˆ Gradient Boosting & Hist Gradient Boosting & XGBoosting ๊ฒฐ๊ณผ๋ฅผ ํ•จ๊ป˜ ๋ณด๊ธฐ๋กœ ํ–ˆ์Œ
  • ๊ฒฐ๊ณผ๋Š” For ๋ฌธ์„ ํ•œ ๋ฒˆ ๋Œ ๋•Œ๋งˆ๋‹ค ์ด๋ ‡๊ฒŒ ๋‚˜์˜จ๋‹ค

์ฒซ๋ฒˆ์งธ K-fold๋กœ ๋ชจ๋ธ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ(MAE)

  for i, (train_ix, test_ix) in enumerate(skf.split(X, Y)):
        
    X_train, X_test = X.iloc[train_ix], X.iloc[test_ix]
    Y_train, Y_test = Y.iloc[train_ix], Y.iloc[test_ix]
    
    print('---------------------------------------------------------------')
    
    ######################
    ## GradientBoosting ##
    ######################
    gb_md = GradientBoostingRegressor(loss = 'absolute_error',
                                      n_estimators = 1000, 
                                      max_depth = 8, 
                                      learning_rate = 0.01,
                                      min_samples_split = 10, 
                                      min_samples_leaf = 20).fit(X_train, Y_train) 
    
    gb_pred_1 = gb_md.predict(X_test[X_test['generated'] == 1])
    gb_pred_2 = gb_md.predict(test_baseline)
            
    gb_score_fold = mean_absolute_error(Y_test[X_test['generated'] == 1], gb_pred_1)
    gb_cv_scores.append(gb_score_fold)
    gb_preds.append(gb_pred_2)
    
    print('Fold', i, '==> GradientBoositng oof MAE is ==>', gb_score_fold)
    
    
    ##########################
    ## HistGradientBoosting ##
    ##########################
    hist_md = HistGradientBoostingRegressor(loss = 'absolute_error',
                                            l2_regularization = 0.01,
                                            early_stopping = False,
                                            learning_rate = 0.01,
                                            max_iter = 1000,
                                            max_depth = 15,
                                            max_bins = 255,
                                            min_samples_leaf = 70,
                                            max_leaf_nodes = 115).fit(X_train, Y_train)
    
    hist_pred_1 = hist_md.predict(X_test[X_test['generated'] == 1])
    hist_pred_2 = hist_md.predict(test_baseline)

    hist_score_fold = mean_absolute_error(Y_test[X_test['generated'] == 1], hist_pred_1)
    hist_cv_scores.append(hist_score_fold)
    hist_preds.append(hist_pred_2)
    
    print('Fold', i, '==> HistGradient oof MAE is ==>', hist_score_fold)

    
    #############
    ## XGBoost ##
    #############
        
    xgb_md = XGBRegressor(objective = 'reg:pseudohubererror',
                          tree_method = 'exact',
                          colsample_bytree = 0.9, 
                          gamma = 0.65, 
                          learning_rate = 0.01, 
                          max_depth = 7, 
                          min_child_weight = 20, 
                          n_estimators = 1000).fit(X_train, Y_train)
    
    xgb_pred_1 = xgb_md.predict(X_test[X_test['generated'] == 1])
    xgb_pred_2 = xgb_md.predict(test_baseline)

    xgb_score_fold = mean_absolute_error(Y_test[X_test['generated'] == 1], xgb_pred_1)    
    xgb_cv_scores.append(xgb_score_fold)
    xgb_preds.append(xgb_pred_2)
    
    print('Fold', i, '==> XGBoost oof MAE is ==>', xgb_score_fold)

 

6. ๋ชจ๋ธ๋ณ„ k-fold ์ˆ˜ํ–‰ ๊ฒฐ๊ณผ MAE ํ‰๊ท 

# ๊ฐ ๋ชจ๋ธ๋ณ„ MAE ์ ์ˆ˜ ํ‰๊ท ๋‚ด๊ธฐ
gb_cv_score = np.mean(gb_cv_scores)
hist_cv_score = np.mean(hist_cv_scores)
xgb_cv_score = np.mean(xgb_cv_scores)

# ๋ชจ๋ธ๋ณ„ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ์ €์žฅ
model_perf = pd.DataFrame({'Model':['GradientBoosting', 'HistGradient', 'XGBoost'],
                           'cv-score': [gb_cv_score, hist_cv_score, xgb_cv_score]})

# ์œ„ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์‹œ๊ฐํ™”ํ•˜์—ฌ ํ•œ๋ˆˆ์— ํ™•์ธ
plt.figure(figsize = (8, 8))
ax = sns.barplot(y = 'Model', x ='cv-score', data = model_perf)
ax.bar_label(ax.containers[0]);

 

728x90