๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[๐Ÿฆ€ ๊ฒŒ ๋‚˜์ด ์˜ˆ์ธก(6)] Baseline Modeling 2 (LAD Regression)

by ISLA! 2023. 9. 26.

 

(4)~(5) ํฌ์ŠคํŒ…์—์„œ ์ง„ํ–‰ํ•œ ๋ชจ๋ธ๋ง ์ฝ”๋“œ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ๋‘ ๋ฒˆ์งธ ๋ฒ ์ด์Šค๋ผ์ธ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•œ๋‹ค.

ํ”ผ์ณ ์—”์ง€๋‹ˆ์–ด๋ง์„ ์ˆ˜ํ–‰ํ•˜๊ณ , ๊ทธ ๊ฒฐ๊ณผ๋Š” catboost ๋ชจ๋ธ๋กœ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•œ๋‹ค. 

์ด์™ธ ๋‹ค๋ฅธ ๋ชจ๋ธ๋„ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๊ณ  k-fold๋ฅผ 10ํšŒ ์ง„ํ–‰ํ•˜์—ฌ ๊ฐ ํด๋“œ๋งˆ๋‹ค ๋ชจ๋ธ๋ณ„ ์„ฑ๊ณผ๋ฅผ ํ™•์ธํ•œ๋‹ค.


๐Ÿš€ ๋ชจ๋ธ๋ง ์ค€๋น„ : ๋ณ€์ˆ˜ ์„ ํƒ & ์ธ์ฝ”๋”ฉ

X = train.drop(columns = ['Age'], axis = 1)
Y = train['Age']

# train ๋…๋ฆฝ๋ณ€์ˆ˜์— ๋Œ€ํ•ด ์ฃผ์š” ํŒŒ์ƒ๋ณ€์ˆ˜ ์ƒ์„ฑ
X['Meat Yield'] = X['Shucked Weight'] / (X['Weight'] + X['Shell Weight'])
X['Shell Ratio'] = X['Shell Weight'] / X['Weight']
X['Weight_to_Shucked_Weight'] = X['Weight'] / X['Shucked Weight']
X['Viscera Ratio'] = X['Viscera Weight'] / X['Weight']

# test ๋ฐ์ดํ„ฐ ์ •๋ฆฌ / ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ
test_baseline = test.drop(columns = ['id'], axis = 1)
test_baseline['Sex'] = le.transform(test_baseline['Sex'])

# test ๋…๋ฆฝ๋ณ€์ˆ˜๋„ ๋™์ผ ํŒŒ์ƒ๋ณ€์ˆ˜ ์ƒ์„ฑ
test_baseline['Meat Yield'] = test_baseline['Shucked Weight'] / (test_baseline['Weight'] + test_baseline['Shell Weight'])
test_baseline['Shell Ratio'] = test_baseline['Shell Weight'] / test_baseline['Weight']
test_baseline['Weight_to_Shucked_Weight'] = test_baseline['Weight'] / test_baseline['Shucked Weight']
test_baseline['Viscera Ratio'] = test_baseline['Viscera Weight'] / test_baseline['Weight']

 

๐Ÿš€ 5๊ฐœ ๋ชจ๋ธ๋ณ„ & LAD ์•™์ƒ๋ธ” ๊ฒฐ๊ณผ ์ €์žฅ์šฉ :  MAE, ์˜ˆ์ธก ๊ฒฐ๊ณผ ๋ฆฌ์ŠคํŠธ ์ƒ์„ฑ

aml_cv_scores, aml_preds = list(), list()
gb_cv_scores, gb_preds = list(), list()
hist_cv_scores, hist_preds = list(), list()
lgb_cv_scores, lgb_preds = list(), list()
xgb_cv_scores, xgb_preds = list(), list()
cat_cv_scores, cat_preds = list(), list()

ens_cv_scores_1, ens_preds_1 = list(), list()
ens_cv_scores_2, ens_preds_2 = list(), list()
ens_cv_scores_3, ens_preds_3 = list(), list()
ens_cv_scores_4, ens_preds_4 = list(), list()

 

๐Ÿš€ K-fold ์ƒ์„ฑ, ํด๋“œ๋ณ„ ๋ชจ๋ธ๋ง ์ˆ˜ํ–‰ ๋ฐ ๊ฒฐ๊ณผ ํ™•์ธ

  • ์ง€๋‚œ ๋ชจ๋ธ๋ง์—์„œ์™€ ๊ฐ™์ด ๋ชจ๋ธ๋ง ์ฝ”๋“œ๋ฅผ ๋™์ผํ•˜๊ฒŒ ์ž‘์„ฑ
  • 5๊ฐ€์ง€ ๋ชจ๋ธ ์‚ฌ์šฉ : Gradient Boosting, Hist Gradeint, LightBGM, XGBoost, CatBoost
  • ์—ฌ๊ธฐ์— ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ •ํ•œ 4๊ฐ€์ง€ LAD ํšŒ๊ท€ ๋ชจ๋ธ์„ ์ถ”๊ฐ€๋กœ ์‚ฌ์šฉ
skf = KFold(n_splits = 10, random_state = 42, shuffle = True)

for i, (train_ix, test_ix) in enumerate (skf.split(X, Y)):
    X_train, X_test = X.iloc[train_ix], X.iloc[test_ix]
    Y_train, Y_test = Y.iloc[train_ix], Y.iloc[test_ix]

    print('---------------------------------------------------------------')

    ######################
    ## GradientBoosting ##
    ######################
    
    gb_features = ['Sex',
                   'Length',
                   'Diameter',
                   'Height',
                   'Weight',
                   'Shucked Weight',
                   'Viscera Weight',
                   'Shell Weight',
                   'generated']

    # ๋ชจ๋ธ๋ง์— ์‚ฌ์šฉํ•  ์ฃผ์š” ํ”ผ์ณ๋งŒ ๋ฝ‘์•„ ์ €์žฅ
    X_train_gb = X_train[gb_features]
    X_test_gb = X_test[gb_features]
    test_baseline_gb = test_baseline[gb_features]

    gb_md = GradientBoostingRegressor(loss = 'absolute_error',
                                      n_estimators = 1000,
                                      max_depth = 8,
                                      learning_rate = 0.01,
                                      min_samples_split = 10,
                                      min_samples_leaf = 20,
                                      random_state = 42)
    gb_md.fit(X_train_gb, Y_train)

    gb_pred_1 = gb_md.predict(X_test_gb[X_test_gb['generated'] == 1])
    gb_pred_2 = gb_md.predict(test_baseline_gb)

    gb_score_fold = mean_absolute_error(Y_test[X_test_gb['generated'] == 1], gb_pred_1)
    gb_cv_scores.append(gb_score_fold)
    gb_preds.append(gb_pred_2)

    print('Fold', i, '==> GradientBoositng oof MAE is ==>', gb_score_fold)

    ##########################
    ## HistGradientBoosting ##
    ##########################
        
    hist_md = HistGradientBoostingRegressor(loss = 'absolute_error',
                                            l2_regularization = 0.01,
                                            early_stopping = False,
                                            learning_rate = 0.01,
                                            max_iter = 1000,
                                            max_depth = 15,
                                            max_bins = 255,
                                            min_samples_leaf = 70,
                                            max_leaf_nodes = 115,
                                            random_state = 42).fit(X_train, Y_train) 
    
    hist_pred_1 = hist_md.predict(X_test[X_test['generated'] == 1])
    hist_pred_2 = hist_md.predict(test_baseline)

    hist_score_fold = mean_absolute_error(Y_test[X_test['generated'] == 1], hist_pred_1)
    hist_cv_scores.append(hist_score_fold)
    hist_preds.append(hist_pred_2)
    
    print('Fold', i, '==> HistGradient oof MAE is ==>', hist_score_fold)

    ##############
    ## LightGBM ##
    ##############
        
    lgb_md = LGBMRegressor(objective = 'mae', 
                           n_estimators = 1000,
                           max_depth = 15,
                           learning_rate = 0.01,
                           num_leaves = 105, 
                           reg_alpha = 8, 
                           reg_lambda = 3, 
                           subsample = 0.6, 
                           colsample_bytree = 0.8,
                           random_state = 42).fit(X_train, Y_train)
    
    lgb_pred_1 = lgb_md.predict(X_test[X_test['generated'] == 1])
    lgb_pred_2 = lgb_md.predict(test_baseline)

    lgb_score_fold = mean_absolute_error(Y_test[X_test['generated'] == 1], lgb_pred_1)    
    lgb_cv_scores.append(lgb_score_fold)
    lgb_preds.append(lgb_pred_2)
    
    print('Fold', i, '==> LightGBM oof MAE is ==>', lgb_score_fold)

    #############
    ## XGBoost ##
    #############
    
    xgb_md = XGBRegressor(objective = 'reg:pseudohubererror',
                          tree_method = 'hist',
                          colsample_bytree = 0.9, 
                          gamma = 0.65, 
                          learning_rate = 0.01, 
                          max_depth = 7, 
                          min_child_weight = 20, 
                          n_estimators = 1500,
                          subsample = 0.7,
                          random_state = 42).fit(X_train_gb, Y_train) 
    
    xgb_pred_1 = xgb_md.predict(X_test_gb[X_test_gb['generated'] == 1])
    xgb_pred_2 = xgb_md.predict(test_baseline_gb)

    xgb_score_fold = mean_absolute_error(Y_test[X_test_gb['generated'] == 1], xgb_pred_1)    
    xgb_cv_scores.append(xgb_score_fold)
    xgb_preds.append(xgb_pred_2)
    
    print('Fold', i, '==> XGBoost oof MAE is ==>', xgb_score_fold)

    ##############
    ## CatBoost ##
    ##############
    
    cat_features = ['Sex',
                    'Length',
                    'Diameter',
                    'Height',
                    'Weight',
                    'Shucked Weight',
                    'Viscera Weight',
                    'Shell Weight',
                    'generated',
                    'Meat Yield',
                    'Shell Ratio',
                    'Weight_to_Shucked_Weight']
    
    X_train_cat = X_train[cat_features]
    X_test_cat = X_test[cat_features]
    test_baseline_cat = test_baseline[cat_features]

    cat_md = CatBoostRegressor(loss_function = 'MAE',
                               iterations = 1000,
                               learning_rate = 0.08,
                               depth = 10, 
                               random_strength = 0.2,
                               bagging_temperature = 0.7,
                               border_count = 254,
                               l2_leaf_reg = 0.001,
                               verbose = False,
                               grow_policy = 'Lossguide',
                               task_type = 'CPU',
                               random_state = 42).fit(X_train_cat, Y_train)
    
    cat_pred_1 = cat_md.predict(X_test_cat[X_test_cat['generated'] == 1])
    cat_pred_2 = cat_md.predict(test_baseline_cat)

    cat_score_fold = mean_absolute_error(Y_test[X_test_cat['generated'] == 1], cat_pred_1)    
    cat_cv_scores.append(cat_score_fold)
    cat_preds.append(cat_pred_2)

    print('Fold', i, '==> CatBoost oof MAE is ==>', cat_score_fold)

 

[ ๐Ÿง‘‍๐Ÿ’ป LAD ํšŒ๊ท€ ] for ๋ฌธ ์•ˆ์— ์ด์–ด์ง€๋Š” ์ฝ”๋“œ์ด์ง€๋งŒ, ์ •๋ฆฌ๋ฅผ ์œ„ํ•ด ์•„๋ž˜ ๋”ฐ๋กœ ์ž‘์„ฑ

LAD ํšŒ๊ท€๋ž€?
* Least Absolute Deviation์œผ๋กœ, ์ด์ƒ์น˜์— ๋œ ๋ฏผ๊ฐํ•˜๊ณ  ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๊ฐ€ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด์ง€ ์•Š์„ ๋•Œ ์œ ์šฉํ•˜๋‹ค
* ์ž”์ฐจ์˜ ์ ˆ๋Œ€๊ฐ’ ํ•ฉ์„ ์ตœ์†Œํ™”ํ•˜์—ฌ ๋ชจ๋ธ์„ ์ ํ•ฉ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค
* ํšŒ๊ท€ ๋ชจ๋ธ์˜ ์ ํ•ฉ์„ฑ ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด ์ฃผ๋กœ ํ‰๊ท  ์ ˆ๋Œ€ ์˜ค์ฐจ(MAE)๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

 

  • x : 5๊ฐœ ๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ฐ’์„ ์†Œ์ˆ˜์  ์ฒซ์งธ์งœ๋ฆฌ์—์„œ ๋ฐ˜์˜ฌ๋ฆผํ•˜์—ฌ, ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์ƒ์„ฑ
  • y : test์šฉ train ๋ฐ์ดํ„ฐ์˜ ์ข…์†๋ณ€์ˆ˜
  • fit_intercept : ํšŒ๊ท€๋ชจ๋ธ์—์„œ ์ƒ์ˆ˜ํ•ญ์„ ํ•™์Šตํ•  ์ง€ ์—ฌ๋ถ€๋ฅผ ์ง€์ •(์ƒ์ˆ˜ํ•ญ์€ ํšŒ๊ท€ ์ง์„ ์ด ์›์ ์„ ํ†ต๊ณผํ•˜๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ๊ฒฐ์ •)
    • True : ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ ์ƒ์ˆ˜ํ•ญ์„ ํฌํ•จํ•˜์—ฌ ํšŒ๊ท€ ์ง์„ ์„ ํ•™์Šตํ•จ. ์ผ๋ฐ˜์ ์ธ ์ƒํ™ฉ์—์„œ ์‚ฌ์šฉ๋˜๋ฉฐ ๋ฐ์ดํ„ฐ๊ฐ€ ์›์ ์„ ์ค‘์‹ฌ์œผ๋กœ ๋ถ„ํฌํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ ์œ ์šฉํ•˜๋‹ค.
    • False : ๋ชจ๋ธ์ด ์ƒ์ˆ˜ํ•ญ์„ ๋ฌด์‹œํ•˜๊ณ  ํšŒ๊ท€ ์ง์„ ์„ ์›์ ์œผ๋กœ ํ†ต๊ณผํ•˜๋„๋ก ๊ฐ•์ œํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ์„ค์ •ํ•˜๋ฉด ๋ฐ์ดํ„ฐ๊ฐ€ ์›์ ์„ ์ค‘์‹ฌ์œผ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ„ํฌํ•  ๊ฒƒ์„ ๊ฐ€์ •ํ•œ๋‹ค.
    • fit_intercept๋ฅผ ์กฐ์ ˆํ•˜์—ฌ ๋ชจ๋ธ ํŽธํ–ฅ์„ ์กฐ์ ˆํ•˜๊ณ , ๋ฐ์ดํ„ฐ์™€ ๋ชจ๋ธ ๊ฐ„ ์ ํ•ฉ๋„๋ฅผ ๋” ์ž˜ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋‹ค.
  • positive : ํšŒ๊ท€ ๋ชจ๋ธ์—์„œ ์˜ˆ์ธก๊ฐ’์ด ์–‘์ˆ˜๋กœ ์ œํ•œ๋˜์–ด์•ผ ํ•˜๋Š”์ง€ ์—ฌ๋ถ€. 
    • True : ๋ชจ๋ธ ์˜ˆ์ธก๊ฐ’์ด ์–‘์ˆ˜๋กœ ์ œํ•œ(์˜ˆ์ธก๊ฐ’์€ 0 ๋˜๋Š” ์–‘์˜ ๊ฐ’๋งŒ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Œ)
    • False : ๋ชจ๋ธ ์˜ˆ์ธก๊ฐ’์— ์ œํ•œ์„ ๋‘์ง€ ์•Š์Œ
    • ๊ฐ€๊ฒฉ ์˜ˆ์ธก๊ณผ ๊ฐ™์ด ์˜ˆ์ธก๊ฐ’์ด ์–‘์ˆ˜๋กœ๋งŒ ์ œํ•œ๋˜์–ด์•ผ ํ•˜๋Š” ๋ฌธ์ œ์—์„œ ๋ชจ๋ธ ์œ ํšจ์„ฑ์„ ๋†’์ผ ์ˆ˜ ์žˆ์Œ
    ##################
    ## LAD Ensemble ##
    ##################
    
    x = pd.DataFrame({'GBC': np.round(gb_pred_1.tolist()),  'hist': np.round(hist_pred_1.tolist()), 
                      'lgb': np.round(lgb_pred_1.tolist()), 'xgb': np.round(xgb_pred_1.tolist()), 
                      'cat': np.round(cat_pred_1.tolist())})
    y = Y_test[X_test['generated'] == 1]
    
    x_test = pd.DataFrame({'GBC': np.round(gb_pred_2.tolist()),  'hist': np.round(hist_pred_2.tolist()), 
                           'lgb': np.round(lgb_pred_2.tolist()), 'xgb': np.round(xgb_pred_2.tolist()), 
                           'cat': np.round(cat_pred_2.tolist())})
    
    lad_md_1 = LADRegression(fit_intercept = True, positive = False).fit(x, y)
    lad_md_2 = LADRegression(fit_intercept = True, positive = True).fit(x, y)
    lad_md_3 = LADRegression(fit_intercept = False, positive = True).fit(x, y)
    lad_md_4 = LADRegression(fit_intercept = False, positive = False).fit(x, y)
    
    lad_pred_1 = lad_md_1.predict(x)
    lad_pred_2 = lad_md_2.predict(x)
    lad_pred_3 = lad_md_3.predict(x)
    lad_pred_4 = lad_md_4.predict(x)

    lad_pred_test_1 = lad_md_1.predict(x_test)
    lad_pred_test_2 = lad_md_2.predict(x_test)
    lad_pred_test_3 = lad_md_3.predict(x_test)
    lad_pred_test_4 = lad_md_4.predict(x_test)
        
    ens_score_1 = mean_absolute_error(y, lad_pred_1)
    ens_cv_scores_1.append(ens_score_1)
    ens_preds_1.append(lad_pred_test_1)
    
    ens_score_2 = mean_absolute_error(y, lad_pred_2)
    ens_cv_scores_2.append(ens_score_2)
    ens_preds_2.append(lad_pred_test_2)
    
    ens_score_3 = mean_absolute_error(y, lad_pred_3)
    ens_cv_scores_3.append(ens_score_3)
    ens_preds_3.append(lad_pred_test_3)
    
    ens_score_4 = mean_absolute_error(y, lad_pred_4)
    ens_cv_scores_4.append(ens_score_4)
    ens_preds_4.append(lad_pred_test_4)
    
    print('Fold', i, '==> LAD Model 1 ensemble oof MAE is ==>', ens_score_1)
    print('Fold', i, '==> LAD Model 2 ensemble oof MAE is ==>', ens_score_2)
    print('Fold', i, '==> LAD Model 3 ensemble oof MAE is ==>', ens_score_3)
    print('Fold', i, '==> LAD Model 4 ensemble oof MAE is ==>', ens_score_4)

 

๐Ÿ‘‰ ๊ฒฐ๊ณผ ํ™•์ธ

 

๐Ÿš€ ํด๋“œ๋ณ„ ๊ฒฐ๊ณผ(MAE) ํ‰๊ท ๋‚ด์–ด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ํ™•์ธ, ์‹œ๊ฐํ™”

gb_cv_score = np.mean(gb_cv_scores)
hist_cv_score = np.mean(hist_cv_scores)
lgb_cv_score = np.mean(lgb_cv_scores)
xgb_cv_score = np.mean(xgb_cv_scores)
cat_cv_score = np.mean(cat_cv_scores)
ens_cv_score_1 = np.mean(ens_cv_scores_1)
ens_cv_score_2 = np.mean(ens_cv_scores_2)
ens_cv_score_3 = np.mean(ens_cv_scores_3)
ens_cv_score_4 = np.mean(ens_cv_scores_4)

model_perf = pd.DataFrame({'Model': ['GradientBoosting', 'HistGradient' ,'LightGBM', 'XGBoost', 'CatBoost', 
                                     'LDA Model 1',
                                     'LDA Model 2',
                                     'LDA Model 3',
                                     'LDA Model 4'],
                           'cv-score': [gb_cv_score, hist_cv_score, lgb_cv_score, xgb_cv_score, cat_cv_score, 
                                        ens_cv_score_1,
                                        ens_cv_score_2,
                                        ens_cv_score_3,
                                        ens_cv_score_4]})

plt.figure(figsize = (8, 8))
ax = sns.barplot(y = 'Model', x = 'cv-score', data = model_perf)
ax.bar_label(ax.containers[0]);

728x90