๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Projects/๐Ÿช Convenience Store Location Analysis

[Mini Project] 7. ๋ชจ๋ธ๋ง Baseline Code (feat. Trouble Shooting ๐Ÿคจ)

by ISLA! 2023. 9. 12.

 

๐Ÿš€ ํšŒ๊ท€ ์˜ˆ์ธก์„ ์œ„ํ•œ ๋ชจ๋ธ ์„ ์ • : LightGBM

  • ์ด๋ก ์ ์œผ๋กœ ์ ์€ ๋ฐ์ดํ„ฐ ์…‹(ํ†ต์ƒ ํ–‰ ๊ฐœ์ˆ˜ 10,000๊ฐœ ์ดํ•˜)์— ๋Œ€ํ•œ ๊ณผ์ ํ•ฉ ์šฐ๋ ค๊ฐ€ ์žˆ๋Š” ๋ชจ๋ธ์ด์ง€๋งŒ, ํ˜„์žฌ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋Š” ์ธ๊ธฐ์žˆ๋Š” ๋ชจ๋ธ์ด๊ธฐ๋„ ํ•˜๋ฉฐ ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚˜ LightGBM์„ ๊ธฐ๋ณธ์ ์œผ๋กœ ์ฑ„ํƒ
  • ์ดํ›„, ๊ฐ€๋Šฅํ•˜๋ฉด ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ ํƒํ•˜์—ฌ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ๋น„๊ตํ•ด ๋ณผ ์˜ˆ์ •

 

โ–ถ๏ธŽ LightGBM์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ ํƒ

  • Gradient Boosting ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ, ๊ฒฐ์ •ํŠธ๋ฆฌ ๋ชจ๋ธ๊ณผ ๊ฐœ๋…์ด ์ด์–ด์ง€๋Š” ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์ด ๋‹ค์ˆ˜ ์กด์žฌํ•œ๋‹ค
  • ๋ชจ๋ธ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด ์ฃผ์š” ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํƒ์ƒ‰, ์„ ์ •ํ•˜๊ธฐ๋กœ ํ–ˆ๋‹ค (โœ” ๋œ ๊ฒƒ์ด ์ž ์ • ์ฑ„ํƒ๋œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ!)
  • ์„ ์ •๋œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” RandomSearchCV๋ฅผ ์ด์šฉํ•ด, ์ตœ์ ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋„์ถœํ•  ๊ฒƒ์ด๋‹ค
โœ” max_depth tree์˜ ์ตœ๋Œ€ ๊นŠ์ด๋กœ, ๊ฐ€์žฅ ๋จผ์ € ํŠœ๋‹ํ•ด์•ผํ•  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ * ๊ณผ์ ํ•ฉ ์กฐ์ ˆ
* ๊ฐ€์ง€์น˜๊ธฐ ํ•จ
* ๋ณดํ†ต 3~12์˜ ๊ฐ’ ์‚ฌ์šฉ
min_data_in_leaf  Leaf๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์ตœ์†Œํ•œ์˜ ๋ ˆ์ฝ”๋“œ ์ˆ˜ * ๊ณผ์ ํ•ฉ ์กฐ์ ˆ
* ๋””ํดํŠธ = 20
feature_fraction Tree๋ฅผ ๋งŒ๋“ค ๋•Œ, ๊ฐ iteration ๋ฐ˜๋ณต์—์„œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ค‘ (80%) ๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ์„ ํƒ Boosting์ด ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์ผ ๊ฒฝ์šฐ, (0.8)
bagging_fraction ๋งค๋ฒˆ iteration์„ ๋Œ๋•Œ ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ ์ผ๋ถ€๋ฅผ ์„ ํƒํ•˜๋Š”๋ฐ ํŠธ๋ ˆ์ด๋‹ ์†๋„๋ฅผ ๋†’์ด๊ณ  ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€ํ•  ๋•Œ ์ฃผ๋กœ ์‚ฌ์šฉ  
early_stopping_round ๋ชจ๋ธ์€ ์–ด๋–ค validation ๋ฐ์ดํ„ฐ ์ค‘ ํ•˜๋‚˜์˜ ์ง€ํ‘œ๊ฐ€ early_stopping_round์—์„œ ํ–ฅ์ƒ๋˜์ง€ ์•Š์œผ๋ฉด ํ•™์Šต์„ ์ค‘๋‹จ * ์ง€๋‚˜์นœ iteration์„ ์ค„์ด๋Š”๋ฐ ๋„์›€์ด ๋จ
* ์†๋„ ํ–ฅ์ƒ์— ๋„์›€์ด ๋จ. 
min_gain_to_split ๋ถ„๊ธฐํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ ์ตœ์†Œ gain ์œผ๋กœ tree์—์„œ ๋ถ„๊ธฐ์˜ ์ˆ˜๋ฅผ ์กฐ์ ˆํ•  ๋•Œ ์‚ฌ์šฉ  
โœ” num_leaves    
โœ” learning_rate    
โœ” num_iterations
(์‚ฌ์ดํ‚ท๋Ÿฐ : n_estimators)
* number of iterations : ๋ถ€์ŠคํŒ… ์ดํ„ฐ๋ ˆ์ด์…˜ ์ˆ˜๋กœ, ๋ชจ๋ธ ์„ฑ๋Šฅ๊ณผ ํ•™์Šต์‹œ๊ฐ„, ๋žจ ์‚ฌ์šฉ๋Ÿ‰์— ์˜ํ–ฅ์„ ์คŒ
* ํฐ ๊ฐ’์„ ๋„ฃ์€ ํ›„, early stopping ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ์ถ”์ฒœ
* early stopping ์—†์ด ์ด ๊ฐ’์„ ํฌ๊ฒŒ ๋„ฃ์œผ๋ฉด ๊ณผ๋Œ€ ์ ํ•ฉ ์œ„ํ—˜
* ๋ณดํ†ต 50 ์ •๋„์˜ ๊ฐ’์„ ์‚ฌ์šฉ
โœ” early_stopping_rounds    

 

โ–ถ๏ธŽ Modeling Baseline Code ์ž‘์„ฑ (ํ–ฅํ›„ ์ˆ˜์ • ๋ฐ ๋ฐœ์ „ ์˜ˆ์ •!)

  • k-Fold ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ˆ˜ํ–‰
  • RandomSearchCV๋ฅผ ํ†ตํ•œ ์ตœ์ ์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํƒ์ƒ‰
  • LightGBM์„ ์ด์šฉํ•ด ์˜ˆ์ธก ๋ชจ๋ธ๋ง
  • ์˜ˆ์ธก ์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ๋Š” RMSE, MSE ๋ฅผ ๊ธฐ๋ณธ์ ์œผ๋กœ ํ™•์ธ
X = data[['์ ํฌ์ˆ˜', '์‹œ๊ฐ„๋Œ€1',
       '์‹œ๊ฐ„๋Œ€2', '์‹œ๊ฐ„๋Œ€3', '์‹œ๊ฐ„๋Œ€4', '์‹œ๊ฐ„๋Œ€5', '๋ถ„๊ธฐ_1', '๋ถ„๊ธฐ_2', '๋ถ„๊ธฐ_3', '์ด ์ƒ์ฃผ์ธ๊ตฌ ์ˆ˜',
       '์ด ๊ฐ€๊ตฌ ์ˆ˜', '์ด_์ง์žฅ์ธ๊ตฌ_์ˆ˜', '์•„ํŒŒํŠธ_๋‹จ์ง€_์ˆ˜', '์•„ํŒŒํŠธ_๊ฐ€๊ฒฉ_1_์–ต_๋ฏธ๋งŒ_์„ธ๋Œ€_์ˆ˜',
       '์•„ํŒŒํŠธ_๊ฐ€๊ฒฉ_1_์–ต_์„ธ๋Œ€_์ˆ˜', '์•„ํŒŒํŠธ_๊ฐ€๊ฒฉ_2_์–ต_์„ธ๋Œ€_์ˆ˜', '์•„ํŒŒํŠธ_๊ฐ€๊ฒฉ_3_์–ต_์„ธ๋Œ€_์ˆ˜',
       '์•„ํŒŒํŠธ_๊ฐ€๊ฒฉ_4_์–ต_์„ธ๋Œ€_์ˆ˜', '์•„ํŒŒํŠธ_๊ฐ€๊ฒฉ_5_์–ต_์„ธ๋Œ€_์ˆ˜', '์•„ํŒŒํŠธ_๊ฐ€๊ฒฉ_6_์–ต_์ด์ƒ_์„ธ๋Œ€_์ˆ˜', '์ด_์ƒํ™œ์ธ๊ตฌ_์ˆ˜',
       '์‹œ๊ฐ„๋Œ€_์ƒํ™œ์ธ๊ตฌ_์ˆ˜', '์›”์š”์ผ_์ƒํ™œ์ธ๊ตฌ_์ˆ˜', 'ํ™”์š”์ผ_์ƒํ™œ์ธ๊ตฌ_์ˆ˜', '์ˆ˜์š”์ผ_์ƒํ™œ์ธ๊ตฌ_์ˆ˜', '๋ชฉ์š”์ผ_์ƒํ™œ์ธ๊ตฌ_์ˆ˜',
       '๊ธˆ์š”์ผ_์ƒํ™œ์ธ๊ตฌ_์ˆ˜', 'ํ† ์š”์ผ_์ƒํ™œ์ธ๊ตฌ_์ˆ˜', '์ผ์š”์ผ_์ƒํ™œ์ธ๊ตฌ_์ˆ˜', '์ง‘๊ฐ์‹œ์„ค_์ˆ˜', '๊ด€๊ณต์„œ_์ˆ˜', '์€ํ–‰_์ˆ˜',
       '๋ฐฑํ™”์ _์ˆ˜', '์ˆ™๋ฐ•_์‹œ์„ค_์ˆ˜', 'area', '์—ฐ๋ น๋Œ€_10_์ƒํ™œ์ธ๊ตฌ_์ˆ˜', '์—ฐ๋ น๋Œ€_20_์ƒํ™œ์ธ๊ตฌ_์ˆ˜',
       '์—ฐ๋ น๋Œ€_30_์ƒํ™œ์ธ๊ตฌ_์ˆ˜', '์—ฐ๋ น๋Œ€_40_์ƒํ™œ์ธ๊ตฌ_์ˆ˜', '์—ฐ๋ น๋Œ€_50_์ƒํ™œ์ธ๊ตฌ_์ˆ˜', '์—ฐ๋ น๋Œ€_60_์ด์ƒ_์ƒํ™œ์ธ๊ตฌ_์ˆ˜',
       '๋ฐฐํ›„์ง€_์•„ํŒŒํŠธ_๋‹จ์ง€_์ˆ˜', '๋ฐฐํ›„์ง€_์•„ํŒŒํŠธ_๊ฐ€๊ฒฉ_1_์–ต_๋ฏธ๋งŒ_์„ธ๋Œ€_์ˆ˜', '๋ฐฐํ›„์ง€_์•„ํŒŒํŠธ_๊ฐ€๊ฒฉ_1_์–ต_์„ธ๋Œ€_์ˆ˜',
       '๋ฐฐํ›„์ง€_์•„ํŒŒํŠธ_๊ฐ€๊ฒฉ_2_์–ต_์„ธ๋Œ€_์ˆ˜', '๋ฐฐํ›„์ง€_์•„ํŒŒํŠธ_๊ฐ€๊ฒฉ_3_์–ต_์„ธ๋Œ€_์ˆ˜', '๋ฐฐํ›„์ง€_์•„ํŒŒํŠธ_๊ฐ€๊ฒฉ_4_์–ต_์„ธ๋Œ€_์ˆ˜',
       '๋ฐฐํ›„์ง€_์•„ํŒŒํŠธ_๊ฐ€๊ฒฉ_5_์–ต_์„ธ๋Œ€_์ˆ˜', '๋ฐฐํ›„์ง€_์•„ํŒŒํŠธ_๊ฐ€๊ฒฉ_6_์–ต_์ด์ƒ_์„ธ๋Œ€_์ˆ˜', '์‹œ๊ฐ„๋Œ€_๋ฒ„์Šค_์Šนํ•˜์ฐจ์Šน๊ฐ์ˆ˜',
       '์‹œ๊ฐ„๋Œ€_์ง€ํ•˜์ฒ _์Šนํ•˜์ฐจ์Šน๊ฐ์ˆ˜', '๋ฒ„์Šค์ •๋ฅ˜์žฅ_์ˆ˜', '์ง€ํ•˜์ฒ ์—ญ_์ˆ˜']]
y = data['๋งค์ถœ']


# k-ํด๋“œ ๊ต์ฐจ ๊ฒ€์ฆ
num_folds = 5
kf = KFold(n_splits= num_folds, shuffle=True, random_state=42)


# LightGBM ๋ชจ๋ธ ์ดˆ๊ธฐํ™”
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}

# ํŠน์„ฑ ์ค‘์š”๋„ ๋ฆฌ์ŠคํŠธ ์ดˆ๊ธฐํ™”
feature_importance_list = []

# ๊ฒฐ๊ณผ ์Šค์ฝ”์–ด
rmse_scores = []  # RMSE ์Šค์ฝ”์–ด๋ฅผ ์ €์žฅํ•  ๋ฆฌ์ŠคํŠธ
mae_scores = []   # MAE ์Šค์ฝ”์–ด๋ฅผ ์ €์žฅํ•  ๋ฆฌ์ŠคํŠธ
best_params_list = []  # ๊ฐ fold์—์„œ์˜ ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ €์žฅํ•  ๋ฆฌ์ŠคํŠธ

# ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฒ”์œ„ ์„ค์ • (๋žœ๋ค ์„œ์น˜์šฉ)
param_dist = {
    'objective': ['regression'],
    'metric': ['mse'],
    'num_leaves': list(range(7, 64)),              # 7๋ถ€ํ„ฐ 63๊นŒ์ง€
    'learning_rate': [0.01, 0.02, 0.03, 0.04, 0.05],  #0.01๋ถ€ํ„ฐ 0.05๊นŒ์ง€
    'n_estimators': list(range(200, 301)),         # 200๋ถ€ํ„ฐ 300๊นŒ์ง€
    'early_stopping_rounds': list(range(40, 51))  # 40๋ถ€ํ„ฐ 50๊นŒ์ง€
}


# K-Fold ๊ต์ฐจ ๊ฒ€์ฆ ์ˆ˜ํ–‰
for train_index, val_index in kf.split(X_train):
    X_train_kf, X_val_kf = X.iloc[train_index], X.iloc[val_index]
    y_train_kf, y_val_kf = y.iloc[train_index], y.iloc[val_index]


    # ๋ฐ์ดํ„ฐ์…‹
    train_data = lgb.Dataset(X_train_kf, label=y_train_kf)
    val_data = lgb.Dataset(X_val_kf, label=y_val_kf, reference=train_data)


    # ๋žœ๋ค ์„œ์น˜๋ฅผ ์‚ฌ์šฉํ•œ LightGBM ๋ชจ๋ธ ํŠœ๋‹
    random_search = RandomizedSearchCV(
        lgb.LGBMRegressor(),
        param_distributions=param_dist,
        n_iter=10,
        scoring='neg_mean_squared_error',
        cv=kf,
        random_state=42,
        n_jobs=-1,
        verbose=1
    )

    evals = [(X_train_kf, y_train_kf),(X_val_kf, y_val_kf)]
    random_search.fit(X_train_kf, y_train_kf, eval_set = evals, eval_metric='rmse')
    best_params = random_search.best_params_

    bst = lgb.LGBMRegressor(**best_params)

    bst.fit(X_train_kf, y_train_kf,
            eval_set=evals,
            eval_metric='rmse',
            verbose=False)


    # ๋ชจ๋ธ ํ‰๊ฐ€ (RMSE)
    y_pred = bst.predict(X_val_kf)
    mse = mean_squared_error(y_val_kf, y_pred)
    rmse = np.sqrt(mean_squared_error(y_val_kf, y_pred))
    mae = mean_absolute_error(y_val_kf, y_pred)

    rmse_scores.append(rmse)
    mae_scores.append(mae)
    best_params_list.append(best_params)


# ๊ต์ฐจ ๊ฒ€์ฆ ๊ฒฐ๊ณผ ์ถœ๋ ฅ
mean_rmse = np.mean(rmse_scores)
mean_mae = np.mean(mae_scores)
print(f'ํ‰๊ท  RMSE: {mean_rmse}')
print(f'ํ‰๊ท  MAE: {mean_mae}')

# K-fold ๊ต์ฐจ ๊ฒ€์ฆ์—์„œ ์–ป์€ ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ ์ถœ๋ ฅ
print("Best Hyperparameters for K-fold CV:")
for i, params in enumerate(best_params_list):
    print(f'Fold {i + 1}: {params}')

โ–ถ๏ธŽ Issue 1

  • ๋ชจ๋ธ๋ง ๊ฒฝํ—˜์ด ๋ถ€์กฑํ•ด, ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์„ ์ •ํ•˜๊ณ , ๊ทธ ๋ฒ”์œ„๋ฅผ ์ง€์ •ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ค์› ๋‹ค.
  • ๋‹ค๊ฐ™์ด ๊ณต์‹๋ฌธ์„œ ๋ฐ ๊ตฌ๊ธ€๋ง์„ ํ†ตํ•ด ์Šคํ„ฐ๋””๋ฅผ ํ•˜์—ฌ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ •ํ–ˆ๋‹ค.
  • ์ด ์™ธ, ๋‹ค์–‘ํ•œ ๋ธ”๋กœ๊ทธ์™€ chatGPT ๋“ฑ์„ ํ™œ์šฉํ•˜์—ฌ ํ•™์Šตํ–ˆ๋‹ค

 

โ–ถ๏ธŽ Issue 2

  • ์œ„์— Baseline Code ๋ฅผ ์งœ๋Š”๋ฐ ์ž๊พธ๋งŒ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ–ˆ๋‹ค. 
  • early stopping ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋นผ๋†“์„ ์ˆ˜ ์—†๋Š” ์ค‘์š”ํ•œ ์š”์†Œ์˜€๊ธฐ์—, ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ๋ฌธ์„œ๋ฅผ ์ฐพ์•„ ์Šคํ„ฐ๋””ํ–ˆ๋‹ค.
For early stopping, at least one dataset and eval metric is required for evaluation
  • ๊ทธ ๊ฒฐ๊ณผ, eval_set ์— ๋ฌธ์ œ๊ฐ€ ์žˆ์Œ์„ ํ™•์ธ
    ⇒ Kfold์™€ train_test_split์—์„œ validation set ๊ตฌ์„ฑ์—์„œ ์ž˜๋ชป๋œ ๋ถ€๋ถ„์ด ์žˆ๋Š”์ง€ ์˜์‹ฌ
  • LightGBM์„ ํ™œ์šฉํ•œ ๋‹ค์–‘ํ•œ ์ƒ˜ํ”Œ ์ฝ”๋“œ์™€ ํŒŒ์ด์ฌ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ต๊ณผ์„œ์—์„œ ํ•™์Šต์šฉ ์ฝ”๋“œ๋ฅผ ์ฐธ๊ณ ํ•˜๋ฉฐ model fit ๋ถ€๋ถ„์„ ๊ฒ€ํ† 
     LGBM์€ eval_set์„ [(X_train, y_train),(X_valid, y_valid)] ๋กœ ์ง€์ •ํ•ด์ค˜์•ผ ํ•จ์„ ํ™•์ธ
  • ์ดํ›„, ๋ชจ๋ธ ํ•™์Šต์ด ์ˆœ์กฐ๋กญ๊ฒŒ ์ง„ํ–‰๋˜์—ˆ๋‹ค.

 

 

728x90