๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Projects/๐Ÿช Convenience Store Location Analysis

[Mini Project] 11. Baseline Modeling (LightGBM + K-fold CV + RandomSearch)

by ISLA! 2023. 9. 26.

 

โœ” Process Check
์ง€๊ธˆ๊นŒ์ง€ ์ด์ƒ์น˜์™€ ๊ฒฐ์ธก์น˜ ๋“ฑ์„ ์ ๊ฒ€ํ•˜๋Š” EDA๋ฅผ ๋งˆ์นœ ํ›„, ํŒŒ์ƒ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑํ–ˆ๋‹ค.
์ด์ œ ๋ชจ๋ธ๋ง์„ ํ†ตํ•ด ์œ ์˜๋ฏธํ•œ ํŒŒ์ƒ๋ณ€์ˆ˜๋ฅผ ์„ ํƒํ•˜๊ณ , ์ตœ์ข… ๋ชจ๋ธ๋ง์— ํ•„์š”ํ•œ ๋ณ€์ˆ˜๋ฅผ ์ฑ„ํƒํ•˜๋Š” ๊ณผ์ •์ด ๋‚จ์•˜๋‹ค.
์ด๋ฅผ ์œ„ํ•ด ์•„๋ž˜์™€ ๊ฐ™์ด LightGBM + K-fold CV + RandomSearch ์„ ํ™œ์šฉํ•œ ๋ชจ๋ธ๋ง ๋ฒ ์ด์Šค๋ผ์ธ ์ฝ”๋“œ๊ฐ€ ์™„์„ฑ๋˜์—ˆ๋‹ค.

 

1. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import numpy as np
import pandas as pd
import lightgbm as lgb
from lightgbm import LGBMRegressor
from sklearn.model_selection import KFold, train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.font_manager as fm
import joblib
import os

 

โ–ถ๏ธŽ ํฐํŠธ ์„ค์ •

#font ์˜ค๋ฅ˜ ์ˆ˜์ •
font_list = fm.findSystemFonts()
font_name = None
for font in font_list:
    if 'AppleGothic' in font:
        font_name = fm.FontProperties(fname=font).get_name()
plt.rc('font', family=font_name)

 

2. ๋ชจ๋ธ๋ง์šฉ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

  • ๋ชจ๋ธ๋ง์„ ์œ„ํ•ด ์ „์ฒ˜๋ฆฌ๋ฅผ ์™„๋ฃŒํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ๋ถˆ๋Ÿฌ์˜จ๋‹ค.
# ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ๋ฐ ์ „์ฒ˜๋ฆฌ
data = pd.read_csv('./data/๊ณจ๋ชฉ_model์šฉ.csv')

 

โ–ถ๏ธŽ X, Y ๋ณ€์ˆ˜ ๋ถ„๋ฆฌ

  • ๋…๋ฆฝ๋ณ€์ˆ˜ : ๋ชจ๋ธ๋ง์— ์‚ฌ์šฉํ•  ๋ณ€์ˆ˜๋กœ, ์—ฐ๋„, ๋ถ„๊ธฐ, ์ƒ๊ถŒ์ฝ”๋“œ, ์ƒ๊ถŒ์ฝ”๋“œ๋ช…, ์‹œ๊ฐ„๋Œ€ ๋ฐ ๋งค์ถœ์„ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ๋ณ€์ˆ˜
  • ์ข…์†๋ณ€์ˆ˜ : ํŽธ์˜์  ๋งค์ถœ
# ๋ฐ์ดํ„ฐ ๋กœ๋“œ(์‹ค์ œ ๋ฐ์ดํ„ฐ์…‹ ๊ฐ€์ ธ์˜ค๊ธฐ)
X = data.iloc[:, 5:]
y = data.iloc[:, 0]

 

3. K-fold ๊ต์ฐจ ๊ฒ€์ฆ & lightGBM ๋ชจ๋ธ ์ดˆ๊ธฐ ๋งค๊ฐœ๋ณ€์ˆ˜ ์„ค์ •

  • ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ๋ฅผ ๊ณ ๋ คํ•˜์—ฌ k ๊ฐ’์„ 10์œผ๋กœ ์„ค์ •ํ•˜์—ฌ k-fold ์ง„ํ–‰
  • ๊ธฐ๋ณธ์ ์ธ LightGBM ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ • ์™„๋ฃŒ
# k-ํด๋“œ ๊ต์ฐจ ๊ฒ€์ฆ
num_folds = 10
kf = KFold(n_splits= num_folds, shuffle=True, random_state=42)

# LightGBM ๋ชจ๋ธ ์ดˆ๊ธฐํ™”
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}

 

โ–ถ๏ธŽ ํŠน์„ฑ ์ค‘์š”๋„ ๋ฆฌ์ŠคํŠธ / ๊ฒฐ๊ณผ ์Šค์ฝ”๋” ๋ฆฌ์ŠคํŠธ ์ƒ์„ฑ

  • ๋ชจ๋ธ๋ง ๊ฒฐ๊ณผ์ธ (1) ํŠน์„ฑ ์ค‘์š”๋„์™€ (2) ํด๋“œ๋ณ„ ๊ฒฐ๊ณผ ์Šค์ฝ”์–ด๋ฅผ ์ €์žฅํ•  ๋ฆฌ์ŠคํŠธ๋ฅผ ์ƒ์„ฑ
# ํŠน์„ฑ ์ค‘์š”๋„ ๋ฆฌ์ŠคํŠธ ์ดˆ๊ธฐํ™”
feature_importance_list = []

# ๊ฒฐ๊ณผ ์Šค์ฝ”์–ด
rmse_scores = []  # RMSE ์Šค์ฝ”์–ด๋ฅผ ์ €์žฅํ•  ๋ฆฌ์ŠคํŠธ
mae_scores = []   # MAE ์Šค์ฝ”์–ด๋ฅผ ์ €์žฅํ•  ๋ฆฌ์ŠคํŠธ
best_params_list = []  # ๊ฐ fold์—์„œ์˜ ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ €์žฅํ•  ๋ฆฌ์ŠคํŠธ

 

โ–ถ๏ธŽ ๋ฐ์ดํ„ฐ ๋ถ„ํ• (train, test)

  • train ์šฉ, test ์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ์šฐ์„  ๋ถ„๋ฆฌ
  • (์ดํ›„, train ๋ฐ์ดํ„ฐ๋ฅผ k-fold์—์„œ ๋‹ค์‹œํ•œ๋ฒˆ Train, validation์šฉ์œผ๋กœ ๋‚˜๋ˆŒ ์˜ˆ์ •)
# ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

 

โ–ถ๏ธŽ RandomSearch ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ข…๋ฅ˜ & ๋ฒ”์œ„ ์„ค์ •

  • ๋ชจ๋ธ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹
  • ์ฃผ์š”ํ•œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์„ ์ •ํ•˜๊ณ , RandomSearch๋ฅผ ์œ„ํ•ด ๋ฒ”์œ„ ์„ค์ •ํ•ด ์คŒ
param_dist = {
    'objective': ['regression'],
    'metric': ['mse'],
    'num_leaves': list(range(7, 64)),              # 7๋ถ€ํ„ฐ 63๊นŒ์ง€
    'learning_rate': [0.01, 0.02, 0.03, 0.04, 0.05],  #0.01๋ถ€ํ„ฐ 0.05๊นŒ์ง€
    'n_estimators': list(range(200, 301)),         # 200๋ถ€ํ„ฐ 300๊นŒ์ง€
    'early_stopping_rounds': list(range(40, 51))  # 40๋ถ€ํ„ฐ 50๊นŒ์ง€
}

 

4. K-fold ๊ต์ฐจ ๊ฒ€์ฆ 10ํšŒ ์ˆ˜ํ–‰  ๐Ÿ‘‰ ๋ชจ๋ธ๋ง๊ณผ ํ•™์Šต ๊ณผ์ •

  • for๋ฌธ์œผ๋กœ K-fold์˜ k๊ฐ’ ๋งŒํผ train, valiation ๋ฐ์ดํ„ฐ ์…‹์„ ๋‚˜๋ˆ„๊ณ 
  • lightGBM ์šฉ ๋ฐ์ดํ„ฐ ์…‹์„ ์ง€์ •
  • RandomSearchCV() ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜๊ณ , lightGBMRegressor ๋ชจ๋ธ ์ง€์ •, ์„ธํŒ…ํ•œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฒ”์œ„ ์ง€์ •
  • ํ•™์Šต ํ›„, ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ ๋„์ถœ
  • ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋ชจ๋ธ ์žฌ์ ํ•ฉ & ์žฌํ•™์Šต
  • valid ๋ฐ์ดํ„ฐ๋กœ ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ๋„์ถœ, ๋ฆฌ์ŠคํŠธ์— ์ €์žฅ
# K-Fold ๊ต์ฐจ ๊ฒ€์ฆ ์ˆ˜ํ–‰
for train_index, val_index in kf.split(X_train):

	# train, validation ๋ฐ์ดํ„ฐ์…‹ํŠธ ๋ถ„๋ฆฌ(fold ๋ณ„)
    X_train_kf, X_val_kf = X.iloc[train_index], X.iloc[val_index]
    y_train_kf, y_val_kf = y.iloc[train_index], y.iloc[val_index]


    # lightGBM์šฉ ๋ฐ์ดํ„ฐ์…‹ ์ง€์ •
    train_data = lgb.Dataset(X_train_kf, label=y_train_kf)
    val_data = lgb.Dataset(X_val_kf, label=y_val_kf, reference=train_data)
    
    # RandomSearch๋ฅผ ์‚ฌ์šฉํ•œ LightGBM ๋ชจ๋ธ ํŠœ๋‹ ์ค€๋น„
    random_search = RandomizedSearchCV(
        lgb.LGBMRegressor(),
        param_distributions=param_dist,
        n_iter=10,
        scoring='neg_mean_squared_error',
        cv=kf,
        random_state=42,
        n_jobs=-1,
        verbose=1
    )
    
    # LightGBM ํ‰๊ฐ€์šฉ ๋ฐ์ดํ„ฐ์…‹ ์ง€์ •
    evals = [(X_train_kf, y_train_kf),(X_val_kf, y_val_kf)]
    
    # RandomSearch(LightGBM) ํ•™์Šต
    random_search.fit(X_train_kf, y_train_kf, eval_set = evals, eval_metric='rmse')
    
    # ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ ๋„์ถœ
    best_params = random_search.best_params_

	# bst ๋ณ€์ˆ˜์— ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋ชจ๋ธ ์žฌํ•™์Šต
    bst = lgb.LGBMRegressor(**best_params)
    bst.fit(X_train_kf, y_train_kf,
            eval_set=evals,
            eval_metric='rmse',
            verbose=False)

    #Feature importance ๊ณ„์‚ฐ, ๋ฆฌ์ŠคํŠธ์— ์ €์žฅ
    feature_importance = bst.feature_importances_
    feature_importance_list.append(feature_importance)
    
    # ๋ชจ๋ธ ํ‰๊ฐ€ (RMSE) ๋ฐ ๊ฒฐ๊ณผ ๋ฆฌ์ŠคํŠธ์— ์ €์žฅ
    y_pred = bst.predict(X_val_kf)
    mse = mean_squared_error(y_val_kf, y_pred)
    rmse = np.sqrt(mean_squared_error(y_val_kf, y_pred))
    mae = mean_absolute_error(y_val_kf, y_pred)

    rmse_scores.append(rmse)
    mae_scores.append(mae)
    best_params_list.append(best_params)

 

โ–ถ๏ธŽ ๊ต์ฐจ ๊ฒ€์ฆ ๊ฒฐ๊ณผ ์ข…ํ•ฉ

  • np.mead() ์œผ๋กœ ํด๋“œ๋ณ„ ๊ต์ฐจ๊ฒ€์ฆ ๊ฒฐ๊ณผ๋ฅผ ํ‰๊ท ๋‚ด๊ธฐ
  • RMSE, MAE ํ‰๊ท  ๊ฒฐ๊ณผ๊ฐ’ ํ™•์ธ
  • ํŠน์„ฑ ์ค‘์š”๋„ ๊ฒฐ๊ณผ๋„ ํ‰๊ท  ๋‚ด๊ธฐ
  • ํŠน์„ฑ ์ค‘์š”๋„ ๊ฒฐ๊ณผ๋ฅผ ํŠน์„ฑ๋ช…๊ณผ ํ•จ๊ป˜ ์‹œ๊ฐํ™” : ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒ์„ฑ ํ›„, ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ
  • ๊ฐ€์žฅ ๋†’์€ ํŠน์„ฑ ์ค‘์š”๋„๋ฅผ ๊ฐ€์ง„ ํ”ผ์ณ๋ถ€ํ„ฐ ํ™•์ธ ๊ฐ€๋Šฅ
# ๊ต์ฐจ ๊ฒ€์ฆ ๊ฒฐ๊ณผ ์ถœ๋ ฅ
mean_rmse = np.mean(rmse_scores)
mean_mae = np.mean(mae_scores)
print(f'ํ‰๊ท  RMSE: {mean_rmse}')
print(f'ํ‰๊ท  MAE: {mean_mae}')

# ํŠน์„ฑ ์ค‘์š”๋„ ํ‰๊ท  ๊ณ„์‚ฐ
average_feature_importance = np.mean(feature_importance_list, axis=0)

# ํŠน์„ฑ ์ด๋ฆ„
feature_names = X.columns

# ์ค‘์š”๋„๋ฅผ ํŠน์„ฑ ์ด๋ฆ„๊ณผ ํ•จ๊ป˜ ์ถœ๋ ฅ
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': average_feature_importance})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
print(feature_importance_df)

# ํŠน์„ฑ ์ค‘์š”๋„ ์‹œ๊ฐํ™”
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('ํŠน์„ฑ ์ค‘์š”๋„')
plt.show()

 

โ–ถ๏ธŽ ์ตœ์  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํ™•์ธ ๋ฐ ๋ชจ๋ธ ์ €์žฅ

# K-fold ๊ต์ฐจ ๊ฒ€์ฆ์—์„œ ์–ป์€ ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ ์ถœ๋ ฅ(ํด๋“œ๋ณ„)
print("K-fold ๊ต์ฐจ ๊ฒ€์ฆ์„ ์œ„ํ•œ ์ตœ์  ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ:")
for i, params in enumerate(best_params_list):
    print(f'Fold {i + 1}: {params}')


# ๋ชจ๋ธ ์ €์žฅ
if not os.path.exists("models"):
    os.mkdir("models")

model_file = open("models/gm_model.pkl", "wb")
joblib.dump(bst, model_file) # Export
model_file.close()

 


๐Ÿ‘‰ ์ด๋ ‡๊ฒŒ ๊ตฌ์ถ•ํ•œ ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ๋ง ์ฝ”๋“œ๋กœ, ์‹œ๊ฐ„๋Œ€๋ณ„ ํŽธ์˜์  ๋งค์ถœ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ

     ์ด๋ฅผ streamlit ๋Œ€์‹œ๋ณด๋“œ์— ๊ตฌํ˜„ํ•˜๋Š” ์ž‘์—…์ด ์ด์–ด์ง„๋‹ค.

728x90