๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[Home Credit Default Risk] 1. ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ์‹œ๊ฐํ™”, ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ

by ISLA! 2023. 10. 30.

 

1. ํƒ€๊ฒŸ ๊ฐ’ ๋ถ„ํฌ & ์ฃผ์š” ์ปฌ๋Ÿผ์˜ ๋ถ„ํฌ ํ™•์ธํ•˜๊ธฐ

  • ๋ฐ์ดํ„ฐ์˜ ํƒ€๊ฒŸ ๊ฐ’(๋Œ€์ถœ ์—ฐ์ฒด ์—ฌ๋ถ€)๋ฅผ ํ™•์ธํ•œ๋‹ค. >> ๊ฐ ๊ฐ’(0, 1)์˜ ๊ฐœ์ˆ˜์™€ ๋น„์œจ
app_train['TARGET'].value_counts()
app_train['TARGET'].value_counts() / 307511 * 100

 

  • ํƒ€๊ฒŸ์˜ null ๊ฐ’ ํ™•์ธ
apps['TARGET'].value_counts(dropna=False)

 

  • ์ฃผ์š” ์ปฌ๋Ÿผ์˜ ๋ถ„ํฌ๋ฅผ ํžˆ์Šคํ† ๊ทธ๋žจ์œผ๋กœ ํ™•์ธํ•œ๋‹ค
sns.distplot(app_train['AMT_INCOME_TOTAL'])

plt.hist(app_train['AMT_INCOME_TOTAL'])

 

  • ์ผ๋ถ€ ํ•„ํ„ฐ๋„ ์ ์šฉํ•ด๋ณธ๋‹ค (์˜ˆ๋ฅผ ๋“ค์–ด, ์†Œ๋“์ด 1,000,000 ์ดํ•˜์ธ ์„ ์—์„œ ์ฃผ์š” ์ปฌ๋Ÿผ์˜ ๋ถ„ํฌ)
cond_1 = app_train['AMT_INCOME_TOTAL'] < 1000000
app_train[cond_1]['AMT_INCOME_TOTAL'].hist()

 

  • ๋ถ„ํฌ๋ฅผ ํ‘œํ˜„ํ•  ๋•Œ distplot์„ ์‚ฌ์šฉํ•ด๋„ ์ข‹๋‹ค.
    ํŠนํžˆ, ์—ฐ์†ํ˜• ๋ณ€์ˆ˜์˜ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ทธ๋ฆด ๋•Œ kde๋กœ ์ง๊ด€์ ์ธ ํ™•์ธ์ด ๊ฐ€๋Šฅํ•˜๋‹ค
sns.distplot(app_train[app_train['AMT_INCOME_TOTAL'] < 1000000]['AMT_INCOME_TOTAL'])

 

2. ํƒ€๊ฒŸ ๊ฐ’์— ๋”ฐ๋ฅธ ์ฃผ์š” ์ปฌ๋Ÿผ์˜ ๋ถ„ํฌ ํ™•์ธํ•˜๊ธฐ(ํ•จ์ˆ˜)

  • seaborn์˜ Distplot๊ณผ violinplot์˜ ๋ถ„ํฌ๋กœ ๋น„๊ต ์‹œ๊ฐํ™”
  • ํ•จ์ˆ˜๋กœ ์ •์˜ํ•˜์—ฌ ํ•œ๋ˆˆ์— ํ™•์ธํ•˜๊ธฐ
def show_column_hist_by_target(df, column, is_amt = False):

	# ์กฐ๊ฑด ์ง€์ •(ํƒ€๊ฒŸ์ด 1, 0 ์ผ๋•Œ)
    cond1 = (app_train['TARGET'] == 1)
    cond0 = (app_train['TARGET'] == 0)
    
    # ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆด ์ค€๋น„(์‚ฌ์ด์ฆˆ ์ง€์ •)
    fig, axs = plt.subplots(figsize = (12, 4), nrows = 1, ncols = 2, squeeze = False)
    
    # is_amt๊ฐ€ True์ด๋ฉด, < 500000 ์กฐ๊ฑด์œผ๋กœ ํ•„ํ„ฐ๋ง
    cond_amt = True
    if is_amt:
        cond_amt = df[column] < 500000
    
    sns.violinplot(x = 'TARGET', y = column, data = df[cond_amt], ax = axs[0][0])
    sns.distplot(df[cond0 & cond_amt][column], ax = axs[0][1], label = '0', color = 'blue')
    sns.distplot(df[cond1 & cond_amt][column], ax = axs[0][1], label = '1', color = 'red')
    
show_column_hist_by_target(app_train, 'AMT_INCOME_TOTAL', is_amt = True)

 

3. ์นดํ…Œ๊ณ ๋ฆฌํ˜• ํ”ผ์ณ๋“ค์˜ Label Encoding

  • ํŠน์ • ๋ฐ์ดํ„ฐ ํƒ€์ž…์˜ ํ”ผ์ณ๋“ค๋งŒ ๋ฆฌ์ŠคํŠธ๋กœ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
object_columns = apps.dtypes[apps.dtypes == 'object'].index.tolist()

 

  • pandas ์˜ factorize()๋ฅผ ์ด์šฉ >> [0]์ด, ์ธ์ฝ”๋”ฉํ•œ ๊ฒฐ๊ด๊ฐ’
  • ๋‹จ, ์ฃผ์˜ํ•  ์ ์€ ํ•œ๋ฒˆ์— ํ•œ ์ปฌ๋Ÿผ๋งŒ ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ, ๋ฐ˜๋ณต๋ฌธ์„ ์ด์šฉํ•ด์•ผํ•จ.
for column in object_columns:
	apps[column] = pd.factorize(apps[column])[0]
    
# apps.info()๋กœ object ํ˜•์ด ์—†๋Š”์ง€ ํ™•์ธ

 

 

4. LGBM Classifier ๋กœ ํ•™์Šต ์ˆ˜ํ–‰

from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier

train_x, valid_x, train_y, valid_y = train_test_split(ftr_app, target_app, test_size = 0.3, random_state = 2020)
train_x.shape, valid_x.shape

clf = LGBMClassifier(
        n_jobs=-1,         #๋ชจ๋“  ๋ณ‘๋ ฅ์„ ๋‹ค ์“ฐ๊ฒ ๋‹ค.
        n_estimators=1000, #1000๋ฒˆ์˜ Week_learner ๋งŒ๋“ค๊ฒ ๋‹ค.
        learning_rate=0.02,
        num_leaves=32,     #๋งˆ์ง€๋ง‰ ๋ฆฌํ”„๋…ธ๋“œ ์ˆ˜
        subsample=0.8,
        max_depth=12,
        silent=-1,
        verbose=-1
        )

clf.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)], eval_metric= 'auc', verbose= 100, early_stopping_rounds= 50)

 

  • Feature importance ์‹œ๊ฐํ™”
from lightgbm import plot_importance

plot_importance(clf, figsize=(16, 32))

 

 

  • ์˜ˆ์ธก ํ™•๋ฅ  ๊ฐ’ ํ™•์ธ
preds = clf.predict_proba(apps_test.drop(['SK_ID_CURR'], axis = 1))[:, 1]
app_test['TARGET'] = preds
728x90