๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[Home Credit Default Risk] 3.์ฃผ์š” Feature๋“ค์— ๋Œ€ํ•œ feature engineering

by ISLA! 2023. 10. 31.

 

1. ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

  • ์•ž์„  EDA์™€ ๋ถ„ํฌ ์‹œ๊ฐํ™”, ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ๋ง์˜ feature Importance๋ฅผ ํ† ๋Œ€๋กœ ์ฃผ์š” ํ”ผ์ณ๋ฅผ ์„ ์ •
  • ๊ฐ ํ”ผ์ณ์˜ null ๊ฐ’ ํ™•์ธ
app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].isnull().sum()

# dropna = false ๋กœ null ๊ฐ’ ๊ฐœ์ˆ˜๊นŒ์ง€ ํ™•์ธ
app_train['EXT_SOURCE_1'].value_counts(dropna=False)
app_train['EXT_SOURCE_2'].value_counts(dropna=False)
app_train['EXT_SOURCE_3'].value_counts(dropna=False)

 

  • ์ฃผ์š” ํ”ผ์ณ๋“ค์˜ ํ‰๊ท /์ตœ๋Œ€/์ตœ์†Œ/ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ํ™•์ธ
# EXT_SOURCE_X ํ”ผ์ฒ˜๋“ค์˜ ํ‰๊ท /์ตœ๋Œ€/์ตœ์†Œ/ํ‘œ์ค€ํŽธ์ฐจ ํ™•์ธ 
print('### mean ###\n', app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean())
print('### max ###\n',app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].max())
print('### min ###\n',app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].min())
print('### std ###\n',app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].std())

 

2. ํ”ผ์ณ ๊ฐ€๊ณต(๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ) ์ „, Train, test ๋ฐ์ดํ„ฐ์…‹ ๊ฒฐํ•ฉ

  • Train ๋ฐ์ดํ„ฐ์˜ ๊ฐ€๊ณต ๋‚ด์šฉ์„ test ์…‹์— ๋™์ผํ•˜๊ฒŒ ์ ์šฉํ•˜๋Š”๊ฒƒ์ด ๋ฒˆ๊ฑฐ๋กœ์šฐ๋‹ˆ
apps = pd.concat([app_train, app_test])
print(apps.shape)

 

  • row ๋ณ„๋กœ 3๊ฐœ ํ”ผ์ณ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ์‹ ๊ทœ ์ƒ์„ฑ
  • ๊ฒฐํ•ฉํ•˜๋Š” ์ปฌ๋Ÿผ์„ ๋ฆฌ์ŠคํŠธ๋กœ ๋ฌถ๊ณ , .mean(axis=1) ๋กœ ํ–‰๋ณ„ ๊ฐ’ ์‚ฐ์ถœ
apps['APPS_EXT_SOURCE_MEAN'] = apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis = 1)
apps['APPS_EXT_SOURCE_STD'] = apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].std(axis = 1)

 

  • ์ƒ์„ฑ๋œ ์ปฌ๋Ÿผ ํ™•์ธ ํ›„, Nan ๊ฐ’์„ ์ฒ˜๋ฆฌ
  • ์ด ๊ฒฝ์šฐ, ํ‘œ์ค€ํŽธ์ฐจ์˜ ๊ฒฐ์ธก์น˜๊ฐ€ ํ•˜๋‚˜๋ผ๋„ ์žˆ์œผ๋ฉด ์ƒ์„ฑ๋œ ์ปฌ๋Ÿผ ๊ฐ’์ด Nan์ด ๋จ >> ์ „์ฒด ํ‘œ์ค€ํŽธ์ฐจ์˜ ํ‰๊ท ์œผ๋กœ ์ฑ„์šฐ๊ธฐ
apps.iloc[:, -2:].head()
apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APPS_EXT_SOURCE_MEAN', 'APPS_EXT_SOURCE_STD']].head()

apps['APPS_EXT_SOURCE_STD'] = apps['APPS_EXT_SOURCE_STD'].fillna(apps['APPS_EXT_SOURCE_STD'].mean())

 

 

3. Feature ๊ฐ€๊ณต

  • ์ฃผ์š” ํ”ผ์ณ์˜ ํŠน์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ๋‹ค์–‘ํ•˜๊ฒŒ ์กฐํ•ฉ์„ ์ƒ๊ฐํ•ด๋ณธ๋‹ค.
  • ์˜ˆ์‹œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค
# ๋Œ€์ถœ๊ธˆ์•ก ๋Œ€๋น„ ์›” ๋Œ€์ถœ์ง€๊ธ‰์•ก ๋“ฑ ๋น„์œจ
apps['APPS_ANNUITY_CREDIT_RATIO'] = apps['AMT_ANNUITY']/apps['AMT_CREDIT']
apps['APPS_GOODS_CREDIT_RATIO'] = apps['AMT_GOODS_PRICE'] / apps['AMT_CREDIT']
apps['APPS_CREDIT_GOODS_DIFF'] = apps['AMT_CREDIT'] - apps['AMT_GOODS_PRICE']

# AMT_INCOME_TOTAL ๋น„์œจ๋กœ ๋Œ€์ถœ ๊ธˆ์•ก ๊ด€๋ จ ํ”ผ์ฒ˜ ๊ฐ€๊ณต
apps['APPS_ANNUITY_INCOME_RATIO'] = apps['AMT_ANNUITY']/apps['AMT_INCOME_TOTAL']
apps['APPS_CREDIT_INCOME_RATIO'] = apps['AMT_CREDIT']/apps['AMT_INCOME_TOTAL']
apps['APPS_GOODS_INCOME_RATIO'] = apps['AMT_GOODS_PRICE']/apps['AMT_INCOME_TOTAL']
# ๊ฐ€์กฑ์ˆ˜๋ฅผ ๊ณ ๋ คํ•œ ๊ฐ€์ฒ˜๋ถ„ ์†Œ๋“ ํ”ผ์ฒ˜ ๊ฐ€๊ณต
apps['APPS_CNT_FAM_INCOME_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['CNT_FAM_MEMBERS']

# DAYS_BIRTH, DAYS_EMPLOYED ๋น„์œจ๋กœ ์†Œ๋“/์ž์‚ฐ ๊ด€๋ จ Feature ๊ฐ€๊ณต. 
apps['APPS_EMPLOYED_BIRTH_RATIO'] = apps['DAYS_EMPLOYED']/apps['DAYS_BIRTH']
apps['APPS_INCOME_EMPLOYED_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['DAYS_EMPLOYED']
apps['APPS_INCOME_BIRTH_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['DAYS_BIRTH']
apps['APPS_CAR_BIRTH_RATIO'] = apps['OWN_CAR_AGE'] / apps['DAYS_BIRTH']
apps['APPS_CAR_EMPLOYED_RATIO'] = apps['OWN_CAR_AGE'] / apps['DAYS_EMPLOYED']

 

4. ๋‘ ๋ฒˆ์งธ ํ•™์Šต ๋ชจ๋ธ ์ƒ์„ฑ

  • ์ด์ „๊ณผ ๋™์ผํ•˜๊ฒŒ ๋ฐ์ดํ„ฐ๋ ˆ์ด๋ธ” ์ธ์ฝ”๋”ฉ ์ง„ํ–‰
  • null ๊ฐ’์€ LightGBM ๋‚ด๋ถ€์—์„œ ์ฒ˜๋ฆฌํ•˜๋„๋ก ํŠน๋ณ„ํ•œ ๋ณ€๊ฒฝ ์•ˆํ•จ
object_columns = apps.dtypes[apps.dtypes == 'object'].index.tolist()

for column in object_columns:
    apps[column] = pd.factorize(apps[column])[0]

# Train, Test ๋ฐ์ดํ„ฐ๋ฅผ target ๊ฐ’ ๊ธฐ์ค€์œผ๋กœ ๋ถ„๋ฆฌ
apps_train = apps[~apps['TARGET'].isnull()]
apps_test = apps[apps['TARGET'].isnull()]

# ํ•™์Šต/๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์…‹ ๋ถ„๋ฆฌ ํ›„ ๋ชจ๋ธ ํ•™์Šต
from sklearn.model_selection import train_test_split

ftr_app = apps_train.drop(['SK_ID_CURR', 'TARGET'], axis=1)
target_app = app_train['TARGET']

train_x, valid_x, train_y, valid_y = train_test_split(ftr_app, target_app, test_size=0.3, random_state=2020)

from lightgbm import LGBMClassifier

clf = LGBMClassifier(
        n_jobs=-1,
        n_estimators=1000,
        learning_rate=0.02,
        num_leaves=32,
        subsample=0.8,
        max_depth=12,
        silent=-1,
        verbose=-1
        )

clf.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)], eval_metric= 'auc', verbose= 100, 
        early_stopping_rounds= 100)

728x90