๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[Home Credit Default Risk] 4. ์ด์ „ ๋Œ€์ถœ ์ด๋ ฅ ๋ฐ์ดํ„ฐ EDA ๋ฐ ๋ณ‘ํ•ฉ

by ISLA! 2023. 11. 9.

 

 

 

1. ์ด์ „ application(๋ฉ”์ธ๋ฐ์ดํ„ฐ) ์˜ Feature Engineering ํ•จ์ˆ˜ ๋ณต์‚ฌ

def get_apps_processed(apps):
    
    # EXT_SOURCE_X FEATURE ๊ฐ€๊ณต
    apps['APPS_EXT_SOURCE_MEAN'] = apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1)
    apps['APPS_EXT_SOURCE_STD'] = apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].std(axis=1)
    apps['APPS_EXT_SOURCE_STD'] = apps['APPS_EXT_SOURCE_STD'].fillna(apps['APPS_EXT_SOURCE_STD'].mean())
    
    # AMT_CREDIT ๋น„์œจ๋กœ Feature ๊ฐ€๊ณต
    apps['APPS_ANNUITY_CREDIT_RATIO'] = apps['AMT_ANNUITY']/apps['AMT_CREDIT']
    apps['APPS_GOODS_CREDIT_RATIO'] = apps['AMT_GOODS_PRICE']/apps['AMT_CREDIT']
    
    # AMT_INCOME_TOTAL ๋น„์œจ๋กœ Feature ๊ฐ€๊ณต
    apps['APPS_ANNUITY_INCOME_RATIO'] = apps['AMT_ANNUITY']/apps['AMT_INCOME_TOTAL']
    apps['APPS_CREDIT_INCOME_RATIO'] = apps['AMT_CREDIT']/apps['AMT_INCOME_TOTAL']
    apps['APPS_GOODS_INCOME_RATIO'] = apps['AMT_GOODS_PRICE']/apps['AMT_INCOME_TOTAL']
    apps['APPS_CNT_FAM_INCOME_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['CNT_FAM_MEMBERS']
    
    # DAYS_BIRTH, DAYS_EMPLOYED ๋น„์œจ๋กœ Feature ๊ฐ€๊ณต
    apps['APPS_EMPLOYED_BIRTH_RATIO'] = apps['DAYS_EMPLOYED']/apps['DAYS_BIRTH']
    apps['APPS_INCOME_EMPLOYED_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['DAYS_EMPLOYED']
    apps['APPS_INCOME_BIRTH_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['DAYS_BIRTH']
    apps['APPS_CAR_BIRTH_RATIO'] = apps['OWN_CAR_AGE'] / apps['DAYS_BIRTH']
    apps['APPS_CAR_EMPLOYED_RATIO'] = apps['OWN_CAR_AGE'] / apps['DAYS_EMPLOYED']
    
    return apps

 

โ–ถ๏ธŽ ์ด์ „ ๋Œ€์ถœ ๋ฐ์ดํ„ฐ(prev) ๋ฉ”์ธ๋ฐ์ดํ„ฐ(apps)๋ฅผ merge

prev_app_outer = prev.merge(apps['SK_ID_CURR'], on = 'SK_ID_CURR', how = 'outer', indicator = True)

 

๐Ÿ‘‰ merge ํ•  ๋•Œ indicator = True ๋กœ ์กฐ์ธ ํ›„, ๋ˆ„๋ฝ๋œ ์ง‘ํ•ฉ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ.

 

prev_app_outer['_merge'].value_counts()

 

๐Ÿ‘‰ _merge ์ปฌ๋Ÿผ์„ value_counts() ํ•˜์—ฌ ๊ฐ๊ฐ ์–ด๋–ป๊ฒŒ ์กฐ์ธ๋˜์–ด์žˆ๋Š”์ง€ ํ™•์ธ 

2. ์ฃผ์š” ์ปฌ๋Ÿผ EDA

โ–ถ๏ธŽ apps(๋ฉ”์ธ๋ฐ์ดํ„ฐ)์˜ ID๋ณ„๋กœ, ์ด์ „ ๋Œ€์ถœ ์ด๋ ฅ ๊ฑด์ˆ˜๋ฅผ ํ™•์ธ

  • groupby, boxplot
prev.groupby('SK_ID_CURR')['SK_ID_PREV'].count()

 

 

 

โ–ถ๏ธŽ ์ˆซ์žํ˜• ํ”ผ์ณ๋“ค์˜ ๋ถ„ํฌ ํ™•์ธ(ํžˆ์Šคํ† ๊ทธ๋žจ) / Target ์œ ํ˜•์— ๋”ฐ๋ผ

  • ๋ฉ”์ธ๋ฐ์ดํ„ฐ์˜ ID, Target ๊ฐ’๊ณผ prev(๊ณผ๊ฑฐ ๋Œ€์ถœ์ด๋ ฅ) ๋ฐ์ดํ„ฐ merge
app_prev= prev.merge(app_train[['SK_ID_CURR', 'TARGET']], on = 'SK_ID_CURR', how = 'left')
app_prev.shape

 

  • Target ๊ฐ’์— ๋”ฐ๋ผ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ทธ๋ฆฌ๋Š” ํ•จ์ˆ˜
def show_hist_by_target(df, columns):
    cond_1 = (df['TARGET'] == 1)
    cond_0 = (df['TARGET'] == 0)
    
    for column in columns:
        fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(12, 4), squeeze=False)
        sns.violinplot(x='TARGET', y=column, data=df, ax=axs[0][0] )
        sns.distplot(df[cond_0][column], ax=axs[0][1], label='0', color='blue')
        sns.distplot(df[cond_1][column], ax=axs[0][1], label='1', color='red')

 

  • ์ˆซ์žํ˜• ํ”ผ์ณ๋“ค์˜ ์ปฌ๋Ÿผ๋ช… ๋ถ„๋ฆฌ
  • ๊ธฐ์กด ์ปฌ๋Ÿผ๋“ค์ค‘ ID, target ์ปฌ๋Ÿผ์„ ์ œ์™ธํ•˜๊ณ  ๋ฆฌ์ŠคํŠธ์— ์ €์žฅ
num_columns = app_prev.dtypes[app_prev.dtypes != 'object'].index.tolist()
num_columns = [column for column in num_columns if column not in ['SK_ID_CURR', 'SK_ID_PREV', 'TARGET']]

 

  • ํžˆ์Šคํ† ๊ทธ๋žจ ์‹œ๊ฐํ™” ๋ฐ ๊ฒฐ๊ณผ ํ•ด์„
show_hist_by_target(app_prev, num_columns)

 

๊ฒฐ๊ณผ ์ผ๋ถ€

 

 

โ–ถ๏ธŽ ์นดํ…Œ๊ณ ๋ฆฌํ˜• ํ”ผ์ณ๋“ค์˜ ๋ถ„ํฌ ํ™•์ธ(ํžˆ์Šคํ† ๊ทธ๋žจ) / Target ์œ ํ˜•์— ๋”ฐ๋ผ

object_columns = app_prev.dtypes[app_prev.dtypes=='object'].index.tolist()
object_columns

 

  • ์นดํ…Œ๊ณ ๋ฆฌํ˜• ์ปฌ๋Ÿผ์€ ๐Ÿ“Š catplot์„ ์‚ฌ์šฉ!!
def show_category_by_target(df, columns):
    for column in columns:
        chart = sns.catplot(x=column, col="TARGET", data=df, kind="count")
        chart.set_xticklabels(rotation=65)
        
show_category_by_target(app_prev, object_columns)

 

 

 

โ–ถ๏ธŽ groupby ๋กœ ํ”ผ์ณ ํ™•์ธํ•˜๊ธฐ

  • agg_dict ๋ฅผ ์ •์˜ํ•˜๊ณ , groupby ๊ฒฐ๊ณผ์— ํ•œ๋ฒˆ์— ์ ์šฉ
agg_dict = {
     # ๊ธฐ์กด ์ปฌ๋Ÿผ. 
    'SK_ID_CURR':['count'],
    'AMT_CREDIT':['mean', 'max', 'sum'],
    'AMT_ANNUITY':['mean', 'max', 'sum'], 
    'AMT_APPLICATION':['mean', 'max', 'sum'],
    'AMT_DOWN_PAYMENT':['mean', 'max', 'sum'],
    'AMT_GOODS_PRICE':['mean', 'max', 'sum']
}

prev_group = prev.groupby('SK_ID_CURR')
prev_amt_agg = prev_group.agg(agg_dict)
prev_amt_agg.head() #๊ฒฐ๊ณผ๋Š” ๋ฉ€ํ‹ฐ๋ ˆ๋ฒจ ์ธ๋ฑ์Šค(์ปฌ๋Ÿผ)

 

  • ๋ฉ€ํ‹ฐ ์ธ๋ฑ์Šค ์ปฌ๋Ÿผ๋ณ€๊ฒฝ >> ์–ธ๋”๋ฐ”('_')๋กœ ์—ฐ๊ฒฐํ•ด์„œ ์ปฌ๋Ÿผ๋ช… ๊น”๋”ํ•˜๊ฒŒ ์ •๋ฆฌ
# ์ปฌ๋Ÿผ์ด ๋ฉ€ํ‹ฐ์ธ๋ฑ์Šค์ธ๊ฑฐ ํ™•์ธ
prev_amt_agg.columns

prev_amt_agg.columns = 
	['PREV_' + ('_').join(column).upper() for column in prev_amt_agg.columns]
    
prev_amt_agg.head()

 

โ–ถ๏ธŽ prev ํ”ผ์ณ ๊ฐ€๊ณต

# ๋Œ€์ถœ ์‹ ์ฒญ ๊ธˆ์•ก๊ณผ ์‹ค์ œ ๋Œ€์ถœ์•ก/๋Œ€์ถœ ์ƒํ’ˆ๊ธˆ์•ก ์ฐจ์ด ๋ฐ ๋น„์œจ
prev['PREV_CREDIT_DIFF'] = prev['AMT_APPLICATION'] - prev['AMT_CREDIT']
prev['PREV_GOODS_DIFF'] = prev['AMT_APPLICATION'] - prev['AMT_GOODS_PRICE']
prev['PREV_CREDIT_APPL_RATIO'] = prev['AMT_CREDIT']/prev['AMT_APPLICATION']
prev['PREV_ANNUITY_APPL_RATIO'] = prev['AMT_ANNUITY']/prev['AMT_APPLICATION']
prev['PREV_GOODS_APPL_RATIO'] = prev['AMT_GOODS_PRICE']/prev['AMT_APPLICATION']

 

โ–ถ๏ธŽ ์ด์ƒ์น˜๋ฅผ Null๋กœ ๋Œ€์ฒด 

prev['DAYS_FIRST_DRAWING'].replace(365243, np.nan, inplace = True)
prev['DAYS_FIRST_DUE'].replace(365243, np.nan, inplace= True)
prev['DAYS_LAST_DUE_1ST_VERSION'].replace(365243, np.nan, inplace= True)
prev['DAYS_LAST_DUE'].replace(365243, np.nan, inplace= True)
prev['DAYS_TERMINATION'].replace(365243, np.nan, inplace= True)

# ์ฒซ๋ฒˆ์งธ ๋งŒ๊ธฐ์ผ๊ณผ ๋งˆ์ง€๋ง‰ ๋งŒ๊ธฐ์ผ๊นŒ์ง€์˜ ๊ธฐ๊ฐ„
prev['PREV_DAYS_LAST_DUE_DIFF'] = prev['DAYS_LAST_DUE_1ST_VERSION'] - prev['DAYS_LAST_DUE']

 

 

โ–ถ๏ธŽ ์ด์ž์œจ์— null ๊ฐ’์ด ๋งŽ์œผ๋ฏ€๋กœ, ์ง์ ‘ ์ปฌ๋Ÿผ ์ƒ์„ฑ

# ์›” ๋‚ฉ๋ถ€์•ก * ํšŸ์ˆ˜ =>> ์ด ๋Œ€์ถœ์ƒํ™˜์•ก
all_pay = prev['AMT_ANNUITY'] * prev['CNT_PAYMENT']

# ์ด์ž์˜ฌ = (๋Œ€์ถœ์ƒํ™˜์•ก/๋Œ€์ถœ์•ก -1) / ๋Œ€์ถœ์ƒํ™˜ํšŸ์ˆ˜
prev['PREV_INTERESTS_RATE'] = (all_pay / prev['AMT_CREDIT'] -1)/prev['CNT_PAYMENT']

 

=>>> ์ƒ์„ฑ๋œ ์ปฌ๋Ÿผ ํ™•์ธ

prev.iloc[:, -7:].head(10)

 

 

 

โ–ถ๏ธŽ ID ๊ธฐ์ค€์œผ๋กœ (๊ธฐ์กด)๋Œ€์ถœ ์ƒํƒœ๊ฐ€ refused ์ธ ๊ฒฝ์šฐ์˜ ๊ฑด์ˆ˜ ๋ฐ ๊ณผ๊ฑฐ ๋Œ€์ถœ ๊ฑด ๋Œ€๋น„ ๋น„์œจ ๊ตฌํ•˜๊ธฐ

  • ๐Ÿง ๋ฉ”์ธ ๋ฐ์ดํ„ฐ์—์„œ ๋Œ€์ถœ์ƒํƒœ๊ฐ€ '๊ฑฐ์ ˆ'์ธ ๊ฒฝ์šฐ์—๋Š” ๊ณผ๊ฑฐ ๋Œ€์ถœ ๊ฑด ์ˆ˜์™€ ์–ด๋–ค ๊ด€๋ จ์ด ์žˆ๋Š”์ง€ ์•Œ์•„๋ณด์ž
# ์กฐ๊ฑด
cond_refused = (prev['NAME_CONTRACT_STATUS'] == 'Refused')

# ์กฐ๊ฑด ์ ์šฉ
prev_refused = prev[cond_refused]

# ํ™•์ธ
prev_refused.shape, prev.shape
##((138377, 44), (798432, 44))

 

  • ์กฐ๊ฑด์„ ์ ์šฉํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ refused ๊ฑด์ˆ˜ ๋„์ถœ
  • groupby, count
  • pd.DataFrame()์œผ๋กœ groupby ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธ
prev_refused_agg = prev_refused.groupby('SK_ID_CURR')['SK_ID_CURR'].count()
prev_refused_agg.shape, prev_amt_agg.shape
##((79325,), (281965, 39))

pd.DataFrame(prev_refused_agg) # reset_index()๊ฐ€ ์•ˆ๋œ ์ƒํƒœ

 

 

  • reset_index()๋กœ ์ธ๋ฑ์Šค์™€ ์ปฌ๋Ÿผ๋ช… ์ค‘๋ณต ํ’€์–ด์ฃผ๊ธฐ
  • name์„ ์ง€์ •ํ•˜์—ฌ ์ปฌ๋Ÿผ๋ช…์„ ๋ณ€๊ฒฝ
prev_refused_agg = prev_refused_agg.reset_index(name = 'PREV_REFUSED_COUNT')
prev_refused_agg.head()

 

 

โ–ถ๏ธŽ ์œ„์—์„œ ๋งŒ๋“  prev_amt_agg ์ปฌ๋Ÿผ๊ณผ ๊ฒฐํ•ฉ

prev_amt_agg = prev_amt_agg.reset_index()
prev_amt_refused_agg = prev_amt_agg.merge(prev_refused_agg, on = 'SK_ID_CURR', how = 'left')

 

 

  • ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ : 
# ๊ฒฐ์ธก์น˜ ๊ฐœ์ˆ˜๊นŒ์ง€ ํ™•์ธ
prev_amt_refused_agg['PREV_REFUSED_COUNT'].value_counts(dropna = False)

# ๊ฒฐ์ธก์น˜๋ฅผ 0์œผ๋กœ ์ฑ„์›€
prev_amt_refused_agg = prev_amt_refused_agg.fillna(0)

 

  • ๊ณผ๊ฑฐ ๋Œ€์ถœ ๊ฑด ์ˆ˜ ๋Œ€๋น„ ๋Œ€์ถœ ๊ฑฐ์ ˆ ๊ฑด ์ˆ˜
    ๐Ÿ‘‰ ๊ฑฐ์ ˆ ๋Œ€์ถœ ๊ฑด ์ˆ˜ / ๊ณผ๊ฑฐ ๋Œ€์ถœ ๊ฑด์ˆ˜ >> ๋น„์œจ ์‚ฐ์ถœ
prev_amt_refused_agg['PREV_REFUSED_RATIO'] = prev_amt_refused_agg['PREV_REFUSED_COUNT'] / prev_amt_refused_agg['PREV_SK_ID_CURR_COUNT']
prev_amt_refused_agg.head(10)

 

[TIP โค๏ธ] Case When + groupby ๋‹จ๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ• (unstack ์‚ฌ์šฉ)

์˜ˆ) ๋Œ€์ถœ ์‹ ์ฒญ ์ƒํƒœ๊ฐ€ '์Šน์ธ', '๊ฑฐ์ ˆ'์ธ ๊ฒฝ์šฐ์— ๊ฐ ID ๋ณ„๋กœ ๋ช‡ ๊ฑด์ธ์ง€ ํ™•์ธ

 

unstack ์ดํ•ด

 

  • ์•„๋ž˜๊ณผ ๊ฐ™์ด isin()์— ๋Œ€์ถœ์ƒํƒœ๋ฅผ ์ง€์ •ํ•˜๊ณ , groupby()์— ๋‘ ๊ฐœ ์ปฌ๋Ÿผ์„ ๋„ฃ์–ด ๊ฐ ๊ฒฝ์šฐ์˜ ์ˆ˜๋ฅผ ์…€ ์ˆ˜ ์žˆ์Œ
  • groupby ํ•  ๋•Œ, ๋งˆ์ง€๋ง‰์— unstack()์„ ํ•ด์ค˜์•ผ ํ•จ
prev_refused_appr_group = prev[prev['NAME_CONTRACT_STATUS'].isin(['Approved', 'Refused'])].
	groupby(['SK_ID_CURR', 'NAME_CONTRACT_STATUS'])
    
prev_refused_appr_agg = prev_refused_appr_group['SK_ID_CURR'].count().unstack()

unstack ์•ˆํ•œ ๊ฒฝ์šฐ

 

unstack ํ•œ ๊ฒฝ์šฐ

 

# ๊ฒฐ์ธก์น˜ ์ฑ„์šฐ๊ณ  ์ปฌ๋Ÿผ๋ช… ๋ณ€๊ฒฝ
prev_refused_appr_agg = prev_refused_appr_agg.fillna(0)
prev_refused_appr_agg.columns = ['PREV_APPROVED_COUNT', 'PREV_REFUSED_COUNT']

# ๋ฆฌ์…‹์ธ๋ฑ์Šค๋กœ ์ •๋ฆฌ
prev_refused_appr_agg = prev_refused_appr_agg.reset_index()
prev_refused_appr_agg.head(3)

 

  • prev_amt_agg์™€ ์กฐ์ธ ํ›„ ๋ฐ์ดํ„ฐ ๊ฐ€๊ณต
prev_agg = prev_amt_agg.merge(prev_refused_appr_agg, on = 'SK_ID_CURR', how = 'left')

# SK_ID_CURR๋ณ„ ๊ณผ๊ฑฐ ๋Œ€์ถœ๊ฑด์ˆ˜ ๋Œ€๋น„ APPROVED_COUNT ๋ฐ REFUSED_COUNT ๋น„์œจ ์ƒ์„ฑ. 
prev_agg['PREV_REFUSED_RATIO'] = prev_agg['PREV_REFUSED_COUNT']/prev_agg['PREV_SK_ID_CURR_COUNT']
prev_agg['PREV_APPROVED_RATIO'] = prev_agg['PREV_APPROVED_COUNT']/prev_agg['PREV_SK_ID_CURR_COUNT']

# 'PREV_REFUSED_COUNT', 'PREV_APPROVED_COUNT' ์ปฌ๋Ÿผ drop 
prev_agg = prev_agg.drop(['PREV_REFUSED_COUNT', 'PREV_APPROVED_COUNT'], axis=1)

# prev_amt_agg์™€ prev_refused_appr_agg INDEX์ธ SK_ID_CURR์ด ์กฐ์ธ ํ›„ ์ •์‹ ์ปฌ๋Ÿผ์œผ๋กœ ์ƒ์„ฑ๋จ. 
prev_agg.head(3)

 

 

 

 

728x90