๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[Home Credit Default Risk] 5. ์ด์ „ ๋Œ€์ถœ์ด๋ ฅ ๋ฐ์ดํ„ฐ EDA, FE ์ˆ˜ํ–‰(์ˆ˜์ •์ค‘)

by ISLA! 2023. 11. 15.

 

๐Ÿ“๋ฐ์ดํ„ฐ ์„ค๋ช… : Previous_Application

  • ์ด ์ผ€๊ธ€ ๋Œ€ํšŒ๋Š” ๋Œ€์ถœ ์ •๋ณด์— ๋”ฐ๋ผ ์—ฐ์ฒด ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ๋ฉ”์ธ train ๋ฐ์ดํ„ฐ ์™ธ์—, ๊ณ ๊ฐ ๋ณ„๋กœ ์ด์ „ ๋Œ€์ถœ ์ด๋ ฅ ํ˜„ํ™ฉ ๋ฐ์ดํ„ฐ๋„ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋‹ค. 
  • ์ด์ „ ๋Œ€์ถœ์ด๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€๊ณตํ•˜์—ฌ, ๋ฉ”์ธ ๋ฐ์ดํ„ฐ ์…‹๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ ์˜ˆ์ธก ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์•„์ง€๋Š”์ง€ ํ™•์ธํ•ด๋ณด์ž.
  • ์ž์„ธํ•œ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์ปฌ๋Ÿผ ์„ค๋ช…์€ ์ผ€๊ธ€์—์„œ ํ™•์ธํ•˜์„ธ์š”!

 

๋ฐ์ดํ„ฐ ๋กœ๋”ฉ

prev = pd.read_csv('previous_application.csv')
print(prev.shape, apps.shape)

 

๋ฉ”์ธ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์กฐ์ธํ•˜์—ฌ key(ID) ๊ฐ’ ๊ธฐ์ค€์œผ๋กœ ์ฒดํฌ

  • ๋ฉ”์ธ ๋ฐ์ดํ„ฐ์…‹์ธ apps ๋ฐ์ดํ„ฐ์™€ ์กฐ์ธ
  • ์ด๋•Œ, ํ‚ค ๊ฐ’์ธ SK_ID_CURR ๊ธฐ์ค€์œผ๋กœ merge ํ•˜๋˜, indicator๋ฅผ ์„ค์ •ํ•˜์—ฌ ๋‘ ๋ฐ์ดํ„ฐ์˜ id ์ฐจ์ด๊ฐ€ ์–ด๋–ป๊ฒŒ ๋‚˜๋Š”์ง€ ํ™•์ธํ•œ๋‹ค.
prev_app_outer = prev.merge(apps['SK_ID_CURR'], on = 'SK_ID_CURR', how = 'outer', indicator = True) #indicator ์–ด๋Š์ชฝ ๋ˆ„๋ฝ์ด ์žˆ๋Š”์ง€ ํ™•์ธ
prev_app_outer['_merge'].value_counts()

 

์ฃผ์š” Feature EDA

โ–ถ๏ธŽ ์ˆซ์žํ˜• ํ”ผ์ณ๋“ค์˜ ๋ถ„ํฌ ํ™•์ธ(TARGET ๊ฐ’์— ๋”ฐ๋ผ)

# prev์™€ ๋ฉ”์ธ ๋ฐ์ดํ„ฐ์…‹์˜ target, ID ๊ฐ’์„ merge
app_prev= prev.merge(app_train[['SK_ID_CURR', 'TARGET']], on = 'SK_ID_CURR', how = 'left')

# ์—ฐ์†ํ˜• ๋ณ€์ˆ˜ ๋ถ„ํฌ ์‹œ๊ฐํ™”
def show_hist_by_target(df, columns):
    cond_1 = (df['TARGET'] == 1)
    cond_0 = (df['TARGET'] == 0)
    
    for column in columns:
        fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(12, 4), squeeze=False)
        sns.violinplot(x='TARGET', y=column, data=df, ax=axs[0][0] )
        sns.distplot(df[cond_0][column], ax=axs[0][1], label='0', color='blue')
        sns.distplot(df[cond_1][column], ax=axs[0][1], label='1', color='red') 

# ์ˆซ์žํ˜• ์ปฌ๋Ÿผ๋งŒ ์ถ”์ถœ
num_columns = app_prev.dtypes[app_prev.dtypes != 'object'].index.tolist()

# ์‹œ๊ฐํ™”ํ•˜์ง€ ์•Š์„ ์ปฌ๋Ÿผ์€ ์ œ์™ธ(Id, target)
num_columns = [column for column in num_columns if column not in ['SK_ID_CURR', 'SK_ID_PREV', 'TARGET']]

show_hist_by_target(app_prev, num_columns)

 

 

โ–ถ๏ธŽ ๋ช…๋ชฉํ˜• ํ”ผ์ณ๋“ค์˜ ๋ถ„ํฌ ํ™•์ธ(TARGET ๊ฐ’์— ๋”ฐ๋ผ)

object_columns = app_prev.dtypes[app_prev.dtypes=='object'].index.tolist()

# catplot์œผ๋กœ ์‹œ๊ฐํ™”
def show_category_by_target(df, columns):
    for column in columns:
        chart = sns.catplot(x=column, col="TARGET", data=df, kind="count")
        chart.set_xticklabels(rotation=65)
        
show_category_by_target(app_prev, object_columns)

 

 

โ–ถ๏ธŽ ํŒŒ์ƒ๋ณ€์ˆ˜ ์ƒ์„ฑ

  • ์ค‘์š”ํ•œ ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜์—ฌ, ์˜๋ฏธ์žˆ์„ ๊ฒƒ์ด๋ผ ๊ธฐ๋Œ€ํ•˜๋Š” ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑ
# ๋Œ€์ถœ ์‹ ์ฒญ ๊ธˆ์•ก๊ณผ ์‹ค์ œ ๋Œ€์ถœ์•ก/๋Œ€์ถœ ์ƒํ’ˆ๊ธˆ์•ก ์ฐจ์ด ๋ฐ ๋น„์œจ
prev['PREV_CREDIT_DIFF'] = prev['AMT_APPLICATION'] - prev['AMT_CREDIT']
prev['PREV_GOODS_DIFF'] = prev['AMT_APPLICATION'] - prev['AMT_GOODS_PRICE']
prev['PREV_CREDIT_APPL_RATIO'] = prev['AMT_CREDIT']/prev['AMT_APPLICATION']
prev['PREV_ANNUITY_APPL_RATIO'] = prev['AMT_ANNUITY']/prev['AMT_APPLICATION']
prev['PREV_GOODS_APPL_RATIO'] = prev['AMT_GOODS_PRICE']/prev['AMT_APPLICATION']

 

  • ๋ถ„ํฌ ํ™•์ธ ์‹œ, ์ด์ƒ์น˜๊ฐ€ ์žˆ์—ˆ์„ ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ด์ƒ์น˜๋ฅผ ์น˜ํ™˜ํ•ด์ค€ ํ›„, ๋ณ€์ˆ˜ ์ƒ์„ฑ
# ์ด์ƒ์น˜ ์ œ๊ฑฐ ํ›„, 
prev['DAYS_LAST_DUE_1ST_VERSION'].replace(365243, np.nan, inplace= True)

# ์ฒซ๋ฒˆ์งธ ๋งŒ๊ธฐ์ผ๊ณผ ๋งˆ์ง€๋ง‰ ๋งŒ๊ธฐ์ผ๊นŒ์ง€์˜ ๊ธฐ๊ฐ„
prev['PREV_DAYS_LAST_DUE_DIFF'] = prev['DAYS_LAST_DUE_1ST_VERSION'] - prev['DAYS_LAST_DUE']

 

  • null ๊ฐ’์ด ๋งŽ์ง€๋งŒ, ์ค‘์š”ํ•œ ๋ณ€์ˆ˜์ธ ๊ฒฝ์šฐ๋Š” ์ƒˆ๋กญ๊ฒŒ ์ƒ์„ฑ(์˜ˆ : ์ด์ž์œจ)
# ์›” ๋‚ฉ๋ถ€์•ก * ํšŸ์ˆ˜ =>> ์ด ๋Œ€์ถœ์ƒํ™˜์•ก
all_pay = prev['AMT_ANNUITY'] * prev['CNT_PAYMENT']

# ์ด์ž์˜ฌ = (๋Œ€์ถœ์ƒํ™˜์•ก/๋Œ€์ถœ์•ก -1) / ๋Œ€์ถœ์ƒํ™˜ํšŸ์ˆ˜
prev['PREV_INTERESTS_RATE'] = (all_pay / prev['AMT_CREDIT'] -1)/prev['CNT_PAYMENT']

 

โ–ถ๏ธŽ ๊ธฐ์กด ํ”ผ์ณ์™€ ์ƒ์„ฑํ•œ ํŒŒ์ƒ๋ณ€์ˆ˜๋“ค์„ ๊ธฐ์ค€์œผ๋กœ aggregation

  • ์ด๋ ‡๊ฒŒ ํ•œ ์ปฌ๋Ÿผ์— ๋‹ค์–‘ํ•œ ์ง‘๊ณ„ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜๋Š” ์ด์œ ๋Š”, ์ผ๋‹จ ์–ด๋–ค ์ปฌ๋Ÿผ์ด ์œ ์šฉํ• ์ง€ ์•„์ง์€ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ
# ์ƒˆ๋กญ๊ฒŒ ์ƒ์„ฑ๋œ ๋Œ€์ถœ ์‹ ์ฒญ์•ก ๋Œ€๋น„ ๋‹ค๋ฅธ ๊ธˆ์•ก ์ฐจ์ด ๋ฐ ๋น„์œจ๋กœ aggregation ์ˆ˜ํ–‰. >> ์ผ๋‹จ ์–ด๋–ค ๊ฐ’์ด ์ค‘์š”ํ• ์ง€ ๋ชจ๋ฅด๋‹ˆ, ๋‚˜์—ดํ•˜๊ณ  ๋‚˜์ค‘์— ํ™•์ธ, ์‚ญ์ œ
agg_dict = {
     # ๊ธฐ์กด ์ปฌ๋Ÿผ. 
    'SK_ID_CURR':['count'],
    'AMT_CREDIT':['mean', 'max', 'sum'],
    'AMT_ANNUITY':['mean', 'max', 'sum'], 
    'AMT_APPLICATION':['mean', 'max', 'sum'],
    'AMT_DOWN_PAYMENT':['mean', 'max', 'sum'],
    'AMT_GOODS_PRICE':['mean', 'max', 'sum'],
    'RATE_DOWN_PAYMENT': ['min', 'max', 'mean'],
    'DAYS_DECISION': ['min', 'max', 'mean'],
    'CNT_PAYMENT': ['mean', 'sum'],
    # ๊ฐ€๊ณต ์ปฌ๋Ÿผ
    'PREV_CREDIT_DIFF':['mean', 'max', 'sum'], 
    'PREV_CREDIT_APPL_RATIO':['mean', 'max'],
    'PREV_GOODS_DIFF':['mean', 'max', 'sum'],
    'PREV_GOODS_APPL_RATIO':['mean', 'max'],
    'PREV_DAYS_LAST_DUE_DIFF':['mean', 'max', 'sum'],
    'PREV_INTERESTS_RATE':['mean', 'max']
}

prev_group = prev.groupby('SK_ID_CURR')
prev_amt_agg = prev_group.agg(agg_dict)
prev_amt_agg.columns = ['PREV_'+ ('_').join(column).upper() for column in prev_amt_agg.columns.ravel()]

728x90