๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[Home Credit Default Risk] 2. ์ฃผ์š” Feature์— ๋Œ€ํ•œ EDA

by ISLA! 2023. 10. 30.

 

1. ์—ฐ์†ํ˜• ํ”ผ์ณ์˜ ๋ถ„ํฌ ์‹œ๊ฐํ™”(Target ๊ฐ’์— ๋”ฐ๋ผ)

  • ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ, ์ปฌ๋Ÿผ๋ณ„๋กœ target๊ฐ’์ด 0, 1์ผ ๋•Œ ์‹œ๊ฐํ™”ํ•˜๊ธฐ
  • violinplot, distplot ์‚ฌ์šฉ
  • ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉฐ, target ์— ๋”ฐ๋ผ ์œ ์˜๋ฏธํ•œ ์ฐจ์ด๊ฐ€ ๋‚˜๋Š” ํ”ผ์ณ ํ™•์ธ >> ํ”ผ์ณ ์ค‘์š”๋„ ํŒŒ์•…
    ๐Ÿ‘‰ ์—ฐ๋ น๋Œ€๊ฐ€ ๋‚ฎ์€(๋˜๋Š” ์ง์žฅ ๊ฒฝ๋ ฅ์ด ์ ์€), ์†Œ์•ก ๋Œ€์ถœ ๊ฑด์—์„œ ์—ฐ์ฒด ๋น„์ค‘์ด ๋†’์•„ ๋ณด์ž„
def show_hist_by_target(df, columns):
	# ํƒ€๊ฒŸ ๊ฐ’์— ๋”ฐ๋ฅธ ์กฐ๊ฑด ์ง€์ •
    cond_1 = (df['TARGET'] == 1)
    cond_0 = (df['TARGET'] == 0)
    
    # ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
    for column in columns:
        print('column name: ', column)   #ํ™•์ธ์šฉ
        fig, axs = plt.subplots(figsize = (12, 4), nrows=1, ncols=2, squeeze=False)
        sns.violinplot(x = 'TARGET', y = column, data = df, ax = axs[0][0])
        sns.distplot(df[cond_1][column], label = '1', color = 'red', ax = axs[0][1])    #์‹œ๋ฆฌ์ฆˆ๋กœ ๋„ฃ์–ด์ค˜์•ผํ•จ
        sns.distplot(df[cond_0][column], label = '0', color = 'blue', ax = axs[0][1])
        
show_hist_by_target(app_train, columns)

 

2. ๋ช…๋ชฉํ˜• ํ”ผ์ณ์˜ ๋ถ„ํฌ ์‹œ๊ฐํ™”(Target ๊ฐ’์— ๋”ฐ๋ผ)

  • countplot์„ ์ด์šฉํ•˜์—ฌ, ๋ช…๋ชฉํ˜• ํ”ผ์ณ์˜ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ํ‘œํ˜„
# object ๋ฐ์ดํ„ฐ ์ปฌ๋Ÿผ๋งŒ ๋ฆฌ์ŠคํŠธ๋กœ
object_columns = app_train.dtypes[app_train.dtypes =='object'].index.to_list()

# ์‹œ๊ฐํ™” ํ•จ์ˆ˜
def show_count_by_target(df, columns):
    cond_1 = (df['TARGET'] == 1)
    cond_0 = (df['TARGET'] == 0)
    
    for column in columns:
        fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(18, 4), squeeze=False)
        # countplot์„ ์ด์šฉํ•˜์—ฌ category๊ฐ’์˜ histogram ํ‘œํ˜„
        chart0 = sns.countplot(df[cond_0][column], ax=axs[0][0])
        # x์ถ•์˜ tick label๋“ค์ด ๊ฐ’ ์œ ํ˜•์ด ๋งŽ์œผ๋ฏ€๋กœ 45๋„๋กœ ํšŒ์ „ํ•˜์—ฌ ํ‘œํ˜„
        chart0.set_xticklabels(chart0.get_xticklabels(), rotation=45)
        chart1 = sns.countplot(df[cond_1][column], ax=axs[0][1])
        chart1.set_xticklabels(chart1.get_xticklabels(), rotation=45)

        
show_count_by_target(app_train, object_columns)

 

  • catplot()์„ ์ด์šฉํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด target ๊ฐ’์— ๋”ฐ๋ผ ํ”ผ์ณ ๊ฐ’์˜ ๋ถ„ํฌ๋ฅผ ๋™์ผํ•œ y ์„ ์ƒ์—์„œ ๋น„๊ต ๊ฐ€๋Šฅ
# y ๋Œ€์‹ , col์„ ์จ์ฃผ๋Š” ๊ฒƒ ์œ ์˜
sns.catplot(x = 'CODE_GENDER', col = 'TARGET', data = app_train, kind = 'count')

# ์‹œ๊ฐํ™” ํ•จ์ˆ˜ ์ •์˜
def show_category_by_target(df, columns):
    for column in columns:
        print('col name: ', column)
        chart = sns.catplot(x = column, col = 'TARGET', data = df, kind = 'count')
        chart.set_xticklabels(rotation = 65)

show_category_by_target(app_train, object_columns)

 

  • ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜์—ฌ, ์ผ๋ถ€ ๊ฐ’์„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ํ™•์ธ
    ๐Ÿ‘‰ ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ฐ”๋กœ ์œ„ ์ด๋ฏธ์ง€๋ฅผ ๋ณด๋ฉด ๋Œ€์ถœ ํšŸ์ˆ˜ ๋Œ€๋น„ ์—ฐ์ฒด ๋น„์œจ์ด ๋‚จ์„ฑ์ด ์—ฌ์„ฑ ๋ณด๋‹ค ๋†’์•„ ๋ณด์ด๋ฏ€๋กœ ์ด๋ฅผ ํ™•์ธ
cond_1 = (app_train['TARGET'] == 1)
cond_0 = (app_train['TARGET'] == 0)

print(app_train['CODE_GENDER'].value_counts() / app_train.shape[0])
print(app_train[cond_1]['CODE_GENDER'].value_counts() / app_train[cond_1].shape[0])
print(app_train[cond_0]['CODE_GENDER'].value_counts() / app_train[cond_0].shape[0])

 

3.  Target ๊ณผ ์ฃผ์š” ์ปฌ๋Ÿผ์˜ ์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„

  • ์ฃผ์š” ์ปฌ๋Ÿผ์„ ์ถ”์ถœํ•˜๊ณ , corr()์œผ๋กœ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๋„์ถœ
  • ํžˆํŠธ๋งต์„ ํ†ตํ•ด ํ™•์ธ
corr_columns = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
               'DAYS_EMPLOYED','DAYS_ID_PUBLISH', 'DAYS_REGISTRATION', 'DAYS_LAST_PHONE_CHANGE', 'AMT_INCOME_TOTAL', 'TARGET']

col_corr = app_train[corr_columns].corr()
col_corr

๊ฒฐ๊ณผ ์ผ๋ถ€

 

plt.figure(figsize = (9, 9))
sns.heatmap(col_corr, annot = True)

๊ฒฐ๊ณผ ์ผ๋ถ€

 

4.  ์ด์ƒ์น˜ ํ™•์ธ ๋ฐ ์ฒ˜๋ฆฌ

  • ์œ„์˜ ์ปฌ๋Ÿผ๋ณ„ ๋ถ„ํฌ ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉฐ, ๋น„์ƒ์‹์ ์ธ ์ด์ƒ์น˜(์˜ˆ๋ฅผ ๋“ค๋ฉด, ๊ทผ์†๊ธฐ๊ฐ„ 3๋งŒ๋…„..)๋ฅผ ์ฒ˜๋ฆฌ
# ์ด์ƒ์น˜ ๋ฐ์ดํ„ฐ ์ง์ ‘ ํ™•์ธ
## 365243์ด ๋งค์šฐ ๋งŽ์Œ. ์•ฝ 1000๋…„์น˜์— ํ•ด๋‹นํ•˜๋Š” ๋‚ ์งœ
app_train['DAYS_EMPLOYED'].value_counts()

## CODE_GENDER์˜ ๊ฒฝ์šฐ XNA๊ฐ€ 4๊ฑด ์ •๋„์ธ๋ฐ, ๋งŽ์ง€ ์•Š์œผ๋ฏ€๋กœ ๊ทธ๋Œ€๋กœ ์œ ์ง€ 
app_train['CODE_GENDER'].value_counts()

# replace๋กœ ๋Œ€์ฒด
app_train['DAYS_EMPLOYED'] = app_train['DAYS_EMPLOYED'].replace(365243, np.nan)

# ๊ฒฐ๊ณผ ํ™•์ธ
app_train['DAYS_EMPLOYED'].value_counts(dropna = False)

 

728x90