1. ํ๊ฒ ๊ฐ ๋ถํฌ & ์ฃผ์ ์ปฌ๋ผ์ ๋ถํฌ ํ์ธํ๊ธฐ
- ๋ฐ์ดํฐ์ ํ๊ฒ ๊ฐ(๋์ถ ์ฐ์ฒด ์ฌ๋ถ)๋ฅผ ํ์ธํ๋ค. >> ๊ฐ ๊ฐ(0, 1)์ ๊ฐ์์ ๋น์จ
app_train['TARGET'].value_counts()
app_train['TARGET'].value_counts() / 307511 * 100
- ํ๊ฒ์ null ๊ฐ ํ์ธ
apps['TARGET'].value_counts(dropna=False)
- ์ฃผ์ ์ปฌ๋ผ์ ๋ถํฌ๋ฅผ ํ์คํ ๊ทธ๋จ์ผ๋ก ํ์ธํ๋ค
sns.distplot(app_train['AMT_INCOME_TOTAL'])
plt.hist(app_train['AMT_INCOME_TOTAL'])
- ์ผ๋ถ ํํฐ๋ ์ ์ฉํด๋ณธ๋ค (์๋ฅผ ๋ค์ด, ์๋์ด 1,000,000 ์ดํ์ธ ์ ์์ ์ฃผ์ ์ปฌ๋ผ์ ๋ถํฌ)
cond_1 = app_train['AMT_INCOME_TOTAL'] < 1000000
app_train[cond_1]['AMT_INCOME_TOTAL'].hist()
- ๋ถํฌ๋ฅผ ํํํ ๋ distplot์ ์ฌ์ฉํด๋ ์ข๋ค.
ํนํ, ์ฐ์ํ ๋ณ์์ ํ์คํ ๊ทธ๋จ์ ๊ทธ๋ฆด ๋ kde๋ก ์ง๊ด์ ์ธ ํ์ธ์ด ๊ฐ๋ฅํ๋ค
sns.distplot(app_train[app_train['AMT_INCOME_TOTAL'] < 1000000]['AMT_INCOME_TOTAL'])
2. ํ๊ฒ ๊ฐ์ ๋ฐ๋ฅธ ์ฃผ์ ์ปฌ๋ผ์ ๋ถํฌ ํ์ธํ๊ธฐ(ํจ์)
- seaborn์ Distplot๊ณผ violinplot์ ๋ถํฌ๋ก ๋น๊ต ์๊ฐํ
- ํจ์๋ก ์ ์ํ์ฌ ํ๋์ ํ์ธํ๊ธฐ
def show_column_hist_by_target(df, column, is_amt = False):
# ์กฐ๊ฑด ์ง์ (ํ๊ฒ์ด 1, 0 ์ผ๋)
cond1 = (app_train['TARGET'] == 1)
cond0 = (app_train['TARGET'] == 0)
# ๊ทธ๋ํ ๊ทธ๋ฆด ์ค๋น(์ฌ์ด์ฆ ์ง์ )
fig, axs = plt.subplots(figsize = (12, 4), nrows = 1, ncols = 2, squeeze = False)
# is_amt๊ฐ True์ด๋ฉด, < 500000 ์กฐ๊ฑด์ผ๋ก ํํฐ๋ง
cond_amt = True
if is_amt:
cond_amt = df[column] < 500000
sns.violinplot(x = 'TARGET', y = column, data = df[cond_amt], ax = axs[0][0])
sns.distplot(df[cond0 & cond_amt][column], ax = axs[0][1], label = '0', color = 'blue')
sns.distplot(df[cond1 & cond_amt][column], ax = axs[0][1], label = '1', color = 'red')
show_column_hist_by_target(app_train, 'AMT_INCOME_TOTAL', is_amt = True)
3. ์นดํ ๊ณ ๋ฆฌํ ํผ์ณ๋ค์ Label Encoding
- ํน์ ๋ฐ์ดํฐ ํ์ ์ ํผ์ณ๋ค๋ง ๋ฆฌ์คํธ๋ก ๋ถ๋ฌ์ค๊ธฐ
object_columns = apps.dtypes[apps.dtypes == 'object'].index.tolist()
- pandas ์ factorize()๋ฅผ ์ด์ฉ >> [0]์ด, ์ธ์ฝ๋ฉํ ๊ฒฐ๊ด๊ฐ
- ๋จ, ์ฃผ์ํ ์ ์ ํ๋ฒ์ ํ ์ปฌ๋ผ๋ง ๊ฐ๋ฅํ๋ฏ๋ก, ๋ฐ๋ณต๋ฌธ์ ์ด์ฉํด์ผํจ.
for column in object_columns:
apps[column] = pd.factorize(apps[column])[0]
# apps.info()๋ก object ํ์ด ์๋์ง ํ์ธ
4. LGBM Classifier ๋ก ํ์ต ์ํ
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
train_x, valid_x, train_y, valid_y = train_test_split(ftr_app, target_app, test_size = 0.3, random_state = 2020)
train_x.shape, valid_x.shape
clf = LGBMClassifier(
n_jobs=-1, #๋ชจ๋ ๋ณ๋ ฅ์ ๋ค ์ฐ๊ฒ ๋ค.
n_estimators=1000, #1000๋ฒ์ Week_learner ๋ง๋ค๊ฒ ๋ค.
learning_rate=0.02,
num_leaves=32, #๋ง์ง๋ง ๋ฆฌํ๋
ธ๋ ์
subsample=0.8,
max_depth=12,
silent=-1,
verbose=-1
)
clf.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)], eval_metric= 'auc', verbose= 100, early_stopping_rounds= 50)
- Feature importance ์๊ฐํ
from lightgbm import plot_importance
plot_importance(clf, figsize=(16, 32))
- ์์ธก ํ๋ฅ ๊ฐ ํ์ธ
preds = clf.predict_proba(apps_test.drop(['SK_ID_CURR'], axis = 1))[:, 1]
app_test['TARGET'] = preds
728x90
'Machine Learning > Case Study ๐ฉ๐ปโ๐ป' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[Home Credit Default Risk] 3.์ฃผ์ Feature๋ค์ ๋ํ feature engineering (0) | 2023.10.31 |
---|---|
[Home Credit Default Risk] 2. ์ฃผ์ Feature์ ๋ํ EDA (0) | 2023.10.30 |
[BG/NBD] ๊ณ ๊ฐ ๊ฑฐ๋ ํ๋ ์์ธก ๋ชจ๋ธ (2) | 2023.10.10 |
[Kaggle] ์ด์ปค๋จธ์ค ๋ฐ์ดํฐ ๋ถ์ 7 (CRM Analytics ๐๏ธ๐) (1) | 2023.10.10 |
[Kaggle] ์ด์ปค๋จธ์ค ๋ฐ์ดํฐ ๋ถ์ 6 (CRM Analytics ๐๏ธ๐) (0) | 2023.10.08 |