1. ๊ฒฐ์ธก์น ์ฒ๋ฆฌ
- ์์ EDA์ ๋ถํฌ ์๊ฐํ, ๋ฒ ์ด์ค๋ผ์ธ ๋ชจ๋ธ๋ง์ feature Importance๋ฅผ ํ ๋๋ก ์ฃผ์ ํผ์ณ๋ฅผ ์ ์
- ๊ฐ ํผ์ณ์ null ๊ฐ ํ์ธ
app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].isnull().sum()
# dropna = false ๋ก null ๊ฐ ๊ฐ์๊น์ง ํ์ธ
app_train['EXT_SOURCE_1'].value_counts(dropna=False)
app_train['EXT_SOURCE_2'].value_counts(dropna=False)
app_train['EXT_SOURCE_3'].value_counts(dropna=False)
- ์ฃผ์ ํผ์ณ๋ค์ ํ๊ท /์ต๋/์ต์/ํ์คํธ์ฐจ๋ฅผ ํ์ธ
# EXT_SOURCE_X ํผ์ฒ๋ค์ ํ๊ท /์ต๋/์ต์/ํ์คํธ์ฐจ ํ์ธ
print('### mean ###\n', app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean())
print('### max ###\n',app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].max())
print('### min ###\n',app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].min())
print('### std ###\n',app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].std())
2. ํผ์ณ ๊ฐ๊ณต(๊ฒฐ์ธก์น ์ฒ๋ฆฌ) ์ , Train, test ๋ฐ์ดํฐ์ ๊ฒฐํฉ
- Train ๋ฐ์ดํฐ์ ๊ฐ๊ณต ๋ด์ฉ์ test ์ ์ ๋์ผํ๊ฒ ์ ์ฉํ๋๊ฒ์ด ๋ฒ๊ฑฐ๋ก์ฐ๋
apps = pd.concat([app_train, app_test])
print(apps.shape)
- row ๋ณ๋ก 3๊ฐ ํผ์ณ๋ฅผ ๊ฒฐํฉํ์ฌ ํ๊ท ๊ณผ ํ์คํธ์ฐจ๋ฅผ ์ ๊ท ์์ฑ
- ๊ฒฐํฉํ๋ ์ปฌ๋ผ์ ๋ฆฌ์คํธ๋ก ๋ฌถ๊ณ , .mean(axis=1) ๋ก ํ๋ณ ๊ฐ ์ฐ์ถ
apps['APPS_EXT_SOURCE_MEAN'] = apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis = 1)
apps['APPS_EXT_SOURCE_STD'] = apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].std(axis = 1)
- ์์ฑ๋ ์ปฌ๋ผ ํ์ธ ํ, Nan ๊ฐ์ ์ฒ๋ฆฌ
- ์ด ๊ฒฝ์ฐ, ํ์คํธ์ฐจ์ ๊ฒฐ์ธก์น๊ฐ ํ๋๋ผ๋ ์์ผ๋ฉด ์์ฑ๋ ์ปฌ๋ผ ๊ฐ์ด Nan์ด ๋จ >> ์ ์ฒด ํ์คํธ์ฐจ์ ํ๊ท ์ผ๋ก ์ฑ์ฐ๊ธฐ
apps.iloc[:, -2:].head()
apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APPS_EXT_SOURCE_MEAN', 'APPS_EXT_SOURCE_STD']].head()
apps['APPS_EXT_SOURCE_STD'] = apps['APPS_EXT_SOURCE_STD'].fillna(apps['APPS_EXT_SOURCE_STD'].mean())
3. Feature ๊ฐ๊ณต
- ์ฃผ์ ํผ์ณ์ ํน์ฑ์ ๊ณ ๋ คํ์ฌ ๋ค์ํ๊ฒ ์กฐํฉ์ ์๊ฐํด๋ณธ๋ค.
- ์์๋ ๋ค์๊ณผ ๊ฐ๋ค
# ๋์ถ๊ธ์ก ๋๋น ์ ๋์ถ์ง๊ธ์ก ๋ฑ ๋น์จ
apps['APPS_ANNUITY_CREDIT_RATIO'] = apps['AMT_ANNUITY']/apps['AMT_CREDIT']
apps['APPS_GOODS_CREDIT_RATIO'] = apps['AMT_GOODS_PRICE'] / apps['AMT_CREDIT']
apps['APPS_CREDIT_GOODS_DIFF'] = apps['AMT_CREDIT'] - apps['AMT_GOODS_PRICE']
# AMT_INCOME_TOTAL ๋น์จ๋ก ๋์ถ ๊ธ์ก ๊ด๋ จ ํผ์ฒ ๊ฐ๊ณต
apps['APPS_ANNUITY_INCOME_RATIO'] = apps['AMT_ANNUITY']/apps['AMT_INCOME_TOTAL']
apps['APPS_CREDIT_INCOME_RATIO'] = apps['AMT_CREDIT']/apps['AMT_INCOME_TOTAL']
apps['APPS_GOODS_INCOME_RATIO'] = apps['AMT_GOODS_PRICE']/apps['AMT_INCOME_TOTAL']
# ๊ฐ์กฑ์๋ฅผ ๊ณ ๋ คํ ๊ฐ์ฒ๋ถ ์๋ ํผ์ฒ ๊ฐ๊ณต
apps['APPS_CNT_FAM_INCOME_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['CNT_FAM_MEMBERS']
# DAYS_BIRTH, DAYS_EMPLOYED ๋น์จ๋ก ์๋/์์ฐ ๊ด๋ จ Feature ๊ฐ๊ณต.
apps['APPS_EMPLOYED_BIRTH_RATIO'] = apps['DAYS_EMPLOYED']/apps['DAYS_BIRTH']
apps['APPS_INCOME_EMPLOYED_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['DAYS_EMPLOYED']
apps['APPS_INCOME_BIRTH_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['DAYS_BIRTH']
apps['APPS_CAR_BIRTH_RATIO'] = apps['OWN_CAR_AGE'] / apps['DAYS_BIRTH']
apps['APPS_CAR_EMPLOYED_RATIO'] = apps['OWN_CAR_AGE'] / apps['DAYS_EMPLOYED']
4. ๋ ๋ฒ์งธ ํ์ต ๋ชจ๋ธ ์์ฑ
- ์ด์ ๊ณผ ๋์ผํ๊ฒ ๋ฐ์ดํฐ๋ ์ด๋ธ ์ธ์ฝ๋ฉ ์งํ
- null ๊ฐ์ LightGBM ๋ด๋ถ์์ ์ฒ๋ฆฌํ๋๋ก ํน๋ณํ ๋ณ๊ฒฝ ์ํจ
object_columns = apps.dtypes[apps.dtypes == 'object'].index.tolist()
for column in object_columns:
apps[column] = pd.factorize(apps[column])[0]
# Train, Test ๋ฐ์ดํฐ๋ฅผ target ๊ฐ ๊ธฐ์ค์ผ๋ก ๋ถ๋ฆฌ
apps_train = apps[~apps['TARGET'].isnull()]
apps_test = apps[apps['TARGET'].isnull()]
# ํ์ต/๊ฒ์ฆ ๋ฐ์ดํฐ ์
๋ถ๋ฆฌ ํ ๋ชจ๋ธ ํ์ต
from sklearn.model_selection import train_test_split
ftr_app = apps_train.drop(['SK_ID_CURR', 'TARGET'], axis=1)
target_app = app_train['TARGET']
train_x, valid_x, train_y, valid_y = train_test_split(ftr_app, target_app, test_size=0.3, random_state=2020)
from lightgbm import LGBMClassifier
clf = LGBMClassifier(
n_jobs=-1,
n_estimators=1000,
learning_rate=0.02,
num_leaves=32,
subsample=0.8,
max_depth=12,
silent=-1,
verbose=-1
)
clf.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)], eval_metric= 'auc', verbose= 100,
early_stopping_rounds= 100)
728x90