๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป26

[Pandas] ํŒŒ์ด์ฌ์œผ๋กœ ์ด์ปค๋จธ์Šค ๋ฐ์ดํ„ฐ A/B test ๊ฒฐ๊ณผ ํ•ด์„ (feat. ํ†ต๊ณ„ ๊ฒ€์ •) ๋ฐ์ดํ„ฐ ์ถœ์ฒ˜ : ์ผ€๊ธ€ https://www.kaggle.com/code/sergylog/ab-test-data-analysis ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ import numpy as np import pandas as pd from scipy.stats import mannwhitneyu from scipy.stats import ttest_ind from scipy.stats import norm from scipy.stats import pearsonr from scipy.stats import shapiro import matplotlib.pyplot as plt import seaborn as sns from tqdm.auto import tqdm df = pd.read_csv('AB_Test.. 2024. 1. 30.
[Home Credit Default Risk] 5. ์ด์ „ ๋Œ€์ถœ์ด๋ ฅ ๋ฐ์ดํ„ฐ EDA, FE ์ˆ˜ํ–‰(์ˆ˜์ •์ค‘) ๐Ÿ“๋ฐ์ดํ„ฐ ์„ค๋ช… : Previous_Application ์ด ์ผ€๊ธ€ ๋Œ€ํšŒ๋Š” ๋Œ€์ถœ ์ •๋ณด์— ๋”ฐ๋ผ ์—ฐ์ฒด ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ๋ฉ”์ธ train ๋ฐ์ดํ„ฐ ์™ธ์—, ๊ณ ๊ฐ ๋ณ„๋กœ ์ด์ „ ๋Œ€์ถœ ์ด๋ ฅ ํ˜„ํ™ฉ ๋ฐ์ดํ„ฐ๋„ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋‹ค. ์ด์ „ ๋Œ€์ถœ์ด๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€๊ณตํ•˜์—ฌ, ๋ฉ”์ธ ๋ฐ์ดํ„ฐ ์…‹๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ ์˜ˆ์ธก ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์•„์ง€๋Š”์ง€ ํ™•์ธํ•ด๋ณด์ž. ์ž์„ธํ•œ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์ปฌ๋Ÿผ ์„ค๋ช…์€ ์ผ€๊ธ€์—์„œ ํ™•์ธํ•˜์„ธ์š”! ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ prev = pd.read_csv('previous_application.csv') print(prev.shape, apps.shape) ๋ฉ”์ธ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์กฐ์ธํ•˜์—ฌ key(ID) ๊ฐ’ ๊ธฐ์ค€์œผ๋กœ ์ฒดํฌ ๋ฉ”์ธ ๋ฐ์ดํ„ฐ์…‹์ธ apps ๋ฐ์ดํ„ฐ์™€ ์กฐ์ธ ์ด๋•Œ, ํ‚ค ๊ฐ’์ธ SK_ID_CURR ๊ธฐ์ค€์œผ๋กœ merge ํ•˜๋˜, indicator๋ฅผ ์„ค์ •ํ•˜์—ฌ ๋‘ ๋ฐ์ดํ„ฐ์˜ id ์ฐจ.. 2023. 11. 15.
[Home Credit Default Risk] 4. ์ด์ „ ๋Œ€์ถœ ์ด๋ ฅ ๋ฐ์ดํ„ฐ EDA ๋ฐ ๋ณ‘ํ•ฉ 1. ์ด์ „ application(๋ฉ”์ธ๋ฐ์ดํ„ฐ) ์˜ Feature Engineering ํ•จ์ˆ˜ ๋ณต์‚ฌ def get_apps_processed(apps): # EXT_SOURCE_X FEATURE ๊ฐ€๊ณต apps['APPS_EXT_SOURCE_MEAN'] = apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1) apps['APPS_EXT_SOURCE_STD'] = apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].std(axis=1) apps['APPS_EXT_SOURCE_STD'] = apps['APPS_EXT_SOURCE_STD'].fillna(apps['APPS_EXT_SOURCE_STD']... 2023. 11. 9.
[Home Credit Default Risk] 3.์ฃผ์š” Feature๋“ค์— ๋Œ€ํ•œ feature engineering 1. ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ ์•ž์„  EDA์™€ ๋ถ„ํฌ ์‹œ๊ฐํ™”, ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ๋ง์˜ feature Importance๋ฅผ ํ† ๋Œ€๋กœ ์ฃผ์š” ํ”ผ์ณ๋ฅผ ์„ ์ • ๊ฐ ํ”ผ์ณ์˜ null ๊ฐ’ ํ™•์ธ app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].isnull().sum() # dropna = false ๋กœ null ๊ฐ’ ๊ฐœ์ˆ˜๊นŒ์ง€ ํ™•์ธ app_train['EXT_SOURCE_1'].value_counts(dropna=False) app_train['EXT_SOURCE_2'].value_counts(dropna=False) app_train['EXT_SOURCE_3'].value_counts(dropna=False) ์ฃผ์š” ํ”ผ์ณ๋“ค์˜ ํ‰๊ท /์ตœ๋Œ€/์ตœ์†Œ/ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ํ™•์ธ # EXT_SOURCE_X ํ”ผ์ฒ˜.. 2023. 10. 31.
[Home Credit Default Risk] 2. ์ฃผ์š” Feature์— ๋Œ€ํ•œ EDA 1. ์—ฐ์†ํ˜• ํ”ผ์ณ์˜ ๋ถ„ํฌ ์‹œ๊ฐํ™”(Target ๊ฐ’์— ๋”ฐ๋ผ) ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ, ์ปฌ๋Ÿผ๋ณ„๋กœ target๊ฐ’์ด 0, 1์ผ ๋•Œ ์‹œ๊ฐํ™”ํ•˜๊ธฐ violinplot, distplot ์‚ฌ์šฉ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉฐ, target ์— ๋”ฐ๋ผ ์œ ์˜๋ฏธํ•œ ์ฐจ์ด๊ฐ€ ๋‚˜๋Š” ํ”ผ์ณ ํ™•์ธ >> ํ”ผ์ณ ์ค‘์š”๋„ ํŒŒ์•… ๐Ÿ‘‰ ์—ฐ๋ น๋Œ€๊ฐ€ ๋‚ฎ์€(๋˜๋Š” ์ง์žฅ ๊ฒฝ๋ ฅ์ด ์ ์€), ์†Œ์•ก ๋Œ€์ถœ ๊ฑด์—์„œ ์—ฐ์ฒด ๋น„์ค‘์ด ๋†’์•„ ๋ณด์ž„ def show_hist_by_target(df, columns): # ํƒ€๊ฒŸ ๊ฐ’์— ๋”ฐ๋ฅธ ์กฐ๊ฑด ์ง€์ • cond_1 = (df['TARGET'] == 1) cond_0 = (df['TARGET'] == 0) # ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ for column in columns: print('column name: ', column) #ํ™•์ธ์šฉ fig, axs = plt.subplots.. 2023. 10. 30.
[Home Credit Default Risk] 1. ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ์‹œ๊ฐํ™”, ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ 1. ํƒ€๊ฒŸ ๊ฐ’ ๋ถ„ํฌ & ์ฃผ์š” ์ปฌ๋Ÿผ์˜ ๋ถ„ํฌ ํ™•์ธํ•˜๊ธฐ ๋ฐ์ดํ„ฐ์˜ ํƒ€๊ฒŸ ๊ฐ’(๋Œ€์ถœ ์—ฐ์ฒด ์—ฌ๋ถ€)๋ฅผ ํ™•์ธํ•œ๋‹ค. >> ๊ฐ ๊ฐ’(0, 1)์˜ ๊ฐœ์ˆ˜์™€ ๋น„์œจ app_train['TARGET'].value_counts() app_train['TARGET'].value_counts() / 307511 * 100 ํƒ€๊ฒŸ์˜ null ๊ฐ’ ํ™•์ธ apps['TARGET'].value_counts(dropna=False) ์ฃผ์š” ์ปฌ๋Ÿผ์˜ ๋ถ„ํฌ๋ฅผ ํžˆ์Šคํ† ๊ทธ๋žจ์œผ๋กœ ํ™•์ธํ•œ๋‹ค sns.distplot(app_train['AMT_INCOME_TOTAL']) plt.hist(app_train['AMT_INCOME_TOTAL']) ์ผ๋ถ€ ํ•„ํ„ฐ๋„ ์ ์šฉํ•ด๋ณธ๋‹ค (์˜ˆ๋ฅผ ๋“ค์–ด, ์†Œ๋“์ด 1,000,000 ์ดํ•˜์ธ ์„ ์—์„œ ์ฃผ์š” ์ปฌ๋Ÿผ์˜ ๋ถ„ํฌ) cond_1 = app_train[.. 2023. 10. 30.
[BG/NBD] ๊ณ ๊ฐ ๊ฑฐ๋ž˜ ํ–‰๋™ ์˜ˆ์ธก ๋ชจ๋ธ BetaGeoFitter๋ž€? ๊ณ ๊ฐ ์ดํƒˆ ์˜ˆ์ธก ๋ฐ ๊ตฌ๋งค ํ™•๋ฅ  ๋ชจ๋ธ๋ง์— ์‚ฌ์šฉ๋˜๋Š” ํ†ต๊ณ„ ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜๋กœ, ๊ณ ๊ฐ ์ดํƒˆ ๋ฐ ๊ตฌ๋งค ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋œ๋‹ค. ์ด ๋ชจ๋ธ์€ ๊ธฐ์กด ๊ณ ๊ฐ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž‘๋™ํ•˜๋ฉฐ, ํŠนํžˆ ๊ตฌ๋งค ํšŸ์ˆ˜์™€ ์žฌ๊ตฌ๋งค ๊ฐ„๊ฒฉ์„ ๊ณ ๋ คํ•œ๋‹ค. BetaGeoFitter ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ•˜๋ ค๋ฉด Python์˜ lifetimes ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์•„๋ž˜๋Š” BetaGeoFitter ๋ชจ๋ธ์˜ ์‚ฌ์šฉ ์˜ˆ์‹œ์ด๋‹ค. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import lifetimes ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•˜๊ณ  ๊ฐ€์ ธ์˜จ๋‹ค. !pip install lifetimes import pandas as pd from lifetimes import BetaGeoFitter from lifetimes.datasets import load_cdnow_summary ์˜ˆ.. 2023. 10. 10.
[Kaggle] ์ด์ปค๋จธ์Šค ๋ฐ์ดํ„ฐ ๋ถ„์„ 7 (CRM Analytics ๐Ÿ›๏ธ๐Ÿ›’) Customer Lifetime Value ํ•œ ๊ณ ๊ฐ์ด ๋‹น์‹ ์˜ ๋ธŒ๋žœ๋“œ์— ์ „ ์ƒ์• ๋™์•ˆ ์–ผ๋งˆ๋ฅผ ๊ฐ€์ ธ๋‹ค์ค„ ๊ฒƒ์ธ๊ฐ€? ์— ๋Œ€ํ•œ ๋‹ต์„ ์ฐพ๋Š” ๊ณผ์ •์ด๋‹ค. ๊ณ ๊ฐ ๋ณ„๋กœ, ์ด ๊ตฌ๋งค ๊ธฐ๊ฐ„, ์ตœ์ดˆ ๊ตฌ๋งค์ผ๋กœ๋ถ€ํ„ฐ ์ง€๊ธˆ๊นŒ์ง€ ํ๋ฅธ ์‹œ๊ฐ„, ์ฃผ๋ฌธ ํšŸ์ˆ˜, ์ด ์ฃผ๋ฌธ ๊ธˆ์•ก์„ ๊ตฌํ•ด ์ด๋ฅผ ์˜ˆ์ธกํ•ด ๋ณด์ž. ๋จผ์ € ๊ณ ๊ฐ ID๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ฃนํ™”๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค. ์ด๋•Œ, ๊ธฐ๊ฐ„๊ณผ ๊ด€๋ จ๋œ ์ •๋ณด๋ฅผ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด InvoiceDate์— ๋‘ ๊ฐ€์ง€ ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ณ ๊ฐ๋ณ„ ๊ตฌ๋งค ๊ธฐ๊ฐ„๊ณผ ์ตœ์ดˆ ๊ตฌ๋งค์ผ๋กœ๋ถ€ํ„ฐ ํ˜„์žฌ๊นŒ์ง€์˜ ์‹œ๊ฐ„์„ ๋„์ถœํ•œ๋‹ค InvoiceNo๋Š” nunique๋กœ ๊ณ ์œ ํ•œ ๊ตฌ๋งค ๋ฒˆํ˜ธ๋กœ ์ฃผ๋ฌธ ํšŸ์ˆ˜๋ฅผ ์นด์šดํŠธํ•œ๋‹ค. TotalPrice ์—๋Š” Sum์„ ์ ์šฉํ•˜์—ฌ ์–ผ๋งˆ๋‚˜ ๋งŽ์€ ๋ˆ์„ ์ผ๋Š”์ง€ ์ง‘๊ณ„ํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์นผ๋Ÿผ์„ ์ •๋ฆฌํ•ด ์ฃผ๋Š”๋ฐ, droplevel(0)๋กœ ๊ทธ๋ฃนํ™”๋˜๋ฉฐ ์ƒ์„ฑ๋œ ๋ฉ€ํ‹ฐ.. 2023. 10. 10.
[Kaggle] ์ด์ปค๋จธ์Šค ๋ฐ์ดํ„ฐ ๋ถ„์„ 6 (CRM Analytics ๐Ÿ›๏ธ๐Ÿ›’) ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” 5์˜ ๊ณ ๊ฐ๊ตฐ ๋ถ„์„์— ์ด์–ด, ์ฝ”ํ˜ธํŠธ ๋ถ„์„์„ ์ง„ํ–‰ํ•œ๋‹ค. Cohort Analysis ์ฝ”ํ˜ธํŠธ๋Š” ์–ด๋–ค ๊ณตํ†ต์ ์„ ๊ณต์œ ํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์˜ ๊ทธ๋ฃน์„ ์˜๋ฏธํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๊ณตํ†ต์ ์€ ์•ฑ ๊ฐ€์ž… ๋‚ ์งœ, ์ฒ˜์Œ ๊ตฌ๋งคํ•œ ๋‹ฌ, ์ง€๋ฆฌ์  ์œ„์น˜, ํš๋“ ์ฑ„๋„ (์ผ๋ฐ˜ ์‚ฌ์šฉ์ž, ๋งˆ์ผ€ํŒ… ์œ ์ž…์ž ๋“ฑ) ๋“ฑ์ด ๋  ์ˆ˜ ์žˆ๋‹ค. ์ฝ”ํ˜ธํŠธ ๋ถ„์„์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์‚ฌ์šฉ์ž ๊ทธ๋ฃน์„ ์‹œ๊ฐ„์— ๋”ฐ๋ผ ์ถ”์ ํ•˜์—ฌ ์ผ๋ฐ˜์ ์ธ ํŒจํ„ด์ด๋‚˜ ํ–‰๋™์„ ์‹๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋œ๋‹ค. ๋ณธ ์˜ˆ์ œ์—์„œ๋Š” ์ฝ”ํ˜ธํŠธ ๋ถ„์„ ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ–ˆ์œผ๋ฉฐ, ํ•จ์ˆ˜๊ฐ€ ๊ธด ๊ด€๊ณ„๋กœ ๋Š์–ด์„œ ์„ค๋ช…ํ•˜๊ณ  ๋งˆ์ง€๋ง‰์— ์ตœ์ข… ํ•จ์ˆ˜๋ฅผ ๊ธฐ๋กํ•  ๊ฒƒ์ด๋‹ค. cohort(์ตœ์ดˆ ์ฃผ๋ฌธ์ผ๊ณผ ๊ณ ๊ฐ ๋‹น ์ฃผ๋ฌธ ๊ฑด์˜ ๋‚ ์งœ ์ถ”์ถœ) ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๋ณต์‚ฌ ๊ณ ๊ฐ ID, ์†ก์žฅ๋ฒˆํ˜ธ, ์ฃผ๋ฌธ๋‚ ์งœ๋งŒ ์ถ”์ถœํ•˜๊ณ , ์ค‘๋ณต๋œ ํ–‰ ์ œ๊ฑฐ ๋‚ ์งœ๋ฅผ ์›” ๋‹จ์œ„์˜ ๊ธฐ๊ฐ„(Period)์œผ๋กœ ๋ณ€ํ™˜ : ๊ฐ .. 2023. 10. 8.
728x90