Machine Learning/Case Study ๐ฉ๐ป๐ป
[๐ฆ ๊ฒ ๋์ด ์์ธก(1)] ๋ฐ์ดํฐ ํ์ & EDA
ISLA!
2023. 9. 24. 19:25
๐ ๋ณธ ์์ ๋ kaggle ์ EDA & ML ์ฝ๋์ best practice๋ฅผ ๋ณด๋ฉฐ ์คํฐ๋ํ ๋ด์ฉ์ ๋๋ค.
- ์ ์ฒด์ ์ธ ์ฝ๋๋ ์ด ๋งํฌ๋ก >> https://www.kaggle.com/code/oscarm524/ps-s3-ep16-eda-modeling-submission/notebook
๐ฆ ์ด๋ฒ ์๊ฐ์๋ '๊ฒ' ๋์ด๋ฅผ ์์ธกํ๋ ๊ฒ์ ์ฃผ์ ๋ก ๋ฐ์ดํฐ๋ฅผ ํ์ํ๊ณ ,
EDA๋ฅผ ํ๋ ๊ธฐ์ด์ ์ธ ๊ณผ์ ์ ์ดํด๋ณธ๋ค
๋ผ์ด๋ธ๋ฌ๋ฆฌ load
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt; plt.style.use('ggplot')
import seaborn as sns
import plotly.express as px
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split, GridSearchCV, RepeatedKFold, RepeatedStratifiedKFold
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.inspection import PartialDependenceDisplay
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.svm import SVR
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklego.linear_model import LADRegression
Data ๋ถ๋ฌ์ค๊ณ ๊ธฐ์ด ํต๊ณ๋ ํ์ธ
- ๋ฐ์ดํฐ๋ฅผ ๋ก๋ํ ๋ค์๋ ๋จผ์ ์ ์ฒด์ ์ธ ํฌ๊ธฐ(shape)์ ํ์ธํ๋ค : ํ๊ณผ ์ด ๊ฐ์
train = pd.read_csv('train.csv')
original = pd.read_csv('CrabAgePrediction.csv')
test = pd.read_csv('test.csv')
submission = pd.read_csv('sample_submission.csv')
print('train data size : ', train.shape)
print('test data size : ', test.shape)
print('submission data size : ', submission.shape)
๊ฒฐ๊ณผ ์์
train data size : (74051, 10)
test data size : (49368, 9)
submission data size : (49368, 2)
- head() ๋ก ์ฒซ 5์ค์ ๋ฐ์ดํฐ๋ฅผ ์ถ๋ ฅํด์ ํ์ธํ๊ณ , describe() ๋ก ์์นํ ๊ฐ๋ค์ ๊ธฐ์ด ํต๊ณ๋์ ํ์ธ
- ์ด๋, ๊ฒฐ์ธก์น์ ๋ถํฌ๋ ํจ๊ป ํ์ธํ ์ ์๋ค
train.head()
train.describe()
Data ํ์
1. ์ค๋ณต๊ฐ ํ์ธ
- drop.duplicates()๋ฅผ ์ฌ์ฉํ์ฌ, ์ ์ฒด ๋ฐ์ดํฐ ์ฌ์ด์ฆ์์ ์ค๋ณต๊ฐ์ ๋นผ๊ณ ๋ ํ์ ์ฌ์ด์ฆ(ํ์ ๊ฐ์)๋ฅผ ๋น๊ตํ๋ค
- ์ด๋ฅผ ํตํด ์ ์ฒด ๋ฐ์ดํฐ์์ ์ค๋ณต๋ ๊ฐ์ด ์๋์ง ๋ณผ ์ ์๋ค
print(f'There are {train.shape[0]} observations in the train dataset')
print('There are', train.drop(columns=['id'], axis = 1).drop_duplicates().shape[0], 'unique observations in the train dataset')
print('There are', train.drop(columns=['id', 'Age'], axis = 1).drop_duplicates().shape[0], 'unique observations (only features) in the trian dataset')
2. ๋ฐ์ดํฐ์ ๋ถํฌ ํ์ธ
- ๊ธฐ๋ณธ์ ์ผ๋ก train, test ๋ฐ์ดํฐ์ ๋ถํฌ์ ์ปฌ๋ผ๊ฐ ์๊ด๊ด๊ณ๋ฅผ ํ์ธํ๋ค
- ํนํ ๋ํ์ ์ถ์ฉ ๋ฐ์ดํฐ๋ train, test๊ฐ ๋ฐ๋ก ๋๋์ด ์์ผ๋ ์ด๋ฅผ ์ ์ํ์ฌ ์ฒดํฌํ๋ค
- ๋ณธ ํฌ์คํ
์ ๋ฐ์ดํฐ๋ train, original, test ์ด๋ ๊ฒ 3๊ฐ์ ๋ฐ์ดํฐ ์
์ ๊ธฐ๋ณธ์ผ๋ก ํ๋ค.
- train ๋ฐ์ดํฐ๋ original ๋ฐ์ดํฐ์์ ์ถ์ถ๋ ๊ฒ์ผ๋ก ์์๋๋ฉฐ, ์ ํํ ๋ถ์์ ์ํด train๊ณผ original ๋ฐ์ดํฐ์ ์ ์ฒด์ ์ธ ๋ถํฌ๋ ์ํ๋ฅผ ํ์ธํ๋ ๊ฒ์ด๋ค.
fig, axes = plt.subplots(1, 2, figsize = (18, 8))
sns.kdeplot(ax = axes[0], data = train, x = 'Age', fill = True, color = 'steelblue').set_title('train data');
sns.kdeplot(ax = axes[1], data = original, x = 'Age', fill = True, color = 'orange').set_title('original data');
plt.show()
๐ ๋ ๋ฐ์ดํฐ์ ๋ถํฌ๊ฐ ์ ์ฒด์ ์ผ๋ก ๋์ผํ๋ค
3. ๋ฐ์ดํฐ ์ปฌ๋ผ๊ณผ target์ ์๊ด๊ด๊ณ ํ์ธ
corr_train = train.drop(columns = ['id', 'Sex'], axis = 1).corr()
corr_original = original.drop(columns = ['Sex'], axis = 1).corr()
train_mask = np.triu(np.ones_like(corr_train, dtype = bool))
original_mask = np.triu(np.ones_like(corr_original, dtype = bool))
cmap = sns.diverging_palette(100, 7, s = 75, l = 40, n = 20, center = 'light', as_cmap = True)
fig, axes = plt.subplots(1, 2, figsize = (25, 10))
sns.heatmap(corr_train, annot = True, cmap = cmap, fmt = '.2f', center = 0,
annot_kws = {'size':12}, ax = axes[0], mask = train_mask).set_title('train ์๊ด๊ด๊ณ')
sns.heatmap(corr_original, annot = True, cmap = cmap, fmt = '.2f', center = 0,
annot_kws = {'size':12}, ax = axes[1], mask = original_mask).set_title('original ์๊ด๊ด๊ณ')
plt.show()
๐ ๋ ๋ฐ์ดํฐ์ ์๊ด๊ณ์๋ ์ ์ฌํ๊ฒ ๋ํ๋๋ค
- age์ shell weight์ ์๊ด๊ณ์๊ฐ ๊ฐ์ฅ ๋์
- age ์ shucked weight์ ์๊ด๊ณ์๊ฐ ๊ฐ์ฅ ๋ฎ์
4. ์ฃผ์ ์ปฌ๋ผ๊ณผ target์ ๊ด๊ณ ์๊ฐํ
4-1. Sex - Age
# sex์ปฌ๋ผ์ ๋ฒ์ฃผํ ๋ฐ์ดํฐ๋ก ๋ณํ, I/M/F ์์์ ๋ฒ์ฃผ๋ฅผ ๊ฐ์ง๋๋ก ํจ(์๊ฐํ์ฉ)
original['Sex'] = pd.Categorical(original['Sex'], categories = ['I', 'M', 'F'], ordered = True)
fig, axes = plt.subplots(1, 2, figsize = (15, 6))
sns.boxplot(ax = axes[0], data = train, x = 'Sex', y = 'Age').set_title('Competition Dataset')
sns.boxplot(ax = axes[1], data = original, x = 'Sex', y = 'Age').set_title('Original Dataset');
๐ ์ฑ๋ณ์ด ๋ช ํํ ๊ตฌ๋ถ๋ ๊ฒฝ์ฐ, ์ ์ฌํ ๋ถํฌ๋ฅผ ๋ณด์
4-2. Shell Weight - Age
fig, axes = plt.subplots(1, 2, figsize = (15, 6))
sns.scatterplot(ax = axes[0], data = train, x = 'Shell Weight', y = 'Age', color = 'steelblue').set_title('Competition Dataset')
sns.scatterplot(ax = axes[1], data = original, x = 'Shell Weight', y = 'Age', color = 'orange').set_title('Original Dataset');
๐ ์์ ์๊ด๊ด๊ณ
4-3. Diameter - Age
fig, axes = plt.subplots(1, 2, figsize = (15, 6))
sns.scatterplot(ax = axes[0], data = train, x = 'Diameter', y = 'Age', color = 'steelblue').set_title('Competition Dataset')
sns.scatterplot(ax = axes[1], data = original, x = 'Diameter', y = 'Age', color = 'orange').set_title('Original Dataset');
๐ ์์ ์๊ด๊ด๊ณ(์ ํ)
๋ค์ ํฌ์คํ ์์๋ ์ด๋ฅผ ๋ฐํ์ผ๋ก ๋ค์ํ ๋ชจ๋ธ์ ๊ธฐ๋ฐ์ผ๋กํ baseline code๋ฅผ ์ง๋ณด๋ ๊ฒ์ผ๋ก ์ด์ด์ง๋ค!
728x90