๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[๐Ÿฆ€ ๊ฒŒ ๋‚˜์ด ์˜ˆ์ธก(1)] ๋ฐ์ดํ„ฐ ํƒ์ƒ‰ & EDA

by ISLA! 2023. 9. 24.
๐Ÿ˜€ ๋ณธ ์˜ˆ์ œ๋Š” kaggle ์˜ EDA & ML ์ฝ”๋“œ์˜ best practice๋ฅผ ๋ณด๋ฉฐ ์Šคํ„ฐ๋””ํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.
- ์ „์ฒด์ ์ธ ์ฝ”๋“œ๋Š” ์ด ๋งํฌ๋กœ >> https://www.kaggle.com/code/oscarm524/ps-s3-ep16-eda-modeling-submission/notebook

๐Ÿฆ€  ์ด๋ฒˆ ์‹œ๊ฐ„์—๋Š” '๊ฒŒ' ๋‚˜์ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์„ ์ฃผ์ œ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํƒ์ƒ‰ํ•˜๊ณ ,
         EDA๋ฅผ ํ•˜๋Š” ๊ธฐ์ดˆ์ ์ธ ๊ณผ์ •์„ ์‚ดํŽด๋ณธ๋‹ค

 


๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ load

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt; plt.style.use('ggplot')
import seaborn as sns
import plotly.express as px
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split, GridSearchCV, RepeatedKFold, RepeatedStratifiedKFold
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.inspection import PartialDependenceDisplay
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.svm import SVR

from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklego.linear_model import LADRegression

 

Data ๋ถˆ๋Ÿฌ์˜ค๊ณ  ๊ธฐ์ดˆ ํ†ต๊ณ„๋Ÿ‰ ํ™•์ธ

  • ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•œ ๋’ค์—๋Š” ๋จผ์ € ์ „์ฒด์ ์ธ ํฌ๊ธฐ(shape)์„ ํ™•์ธํ•œ๋‹ค : ํ–‰๊ณผ ์—ด ๊ฐœ์ˆ˜
train = pd.read_csv('train.csv')
original = pd.read_csv('CrabAgePrediction.csv')
test = pd.read_csv('test.csv')
submission = pd.read_csv('sample_submission.csv')

print('train data size : ', train.shape)
print('test data size : ', test.shape)
print('submission data size : ', submission.shape)

๊ฒฐ๊ณผ ์˜ˆ์‹œ

train data size :  (74051, 10)
test data size :  (49368, 9)
submission data size :  (49368, 2)

 

  • head() ๋กœ ์ฒซ 5์ค„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ถœ๋ ฅํ•ด์„œ ํ™•์ธํ•˜๊ณ , describe() ๋กœ ์ˆ˜์น˜ํ˜• ๊ฐ’๋“ค์˜ ๊ธฐ์ดˆ ํ†ต๊ณ„๋Ÿ‰์„ ํ™•์ธ
  • ์ด๋•Œ, ๊ฒฐ์ธก์น˜์˜ ๋ถ„ํฌ๋„ ํ•จ๊ป˜ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค
train.head()
train.describe()

train.head()

 


Data ํƒ์ƒ‰

1. ์ค‘๋ณต๊ฐ’ ํ™•์ธ

  • drop.duplicates()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ์ „์ฒด ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ์—์„œ ์ค‘๋ณต๊ฐ’์„ ๋นผ๊ณ  ๋‚œ ํ›„์˜ ์‚ฌ์ด์ฆˆ(ํ–‰์˜ ๊ฐœ์ˆ˜)๋ฅผ ๋น„๊ตํ•œ๋‹ค
  • ์ด๋ฅผ ํ†ตํ•ด ์ „์ฒด ๋ฐ์ดํ„ฐ์—์„œ ์ค‘๋ณต๋œ ๊ฐ’์ด ์žˆ๋Š”์ง€ ๋ณผ ์ˆ˜ ์žˆ๋‹ค
print(f'There are {train.shape[0]} observations in the train dataset')
print('There are', train.drop(columns=['id'], axis = 1).drop_duplicates().shape[0], 'unique observations in the train dataset')
print('There are', train.drop(columns=['id', 'Age'], axis = 1).drop_duplicates().shape[0], 'unique observations (only features) in the trian dataset')

 

 

2. ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ ํ™•์ธ

  • ๊ธฐ๋ณธ์ ์œผ๋กœ train, test ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ์™€ ์ปฌ๋Ÿผ๊ฐ„ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํ™•์ธํ•œ๋‹ค
  • ํŠนํžˆ ๋Œ€ํšŒ์ œ์ถœ์šฉ ๋ฐ์ดํ„ฐ๋Š” train, test๊ฐ€ ๋”ฐ๋กœ ๋‚˜๋‰˜์–ด ์žˆ์œผ๋‹ˆ ์ด๋ฅผ ์œ ์˜ํ•˜์—ฌ ์ฒดํฌํ•œ๋‹ค
  • ๋ณธ ํฌ์ŠคํŒ…์˜ ๋ฐ์ดํ„ฐ๋Š” train, original, test ์ด๋ ‡๊ฒŒ 3๊ฐœ์˜ ๋ฐ์ดํ„ฐ ์…‹์„ ๊ธฐ๋ณธ์œผ๋กœ ํ•œ๋‹ค.
    • train ๋ฐ์ดํ„ฐ๋Š” original ๋ฐ์ดํ„ฐ์—์„œ ์ถ”์ถœ๋œ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋˜๋ฉฐ, ์ •ํ™•ํ•œ ๋ถ„์„์„ ์œ„ํ•ด train๊ณผ original ๋ฐ์ดํ„ฐ์˜ ์ „์ฒด์ ์ธ ๋ถ„ํฌ๋‚˜ ์ƒํƒœ๋ฅผ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
fig, axes = plt.subplots(1, 2, figsize = (18, 8))

sns.kdeplot(ax = axes[0], data = train, x = 'Age', fill = True, color = 'steelblue').set_title('train data');
sns.kdeplot(ax = axes[1], data = original, x = 'Age', fill = True, color = 'orange').set_title('original data');
plt.show()

๐Ÿ‘‰ ๋‘ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๊ฐ€ ์ „์ฒด์ ์œผ๋กœ ๋™์ผํ•˜๋‹ค

 

 

3. ๋ฐ์ดํ„ฐ ์ปฌ๋Ÿผ๊ณผ target์˜ ์ƒ๊ด€๊ด€๊ณ„ ํ™•์ธ

corr_train = train.drop(columns = ['id', 'Sex'], axis = 1).corr()
corr_original = original.drop(columns = ['Sex'], axis = 1).corr()

train_mask = np.triu(np.ones_like(corr_train, dtype = bool))
original_mask = np.triu(np.ones_like(corr_original, dtype = bool))

cmap = sns.diverging_palette(100, 7, s = 75, l = 40, n = 20, center = 'light', as_cmap = True)

fig, axes = plt.subplots(1, 2, figsize = (25, 10))
sns.heatmap(corr_train, annot = True, cmap = cmap, fmt = '.2f', center = 0,
            annot_kws = {'size':12}, ax = axes[0], mask = train_mask).set_title('train ์ƒ๊ด€๊ด€๊ณ„')

sns.heatmap(corr_original, annot = True, cmap = cmap, fmt = '.2f', center = 0,
           annot_kws = {'size':12}, ax = axes[1], mask = original_mask).set_title('original ์ƒ๊ด€๊ด€๊ณ„')

plt.show()

 

๐Ÿ‘‰ ๋‘ ๋ฐ์ดํ„ฐ์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋„ ์œ ์‚ฌํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚œ๋‹ค 

- age์™€ shell weight์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋†’์Œ
- age ์™€ shucked weight์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋‚ฎ์Œ

 

 

 

4. ์ฃผ์š” ์ปฌ๋Ÿผ๊ณผ target์˜ ๊ด€๊ณ„ ์‹œ๊ฐํ™”

4-1. Sex - Age

# sex์ปฌ๋Ÿผ์„ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋กœ ๋ณ€ํ™˜, I/M/F ์ˆœ์„œ์˜ ๋ฒ”์ฃผ๋ฅผ ๊ฐ€์ง€๋„๋ก ํ•จ(์‹œ๊ฐํ™”์šฉ)
original['Sex'] = pd.Categorical(original['Sex'], categories = ['I', 'M', 'F'], ordered = True)

fig, axes = plt.subplots(1, 2, figsize = (15, 6))

sns.boxplot(ax = axes[0], data = train, x = 'Sex', y = 'Age').set_title('Competition Dataset')
sns.boxplot(ax = axes[1], data = original, x = 'Sex', y = 'Age').set_title('Original Dataset');

๐Ÿ‘‰ ์„ฑ๋ณ„์ด ๋ช…ํ™•ํžˆ ๊ตฌ๋ถ„๋œ ๊ฒฝ์šฐ, ์œ ์‚ฌํ•œ ๋ถ„ํฌ๋ฅผ ๋ณด์ž„

 

4-2. Shell Weight - Age

fig, axes = plt.subplots(1, 2, figsize = (15, 6))

sns.scatterplot(ax = axes[0], data = train, x = 'Shell Weight', y = 'Age', color = 'steelblue').set_title('Competition Dataset')
sns.scatterplot(ax = axes[1], data = original, x = 'Shell Weight', y = 'Age', color = 'orange').set_title('Original Dataset');

๐Ÿ‘‰ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„

 

 

4-3. Diameter - Age

fig, axes = plt.subplots(1, 2, figsize = (15, 6))

sns.scatterplot(ax = axes[0], data = train, x = 'Diameter', y = 'Age', color = 'steelblue').set_title('Competition Dataset')
sns.scatterplot(ax = axes[1], data = original, x = 'Diameter', y = 'Age', color = 'orange').set_title('Original Dataset');

๐Ÿ‘‰ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„(์„ ํ˜•)

 


๋‹ค์Œ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœํ•œ baseline code๋ฅผ ์งœ๋ณด๋Š” ๊ฒƒ์œผ๋กœ ์ด์–ด์ง„๋‹ค!

728x90