๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[ํ•ด์™ธ ๋ถ€๋™์‚ฐ ์›”์„ธ ์˜ˆ์ธก(2)] One-Hot Encoding & Ridge Regression

by ISLA! 2023. 9. 15.

 

Library Import

import os
import numpy as np
import random

from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import Ridge

 

Data pre-processing : One-Hot Encoding

  • ๋ช…๋ชฉํ˜• ๋ณ€์ˆ˜์˜ ๊ฒฝ์šฐ ๊ฐ’๋“ค ๊ฐ๊ฐ์„ ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์œผ๋กœ ๋งŒ๋“ค๊ณ  ๐Ÿ‘‰ ์›๋ž˜ ํ•ด๋‹นํ•˜๋˜ ๊ฐ’์—๋Š” 1์„, ์•„๋‹ ๊ฒฝ์šฐ 0์„ ๋ถ€์—ฌ
  • ์ด์ค‘ For ๋ฌธ์„ ์“ด ์ด์œ  : One Hot Encoder๊ฐ€ Test ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ Fitting๋˜๋Š” ๊ฒƒ์€ Data Leakage์ด๋ฏ€๋กœ, 
    Test ๋ฐ์ดํ„ฐ์—๋Š” Train ๋ฐ์ดํ„ฐ๋กœ Fitting๋œ One Hot Encoder๋กœ๋ถ€ํ„ฐ transform๋งŒ ์ˆ˜ํ–‰๋˜์–ด์•ผ ํ•œ๋‹ค!
# ์งˆ์  ์ปฌ๋Ÿผ ์›ํ•ซ์ธ์ฝ”๋”ฉ

qual_col = ['propertyType','suburbName']
ohe = OneHotEncoder(sparse=False)

for i in qual_col :
    train_x = pd.concat([train_x, pd.DataFrame(ohe.fit_transform(train_x[[i]]), columns = ohe.categories_[0])], axis = 1)
    
    for qual_value in np.unique(test_x[i]):
        if qual_value not in np.unique(ohe.categories_):
            ohe.categories_ = np.append(ohe.categories_, qual_value)
            
    # One Hot Encoder๊ฐ€ Test ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ Fitting๋˜๋Š” ๊ฒƒ์€ Data Leakage์ด๋ฏ€๋กœ, 
    # Test ๋ฐ์ดํ„ฐ์—๋Š” Train ๋ฐ์ดํ„ฐ๋กœ Fitting๋œ One Hot Encoder๋กœ๋ถ€ํ„ฐ transform๋งŒ ์ˆ˜ํ–‰๋˜์–ด์•ผ ํ•œ๋‹ค!!
    test_x = pd.concat([test_x, pd.DataFrame(ohe.transform(test_x[[i]]), columns = ohe.categories_[0])], axis = 1)

train_x = train_x.drop(qual_col, axis = 1)
test_x = test_x.drop(qual_col, axis=1)
print('Done')

๊ฒฐ๊ณผ ํ™•์ธ

 

Model Hyperparameter Setting

  • Ridge Regression ๋ชจ๋ธ์—์„œ๋Š” alpha๋ฅผ Hyperparameter๋กœ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋‹ค
  • alpha๋Š” ๋ชจ๋ธ์˜ ๊ทœ์ œํ•ญ์œผ๋กœ, ๋ชจ๋ธ์˜ ์˜ค๋ฒ„ํ”ผํŒ…์„ ๋ฐฉ์ง€ํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค
Model = Ridge(alpha = 1.0)

Model.fit(train_x, train_y)
preds = Model.predict(test_x)

submit['monthlyRent(us_dollar)'] = preds
submit.head()

 

728x90