๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/scikit-learn

๊ฒฐ์ • ํŠธ๋ฆฌ ์‹ค์Šต - ์‚ฌ์šฉ์ž ํ–‰๋™ ์ธ์‹ ๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜ ์˜ˆ์ œ

by ISLA! 2023. 8. 21.

 

๐Ÿง‘๐Ÿป‍๐Ÿ’ป ์˜ˆ์ œ ์„ค๋ช…

  • ์Šค๋งˆํŠธํฐ ์„ผ์„œ๋ฅผ ์žฅ์ฐฉํ•œ 30๋ช…์˜ ํ–‰๋™๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘
  • ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ์ด์šฉํ•ด ์–ด๋– ํ•œ ๋™์ž‘์ธ์ง€ ์˜ˆ์ธก

 


๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ : feature.txt ํŒŒ์ผ ๋กœ๋“œ

ํ”ผ์ฒ˜ ์ธ๋ฑ์Šค์™€ ํ”ผ์ฒ˜๋ช…์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฏ€๋กœ ์ด๋ฅผ DataFrame์œผ๋กœ ๋กœ๋”ฉํ•˜์—ฌ, ํ”ผ์ฒ˜ ๋ช…์นญ ํ™•์ธ

import pandas as pd
DATA_PATH = '/content/drive/MyDrive/data'

feature_name_df = pd.read_csv(DATA_PATH + '/human_activity/features.txt',sep='\s+',
                        header=None,names=['column_index','column_name'])
                        
feature_name_df.head(1)

๊ฒฐ๊ณผ

 

ํ”ผ์ฒ˜๋ช… index ์ œ๊ฑฐ ํ›„, ํ”ผ์ฒ˜๋งŒ ๋”ฐ๋กœ ์ €์žฅ

  • ํ”ผ์ฒ˜๋ช… : ์ธ์ฒด์˜ ์›€์ง์ž„๊ณผ ๊ด€๋ จ๋œ ์†์„ฑ์˜ ํ‰๊ท /ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ x, y, z ์ถ• ๊ฐ’์œผ๋กœ ๋˜์–ด ์žˆ์Œ
  • ๊ทธ๋Ÿฐ๋ฐ, ํ”ผ์ฒ˜๋ช…์˜ ์ค‘๋ณต์ด ์žˆ์–ด ์ด๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•จ
feature_name = feature_name_df.iloc[:, 1].values.tolist()
feature_name[:10]

 

์ค‘๋ณต๋œ ํ”ผ์ฒ˜๋ช… ํ™•์ธ

  • ์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ๋ณด๋ฉด ์ค‘๋ณต ํ”ผ์ณ๊ฐ€ 42๊ฐœ๋‚˜ ๋˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค
  • ๊ฐ„๋‹จํžˆ ์ค‘๋ณต ํ”ผ์ณ 5๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋งŒ ํ™•์ธํ•ด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค
  • ์ค‘๋ณต ํ”ผ์ณ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ํ•จ์ˆ˜๋ฅผ ์ž‘์„ฑํ•ด๋ณธ๋‹ค : ์›๋ณธ ํ”ผ์ฒ˜๋ช…์— ์ค‘๋ณต ํ”ผ์ณ๋ช…์„ _1, _2๋ฅผ ๋ถ™์—ฌ์„œ ๋ณ€๊ฒฝ
feature_dup_df = feature_name_df.groupby('column_name').count()
print(feature_dup_df[feature_dup_df['column_index'] > 1].count())

feature_dup_df[feature_dup_df['column_index'] > 1].head()

 

 

์ค‘๋ณต๋œ ํ”ผ์ฒ˜๋ช… ์ฒ˜๋ฆฌ ํ•จ์ˆ˜

  • ๊ธฐ์กด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๋ฐ›์•„, ์ค‘๋ณต ํ”ผ์ณ๋ช…์„ ์ฒ˜๋ฆฌํ•˜๋Š” ํ•จ์ˆ˜ ์ •์˜
    • feature_dup_df = ๊ธฐ์กด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— column_name์œผ๋กœ ๊ทธ๋ฃจํ•‘ํ•˜์—ฌ ๊ฐ ์ปฌ๋Ÿผ๋ณ„๋กœ ๋ˆ„์  ๊ฐœ์ˆ˜๋ฅผ ์„ผ๋‹ค >> dup_cut ๋กœ ์ €์žฅ
    • feature_dup_df ์˜ ์ธ๋ฑ์Šค ๋ฆฌ์…‹
    • new_feature_dup_df = ๊ธฐ์กด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„๊ณผ feature_dup_df๋ฅผ ์—ฐ๊ฒฐ (์•„์šฐํ„ฐ ์กฐ์ธ)
    • new_feature_dup_df['column_name'] = column_name๊ณผ dup_cnt ์—ด๋งŒ ๋”ฐ๋กœ ๋นผ์„œ ๋žŒ๋‹ค ํ•จ์ˆ˜ ์ ์šฉ 
      • dup_cnt ๊ฐ’์ด 1๋ณด๋‹ค ํฌ๋ฉด(์ฆ‰ ์ค‘๋ณต ์ปฌ๋Ÿผ์ด๋ฉด) column_name ๊ณผ dup_cnt ๊ฐ’์„ _๋กœ ์—ฐ๊ฒฐํ•˜๊ธฐ
      • 1๋ณด๋‹ค ํฌ์ง€ ์•Š์œผ๋ฉด (์ค‘๋ณต ์ปฌ๋Ÿผ์ด ์•„๋‹ˆ๋ฉด) ๊ทธ๋Œ€๋กœ column_name ๋ฐ˜ํ™˜
    • new_feature_dup_df์˜ ์ธ๋ฑ์Šค ์ปฌ๋Ÿผ ์‚ญ์ œ
    • new_feature_dup_df ๋ฐ˜ํ™˜
def get_new_feature_name_df(old_feature_name_df):
    feature_dup_df = pd.DataFrame(data = old_feature_name_df.groupby('column_name').cumcount(), columns = ['dup_cnt'])
    feature_dup_df = feature_dup_df.reset_index()

    new_feature_name_df = pd.merge(old_feature_name_df.reset_index(), feature_dup_df, how = 'outer')

    # ์ค‘๋ณต feature ๋ช…์— ๋Œ€ํ•ด ์›๋ณธ feature_1, _2 ์ถ”๊ฐ€
    new_feature_name_df['column_name'] = new_feature_name_df[['column_name', 'dup_cnt']].apply(lambda x : x[0] + '_' + str(x[1]) if x[1] > 0 else x[0], axis = 1)
    new_feature_name_df = new_feature_name_df.drop(['index'], axis = 1)
    return new_feature_name_df

function_test = get_new_feature_name_df(feature_name_df)
function_test.sample(5)

ํ•จ์ˆ˜ ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ

 

train ๋ฐ์ดํ„ฐ์™€ test ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ : ํ•จ์ˆ˜

  • train๊ณผ test ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
  • ์•ž์„œ ์ •์˜ํ•œ ์ค‘๋ณต ํ”ผ์ณ ์ฒ˜๋ฆฌ ํ•จ์ˆ˜ ์ ์šฉ
  • ์ปฌ๋Ÿผ๋ช…์„ ๋ฆฌ์ŠคํŠธ๋กœ ์ถ”์ถœ
  • ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์…‹ ์„ค์ •
import pandas as pd

def get_human_dataset( ):

    # ๊ฐ ๋ฐ์ดํ„ฐ ํŒŒ์ผ๋“ค์€ ๊ณต๋ฐฑ์œผ๋กœ ๋ถ„๋ฆฌ๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ read_csv์—์„œ ๊ณต๋ฐฑ ๋ฌธ์ž๋ฅผ sep์œผ๋กœ ํ• ๋‹น.
    DATA_PATH = '/content/drive/MyDrive/data'
    feature_name_df = pd.read_csv(DATA_PATH + '/human_activity/features.txt',sep='\s+',
                        header=None,names=['column_index','column_name'])

    # ์ค‘๋ณต๋œ ํ”ผ์ฒ˜๋ช…์„ ์ˆ˜์ •ํ•˜๋Š” get_new_feature_name_df()๋ฅผ ์ด์šฉ, ์‹ ๊ทœ ํ”ผ์ฒ˜๋ช… DataFrame์ƒ์„ฑ.
    new_feature_name_df = get_new_feature_name_df(feature_name_df)

    # DataFrame์— ํ”ผ์ฒ˜๋ช…์„ ์ปฌ๋Ÿผ์œผ๋กœ ๋ถ€์—ฌํ•˜๊ธฐ ์œ„ํ•ด ๋ฆฌ์ŠคํŠธ ๊ฐ์ฒด๋กœ ๋‹ค์‹œ ๋ณ€ํ™˜
    feature_name = new_feature_name_df.iloc[:, 1].values.tolist()

    # ํ•™์Šต ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ ์…‹๊ณผ ํ…Œ์ŠคํŠธ ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ์„ DataFrame์œผ๋กœ ๋กœ๋”ฉ. ์ปฌ๋Ÿผ๋ช…์€ feature_name ์ ์šฉ
    X_train = pd.read_csv(DATA_PATH + '/human_activity/train/X_train.txt',sep='\s+', names=feature_name )
    X_test = pd.read_csv(DATA_PATH + '/human_activity/test/X_test.txt',sep='\s+', names=feature_name)

    # ํ•™์Šต ๋ ˆ์ด๋ธ”๊ณผ ํ…Œ์ŠคํŠธ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ์„ DataFrame์œผ๋กœ ๋กœ๋”ฉํ•˜๊ณ  ์ปฌ๋Ÿผ๋ช…์€ action์œผ๋กœ ๋ถ€์—ฌ
    y_train = pd.read_csv(DATA_PATH + '/human_activity/train/y_train.txt',sep='\s+',header=None,names=['action'])
    y_test = pd.read_csv(DATA_PATH + '/human_activity/test/y_test.txt',sep='\s+',header=None,names=['action'])

    # ๋กœ๋“œ๋œ ํ•™์Šต/ํ…Œ์ŠคํŠธ์šฉ DataFrame์„ ๋ชจ๋‘ ๋ฐ˜ํ™˜
    return X_train, X_test, y_train, y_test


X_train, X_test, y_train, y_test = get_human_dataset()

 

 

train ๋ฐ์ดํ„ฐ์…‹ ํ™•์ธ

  • ์•ฝ 7000๊ฐœ์˜ ๋ ˆ์ฝ”๋“œ, 561๊ฐœ์˜ ํ”ผ์ณ(์ปฌ๋Ÿผ) ๊ฐ€์ง€๊ณ  ์žˆ์Œ
  • ํ”ผ์ณ๊ฐ€ ์ „๋ถ€ ์‹ค์ˆ˜ํ˜•(์ˆซ์ž)์ด๋ฏ€๋กœ, ๋ณ„๋„ ์นดํ…Œ๊ณ ๋ฆฌ ์ธ์ฝ”๋”ฉ ํ•„์š” ์—†์Œ

 

label(๊ฒฐ๊ณผ) ๋ฐ์ดํ„ฐ์…‹ ํ™•์ธ

  • 1๋ถ€ํ„ฐ 6๊นŒ์ง€ ์—ฌ์„ฏ๊ฐœ์˜ ๊ฐ’์ด ์žˆ์œผ๋ฉฐ ๋น„๊ต์  ๊ณ ๋ฅด๊ฒŒ ๋ถ„ํฌ๋˜์–ด ์žˆ์Œ

 

 

DecisionTreeClassifier ๋กœ ๋™์ž‘ ์˜ˆ์ธก ๋ถ„๋ฅ˜

  • ๊ธฐ๋ณธ์ ์œผ๋กœ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์†๋Œ€์ง€ ์•Š๊ณ  ์˜ˆ์ธกํ•œ ๊ฒฐ๊ณผ์˜ ์ •ํ™•๋„์™€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ถœ๋ ฅํ•ด๋ณด์ž
  • ์ •ํ™•๋„๋Š” ์•ฝ 85.48%
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# ์˜ˆ์ œ ๋ฐ˜๋ณต ์‹œ ๋งˆ๋‹ค ๋™์ผํ•œ ์˜ˆ์ธก ๊ฒฐ๊ณผ ๋„์ถœ์„ ์œ„ํ•ด random_state ์„ค์ •
dt_clf = DecisionTreeClassifier(random_state=156)
dt_clf.fit(X_train , y_train)
pred = dt_clf.predict(X_test)
accuracy = accuracy_score(y_test , pred)
print('๊ฒฐ์ • ํŠธ๋ฆฌ ์˜ˆ์ธก ์ •ํ™•๋„: {0:.4f}'.format(accuracy))

# DecisionTreeClassifier์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ถ”์ถœ
print('DecisionTreeClassifier ๊ธฐ๋ณธ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ:\n', dt_clf.get_params())

๊ฒฐ๊ณผ

๊ฒฐ์ • ํŠธ๋ฆฌ ์˜ˆ์ธก ์ •ํ™•๋„: 0.8548
DecisionTreeClassifier ๊ธฐ๋ณธ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ: {'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'random_state': 156, 'splitter': 'best'}

 

GridSearchCV ๋กœ ์ตœ์ ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ฐพ๊ธฐ

from sklearn.model_selection import GridSearchCV

params = {
    'max_depth' : [6, 16, 24],
    'min_samples_split': [16]
}

grid_cv = GridSearchCV(dt_clf, param_grid = params, scoring = 'accuracy', cv = 5, verbose = 1)
grid_cv.fit(X_train, y_train)

 ๊ฒฐ๊ณผ ํ™•์ธ

  • ์ •ํ™•๋„ ์ˆ˜์น˜๊ฐ€ 84.86% ๊นŒ์ง€ ์˜ฌ๋ผ๊ฐ
print('์ •ํ™•๋„ ์ˆ˜์น˜', grid_cv.best_score_)
print('์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ', grid_cv.best_params_)

 

728x90