๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[kaggle] ํ”ผ๋งˆ ์ธ๋””์–ธ ๋‹น๋‡จ๋ณ‘ ์˜ˆ์ธก

by ISLA! 2023. 9. 29.

 

๋ณธ ํฌ์ŠคํŒ…์€ <ํŒŒ์ด์ฌ ๋จธ์‹ ๋Ÿฌ๋‹ ์™„๋ฒฝ ๊ฐ€์ด๋“œ>์˜ 3์žฅ ๋‚ด์šฉ์„ ์ฐธ๊ณ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์ „์ฒด์ ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ดํ”ผ๊ณ , ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๋ชจํ˜• ๊ฒฐ๊ณผ๋ฅผ ๊ต์ •ํ•˜๋Š” ๋‚ด์šฉ์„ ๋‹ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.


1. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ณ , ๊ฒฐ๊ด๊ฐ’ ๋ถ„ํฌ ์ฒดํฌ

from sklearn.linear_model import LogisticRegression
import pandas as pd

diabetes_data = pd.read_csv('diabetes.csv')
print(diabetes_data['Outcome'].value_counts())
diabetes_data.head(3)

 

diabetes_data.info()

 

 

2. ๊ฒฐ๊ณผ ํ‰๊ฐ€ ํ•จ์ˆ˜ ์ •์˜

  • ์ •ํ™•๋„, ์ •๋ฐ€๋„, ์žฌํ˜„์œจ, f1 ๊ฐ’, roc_auc ๊ฐ’, ํ˜ผ๋™ํ–‰๋ ฌ ์ถœ๋ ฅ
  • ์ž…๋ ฅ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” y_test, ์˜ˆ์ธก ๊ฒฐ๊ณผ(pred), 1๋กœ ์˜ˆ์ธกํ•  ํ™•๋ฅ (pred_proba)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score

def get_clf_eval(y_test, pred=None, pred_proba=None):
    confusion = confusion_matrix( y_test, pred)
    accuracy = accuracy_score(y_test , pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    f1 = f1_score(y_test,pred)
    
    # ROC-AUC ์ถ”๊ฐ€
    roc_auc = roc_auc_score(y_test, pred_proba)
    
    print('์˜ค์ฐจ ํ–‰๋ ฌ')
    print(confusion)
    
    # ROC-AUC print ์ถ”๊ฐ€
    print('์ •ํ™•๋„: {0:.4f}, ์ •๋ฐ€๋„: {1:.4f}, ์žฌํ˜„์œจ: {2:.4f},\
    F1: {3:.4f}, AUC:{4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))

 

 

3. 1์ฐจ ๋ชจ๋ธ๋ง(๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€)

  • ํ”ผ์ฒ˜, ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋ถ„๋ฆฌ
  • train_test_split ์œผ๋กœ ํ•™์Šต/ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ
  • ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋กœ ํ•™์Šต, ์˜ˆ์ธก, ํ‰๊ฐ€ โžก ์˜ˆ์ธก๊ฐ’๊ณผ, ํด๋ž˜์Šค 1์— ๋Œ€ํ•œ ์˜ˆ์ธก ํ™•๋ฅ ๊ฐ’ ๋„์ถœ
  • 2์—์„œ ์ •์˜ํ•œ ํ‰๊ฐ€ ํ•จ์ˆ˜๋กœ ๊ฒฐ๊ณผ ํ™•์ธ
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ(x), ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ์„ธํŠธ(y)๋ฅผ ์ถ”์ถœ
X = diabetes_data.iloc[:, :-1]
y = diabetes_data.iloc[:, -1]

# ํ•™์Šต/ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ(๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ -> ์ธตํ™”์ถ”์ถœ)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=156, stratify = y)

# ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋กœ ํ•™์Šต, ์˜ˆ์ธก, ํ‰๊ฐ€
lr_clf = LogisticRegression(solver='liblinear')
lr_clf.fit(X_train, y_train)
pred = lr_clf.predict(X_test)
pred_proba = lr_clf.predict_proba(X_test)[:, 1]

get_clf_eval(y_test, pred, pred_proba)

1์ฐจ ๋ชจ๋ธ๋ง

 

4. ์ž„๊ณ—๊ฐ’ ๋ณ€ํ™”์— ๋”ฐ๋ฅธ ์žฌํ˜„์œจ๊ณผ ์ •๋ฐ€๋„ ์‹œ๊ฐํ™”(ํ•จ์ˆ˜)

  • precision_recall_curve ์‚ฌ์šฉ
  • ์‹œ๊ฐํ™” ํ•จ์ˆ˜ ์ •์˜ : ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” y_test, ํด๋ž˜์Šค 1๊ฐ’์œผ๋กœ์˜ ์˜ˆ์ธก ํ™•๋ฅ 
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.metrics import precision_recall_curve
import numpy as np

def precision_recall_curve_plot(y_test, pred_proba_c1):
    # threshold ndarray์™€ ์ด threshold์— ๋”ฐ๋ฅธ ์ •๋ฐ€๋„, ์žฌํ˜„์œจ ndarray ์ถ”์ถœ.
    precisions, recalls, thresholds = precision_recall_curve(y_test, pred_proba_c1)
    
    plt.figure(figsize = (8, 6))
    threshold_boundary = thresholds.shape[0]
    plt.plot(thresholds, precisions[0:threshold_boundary], linestyle='--', label = 'precision')
    plt.plot(thresholds, recalls[0:threshold_boundary], label = 'recall')
    
    start, end = plt.xlim()
    plt.xticks(np.round(np.arange(start, end, 0.1), 2))
    
    plt.xlabel('threshold value')
    plt.ylabel('Precision and Recall Value')
    plt.legend()
    plt.grid()
    plt.show()
  • ํ•จ์ˆ˜ ์ ์šฉ
pred_proba_c1 = lr_clf.predict_proba(X_test)[:, 1]

precision_recall_curve_plot(y_test, pred_proba_c1)

๋‘ ๊ทธ๋ž˜ํ”„๊ฐ€ ๋งŒ๋‚˜๋Š” ์ง€์ ์˜ ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์ด ๋ชจ๋‘ 0.8 ์ดํ•˜๋กœ ๋‚ฎ์€ ํŽธ์ด๋ฏ€๋กœ, ๋‹ค์‹œ ํ•œ๋ฒˆ ์ „์ฒด ๋ฐ์ดํ„ฐ ๊ฒ€ํ† 

 

5. ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ์žฌํ™•์ธ

diabetes_data.describe()

 

๐Ÿ‘‰ ์ผ๋ถ€ ์ปฌ๋Ÿผ์˜ ์ตœ์†Ÿ๊ฐ’์ด 0์ธ ๊ฒƒ์„ ํ™•์ธ : 0์ธ ๊ฒƒ์ด ๋ถˆ๊ฐ€๋Šฅํ•œ ์ปฌ๋Ÿผ์˜ ์ด์ƒ์น˜๋กœ ํŒ๋‹จ

๐Ÿ‘‰ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ทธ๋ ค์„œ ์ข€ ๋” ์ž์„ธํžˆ ํŒŒ์•…ํ•ด๋ณด์ž.(Glucose ์ปฌ๋Ÿผ)

plt.hist(diabetes_data['Glucose'], bins = 100)
plt.show()

๐Ÿ‘‰ ํ™•์‹คํžˆ 0๊ฐ’์ด ๋ณด์ž„

๐Ÿ‘‰ 0 ๊ฐ’์„ ์ฒ˜๋ฆฌํ•  ํ•„์š”๊ฐ€ ์žˆ์Œ ->> ํ‰๊ท ์œผ๋กœ ๋Œ€์ฒด

 

6. 0๊ฐ’์ด ์žˆ๋Š” ์ปฌ๋Ÿผ์˜ 0๊ฐ’ ๋น„์œจ ์กฐ์‚ฌ(์ „์ฒด ๋ฐ์ดํ„ฐ ๋Œ€๋น„)

  • 0๊ฐ’์ด ์žˆ์„ ์ˆ˜ ์—†๋Š” ํ”ผ์ฒ˜๋ช…์„ ๋ฆฌ์ŠคํŠธ๋กœ ์ •๋ฆฌ
  • ์ „์ฒด ๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜ ์ธก์ •
  • 0๊ฐ’ ๊ฒ€์‚ฌํ•  ํ”ผ์ฒ˜๋“ค ๊ฐ๊ฐ์˜ ์ด 0๊ฐ’์˜ ๊ฐœ์ˆ˜๋ฅผ ์ง‘๊ณ„
  • ์ „์ฒด ๊ฐ’ ๋Œ€๋น„ 0๊ฐ’์˜ ๋น„์œจ์„ ์ถœ๋ ฅ
# 0๊ฐ’์„ ๊ฒ€์‚ฌํ•  ํ”ผ์ฒ˜๋ช…
zero_features = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

# ์ „์ฒด ๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜
total_count = diabetes_data['Glucose'].count()

# ํ”ผ์ฒ˜๋ณ„๋กœ ๋ฐ˜๋ณตํ•˜๋ฉฐ, ๋ฐ์ดํ„ฐ ๊ฐ’์ด 0์ธ ๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜ ์ถ”์ถœ, ๋น„์œจ ๊ณ„์‚ฐ
for feature in zero_features:
    zero_count = diabetes_data[diabetes_data[feature] == 0][feature].count()
    print('{0}ํ”ผ์ฒ˜์˜ 0 ๊ฑด์ˆ˜๋Š” {1}, ํผ์„ผํŠธ๋Š” {2:.2f}%'.format(feature, zero_count, 100*zero_count/total_count))

 

 

7. 0๊ฐ’์ด ์žˆ๋Š” ์ปฌ๋Ÿผ์„ ํ‰๊ท ์œผ๋กœ ๋Œ€์ฒด

  • ์ „์ฒด ์ปฌ๋Ÿผ ์ค‘ 0๊ฐ’ ๋Œ€์ฒด๊ฐ€ ํ•„์š”ํ•œ ์ปฌ๋Ÿผ(zero_features)์˜ ํ‰๊ท ์„ ๋„์ถœ
  • replace() ํ•จ์ˆ˜๋กœ zero_features์˜ 0๊ฐ’์„ ํ‰๊ท ๊ฐ’์œผ๋กœ ๋Œ€์ฒด
# 0๊ฐ’์„ ํ‰๊ท ๊ฐ’์œผ๋กœ ๋Œ€์ฒด!
mean_zero_features = diabetes_data[zero_features].mean()

diabetes_data[zero_features] = diabetes_data[zero_features].replace(0, mean_zero_features)

8. StandardScaler ํ›„, ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ์žฌ์‹œ๋„

  • โœ‹ ๋กœ์ง€์Šคํ‹ฑํšŒ๊ท€๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ˆซ์ž ๋ฐ์ดํ„ฐ์— ์Šค์ผ€์ผ๋ง์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Œ
  • ๋‹ค์‹œ ํ•œ ๋ฒˆ, X, y๊ฐ’์„ ๋‚˜๋ˆ„๊ณ  X ๊ฐ’์— ๋Œ€ํ•ด StandardScaler wjrdyd
  • train, test ๋ฐ์ดํ„ฐ์…‹ ๋ถ„๋ฆฌ
  • ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋กœ ํ•™์Šต, ์˜ˆ์ธก, ํ‰๊ฐ€(ํ‰๊ฐ€๋Š” ์œ„์—์„œ ๋งŒ๋“  ํ•จ์ˆ˜: get_clf_eval ์‚ฌ์šฉ)
from sklearn.preprocessing import StandardScaler

X = diabetes_data.iloc[:, :-1]
y = diabetes_data.iloc[:, -1]

# StandardScaler ํด๋ž˜์Šค๋กœ ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ์…‹์— ์ผ๊ด„ ์Šค์ผ€์ผ๋ง
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2, random_state=156, stratify=y)

#๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋กœ ํ•™์Šต, ์˜ˆ์ธก, ํ‰๊ฐ€
lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train)
pred = lr_clf.predict(X_test)
pred_proba = lr_clf.predict_proba(X_test)[:, 1]

get_clf_eval(y_test, pred, pred_proba)

2์ฐจ ๋ชจ๋ธ๋ง

 

9. ์žฌํ˜„์œจ์ด ๋‚ฎ์€ ๋ฌธ์ œ ํ•ด๊ฒฐ --- ์ž„๊ณ—๊ฐ’์„ ๋ณ€ํ™”์‹œํ‚ค๋ฉฐ ๊ฒฐ๊ณผ ํ™•์ธ

  • Binarizer ๋ฉ”์„œ๋“œ๋กœ threshold(์ž„๊ณ—๊ฐ’)์„ ๋ณ€ํ™”์‹œํ‚ค๋ฉฐ predict ๊ฒฐ๊ด๊ฐ’์„ ๋‹ค๋ฅด๊ฒŒ ํ•˜์—ฌ
  • ํ‰๊ฐ€ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜๋Š” ํ•จ์ˆ˜ ์ž‘์„ฑ
from sklearn.preprocessing import Binarizer

def get_eval_by_thresholds(y_test, pred_proba_c1, thresholds):
    for custom_threshold in thresholds:
        binarizer = Binarizer(threshold = custom_threshold).fit(pred_proba_c1)
        custom_predict = binarizer.transform(pred_proba_c1)
        print('์ž„๊ณ—๊ฐ’: ', custom_threshold)
        get_clf_eval(y_test, custom_predict, pred_proba_c1)
        print('\n')
  • ํ•จ์ˆ˜ ์ ์šฉ
thresholds = [0.3, 0.33, 0.36, 0.39, 0.42, 0.45, 0.48, 0.50]

pred_proba = lr_clf.predict_proba(X_test)
get_eval_by_thresholds(y_test, pred_proba[:, 1].reshape(-1, 1), thresholds)

 

 

10. ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์ด ์ ์ ˆํžˆ ๋†’์€ ์ง€์  ํ™•์ธ(์ž„๊ณ—๊ฐ’ 0.48)

-->> ์ž„๊ณ—๊ฐ’ 0.48๋กœ ์ตœ์ข… ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ์ง„ํ–‰ ๐Ÿš€

# ์ž„๊ณ—๊ฐ’์„ 0.48๋กœ ์„ค์ •ํ•œ Binarizer ์ƒ์„ฑ
binarizer = Binarizer(threshold=0.48)
pred_th_048 = binarizer.fit_transform(pred_proba[:, 1].reshape(-1, 1))

get_clf_eval(y_test, pred_th_048, pred_proba[:, 1])

์ตœ์ข… ๋ชจ๋ธ๋ง ๊ฒฐ๊ณผ

 

728x90