๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[Kaggle] ์ด์ปค๋จธ์Šค ๋ฐ์ดํ„ฐ ๋ถ„์„ 2 (CRM Analytics ๐Ÿ›๏ธ๐Ÿ›’)

by ISLA! 2023. 10. 8.

๐Ÿšง ์ด์ปค๋จธ์Šค ๋ฐ์ดํ„ฐ ๋ถ„์„ 1 ํฌ์ŠคํŒ…๊ณผ ์ด์–ด์ง‘๋‹ˆ๋‹ค!

์ง€๊ธˆ๊นŒ์ง€ ๋ฐ์ดํ„ฐ์˜ ์œ ํ˜•๊ณผ ๊ตญ๊ฐ€๋ณ„ ๋ถ„ํฌ๋ฅผ ์ „๋ฐ˜์ ์œผ๋กœ ์‚ดํŽด๋ณด์•˜๋‹ค. ๋˜ํ•œ ํ•จ์ˆ˜๋ฅผ ์ง€์ •ํ•˜์—ฌ ๊ฒฐ์ธก์น˜์™€ ์ค‘๋ณต๊ฐ’๋„ ๊ฐ„๋‹จํžˆ ํ™•์ธํ–ˆ๋‹ค.

์ด์ œ ๋ฐ์ดํ„ฐ๋ฅผ ์ข€ ๋” ๊ตฌ์ฒด์ ์œผ๋กœ ๋“ค์—ฌ๋‹ค๋ณด๋Š” '๊ธฐ์ˆ  ํ†ต๊ณ„'๋ฅผ ์‹œ์ž‘ํ•œ๋‹ค.


 

๊ธฐ์ˆ  ํ†ต๊ณ„ : Descriptive Statistics

desc_stats ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•˜์—ฌ ๊ธฐ์ˆ  ํ†ต๊ณ„ ๊ฐ’์„ ๋ถˆ๋Ÿฌ์˜ค๋„๋ก ํ•œ๋‹ค.

  • df.describe(). T๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ df์˜ ๊ธฐ์ˆ ํ†ต๊ณ„๋Ÿ‰์„ ๊ณ„์‚ฐํ•˜๊ณ , ์ด๋ฅผ ์ „์น˜ํ•œ๋‹ค.
  • ์ด ํ†ต๊ณ„๋Ÿ‰์„ pd.DataFrame()์„ ํ†ตํ•ด ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ธ desc_df๋กœ ๋งŒ๋“ ๋‹ค.
  • matplotlib์˜ f, ax๋กœ Figure, Axes ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜๊ณ , sns๋กœ ํžˆํŠธ๋งต์„ ๊ทธ๋ ค์ค€๋‹ค.
    • annot_kws : ํ‘œ์‹œ๋˜๋Š” ์ˆซ์ž์˜ ๊ธ€๊ผด ํฌ๊ธฐ
def desc_stats(df):
    desc_df = pd.DataFrame(index = df.columns,
                          columns = df.describe().T.columns,
                          data = df.describe().T)
    
    f, ax = plt.subplots(figsize = (10, desc_df.shape[0] * 0.81))
    
    sns.heatmap(desc_df, annot = True, cmap = 'Greens', fmt = '.2f', ax = ax, linecolor = 'white',
               linewidth = 1.1, cbar = False, annot_kws = {'size':12})
    plt.xticks(size = 18)
    plt.yticks(size = 14, rotation = 0)
    plt.title('Descriptive Statistics', size = 14)
    plt.show()

์ค‘๊ฐ„ ํ™•์ธ์šฉ

 

๐Ÿ‘‰ ์ด๋ ‡๊ฒŒ ์ˆซ์žํ˜• ๋ฐ์ดํ„ฐ์˜ ๊ธฐ์ˆ  ํ†ต๊ณ„๋Ÿ‰์„ ์‹œ๊ฐ์ ์œผ๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

์ตœ์ข… ๊ฒฐ๊ณผ

 

๐Ÿ‘‰ ๊ฒฐ๊ณผ๋ฅผ ํ•ด์„ํ•˜๊ณ  ์—ฌ๊ธฐ๊นŒ์ง€ ๋ถ„์„ํ•œ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•ด ๋ณด์ž.

๐Ÿ“Œ ์ œํ’ˆ ํŒ๋งค ์ˆ˜๋Ÿ‰(Quantity)๊ณผ ๋‹จ๊ฐ€(UnitPrice)์— ์ด์ƒ์น˜(outliers)๊ฐ€ ๋ช…ํ™•ํ•˜๊ฒŒ ๋ณด์ด๋ฉฐ ์ฒ˜๋ฆฌ๋˜์–ด์•ผ ํ•จ.
๐Ÿ“Œ ๋‹จ๊ฐ€(UnitPrice)์— ์Œ์ˆ˜ ๊ฐ’์ด ์žˆ์Œ(์ฃผ๋ฌธ ์ทจ์†Œ ๋•Œ๋ฌธ)
๐Ÿ“Œ ๊ณ ๊ฐ ID(Customer ID)์™€ ์ƒํ’ˆ ์„ค๋ช…(Description)์— ๊ฒฐ์ธก๊ฐ’์ด ์žˆ์Œ(์•ž ํฌ์ŠคํŒ…์—์„œ ๊ฒฐ์ธก์น˜ ์ฒดํฌ)
๐Ÿ“Œ ์ด ๊ฐ€๊ฒฉ(Total Price)์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ˆ˜๋Ÿ‰(Quantity)๊ณผ ๋‹จ๊ฐ€(Unit Price)๋ฅผ ๊ณฑํ•ด์•ผ ํ•จ!

 


๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ : Data Preprocessing

๐Ÿ“Œ ์ œํ’ˆ ํŒ๋งค ์ˆ˜๋Ÿ‰(Quantity)๊ณผ ๋‹จ๊ฐ€(UnitPrice)์— ์ด์ƒ์น˜(outliers)๊ฐ€ ๋ช…ํ™•ํ•˜๊ฒŒ ๋ณด์ด๋ฉฐ ์ฒ˜๋ฆฌ๋˜์–ด์•ผ ํ•จ.

  • ์ด์ƒ์น˜๋ฅผ IQR๋กœ ์ œ๊ฑฐํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•ด ๋ณด์ž
  • ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ dataframe, ์ œ๊ฑฐํ•˜๊ณ ์ž ํ•˜๋Š” ๋ณ€์ˆ˜, q1, q3 ๊ฐ’์„ ์ง€์ •ํ•œ๋‹ค.
  • ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๋ณต์‚ฌํ•œ ํ›„, quantile() ํ•จ์ˆ˜๋กœ q1, q3๋ฅผ ์ฐพ๋Š”๋‹ค.
  • q3 - q1์œผ๋กœ IQR์„ ์ฐพ๊ณ , up_limit, low_limit์„ ์ง€์ •ํ•œ๋‹ค.
  • ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ up_limit, low_limit์„ ๋ฒ—์–ด๋‚œ ๊ฐ’์„  up_limit, low_limit์œผ๋กœ ๋Œ€์ฒดํ•˜๋„๋ก ํ•œ๋‹ค!
    • df_. loc [ ์กฐ๊ฑด, ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ํ–‰์˜ variable ์—ด ] = ํ• ๋‹นํ•  ์ƒˆ ๊ฐ’
def replace_with_thresholds(df, variable, q1 = 0.25, q3 = 0.75):
        
        df_ = df.copy()
        quartile1 = df_[variable].quantile(q1)
        quartile3 = df_[variable].quantile(q3)
        iqr = quartile3 - quartile1
        
        up_limit = quartile3 + 1.5 * iqr
        low_limit = quartile1 - 1.5 * iqr
        # ์ค‘์š”!! 'variable' ์—ด์˜ ๊ฐ’์„ 'low_limit'์œผ๋กœ ๋Œ€์ฒด
        df_.loc[(df_[variable] < low_limit), variable] = low_limit
        df_.loc[(df_[variable] > up_limit), variable] = up_limit
        
        return df_

 

 

๐Ÿ“Œ ๋‹จ๊ฐ€(UnitPrice)์— ์Œ์ˆ˜ ๊ฐ’์ด ์žˆ์Œ(์ฃผ๋ฌธ ์ทจ์†Œ ๋•Œ๋ฌธ)
      & ๊ณ ๊ฐ ID(Customer ID)์™€ ์ƒํ’ˆ ์„ค๋ช…(Description)์— ๊ฒฐ์ธก๊ฐ’์ด ์žˆ์Œ

      & ์ด ๊ฐ€๊ฒฉ(Total Price)์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ˆ˜๋Ÿ‰(Quantity)๊ณผ ๋‹จ๊ฐ€(Unit Price)๋ฅผ ๊ณฑํ•ด์•ผ ํ•จ!

  • ๋‚˜๋จธ์ง€ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•ด ๋ณด์ž.
  • ๋ฐ์ดํ„ฐ๋ฅผ ๋ณต์‚ฌํ•˜๊ณ , ๊ฐ€์žฅ ๋จผ์ € ๊ฒฐ์ธก์น˜๋Š” dropna()๋กœ ์ œ๊ฑฐํ•œ๋‹ค.
  • ์ทจ์†Œ๋œ ์ฃผ๋ฌธ(ํ–‰) ์ œ๊ฑฐ : InvoiceNo ์—ด์— C ๋ฌธ์ž์—ด์„ ํฌํ•จํ•˜๋Š” ํ–‰์„ ์ œ๊ฑฐ
    • df_['InvoiceNo']. str.contains๋กœ C๋ฅผ ํฌํ•จํ•˜๋Š” ๊ฒƒ์„ True๋กœ ๋ฐ˜ํ™˜
    • ~ ๋ฅผ ๋ถ™์—ฌ์คŒ์œผ๋กœ์จ C๋ฅผ ํฌํ•จํ•˜๋Š” ๊ฒƒ์„ False๋กœ ๋ฐ˜ํ™˜
    • C๋ฅผ ํฌํ•จํ•˜์ง€ ์•Š๋Š” (True) ๊ฐ’๋งŒ์„ ๋ฐ˜ํ™˜ํ•˜์—ฌ df_์— ์ €์žฅ
  • ์ฃผ๋ฌธ๋Ÿ‰์€ 0 ์ด ์•„๋‹Œ ๊ฒƒ๋งŒ ํ•„ํ„ฐ๋งํ•œ๋‹ค.
  • ์ด์ƒ์น˜๋Š” ์•ž์„œ IQR๋ฅผ ํ™œ์šฉํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•œ๋‹ค.(์—ฌ๊ธฐ์„œ๋Š” ์ƒํ•˜์œ„ 0.01%๋งŒ ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์ง€์ •)
  • ํŒ๋งค ์ด์•ก ์นผ๋Ÿผ์„ ์ถ”๊ฐ€
# ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ํ•จ์ˆ˜
def ecommerce_preprocess(df):
    df_ = df.copy()
    
    # ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ
    df_ = df_.dropna()
    
    # ์ทจ์†Œ๋œ ์ฃผ๋ฌธ๊ณผ ์ฃผ๋ฌธ๋Ÿ‰
    #  'InvoiceNo' ์—ด์— 'C' ๋ฌธ์ž์—ด์„ ํฌํ•จํ•˜๋Š” ํ–‰์„ ์ œ๊ฑฐ(nan๊ฐ’์€ ๋ฌด์‹œ)
    df_ = df_[~df_['InvoiceNo'].str.contains('C', na = False)]
    # ์ฃผ๋ฌธ๋Ÿ‰์ด 0์ด ์•„๋‹Œ ๊ฒƒ๋งŒ ํ•„ํ„ฐ๋ง
    df_ = df_[df_['Quantity'] > 0]
    
    # ์ด์ƒ์น˜ ์ œ๊ฑฐ
    df_ = replace_with_thresholds(df_, 'Quantity', q1 = 0.01, q3 = 0.99)
    df_ = replace_with_thresholds(df_, 'UnitPrice', q1 = 0.01, q3 = 0.99)
    
    # ํŒ๋งค์ด์•ก ์ปฌ๋Ÿผ ์ถ”๊ฐ€
    df_['TotalPrice'] = df_['Quantity'] * df_['UnitPrice']
    
    return df_

 

๐Ÿง‘‍๐Ÿ’ป ์ „์ฒ˜๋ฆฌ ํ•จ์ˆ˜ ์‹คํ–‰ ํ›„, ๊ธฐ์ˆ  ํ†ต๊ณ„๋Ÿ‰ ๋‹ค์‹œ ํ™•์ธํ•ด ๋ณด๊ธฐ

  • ๊ธฐ์ˆ  ํ†ต๊ณ„๋Ÿ‰ ํ™•์ธ ํ•จ์ˆ˜์—์„œ df.select_dtypes()๋กœ ์‹ค์ˆ˜ํ˜•๊ณผ ์ •์ˆ˜ํ˜• ์นผ๋Ÿผ์— ๋Œ€ํ•ด์„œ๋งŒ ๊ธฐ์ˆ  ํ†ต๊ณ„๋Ÿ‰์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค.
    • df.select_dtypes(include = [float, int])
df = ecommerce_preprocess(df)
desc_stats(df.select_dtypes(include = [float, int]))

์ „์ฒ˜๋ฆฌ ์ „

 

๐Ÿ‘‰ ์ „์ฒ˜๋ฆฌ ํ›„, TotalPrice์˜ ๊ธฐ์ˆ  ํ†ต๊ณ„๋Ÿ‰์ด ์ถ”๊ฐ€๋˜์—ˆ์œผ๋ฉฐ / ๋‹จ๊ฐ€์˜ ์ตœ์†Ÿ๊ฐ’์ด ์Œ์ˆ˜๊ฐ€ ์•„๋‹ˆ๋ฉฐ, ๊ทน๋‹จ์  ์ด์ƒ์น˜๊ฐ€ ์ œ๊ฑฐ๋จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

์ „์ฒ˜๋ฆฌ ํ›„

 

728x90