๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[Kaggle] ์ด์ปค๋จธ์Šค ๋ฐ์ดํ„ฐ ๋ถ„์„ 1 (CRM Analytics ๐Ÿ›๏ธ๐Ÿ›’)

by ISLA! 2023. 10. 8.

์ด ํฌ์ŠคํŒ…์€ ์ด์ปค๋จธ์Šค ๋ฐ์ดํ„ฐ ๋ถ„์„ ๊ณผ์ •์„ ์Šคํ„ฐ๋””ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ kaggle ๋…ธํŠธ๋ฅผ ์ฐธ๊ณ ํ–ˆ์Šต๋‹ˆ๋‹ค.

https://www.kaggle.com/code/sercanyesiloz/crm-analytics


 

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import

import os
import datetime
# import squarify
import warnings
import pandas as pd 
import numpy as np
import datetime as dt
from operator import attrgetter
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import plotly.graph_objs as go
from plotly.offline import iplot
from sklearn.metrics import (silhouette_score,
                             calinski_harabasz_score,
                             davies_bouldin_score)
from lifetimes import BetaGeoFitter, GammaGammaFitter
from lifetimes.plotting import plot_period_transactions
%matplotlib inline
%load_ext nb_black
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
palette = 'Set2'

 

๋ฐ์ดํ„ฐ์…‹ ๋กœ๋“œ ๋ฐ ํ™•์ธ

๐Ÿ‘‰ ์—ฌ๊ธฐ์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ฌ ๋•Œ ์˜ต์…˜์„ ์ถ”๊ฐ€๋กœ ์ค€๋‹ค. ์ธ์ฝ”๋”ฉ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ฐ์ดํ„ฐ ํƒ€์ž… ๋ณ€๊ฒฝ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

  • encoding='unicode_escape : ํŠน์ˆ˜๋ฌธ์ž๊ฐ€ ์žˆ๋Š” csv ํŒŒ์ผ์— ๋Œ€ํ•ด ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹
  • dtype = {'CustomerID':str, 'InvoiceDate':str} : ๊ฐ ์ปฌ๋Ÿผ์„ ๋ฌธ์ž์—ด(str)๋กœ ์ฒ˜๋ฆฌ
  • parse_dates = ['InvoiceDate'] : InvoiceDate ์—ด์„ ๋‚ ์งœ ๋ฐ ์‹œ๊ฐ„ ๊ฐ์ฒด๋กœ ํŒŒ์‹ฑ ํ•˜๋„๋ก ์ง€์‹œ(๋ฐ์ดํ„ฐ ํƒ€์ž… ๋ณ€๊ฒฝ)
  • infer_datetime_format = True : ์œ„์˜ parse_dates์™€ ํ•จ๊ป˜ ์‚ฌ์šฉ๋˜๋ฉฐ, InvoiceDate์—ด์˜ ๋‚ ์งœ ๋ฐ ์‹œ๊ฐ„ ํ˜•์‹์„ ์ถ”๋ก ํ•˜์—ฌ ํŒŒ์‹ฑ ์†๋„๋ฅผ ํ–ฅ์ƒํ•จ
df = pd.read_csv('./data.csv', encoding = 'unicode_escape',
                dtype = {'CustomerID':str,
                        'InvoiceDate':str},
                parse_dates = ['InvoiceDate'],
                infer_datetime_format = True)
df.head()

 

 

 

๋ฐ์ดํ„ฐ ๋ณ€์ˆ˜ ํ™•์ธ

InvoiceNo: 6์ž๋ฆฌ๋กœ ์ด๋ฃจ์–ด์ง„ ์†ก์žฅ ๋ฒˆํ˜ธ์ž…๋‹ˆ๋‹ค. ์ด ์ฝ”๋“œ๊ฐ€ 'c'๋กœ ์‹œ์ž‘ํ•˜๋ฉด ์ทจ์†Œ๋œ ๊ฑฐ๋ž˜์ž„์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
StockCode: 5์ž๋ฆฌ ์ˆซ์ž๋กœ ์ด๋ฃจ์–ด์ง„ ์ œํ’ˆ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.
Description: ์ œํ’ˆ ์ด๋ฆ„์ž…๋‹ˆ๋‹ค.
Quantity: ๊ฐ ์ œํ’ˆ์˜ ๊ฑฐ๋ž˜๋‹น ์ˆ˜๋Ÿ‰์ž…๋‹ˆ๋‹ค.
InvoiceDate: ๊ฐ ๊ฑฐ๋ž˜๊ฐ€ ์ƒ์„ฑ๋œ ๋‚ ์งœ์™€ ์‹œ๊ฐ„์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
UnitPrice: ์ œํ’ˆ ๋‹จ๊ฐ€์ž…๋‹ˆ๋‹ค.
CustomerID: 5์ž๋ฆฌ ์ˆซ์ž๋กœ ์ด๋ฃจ์–ด์ง„ ๊ณ ๊ฐ ๋ฒˆํ˜ธ์ž…๋‹ˆ๋‹ค. ๊ฐ ๊ณ ๊ฐ์€ ๊ณ ์œ ํ•œ ๊ณ ๊ฐ ID๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
Country: ๊ฐ ๊ณ ๊ฐ์ด ๊ฑฐ์ฃผํ•˜๋Š” ๊ตญ๊ฐ€์˜ ์ด๋ฆ„์ž…๋‹ˆ๋‹ค.

 

df.info()

 

 

 

๋ฐ์ดํ„ฐ ์ฒดํฌํ•˜๋Š” ํ•จ์ˆ˜ ์ •์˜ : check_data()

  • rows, columns ๊ฐœ์ˆ˜ ํ™•์ธ
  • ์นผ๋Ÿผ๋ณ„ ๋ฐ์ดํ„ฐ ํƒ€์ž…
  • head, tail
  • ๊ฒฐ์ธก์น˜ : df.isnull().sum()
  • ์ค‘๋ณต๊ฐ’

[์ฐธ๊ณ ]

โœ” center(width, fillchar) 
์ด ๋ฉ”์„œ๋“œ๋Š” ๋ฌธ์ž์—ด์„ ๊ฐ€์šด๋ฐ ์ •๋ ฌํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋จ.
 - width: ์ •๋ ฌ๋œ ๊ฒฐ๊ณผ ๋ฌธ์ž์—ด์˜ ์ „์ฒด ๋„ˆ๋น„๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ์ด ๊ฒฝ์šฐ์—๋Š” 70์ด๋‹ค.
 - fillchar: ํ•„์š”ํ•œ ๊ฒฝ์šฐ ๋ฌธ์ž์—ด์„ ์ฑ„์šธ ๋ฌธ์ž๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ์—๋Š” '-' ๋ฌธ์ž๊ฐ€ ์‚ฌ์šฉ๋œ๋‹ค

 

def check_data(df, head = 5):
    print(" SHAPE ".center(70, '-'))
    print('Rows: {}'.format(df.shape[0]))
    print('Columns: {}'.format(df.shape[1]))
    
    print(' Type '.center(70, '-'))
    print(df.dtypes)
    print(' HEAD '.center(70, '-'))
    print(df.head(head))
    print(' TAIL '.center(70, '-'))
    print(df.tail(head))
    
    print(' Missing Values '.center(70, '-'))
    print(df.isnull().sum())
    
    print(' Duplicated Values '.center(70, '-'))
    print(df.duplicated().sum())
    
check_data(df)

 

๊ฒฐ๊ณผ ์ผ๋ถ€

 

World Map : ๋งค์ถœ์ด ๋ฐœ์ƒํ•œ ๊ตญ๊ฐ€ ์‹œ๊ฐํ™”

์ง€๋„ ๊ตฌํ˜„์„ ์œ„ํ•ด pandas, plotly, plotly-express ๊ฐ€ ํ•„์š”ํ•จ

# import plotly.graph_objs as go
# Import Pandas
# import plotly.express as px

 

  • ๊ณ ๊ฐ Id, ์†ก์žฅ, ๊ตญ๊ฐ€ ๋ณ„๋กœ ๊ทธ๋ฃนํ™” -> ๊ฐ ๊ทธ๋ฃน์˜ count ๊ณ„์‚ฐ
  • ๊ฐ ๊ตญ๊ฐ€๊ฐ€ ์ฃผ๋ฌธ๋ฐ›์€ ํšŸ์ˆ˜๋ฅผ ์ถœ๋ ฅํ•˜๊ธฐ ์œ„ํ•ด, '๊ตญ๊ฐ€' ์—ด ๊ธฐ๋ฐ˜์œผ๋กœ value_counts()
  • data๋กœ ์ €์žฅ๋œ ๋”•์…”๋„ˆ๋ฆฌ : ์ง€๋„ ์„ค์ •(choropleth)
    • type : ์‹œ๊ฐํ™” ์ข…๋ฅ˜๋กœ, choropleth ์ง€๋„๋ฅผ ์ง€์ •
    • locations : ๊ฐ ๊ตญ๊ฐ€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ธ๋ฑ์Šค 
    • locationmode : ์œ„์น˜ ๋ชจ๋“œ๋กœ '๊ตญ๊ฐ€๋ช…' ์‚ฌ์šฉ
    • z : ๊ฐ ๊ตญ๊ฐ€์— ๋Œ€ํ•œ ์ฃผ๋ฌธ ์ˆ˜
    • text : ๋งˆ์šฐ์Šค๋ฅผ ๊ฐ€์ ธ๋‹ค ๋Œ€๋ฉด ํ‘œ์‹œํ•  ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ
    • colorbar : ์ƒ‰์ƒ ๋ง‰๋Œ€ ์ œ๋ชฉ
    • colorscale : ์‚ฌ์šฉํ•  ์ƒ‰์ƒ ์ฒ™๋„
    • reversescale : ์ƒ‰์ƒ ์ฒ™๋„ ๋ฐ˜์ „ ์—ฌ๋ถ€ ์„ค์ •
  • layout์œผ๋กœ ์ €์žฅ๋œ ๋”•์…”๋„ˆ๋ฆฌ : ์ง€๋„์˜ ๋ ˆ์ด์•„์›ƒ๊ณผ ์Šคํƒ€์ผ ์„ค์ •
    • title : ์ง€๋„ ์ œ๋ชฉ, ์œ„์น˜, ์•ต์ปค
    • geo : ์ง€๋„์˜ ์ง€๋ฆฌ์  ์†์„ฑ(ํ•ด์ƒ๋„, ๋ฐ”๋‹ค์ƒ‰, ์œก์ง€ ์ƒ‰, ํ”„๋ ˆ์ž„ ํ‘œ์‹œ ์—ฌ๋ถ€)
    • template : ํ”Œ๋กฏ ๋ฆฌํฌํŠธ์˜ ํ…œํ”Œ๋ฆฟ
    • height, width : ์ง€๋„์˜ ๋†’์ด์™€ ๋„ˆ๋น„
  • choromap ๋ณ€์ˆ˜ : data์™€ layout์„ ์‚ฌ์šฉํ•˜์—ฌ ์ง€๋„ ์ƒ์„ฑ
  • iplot : ์ง€๋„๋ฅผ ์ธํ„ฐ๋ ‰ํ‹ฐ๋ธŒ ํ•˜๊ฒŒ ํ‘œ์‹œ (validate = False : ๊ฒ€์ฆ์„ ๋น„ํ™œ์„ฑํ™”ํ•˜๋Š” ๊ฒƒ์œผ๋กœ, ์ง€๋„๊ฐ€ ์ •์ƒ์ ์œผ๋กœ ๋ Œ๋”๋ง ๋˜๋Š”์ง€ ํ™•์ธํ•˜์ง€ ์•Š๋Š”๋‹ค.

 

# ๊ณ ๊ฐId, ์†ก์žฅ, ๊ตญ๊ฐ€ ๋ณ„๋กœ ๊ทธ๋ฃนํ™” -> ๊ฐ ๊ทธ๋ฃน์˜ count ๊ณ„์‚ฐ
world_map = df[['CustomerID', 'InvoiceNo', 'Country']].groupby(['CustomerID', 'InvoiceNo', 'Country']).count().reset_index(drop=False)

# ๊ทธ๋ฃนํ™” ๊ฒฐ๊ณผ์—์„œ '๊ตญ๊ฐ€' ์—ด ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตญ๊ฐ€์˜ ์ฃผ๋ฌธ์ˆ˜ ๊ณ„์‚ฐ
countries = world_map['Country'].value_counts()

data = dict(type = 'choropleth',           # ์ƒ์„ฑํ•˜๋ ค๋Š” ์ง€๋„ ์ข…๋ฅ˜(choropleth)
           locations = countries.index,    # ์ง€๋„์— ๋‚˜ํƒ€๋‚ด๋Š” ์ง€์—ญ ์ธ๋ฑ์Šค
           locationmode = 'country names',
           z = countries,
           text = countries.index,
           colorbar = {'title':'Orders'},  # ์ƒ‰์ƒ ๋ง‰๋Œ€์˜ ์ œ๋ชฉ
           colorscale = 'Viridis',
           reversescale = False)    # ์ƒ‰์ƒ ์ฒ™๋„ ๋ฐ˜์ „ ์—ฌ๋ถ€(false : ๊ฐ’์ด ์ปค์งˆ ์ˆ˜๋ก ์ง„ํ•œ ์ƒ‰์ƒ/true: ๋ฐ์€์ƒ‰์ƒ)


layout = dict(title = {'text': "Number of Orders by Countries",
                      'y': 0.9,
                      'x' : 0.5,
                      'xanchor': 'center',
                      'yanchor': 'top'},
             geo = dict(resolution = 50,
                       showocean = True,
                       oceancolor = 'LightBlue',
                       showland = True,
                       landcolor = 'whitesmoke',
                       showframe = True),
             template = 'plotly_white',
             height = 600,
             width = 1000)

choromap = go.Figure(data = [data], layout = layout)
iplot(choromap, validate = False)

 

๐Ÿ‘‰ ์ด๋ ‡๊ฒŒ ๋งˆ์šฐ์Šค๋ฅผ ํ˜ธ๋ฒ„ ํ•˜๋ฉด ๊ตญ๊ฐ€๋ช…๊ณผ ์ฃผ๋ฌธ ๊ฑด์ˆ˜๊ฐ€ ๋‚˜ํƒ€๋‚˜๋Š” ์ธํ„ฐ๋ ‰ํ‹ฐ๋ธŒ ํ•œ ์ง€๋„๊ฐ€ ํ‘œ์‹œ๋œ๋‹ค!

 

728x90