๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[Kaggle] ์ด์ปค๋จธ์Šค ๋ฐ์ดํ„ฐ ๋ถ„์„ 6 (CRM Analytics ๐Ÿ›๏ธ๐Ÿ›’)

by ISLA! 2023. 10. 8.

์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” 5์˜ ๊ณ ๊ฐ๊ตฐ ๋ถ„์„์— ์ด์–ด, ์ฝ”ํ˜ธํŠธ ๋ถ„์„์„ ์ง„ํ–‰ํ•œ๋‹ค.


Cohort Analysis

์ฝ”ํ˜ธํŠธ๋Š” ์–ด๋–ค ๊ณตํ†ต์ ์„ ๊ณต์œ ํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์˜ ๊ทธ๋ฃน์„ ์˜๋ฏธํ•œ๋‹ค.

์ด๋Ÿฌํ•œ ๊ณตํ†ต์ ์€ ์•ฑ ๊ฐ€์ž… ๋‚ ์งœ, ์ฒ˜์Œ ๊ตฌ๋งคํ•œ ๋‹ฌ, ์ง€๋ฆฌ์  ์œ„์น˜, ํš๋“ ์ฑ„๋„ (์ผ๋ฐ˜ ์‚ฌ์šฉ์ž, ๋งˆ์ผ€ํŒ… ์œ ์ž…์ž ๋“ฑ) ๋“ฑ์ด ๋  ์ˆ˜ ์žˆ๋‹ค.

์ฝ”ํ˜ธํŠธ ๋ถ„์„์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์‚ฌ์šฉ์ž ๊ทธ๋ฃน์„ ์‹œ๊ฐ„์— ๋”ฐ๋ผ ์ถ”์ ํ•˜์—ฌ ์ผ๋ฐ˜์ ์ธ ํŒจํ„ด์ด๋‚˜ ํ–‰๋™์„ ์‹๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋œ๋‹ค.

๋ณธ ์˜ˆ์ œ์—์„œ๋Š” ์ฝ”ํ˜ธํŠธ ๋ถ„์„ ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ–ˆ์œผ๋ฉฐ, ํ•จ์ˆ˜๊ฐ€ ๊ธด ๊ด€๊ณ„๋กœ ๋Š์–ด์„œ ์„ค๋ช…ํ•˜๊ณ  ๋งˆ์ง€๋ง‰์— ์ตœ์ข… ํ•จ์ˆ˜๋ฅผ ๊ธฐ๋กํ•  ๊ฒƒ์ด๋‹ค.

 

 

cohort(์ตœ์ดˆ ์ฃผ๋ฌธ์ผ๊ณผ ๊ณ ๊ฐ ๋‹น ์ฃผ๋ฌธ ๊ฑด์˜ ๋‚ ์งœ ์ถ”์ถœ)

  • ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๋ณต์‚ฌ
  • ๊ณ ๊ฐ ID, ์†ก์žฅ๋ฒˆํ˜ธ, ์ฃผ๋ฌธ๋‚ ์งœ๋งŒ ์ถ”์ถœํ•˜๊ณ , ์ค‘๋ณต๋œ ํ–‰ ์ œ๊ฑฐ
  • ๋‚ ์งœ๋ฅผ ์›” ๋‹จ์œ„์˜ ๊ธฐ๊ฐ„(Period)์œผ๋กœ ๋ณ€ํ™˜ : ๊ฐ ์ฃผ๋ฌธ์˜ ๋‚ ์งœ๋ฅผ ํ•ด๋‹น ์›”์˜ ์ฒซ ๋ฒˆ์งธ ๋‚ ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ํ•ด๋‹น ์›”์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ธฐ๊ฐ„์œผ๋กœ ํ‘œํ˜„
  • ๊ณ ๊ฐ ID๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ฃนํ™” :
    • ๊ฐ ๊ณ ๊ฐ์— ๋Œ€ํ•ด, ์ฃผ๋ฌธ๋‚ ์งœ(InvoiceDate)์—ด์˜ ์ตœ์†Ÿ๊ฐ’(๊ฐ€์žฅ ์ด๋ฅธ ์ฃผ๋ฌธ ๋‚ ์งœ)์„ ์ฐพ๋Š”๋‹ค.
    • ๊ฐ€์žฅ ์ด๋ฅธ ์ฃผ๋ฌธ ๋‚ ์งœ๋ฅผ ์›” ๋‹จ์œ„ ๊ธฐ๊ฐ„์œผ๋กœ ๋ณ€ํ™˜
  • cohort ์—ด๊ณผ order_month ์—ด์„ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ฃนํ™” : 
    • ๊ฐ ๊ทธ๋ฃน์— ๋Œ€ํ•ด ๊ณ ์œ ํ•œ ๊ณ ๊ฐID์˜ ์ˆ˜์„ ๊ณ„์‚ฐํ•˜์—ฌ n_customers ์— ์ €์žฅ
def CohortAnalysis(dataframe):
    
    data = dataframe.copy()
    data = data[["CustomerID", "InvoiceNo", "InvoiceDate"]].drop_duplicates()
    
    data['order_month'] = data['InvoiceDate'].dt.to_period('M')
    data['cohort'] = data.groupby('CustomerID')['InvoiceDate'].transform('min').dt.to_period('M')
    cohort_data = (
        data.groupby(['cohort', 'order_month']).agg(n_customers = ('CustomerID', 'nunique'))\
            .reset_index(drop = False)
    )

data
cohort_data

 

์ตœ์ดˆ ์ฃผ๋ฌธ์ผ๊ณผ ์ฃผ๋ฌธ ๊ฑด๋ณ„ ์ผ์ˆ˜ ์ฐจ์ด ๊ณ„์‚ฐ

  • order_month ์™€ cohort ์˜ ๋‚ ์งœ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•˜๋˜ .apply(attrgetter('n')) ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ ๊ฐ์ฒด์—์„œ n ์†์„ฑ ๊ฐ’์„ ์ถ”์ถœํ•œ๋‹ค.
    • attrgetter๋Š” ํŒŒ์ด์ฌ์˜ operator ๋ชจ๋“ˆ์—์„œ ์ œ๊ณตํ•˜๋Š” ํ•จ์ˆ˜ ์ค‘ ํ•˜๋‚˜์ด๋‹ค.
from operator import attrgetter
cohort_data['period_number'] = (cohort_data.order_month - cohort_data.cohort).apply(attrgetter('n'))

cohort_data

 

pivot_table ๋กœ ์ฝ”ํ˜ธํŠธ(์ตœ์ดˆ ๊ตฌ๋งค์ผ) ๊ธฐ์ค€, ๊ฒฝ๊ณผ ์‹œ๊ฐ„ ๋ณ„ ๊ตฌ๋งค ๊ณ ๊ฐ ์ˆ˜ ๊ณ„์‚ฐ

cohort_pivot = cohort_data.pivot_table(
        index = 'cohort', columns = 'period_number', values = 'n_customers'
    )

# ์ฝ”ํ˜ธํŠธ ์ดˆ๊ธฐ ๊ณ ๊ฐ์ˆ˜
cohort_size = cohort_pivot.iloc[:, 0]

cohort_pivot

 

๊ณ ๊ฐ ์œ ์ง€์œจ ๊ณ„์‚ฐ

  • ํ”ผ๋ฒ—ํ…Œ์ด๋ธ”์˜ ๊ฐ ํ–‰์„ 'cohort_size' ๋ณ€์ˆ˜๋กœ ๋‚˜๋ˆ„์–ด, ๊ณ ๊ฐ ์œ ์ง€์œจ(์ดํƒˆ์œจ ํŒŒ์•…์šฉ) ๊ณ„์‚ฐ
    • divide() ํ•จ์ˆ˜๋Š” ํ–‰์„ ๋‚˜๋ˆŒ ๋•Œ ์‚ฌ์šฉ๋˜๋ฉฐ, axis = 0 ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๊ฐ ํ–‰์„ ๋‚˜๋ˆ„๊ณ  ์žˆ์Œ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.
    • ์ด๋ ‡๊ฒŒ ๊ฐ cohort์— ๋Œ€ํ•œ ์ดํƒˆ๋ฅ ์ด ๊ณ„์‚ฐ๋˜๋ฉฐ, ๊ฒฐ๊ณผ๊ฐ€ retention_matrix ๋ณ€์ˆ˜์— ์ €์žฅ๋œ๋‹ค.
retention_matrix = cohort_pivot.divide(cohort_size, axis = 0)

 

retention_matrix

 

๐Ÿš€ ๊ณ ๊ฐ ์œ ์ง€์œจ ์‹œ๊ฐํ™”(heatmap)

  • with sns.axes_style('white') : ๋ฐฐ๊ฒฝ์„ ํฐ์ƒ‰์œผ๋กœ
  • fig, ax ~ : sharey = True (๋‘ ๊ฐœ์˜ ํ•˜์œ„ ๊ทธ๋ฆผ์ด ๋™์ผํ•œ y-์ถ•์„ ๊ณต์œ )
  • gridspec_kw = {'width_ratios':[0.1, 1]} : ๋‘ ๊ทธ๋ฆผ์˜ ๋„ˆ๋น„ ๋น„์œจ ์„ค์ •
  • ax[1] ํžˆํŠธ๋งต : retention_matrix ํ™œ์šฉํ•˜๊ณ , null ๊ฐ’์€ ๊ฐ€๋ฆผ
  • white_cmap ์ƒ์„ฑ : ๋ฆฌ์ŠคํŠธ๋กœ ์ง€์ •ํ•œ ์ปฌ๋Ÿฌ๋กœ colormap ๋งŒ๋“ค๊ธฐ
    • import matplotlib.colors as mcolors
    • mcolors.ListedColormap() ํ•จ์ˆ˜
  • cohort_size ์ธ๋ฑ์Šค์™€ ๊ฐ’์„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ํžˆํŠธ๋งต ์™ผ์ชฝ์— ์ถ”๊ฐ€
    • fmt = 'g' : ์ฃผ์„์˜ ์ˆซ์ž ํ˜•์‹์„ ์ผ๋ฐ˜ ์ˆซ์ž ํ˜•์‹(general)๋กœ ์ง€์ •
    • ์˜ˆ๋ฅผ ๋“ค์–ด, fmt='g'๋กœ ์„ค์ •๋œ ๊ฒฝ์šฐ, ์ˆซ์ž 1000์€ "1000"์œผ๋กœ ํ‘œ์‹œ๋˜๊ณ , ์ˆซ์ž 0.001์€ "0.001"๋กœ ํ‘œ์‹œ๋จ!
with sns.axes_style('white'):     # ๋ฐฐ๊ฒฝ์„ ํฐ์ƒ‰์œผ๋กœ ์„ค์ •
        
        fig, ax = plt.subplots(
                    1, 2, figsize = (12, 8), sharey = True, gridspec_kw = {'width_ratios':[0.1, 1]}
                    )
        
        sns.heatmap(retention_matrix,
                    mask = retention_matrix.isnull(),
                    annot = True,
                    cbar = True,
                    fmt = '.0%',
                    cmap = 'coolwarm',
                    ax = ax[1])
        
        ax[1].set_title("Monthly Cohorts: User Retention", fontsize=14)
        ax[1].set(xlabel='# of periods', ylabel = " ")
        
        white_cmap = mcolors.ListedColormap(['white'])
        
        sns.heatmap(pd.DataFrame(cohort_size).rename(columns = {0:'cohort_size'}),
                   annot = True, cbar = False, fmt = 'g', cmap = white_cmap, ax = ax[0])

    fig.tight_layout()

 


๐Ÿš€ ๊ณ ๊ฐ ์œ ์ง€์œจ ์‹œ๊ฐํ™” ์ „์ฒด ํ•จ์ˆ˜

ํŽผ์ณ๋ณด์„ธ์š”!

๋”๋ณด๊ธฐ
def CohortAnalysis(dataframe):
    
    data = dataframe.copy()
    data = data[["CustomerID", "InvoiceNo", "InvoiceDate"]].drop_duplicates()
    
    data['order_month'] = data['InvoiceDate'].dt.to_period('M')
    data['cohort'] = data.groupby('CustomerID')['InvoiceDate'].transform('min').dt.to_period('M')
    cohort_data = (
        data.groupby(['cohort', 'order_month']).agg(n_customers = ('CustomerID', 'nunique'))\
            .reset_index(drop = False)
    )
    
    cohort_data['period_number'] = (cohort_data.order_month - cohort_data.cohort).apply(attrgetter('n'))
    
    cohort_pivot = cohort_data.pivot_table(
        index = 'cohort', columns = 'period_number', values = 'n_customers'
    )
    cohort_size = cohort_pivot.iloc[:, 0]
    
    # ๊ณ ๊ฐ ์ดํƒˆ์œจ/์œ ์ง€์œจ ๊ณ„์‚ฐ
    retention_matrix = cohort_pivot.divide(cohort_size, axis = 0)
    
    # ์‹œ๊ฐํ™”
    with sns.axes_style('white'):     # ๋ฐฐ๊ฒฝ์„ ํฐ์ƒ‰์œผ๋กœ ์„ค์ •
        
        # ๋‘ ๊ฐœ์˜ ํ•˜์œ„ ๊ทธ๋ฆผ์ด ๋™์ผํ•œ y-์ถ•์„ ๊ณต์œ (sharey = True) / ํ•˜์œ„ ๊ทธ๋ฆผ์˜ ๋„ˆ๋น„ ๋น„์œจ(๋™์ผ)
        fig, ax = plt.subplots(
                    1, 2, figsize = (12, 8), sharey = True, gridspec_kw = {'width_ratios':[0.1, 1]}
                    )
        
        sns.heatmap(retention_matrix,
                    mask = retention_matrix.isnull(),
                    annot = True,
                    cbar = True,
                    fmt = '.0%',
                    cmap = 'coolwarm',
                    ax = ax[1])
        
        ax[1].set_title("Monthly Cohorts: User Retention", fontsize=14)
        ax[1].set(xlabel='# of periods', ylabel = " ")
        
        white_cmap = mcolors.ListedColormap(['white'])
        
        sns.heatmap(pd.DataFrame(cohort_size).rename(columns = {0:'cohort_size'}),
                   annot = True, cbar = False, fmt = 'g', cmap = white_cmap, ax = ax[0])

    fig.tight_layout()

 

728x90