๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[Pandas] ํŒŒ์ด์ฌ์œผ๋กœ ์ด์ปค๋จธ์Šค ๋ฐ์ดํ„ฐ A/B test ๊ฒฐ๊ณผ ํ•ด์„ (feat. ํ†ต๊ณ„ ๊ฒ€์ •)

by ISLA! 2024. 1. 30.

๋ฐ์ดํ„ฐ ์ถœ์ฒ˜ : ์ผ€๊ธ€

https://www.kaggle.com/code/sergylog/ab-test-data-analysis

 

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import numpy as np
import pandas as pd
from scipy.stats import mannwhitneyu
from scipy.stats import ttest_ind
from scipy.stats import norm
from scipy.stats import pearsonr
from scipy.stats import shapiro
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

df = pd.read_csv('AB_Test_Results.csv')

 

 

๋ฐ์ดํ„ฐ ํƒ์ƒ‰

  • ํ†ต์ œ์ง‘๋‹จ๊ณผ ๋ณ€ํ™”(test)๋ฅผ ์ ์šฉํ•œ ์‹คํ—˜์ง‘๋‹จ ๋ณ€์ˆ˜ ํ™•์ธ : VARIANT_NAME
df.head()

df.nunique()

df.describe()

 

  • A/B test ๊ทธ๋ฃน ํ™•์ธ
double_variant_count = df.groupby('USER_ID')['VARIANT_NAME'].nunique().value_counts()
double_variant_count

double_variant_count / double_variant_count.sum()
--
1    0.756325
2    0.243675

 

๐ŸŽฏ ํ™•์ธํ•  ์‚ฌํ•ญ : ํ•œ ๋ช…์˜ ์œ ์ €๊ฐ€ ํ†ต์ œ & ์‹คํ—˜์ง‘๋‹จ์— ๋™์‹œ์— ์†ํ•˜๋ฉด ๊ฒฐ๊ณผ ํ•ด์„์ด ์–ด๋ ค์›Œ์ง โ–ถ๏ธŽ ์ค‘๋ณต ์œ ์ €๋Š” ์ œ์™ธํ•˜๊ธฐ๋กœ ํ•œ๋‹ค.

 

# 1๊ฐœ์˜ ํ…Œ์ŠคํŠธ ๊ทธ๋ฃน์— ์†ํ•œ ์œ ์ €๋งŒ ๋ณด๊ธฐ
single_variant_users = df.groupby('USER_ID')['VARIANT_NAME'].nunique() == 1
single_variant_users

# 1๊ฐœ ๊ทธ๋ฃน์—๋งŒ ์†ํ•œ ์œ ์ € ํ•„ํ„ฐ๋ง
df = df[df['USER_ID'].isin(single_variant_users.index)]

# ํ–‰ ๊ฐœ์ˆ˜ ํ™•์ธ
df.groupby('USER_ID')['VARIANT_NAME'].nunique().value_counts().iloc[0] == double_variant_count.iloc[0]
--
True

 

๋ฐ์ดํ„ฐ ๋ถ„ํฌ ํ™•์ธ

sns.boxplot(x='VARIANT_NAME', y='REVENUE', data=df)

 

  • outlier๊ฐ€ ๋งŽ์ด ๋ณด์ด๋ฏ€๋กœ, ์ถ”ํ›„ ๋งค์ถœ์•ก์„ ๊ธฐ์ค€์œผ๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ ํ•˜์—ฌ ํ™•์ธํ•ด ๋ณด๊ธฐ๋กœ ํ•œ๋‹ค.
  • ์ด์ƒ์น˜๋ฅผ ํ™•์ธํ•ด๋ณด๋‹ˆ, 3342 ์•„์ด๋””์˜ ์œ ์ €์˜ ๋งค์ถœ์ด ์œ ๋… ๋†’๋‹ค.
# ์ด์ƒ์น˜ ์ฐพ๊ธฐ
df.sort_values(by='REVENUE', ascending=False).iloc[:10]

 

  • 3342 ์œ ์ €๋ฅผ ํ™•์ธํ•ด๋ณด๋‹ˆ, ํ†ต์ œ ์ง‘๋‹จ์— ์†ํ•ด์žˆ๋‹ค.
  • ๋‹จ์ผ ๊ฑด์œผ๋กœ ํ™•์ธ๋˜๋ฉฐ, ์ด ๊ฑด์€ ์‚ญ์ œํ•˜์—ฌ ๋‚˜๋จธ์ง€ ์œ ์ €์˜ ๋งค์ถœ ๋ถ„ํฌ๋ฅผ ํ™•์ธํ•ด ๋ณด๊ธฐ๋กœ ํ•œ๋‹ค.
df[df['USER_ID']==3342]

# ํ•ด๋‹น ์ด์ƒ์น˜ ์ œ์™ธ
df = df[df['USER_ID']!=3342]

 

  • ๋˜ํ•œ, ๋งค์ถœ์ด ๋ฐœ์ƒํ•˜์ง€ ์•Š์€ ๊ฑด์„ ์ œ์™ธํ•œ ๊ฒฝ์šฐ์™€ ์ „์ฒด ๊ฒฝ์šฐ๋ฅผ ๋น„๊ตํ•˜์—ฌ ๋ฐ•์Šค ํ”Œ๋กฏ์„ ๊ทธ๋ ค๋ณธ๋‹ค. 
  • ๋งค์ถœ์ด ๋ฐœ์ƒํ•˜์ง€ ์•Š์€ ๊ฑด์„ ์ œ์™ธํ–ˆ์„ ๋•Œ ๋ถ„ํฌ๊ฐ€ ๋” ์ž˜ ๋ณด์ธ๋‹ค.
f, axes = plt.subplots(2, sharex=True, figsize = (5, 12))
sns.boxplot(ax=axes[0], x='VARIANT_NAME', y='REVENUE', data=df)
sns.boxplot(ax=axes[1], x='VARIANT_NAME', y='REVENUE', data=df[df['REVENUE']>0])
plt.xticks(np.arange(2), ('control', 'variant'))
plt.show()

  • ์‹ค์ œ๋กœ ๋งค์ถœ์ด 0์ธ ๊ณ ๊ฐ์ด ๋งŽ์€์ง€ ํ™•์ธํ•ด๋ณธ๋‹ค.
  • ๋งค์ถœ์ด 0์ธ๋ฐ, 0 ์ด์ƒ์ธ ๊ฒƒ์œผ๋กœ๋„ ๋‚˜ํƒ€๋‚˜๋Š” ๊ณ ๊ฐ์ด 156๋ช…์ด๋‹ค. ์ฆ‰, ๊ณ ๊ฐ ๋ฐ์ดํ„ฐ๊ฐ€ ์ค‘๋ณต๋˜์–ด ์žˆ๋‹ค.
# ๊ตฌ๋งค๊ฐ€ ์—†๋Š”/์žˆ๋Š” ๊ณ ๊ฐ ๋น„๊ต
(df.loc[(df['REVENUE']==0)&(df['USER_ID'].isin(df.loc[df['REVENUE']>0, 'USER_ID'].values)), 'USER_ID']).count()
---
156

 

๊ทธ๋ฃน๋ณ„๋กœ ํ™•์ธ

all_stat = df.groupby(by='VARIANT_NAME').agg({'USER_ID':'nunique',
                                             'REVENUE':['sum', 'mean', 'median', 'count']})

  • ๊ฐ ์‚ฌ์šฉ์ž๋ณ„๋กœ ์ฃผ๋ฌธ์ด ์–ผ๋งˆ๋‚˜ ์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š”์ง€ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ์ง€ํ‘œ๋ฅผ ๊ณ„์‚ฐํ•ด ๋ณด์ž 
    ๐Ÿ‘‰ ์‚ฌ์šฉ์ž 1๋ช…๋‹น ํ‰๊ท  ๊ตฌ๋งค ํšŸ์ˆ˜
  • ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ๋‘ ์ง‘๋‹จ์— ํฐ ์ฐจ์ด๊ฐ€ ์—†๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค
orders_per_user = all_stat.loc[:, ('REVENUE', 'count')] / all_stat.loc[:, ('USER_ID', 'nunique')]
orders_per_user

  • ์‚ฌ์šฉ์ž 1๋ช… ๋‹น ํ‰๊ท  ๋งค์ถœ๋„ ๊ณ„์‚ฐํ•ด๋ณธ๋‹ค.
revenue_per_user = all_stat.loc[:, ('REVENUE', 'sum')] / all_stat.loc[:, ('USER_ID', 'nunique')]
revenue_per_user

  • ์œ„์—์„œ ๊ตฌํ•œ ๋‘ ๊ฐ’๋“ค์„ all_stat์— ์ถ”๊ฐ€ํ•œ๋‹ค.
all_stat.loc[:, ('per_user', 'orders')] = orders_per_user
all_stat.loc[:, ('per_user', 'revenue')] = revenue_per_user

all_stat

 

๐Ÿ‘‰ ํ‰๊ท  ๊ตฌ๋งค ๊ธˆ์•ก๋„ ํ†ต์ œ ์ง‘๋‹จ์ด ๋” ๋†’์€ ๊ฒƒ์œผ๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

๊ตฌ๋งค ์œ ์ €๋“ค๋งŒ ๋ถ„์„ 

  • ์œ„์™€ ๋™์ผํ•˜๊ฒŒ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๋˜, revenue๊ฐ€ 0์ด ์•„๋‹Œ ๊ฒƒ๋งŒ ํ•„ํ„ฐ๋งํ•ด ์ค€๋‹ค.
paid_stat = df.loc[df.REVENUE != 0].groupby('VARIANT_NAME').agg({'USER_ID':'nunique',
                                                                'REVENUE':['sum', 'mean', 'median', 'count']})

 

orders_per_user = paid_stat.loc[:, ('REVENUE', 'count')] / paid_stat.loc[:, ('USER_ID', 'nunique')]
revenue_per_user = paid_stat.loc[:, ('REVENUE', 'sum')] / paid_stat.loc[:, ('USER_ID', 'nunique')]

paid_stat.loc[:, ('per_user', 'orders')] = orders_per_user
paid_stat.loc[:, ('per_user', 'revenue')] = revenue_per_user

 

  • ๋™์ผํ•˜๊ฒŒ ์‹œ๊ฐํ™”ํ•œ๋‹ค. ๐Ÿ‘‰ ๋ชจ๋“  ์œ ์ €์™€ ๊ตฌ๋งค๊นŒ์ง€ ์ด์–ด์ง„ ์œ ์ €์˜ ์ˆ˜์ต ๋ถ„ํฌ
  • ๊ตฌ๋งค๊นŒ์ง€ ์ด์–ด์ง„ ์œ ์ € ๊ทธ๋ž˜ํ”„์—์„œ ๋ช…ํ™•ํ•œ ์ฐจ์ด๊ฐ€ ๋” ๋ณด์ด๊ณ , ์‹คํ—˜์ง‘๋‹จ์˜ ๊ตฌ๋งค๊ฐ€ ๋งŽ์ด ๋ฐœ์ƒํ•œ ๊ตฌ๊ฐ„๋„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
f, axes = plt.subplots(2, figsize = (10, 8))
sns.distplot(df.loc[df['VARIANT_NAME']=='control', 'REVENUE'], ax=axes[0], label='control')
sns.distplot(df.loc[df['VARIANT_NAME']=='variant', 'REVENUE'], ax=axes[0], label='variant')
axes[0].set_title('Distribution of revenue of all users')

sns.distplot(df.loc[(df.VARIANT_NAME == 'control') & (df['REVENUE'] > 0), 'REVENUE'], ax=axes[1], label='control')
sns.distplot(df.loc[(df.VARIANT_NAME == 'variant') & (df['REVENUE'] > 0), 'REVENUE'], ax=axes[1], label='variant')
axes[1].set_title('Paying Users revenue Distribution')
plt.legend()
plt.subplots_adjust(hspace=0.3)


ํ†ต๊ณ„ ๊ฒ€์ •

โ–ถ๏ธŽ ์ •๊ทœ๋ถ„ํฌ ์—ฌ๋ถ€ : shapiro-Wilk ๊ฒ€์ •

  • ์‹คํ—˜์ง‘๋‹จ์˜ ์ˆ˜์ต๋ถ„ํฌ๋Š” ์ •๊ทœ๋ถ„ํฌ๊ฐ€ ์•„๋‹˜์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
shapiro(df.loc[df.VARIANT_NAME=='variant', 'REVENUE'])
--
ShapiroResult(statistic=0.027033090591430664, pvalue=0.0)

 

โ–ถ๏ธŽ Mann-Whitney ๊ฒ€์ •

  • ์ˆ˜์ต์ด 0์ธ ๊ฐ’๊ณผ ์ค‘๋ณต๊ฐ’์ด ๋งŽ์€ ๋ฐ์ดํ„ฐ์ด๋ฏ€๋กœ ์œ ์˜ํ•ด์•ผ ํ•œ๋‹ค.
  • ๋งŒ-์œ„ํŠธ๋‹ˆ ๊ฒ€์ •(Mann-Whitney U ๊ฒ€์ •)์€ ๋‘ ๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ ํ‘œ๋ณธ ๊ฐ„์˜ ๋น„๋ชจ์ˆ˜์ ์ธ ๋น„๊ต๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋œ๋‹ค.
  • ๋น„๋ชจ์ˆ˜ ๊ฒ€์ •์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์ •๊ทœ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด์ง€ ์•Š๊ฑฐ๋‚˜ ๋ชจ์ง‘๋‹จ์— ๋Œ€ํ•œ ๋ถ„ํฌ์— ๋Œ€ํ•œ ๊ฐ€์ •์„ ๋งŒ์กฑํ•˜์ง€ ์•Š์„ ๋•Œ ์‚ฌ์šฉํ•œ๋‹ค.
(df['REVENUE']==0).value_counts()
--
True     9848
False     151

 

  • ์‹คํ—˜์ง‘๋‹จ๊ณผ ํ†ต์ œ์ง‘๋‹จ์˜ ์ˆ˜์ต์— ๋Œ€ํ•œ ๋น„๋ชจ์ˆ˜ ๊ฒ€์ • ๊ฒฐ๊ณผ
    • statistics : ๋‘ ์ง‘๋‹จ ๊ฐ„ ์ˆœ์œ„ํ•ฉ์œผ๋กœ, ์ง‘๋‹จ์ด ์œ ์‚ฌํ•˜๋ฉด ๊ฐ’์ด ์ž‘๊ณ , ์ง‘๋‹จ ์ฐจ์ด๊ฐ€ ํฌ๋ฉด ํฐ ๊ฐ’์ด ๋‚˜์˜ด
    • pvalue : ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ์ฐจ์ด๊ฐ€ ์—†์Œ (0.05๋ณด๋‹ค ํผ)
  • ์ฆ‰, ๋‘ ์ง‘๋‹จ ๊ฐ„ ์ฐจ์ด๊ฐ€ ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•˜์ง€ ์•Š์Œ
mannwhitneyu(df.loc[df.VARIANT_NAME=='variant', 'REVENUE'], df.loc[df.VARIANT_NAME=='control', 'REVENUE'])
--
MannwhitneyuResult(statistic=12478180.0, pvalue=0.5291970335120277)

 

  • ์ˆ˜์ต์ด ๋ฐœ์ƒํ•œ ๊ฒฝ์šฐ๋งŒ ๋‹ค์‹œ ํ•œ๋ฒˆ ์ˆ˜ํ–‰
    • ๊ฒฐ๊ณผ : ์—ญ์‹œ ์œ ์˜๋ฏธํ•œ ์ฐจ์ด ์—†์Œ
mannwhitneyu(df.loc[(df.VARIANT_NAME=='control')&(df.REVENUE>0), 'REVENUE'],
            df.loc[(df.VARIANT_NAME=='variant')&(df.REVENUE>0), 'REVENUE'])
--
MannwhitneyuResult(statistic=3284.0, pvalue=0.10145877111519161)

๊ฒฐ๋ก 

๐Ÿ‘‰ ๋จผ์ € ์‚ดํŽด๋ณด์•˜๋˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํƒ์ƒ‰์„ ํ†ตํ•ด, ๋‘ ์ง‘๋‹จ ๊ฐ„ ์œ ์˜๋ฏธํ•œ ๋งค์ถœ ์ฐจ์ด๋ฅผ ๋ฐœ์ƒํ•˜์ง€๋Š” ์•Š์€ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

๐Ÿ‘‰ ํ†ต๊ณ„ ๊ฒ€์ •์œผ๋กœ ์žฌํ™•์ธํ•ด๋ณด์•„๋„ ๊ฒฐ๊ณผ๊ฐ€ ๊ฐ™๋‹ค.

728x90