import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
#We are setting the seed to assure you get the same answers on quizzes as we set up
random.seed(43)

1. 构造一个网页改版前后转化率的df¶

随机构造分组和转化数据,converted列表示转化了与否，group列的new_page代表新版网页，group列的old_page代表旧版网页。

ab_data.converted.mean()==0.100118

ab_data.query('group=="new_page"')['converted'].mean()==0.10010251582612997

ab_data.query('group=="old_page"')['converted'].mean()==0.10013357025675172

#随机数据
#ab_data = pd.DataFrame({'group':np.random.choice(['new_page','old_page'],size=499000),'converted':np.random.randint(0,2,size=499000)})

#ab_data.iloc[-400000:,1] = 0
#ab_data = pd.concat([ab_data,pd.DataFrame({'group':['new_page']*500,'converted':[1]*500})],axis=0)

#ab_data = pd.concat([ab_data,pd.DataFrame({'group':['old_page']*500,'converted':[0]*500})],axis=0)               

#ab_data.to_pickle('ab_data.pkl')

ab_data = pd.read_pickle('ab_data.pkl')

ab_data.head()

平均转化率是多少？

ab_data.converted.mean()

0.099904

假定一个用户处于new_page组中，他的转化率是多少？

ab_data.query('group=="new_page"')['converted'].mean()

0.10181204115117909

假定一个用户处于old_page组中，他的转化率是多少？

ab_data.query('group=="old_page"')['converted'].mean()

0.0979884493144536

截止目前并没有证据可以证明某一页面可以带来更多的转化率

2. A/B 测试¶

请注意，由于与每个事件相关的时间戳，你可以在进行每次观察时连续运行假设检验。然而，问题的难点在于，一个页面被认为比另一页页面的效果好得多的时候你就要停止检验吗？还是需要在一定时间内持续发生？你需要将检验运行多长时间来决定哪个页面比另一个页面更好？

现在，你要考虑的是，你需要根据提供的所有数据做出决定。如果你想假定旧的页面效果更好，除非新的页面在类型I错误率为5％的情况下才能证明效果更好，那么，你的零假设和备择假设是什么？你可以根据单词或旧页面与新页面的转化率$p_{old}$与$p_{new}$来陈述你的假设。

零假设：$p_{new}-p_{old}<=0$

备择假设：$p_{new}-p_{old}>0$

假定在零假设中，不管是新页面还是旧页面，$p_{old}$与$p_{new}$都具有等于转化成功率的“真”成功率，也就是说，$p_{old}$与$p_{new}$是相等的。此外，假设它们都等于ab_data.csv 中的转化率，新旧页面都是如此。

每个页面的样本大小要与 ab_data.csv 中的页面大小相同。

执行两次页面之间转化差异的抽样分布，计算零假设中10000次迭代计算的估计值。

使用下面的单元格提供这个模拟的必要内容。如果现在还没有完整的意义，不要担心，你将通过下面的问题来解决这个问题。

a. 在零假设中，$p_{new}$的 convert rate（转化率）是多少？¶

p_new=ab_data.converted.mean()
p_new

0.099904

b. 在零假设中，$p_{old}$的 convert rate（转化率）是多少？¶

p_old=ab_data.converted.mean()
p_old

0.099904

c.$n_{new}$是多少？¶

n_new=ab_data.query('group=="new_page"').shape[0]
n_new

250491

d.$n_{old}$是多少？¶

n_old=ab_data.query('group=="old_page"').shape[0]
n_old

249509

e. 在零假设中，使用$p_{new}$转化率模拟$n_{new}$次事务，并将这些$n_{new}$次事务的0和1存储在 new_page_converted 中。¶

random.seed(43)
#numpy.random.choice(a, size=None, replace=True, p=None)
#Generates a random sample from a given 1-D array
#p : 1-D array-like, optional
#The probabilities associated with each entry in a. If not given the sample assumes a uniform distribution over all entries in a.
new_page_converted=np.random.choice(2,size=n_new,p=[1-p_new,p_new])
new_page_converted

array([0, 0, 0, ..., 0, 0, 1], dtype=int64)

f. 在零假设中，使用$p_{old}$转化率模拟$n_{old}$次事务，并将这些$n_{old}$次事务的0和1存储在 old_page_converted 中。¶

random.seed(43)
old_page_converted=np.random.choice(2,size=n_old,p=[1-p_old,p_old])
old_page_converted

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

g. 在 (e) 与 (f)中找到$p_{new}-p_{old}$模拟值。¶

diff=new_page_converted.mean()-old_page_converted.mean()
diff

0.00011305574808949392

h. 使用a. 到 g. 中的计算方法来模拟 10,000个$p_{new}-p_{old}$值，并将这 10,000 个值存储在 p_diffs 中。¶

p_diffs=[]
for i in range(10000):
    p_new_diff = np.random.choice(2,size=n_new,p=[1-p_new,p_new]).mean()
    p_old_diff = np.random.choice(2,size=n_old,p=[1-p_old,p_old]).mean()
    p_diffs.append(p_new_diff - p_old_diff)

i. 绘制一个 p_diffs 直方图。这个直方图看起来像你所期望的吗？¶

p_diffs = np.array(p_diffs)
plt.hist(p_diffs,bins=100)

(array([  1.,   0.,   1.,   0.,   1.,   0.,   0.,   3.,   0.,   1.,   4.,
          1.,   5.,   4.,   4.,   6.,   9.,  13.,   7.,  15.,  18.,  15.,
         22.,  28.,  27.,  48.,  50.,  66.,  79.,  75.,  99., 114., 117.,
        119., 150., 152., 165., 171., 191., 222., 233., 233., 256., 281.,
        271., 299., 320., 318., 268., 311., 317., 323., 297., 302., 286.,
        300., 258., 272., 271., 237., 238., 225., 183., 203., 171., 156.,
        142., 131., 121., 106.,  86.,  81.,  71.,  49.,  51.,  61.,  47.,
         34.,  31.,  33.,  31.,  19.,  16.,  12.,   9.,   6.,   5.,   3.,
          3.,   3.,   4.,   2.,   3.,   3.,   0.,   3.,   0.,   0.,   1.,
          1.]),
 array([-3.38948083e-03, -3.32272083e-03, -3.25596082e-03, -3.18920081e-03,
        -3.12244080e-03, -3.05568079e-03, -2.98892078e-03, -2.92216077e-03,
        -2.85540076e-03, -2.78864075e-03, -2.72188075e-03, -2.65512074e-03,
        -2.58836073e-03, -2.52160072e-03, -2.45484071e-03, -2.38808070e-03,
        -2.32132069e-03, -2.25456068e-03, -2.18780067e-03, -2.12104067e-03,
        -2.05428066e-03, -1.98752065e-03, -1.92076064e-03, -1.85400063e-03,
        -1.78724062e-03, -1.72048061e-03, -1.65372060e-03, -1.58696059e-03,
        -1.52020059e-03, -1.45344058e-03, -1.38668057e-03, -1.31992056e-03,
        -1.25316055e-03, -1.18640054e-03, -1.11964053e-03, -1.05288052e-03,
        -9.86120515e-04, -9.19360506e-04, -8.52600497e-04, -7.85840488e-04,
        -7.19080479e-04, -6.52320470e-04, -5.85560462e-04, -5.18800453e-04,
        -4.52040444e-04, -3.85280435e-04, -3.18520426e-04, -2.51760417e-04,
        -1.85000408e-04, -1.18240399e-04, -5.14803906e-05,  1.52796183e-05,
         8.20396272e-05,  1.48799636e-04,  2.15559645e-04,  2.82319654e-04,
         3.49079663e-04,  4.15839672e-04,  4.82599680e-04,  5.49359689e-04,
         6.16119698e-04,  6.82879707e-04,  7.49639716e-04,  8.16399725e-04,
         8.83159734e-04,  9.49919743e-04,  1.01667975e-03,  1.08343976e-03,
         1.15019977e-03,  1.21695978e-03,  1.28371979e-03,  1.35047980e-03,
         1.41723980e-03,  1.48399981e-03,  1.55075982e-03,  1.61751983e-03,
         1.68427984e-03,  1.75103985e-03,  1.81779986e-03,  1.88455987e-03,
         1.95131988e-03,  2.01807988e-03,  2.08483989e-03,  2.15159990e-03,
         2.21835991e-03,  2.28511992e-03,  2.35187993e-03,  2.41863994e-03,
         2.48539995e-03,  2.55215996e-03,  2.61891996e-03,  2.68567997e-03,
         2.75243998e-03,  2.81919999e-03,  2.88596000e-03,  2.95272001e-03,
         3.01948002e-03,  3.08624003e-03,  3.15300004e-03,  3.21976004e-03,
         3.28652005e-03]),
 <a list of 100 Patch objects>)

j. 有多大比例p_diffs大于 ab_data.csv 中观察到的实际差值？¶

obs_diff=ab_data.query('group=="new_page"')['converted'].mean()-\
ab_data.query('group=="old_page"')['converted'].mean()
obs_diff

0.0038235918367254956

(p_diffs>obs_diff).mean()

0.0

k. 用文字解释一下你刚才在 j.中计算出来的结果。在科学研究中，这个值是什么？根据这个数值，新旧页面的转化率是否有区别呢？¶

p-value，p值等于0.0很小，我们应该拒绝零假设而接受备选假设，也就是新页面确实会促进转化。

l.我们也可以使用一个内置程序（built-in）来实现类似的结果。尽管使用内置程序可能更易于编写代码，但上面的内容是对正确思考统计显著性至关重要的思想的一个预排。填写下面的内容来计算每个页面的转化次数，以及每个页面的访问人数。使用$n_{old}$与$n_{new}$分别引证与旧页面和新页面关联的行数。¶

import statsmodels.api as sm

convert_old = ab_data.query('group=="old_page" & converted==1').shape[0]
convert_new = ab_data.query('group=="new_page" & converted==1').shape[0]
n_old = ab_data.query('group=="old_page"').shape[0]
n_new = ab_data.query('group=="new_page"').shape[0]

m. 现在使用 stats.proportions_ztest 来计算你的检验统计量与 p-值。这里是使用内置程序的一个有用链接。¶

z_score,p_value=sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new],alternative='smaller')
z_score,p_value
#alternative='two-sided'时（z_score,p_value）==(-4.508061583981402, 6.542258869878914e-06)
#alternative='smaller'时（z_score,p_value）==(-4.508061583981402, 3.271129434939457e-06)
#alternative='smaller'时（z_score,p_value）==(-4.508061583981402, 0.999996728870565)

(-4.508061583981402, 3.271129434939457e-06)

from scipy.stats import norm
# Critical Z score value for a one tailed test at confidence level of 95%
norm.ppf(1-(0.05))

1.6448536269514722

# Tells how significant z_score is:
norm.cdf(z_score)

3.271129434939457e-06

n. 根据上题算出的 z-score 和 p-value，我们认为新旧页面的转化率是否有区别？它们与 j. 与 k. 中的结果一致吗？¶

由于p-value为3.271129434939457e-06很小，我们应该拒绝零假设而接受备选假设，也就是新页面确实会促进转化，这与之前的结果一致。

z_score的显著程度就是p-value。

3. 回归分析法之一¶

1. 在最后一部分中，你会看到，你在之前的A / B测试中获得的结果也可以通过执行回归来获取。¶

a. 既然每行的值是转化或不转化，那么在这种情况下，我们应该执行哪种类型的回归？¶

逻辑回归

b. 目标是使用 statsmodels 来拟合你在 a. 中指定的回归模型，以查看用户收到的不同页面是否存在显著的转化差异。但是，首先，你需要为这个截距创建一个列（原文：column），并为每个用户收到的页面创建一个虚拟变量列。添加一个截距列，一个ab_page列，当用户浏览new_page时为1，当用户浏览old_page时为0。¶

import statsmodels.api as sm
ab_data['ab_page']=ab_data.group.map({'new_page':1,'old_page':0})
ab_data['intercept']=1

c. 使用 statsmodels 导入你的回归模型。实例化该模型，并使用你在 b. 中创建的2个列来拟合该模型，用来预测一个用户是否会发生转化。¶

logit_mod=sm.Logit(ab_data['converted'],ab_data[['intercept','ab_page']])
result=logit_mod.fit()

Optimization terminated successfully.
         Current function value: 0.324852
         Iterations 6

d. 请在下方提供你的模型摘要，并根据需要使用它来回答下面的问题。¶

result.summary()

e. 与 ab_page 关联的 p-值是多少？为什么它与你在 II 中发现的结果不同？提示: 与你的回归模型相关的零假设与备择假设分别是什么？它们如何与 Part II 中的零假设和备择假设做比较？¶

ab_page关联的p-值为0（3.271129434939457e-06），而第2部分中p-值为0（若能显示更多小数位数，其应该为6.542258869878914e-06），两种情况中p-值不同的原因是检验的方向性不同，在假设检验中的p-值（3.271129434939457e-06）使用的是单尾检验，而这里的p-值表示ab_page因素与转化率是否有相关性，应该为双尾检验。结果表明ab_page适合用来预测转化情况。

f. 现在，你一定在考虑其他可能影响用户是否发生转化的因素。讨论为什么考虑将其他因素添加到回归模型中是一个不错的主意。在回归模型中添加附加项有什么弊端吗？¶

在实际应用中，可能会有多种因素会影响到响应变量，添加其他因素可以更好的分析影响结果的变量；但是随着附加项越多，发生错误推论的可能性就越大。比如自变量彼此相关就会造成多重共线性，导致回归系数偏离想要的方向。

注意特征里面有coef、std err、z P>|Z|、[0.025 0.975]。大胆猜测coef是模型特征的系数，std err是模型特征的系数的标准差，[0.025 0.975]模型特征的系数的95%置信区间，z是模型特征的系数的z分数，P>|Z|是模型特征的系数的p值。

Dep. Variable:	converted	No. Observations:	500000
Model:	Logit	Df Residuals:	499998
Method:	MLE	Df Model:	1
Date:	Mon, 04 Mar 2019	Pseudo R-squ.:	6.256e-05
Time:	20:55:41	Log-Likelihood:	-1.6243e+05
converged:	True	LL-Null:	-1.6244e+05
		LLR p-value:	6.537e-06

	coef	std err	z	P>\|z\|	[0.025	0.975]
intercept	-2.2198	0.007	-329.645	0.000	-2.233	-2.207
ab_page	0.0425	0.009	4.508	0.000	0.024	0.061

	group	converted
0	old_page	1
1	old_page	0
2	old_page	1
3	new_page	1
4	old_page	1