KCBO is a toolkit for anyone who wants to do Bayesian data analysis without worrying about all the implementation details for a certain test.
Currently KCBO is very much alpha/pre-alpha software and only implements a three tests. Here is a list of future objectives for the project.
KCBO is available through PyPI and on Github here. The following commands will install KCBO:
pip install kcbo
git clone https://github.com/HHammond/kcbo
cd kcbo
python setup.py sdist install
If any of this fails, you may need to install numpy (pip install numpy
) in order to install some dependencies of KCBO, then retry installing it.
There are currently three tests implemented in the KCBO library:
Lognormal-Difference of medians: used to compare medians of log-normal distributed data with the same variance.
Bayesian t-Test: an implementation of Kruschke's t-Test.
Conversion Test: test of conversion to success using the Beta-Binomial model. Popular in A/B testing and estimation.
from kcbo import lognormal_comparison_test, t_test, conversion_test
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
Note: Because this test uses an MC simulation on the lognormal distribution conjugate prior, it assumes that both distributions have the same variance.
# Generate some data
g1d = np.random.lognormal(mean=3, sigma=1, size=10000)
g1l = ['A'] * g1d.shape[0]
g2d = np.random.lognormal(mean=3.03, sigma=1, size=10000)
g2l = ['B'] * g2d.shape[0]
g1 = pd.DataFrame(data=g1d, columns=['value'])
g1['group'] = g1l
g2 = pd.DataFrame(data=g2d, columns=['value'])
g2['group'] = g2l
lognormal_data = pd.concat([g1, g2])
summary, data = lognormal_comparison_test(lognormal_data, samples=100000)
print summary
Lognormal Median Comparison Test Groups: A, B Estimates: | Group | Median | 95% CI Lower | 95% CI Upper | Mu | 95% CI Lower | 95% CI Upper | |:--------|---------:|---------------:|---------------:|--------:|---------------:|---------------:| | A | 19.7915 | 19.3976 | 20.1921 | 2.9852 | 2.96515 | 3.00529 | | B | 20.3804 | 19.9753 | 20.7924 | 3.01452 | 2.9945 | 3.03459 | Comparisions: | Hypothesis | Difference of Medians | P.Value | 95% CI Lower | 95% CI Upper | |:-------------|------------------------:|----------:|---------------:|---------------:| | A < B | 0.58924 | 0.97753 | 0.0129132 | 1.15935 |
A,B = data['A']['median'], data['B']['median']
diff = data[('A','B')]['diff_medians']
f, axes = plt.subplots(1,2, figsize=(12, 7))
sns.set(style="white", palette="muted")
sns.despine(left=True)
sns.distplot(A, ax=axes[0], label='Median Estimate Density for A')
sns.distplot(B, ax=axes[0], label='Median Estimate Density for B')
sns.distplot(diff, ax=axes[1], label='Difference of Densities (B-A)')
axes[0].legend()
axes[1].legend()
plt.show()
Kruschke's implementation of the Bayesian t-Test. Since the implementation of this test uses PyMC2 and HMCMC it can take a while for the sampler to converge.
n1,n2 = (140,200)
group1 = np.random.normal(15,2,n1)
group2 = np.random.normal(15.7,2,n2)
A = zip(['A']*n1, group1)
B = zip(['B']*n2, group2)
df = pd.concat([pd.DataFrame(A), pd.DataFrame(B)])
df.columns = 'group','value'
df.head()
group | value | |
---|---|---|
0 | A | 12.755334 |
1 | A | 17.371657 |
2 | A | 15.678301 |
3 | A | 13.358686 |
4 | A | 10.424609 |
description, data = t_test(df,groupcol='group',valuecol='value', samples=60000, progress_bar=True)
[-----------------100%-----------------] 60000 of 60000 complete in 39.3 sec
print description
Bayesian t-Test | Hypothesis | Difference of Means | P.Value | 95% CI Lower | 95% CI Upper | |:-------------|----------------------:|----------:|---------------:|---------------:| | A < B | 1.02472 | 1 | 0.60196 | 1.44501 | | Hypothesis | Difference of S.Dev | P.Value | 95% CI Lower | 95% CI Upper | |:-------------|----------------------:|----------:|---------------:|---------------:| | A < B | 0.0458748 | 1 | -0.25691 | 0.339418 | | Hypothesis | Effect Size | P.Value | 95% CI Lower | 95% CI Upper | |:-------------|--------------:|----------:|---------------:|---------------:| | A < B | 0.546456 | 1 | 0.317 | 0.775599 |
diff = data[('A', 'B')]['diff_means']
f, axes = plt.subplots(1,1, figsize=(12, 7))
sns.despine(left=True)
sns.distplot(diff, label='Difference of Means').legend()
plt.show()
A common test in A/B tests is comparing the conversion rate between two features. Here we take the number of successes and total trials (or tests) and use the beta-binomial model.
A = {'group':'A', 'trials': 10000, 'successes':5000}
B = {'group':'B', 'trials': 8000, 'successes':4090}
df = pd.DataFrame([A,B])
df
group | successes | trials | |
---|---|---|---|
0 | A | 5000 | 10000 |
1 | B | 4090 | 8000 |
summary, data = conversion_test(df, groupcol='group',successcol='successes',totalcol='trials')
print summary
Beta-Binomial Conversion Rate Test Groups: A, B Estimates: | Group | Estimate | 95% CI Lower | 95% CI Upper | |:--------|-----------:|---------------:|---------------:| | A | 0.500016 | 0.490261 | 0.509824 | | B | 0.511272 | 0.500279 | 0.522188 | Comparisions: | Hypothesis | Difference | P.Value | 95% CI Lower | 95% CI Upper | |:-------------|-------------:|----------:|---------------:|---------------:| | A < B | 0.0112787 | 0.93366 | -0.00343106 | 0.0259026 |
A = data['A']['distribution']
B = data['B']['distribution']
diff = data[('A','B')]['distribution']
f, axes = plt.subplots(1,2, figsize=(12, 7))
sns.set(style="white", palette="muted")
sns.despine(left=True)
sns.distplot(A, ax=axes[0], label='Density Estimate for A')
sns.distplot(B, ax=axes[0], label='Density Estimate for B')
sns.distplot(diff, ax=axes[1], label='Difference of Densities (B-A)')
axes[0].legend()
axes[1].legend()
plt.show()