There are two major modules for doing statistical analyses in Python:
(including ANOVA, multiple regression, generalized linear models, etc.)
To see the full functionality of these modules you'll need to look through their pages, but here are a few examples to show you that a lot of the standard statistical tests and models you need to perform can be easily done using Python.
You'll want the stats
module from Scipy and the api
and formula.api
modulesfrom statsmodels. We'll also go ahead and import Numpy for use in
generating data and Matplotlib for some graphing.
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
stats.describe()
gives basic descriptive statistics on any list of numbers. It returns (in the following order) the size of the data, it's min and max, the mean, the variance, skewness, and kurtosis.
x = [1, 2, 3, 4, 5]
stats.describe(x)
(5, (1, 5), 3.0, 2.5, 0.0, -1.3)
We can also get this kind of information for columns in a DataFrame using the .describe()
method.
data = pd.DataFrame([[1, 2, 3.5], [2, 2.4, 3.1], [3, 1.8, 2.5]], columns=['a', 'b', 'c'])
print data
data.describe()
a b c 0 1 2.0 3.5 1 2 2.4 3.1 2 3 1.8 2.5 [3 rows x 3 columns]
a | b | c | |
---|---|---|---|
count | 3.0 | 3.000000 | 3.000000 |
mean | 2.0 | 2.066667 | 3.033333 |
std | 1.0 | 0.305505 | 0.503322 |
min | 1.0 | 1.800000 | 2.500000 |
25% | 1.5 | 1.900000 | 2.800000 |
50% | 2.0 | 2.000000 | 3.100000 |
75% | 2.5 | 2.200000 | 3.300000 |
max | 3.0 | 2.400000 | 3.500000 |
8 rows × 3 columns
Standard 1-sample and 2-sample t-tests (as well as many other basic statistical tests) are available in scipy.stats
.
T-tests return two values, the t-statistic and the p-value. First, let's generate some data that is normally distributed around zero.
x1 = np.random.randn(100, 1)
x2 = np.random.randn(100, 1)
To determine if the mean of x1
is different from some number use a one-sample t-test.
We'll compare the mean of x1
to both 0 (which it shouldn't be different from) and to 1 (which it should be different from).
tstat, pval = stats.ttest_1samp(x1, 0)
print "Comparison of the mean of x1 to 0.\nT-statistic = %s; P-value = %s." % (tstat, pval)
tstat, pval = stats.ttest_1samp(x1, 1)
print "Comparison of the mean of x1 to 5.\nT-statistic = %s; P-value = %s." % (tstat, pval)
Comparison of the mean of x1 to 0. T-statistic = [ 1.17526772]; P-value = [ 0.24270646]. Comparison of the mean of x1 to 5. T-statistic = [-9.46674519]; P-value = [ 1.59375523e-15].
To determine if the mean of two different sets of numbers are different from one another wuse a two-sample t-test. We'll compare the means of x1
and x2
, which should be different from one another since they should both be about zero.
tstat, pval = stats.ttest_ind(x1, x2)
print "Comparison of the means of x1 and x2.\nT-statistic = %s; P-value = %s." % (tstat, pval)
Comparison of the means of x1 and x2. T-statistic = [ 1.54942032]; P-value = [ 0.12287759].
scipy.stats
also includes a powerful general system for working with statistical distribution.
Let's generate some random normally distributed numbers to analyze.
u, sigma = 4, 2
random_numbers = stats.norm.rvs(u, sigma, size=50)
random_numbers
array([ -2.69438161e-02, 2.92683231e+00, 6.59328450e+00, 5.09170230e+00, 4.99668460e+00, 5.50528668e+00, 7.50782290e+00, 4.63995490e+00, 3.17923549e+00, 7.95144616e+00, 2.19070256e+00, 6.40680973e+00, 4.65721099e+00, 1.77901732e+00, 2.63237191e+00, 4.49198740e+00, 5.56909943e+00, -1.83542925e-01, 6.42398538e+00, 6.55107469e+00, 3.45866471e+00, 3.66296653e+00, 4.46982893e+00, 1.11183439e-01, 3.14154367e+00, 3.35583976e+00, 4.99528367e+00, 3.56179539e+00, 5.21900454e+00, 2.35695456e+00, 4.38756723e+00, 5.83026447e+00, 1.46130374e+00, 1.21498908e+00, 7.73665200e+00, -2.39077287e-03, 3.45132547e+00, 4.62012917e+00, 8.59613520e-01, 3.73481950e+00, 7.26835158e+00, 3.26567510e+00, -6.84996675e-01, 3.19824323e+00, 6.74505956e+00, 3.50754549e+00, 4.60034178e+00, 4.63561359e+00, 6.11115427e+00, 5.12909660e+00])
We can then fit distributions to this data.
stats.norm.fit(random_numbers)
(4.0057489130411792, 2.1683842014426378)
You can do simple univariate OLS regression in Scipy, but for anything more complicated you'll need statsmodels, so we'll just do the basics in statsmodels as well.
First, we'll generate some data and plot it.
#generate some data
x = np.array(range(20))
y = 3 + 0.5 * x + 2 * np.random.randn(20)
data = pd.DataFrame(zip(x, y), columns=['x', 'y'])
#plot the data
plt.plot(data['x'], data['y'], 'bo')
plt.show()
Now let's do a regression. We'll use formulas to specify regression models.
This allows us to write our code in simple and intuitive ways,
and means the we don't have to remember how to create design matrices for more complicated models.
To do this we'll need to be using Pandas to manage the data.
The regression function is name ols
for "ordinary least-squares" the standard approach to regression.
It takes two arguments. The first is a formula describing the regression we want to fit.
In this case we just want to model the effect of x
on y
so we use y ~ x
,
where x
and y
are the names of columns in a data frame.
The second argument is the name of the data frame that contains the columns x
and y
.
results = smf.ols('y ~ x', data).fit()
print results.summary()
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.785 Model: OLS Adj. R-squared: 0.773 Method: Least Squares F-statistic: 65.84 Date: Sat, 04 Oct 2014 Prob (F-statistic): 2.00e-07 Time: 11:41:12 Log-Likelihood: -40.471 No. Observations: 20 AIC: 84.94 Df Residuals: 18 BIC: 86.93 Df Model: 1 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept 2.2481 0.832 2.704 0.015 0.501 3.995 x 0.6071 0.075 8.114 0.000 0.450 0.764 ============================================================================== Omnibus: 1.724 Durbin-Watson: 2.053 Prob(Omnibus): 0.422 Jarque-Bera (JB): 1.008 Skew: -0.549 Prob(JB): 0.604 Kurtosis: 2.940 Cond. No. 21.5 ==============================================================================
So, .summary()
presents us with most of the information we would want about the regression.
We can also pull this information out in individual pieces. For example,
intercept, slope = results.params
r2 = results.rsquared
print slope, intercept, r2
0.607133890043 2.24811810288 0.78529268393
This makes it easy to store the key results of large numbers of analyses, or present the results in alternative ways, like graphs.
plt.plot(data['x'], data['y'], 'bo')
plt.hold(True)
x = np.array([min(x), max(x)])
y = intercept + slope * x
plt.plot(x, y, 'r-')
plt.show()
You'll notice that in order to plot the regression line what we actually do is plot a line with the appropriate slope and intercept by:
Multiple-regression works the same way, but with additional terms in the formula.
import pandas as pd
#generate some data
x = np.random.randn(50)
z = np.random.randn(50)
noise = np.random.randn(50)
y = 3 + 0.5 * x + 1.5 * z + noise
data = pd.DataFrame(zip(x, y, z), columns=['x', 'y', 'z'])
results = smf.ols('y ~ x + z', data).fit()
print results.summary()
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.765 Model: OLS Adj. R-squared: 0.755 Method: Least Squares F-statistic: 76.36 Date: Sat, 04 Oct 2014 Prob (F-statistic): 1.72e-15 Time: 11:41:12 Log-Likelihood: -65.717 No. Observations: 50 AIC: 137.4 Df Residuals: 47 BIC: 143.2 Df Model: 2 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept 2.8254 0.133 21.294 0.000 2.558 3.092 x 0.5812 0.124 4.699 0.000 0.332 0.830 z 1.4573 0.123 11.868 0.000 1.210 1.704 ============================================================================== Omnibus: 2.028 Durbin-Watson: 1.739 Prob(Omnibus): 0.363 Jarque-Bera (JB): 1.936 Skew: 0.452 Prob(JB): 0.380 Kurtosis: 2.663 Cond. No. 1.20 ==============================================================================
This makes it easy to do more complicated designs, including interactions.
results = smf.ols('y ~ x + z + x*z', data).fit()
print results.summary()
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.766 Model: OLS Adj. R-squared: 0.751 Method: Least Squares F-statistic: 50.17 Date: Sat, 04 Oct 2014 Prob (F-statistic): 1.52e-14 Time: 11:41:12 Log-Likelihood: -65.585 No. Observations: 50 AIC: 139.2 Df Residuals: 46 BIC: 146.8 Df Model: 3 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept 2.8336 0.135 21.021 0.000 2.562 3.105 x 0.5671 0.128 4.432 0.000 0.310 0.825 z 1.4522 0.124 11.690 0.000 1.202 1.702 x:z 0.0541 0.110 0.493 0.624 -0.167 0.275 ============================================================================== Omnibus: 1.898 Durbin-Watson: 1.788 Prob(Omnibus): 0.387 Jarque-Bera (JB): 1.844 Skew: 0.425 Prob(JB): 0.398 Kurtosis: 2.596 Cond. No. 1.46 ==============================================================================
We can also include transformations in formulas using functions from Numpy.
results = smf.ols('y ~ x + np.log10(z)', data).fit()
print results.summary()
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.336 Model: OLS Adj. R-squared: 0.278 Method: Least Squares F-statistic: 5.825 Date: Sat, 04 Oct 2014 Prob (F-statistic): 0.00898 Time: 11:41:12 Log-Likelihood: -41.444 No. Observations: 26 AIC: 88.89 Df Residuals: 23 BIC: 92.66 Df Model: 2 =============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------- Intercept 4.5062 0.261 17.288 0.000 3.967 5.045 x 0.6071 0.232 2.616 0.015 0.127 1.087 np.log10(z) 2.2221 0.786 2.826 0.010 0.596 3.849 ============================================================================== Omnibus: 2.703 Durbin-Watson: 1.307 Prob(Omnibus): 0.259 Jarque-Bera (JB): 1.915 Skew: 0.664 Prob(JB): 0.384 Kurtosis: 2.960 Cond. No. 3.64 ==============================================================================
Using formulas also makes it easy to conduct statistical tests that are based on linear models.
ANOVA is simply a linear model with appropriate dummy variables set up for each factor.
To do this in statsmodels we simply use C()
to tell the module that the
variable of interest is categorical.
This time, let's start by grabbing some data from the web.
url = 'http://stats191.stanford.edu/data/rehab.csv'
rehab_table = pd.read_table(url, delimiter=",")
rehab_table
Fitness | Time | |
---|---|---|
0 | 1 | 29 |
1 | 1 | 42 |
2 | 1 | 38 |
3 | 1 | 40 |
4 | 1 | 43 |
5 | 1 | 40 |
6 | 1 | 30 |
7 | 1 | 42 |
8 | 2 | 30 |
9 | 2 | 35 |
10 | 2 | 39 |
11 | 2 | 28 |
12 | 2 | 31 |
13 | 2 | 31 |
14 | 2 | 29 |
15 | 2 | 35 |
16 | 2 | 29 |
17 | 2 | 33 |
18 | 3 | 26 |
19 | 3 | 32 |
20 | 3 | 21 |
21 | 3 | 20 |
22 | 3 | 23 |
23 | 3 | 22 |
24 rows × 2 columns
To see if the time to rehabilitate an injury is related to the fitness category we do the same as above,
but wrapping the predictor in C()
.
results = smf.ols('Time ~ C(Fitness)', rehab_table).fit()
print results.summary()
OLS Regression Results ============================================================================== Dep. Variable: Time R-squared: 0.618 Model: OLS Adj. R-squared: 0.581 Method: Least Squares F-statistic: 16.96 Date: Sat, 04 Oct 2014 Prob (F-statistic): 4.13e-05 Time: 11:41:13 Log-Likelihood: -68.286 No. Observations: 24 AIC: 142.6 Df Residuals: 21 BIC: 146.1 Df Model: 2 =================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ----------------------------------------------------------------------------------- Intercept 38.0000 1.574 24.149 0.000 34.728 41.272 C(Fitness)[T.2] -6.0000 2.111 -2.842 0.010 -10.390 -1.610 C(Fitness)[T.3] -14.0000 2.404 -5.824 0.000 -18.999 -9.001 ============================================================================== Omnibus: 0.163 Durbin-Watson: 2.209 Prob(Omnibus): 0.922 Jarque-Bera (JB): 0.211 Skew: -0.163 Prob(JB): 0.900 Kurtosis: 2.675 Cond. No. 3.80 ==============================================================================
While all of the information that we want is technically present in this table,
we typically want it presented in more standard fashion for ANOVA.
We can do this using the anova_lm
function.
from statsmodels.stats.anova import anova_lm
anova_lm(results)
df | sum_sq | mean_sq | F | PR(>F) | |
---|---|---|---|---|---|
C(Fitness) | 2 | 672 | 336.000000 | 16.961538 | 0.000041 |
Residual | 21 | 416 | 19.809524 | NaN | NaN |
2 rows × 5 columns
Statsmodels includes much more advanced functionality including: