Inference basics and p-values¶

In [1]:

import pandas as pd
import pandas.rpy.common as com

galton = com.load_data('galton', package='UsingR')

We fit a line to the galton data, and get the line parameters:

In [2]:

from statsmodels.formula.api import ols

lm1 = ols('child ~ parent', galton).fit()

plot(galton['parent'], galton['child'], 'ob')
plot(galton['parent'], lm1.fittedvalues, 'r', linewidth=3);

In [3]:

lm1.params

Out[3]:

Intercept    23.941530
parent        0.646291

In [4]:

# emulates abline function, only possible to basic line styles and width
def abline(intercept, gradient, *args, **kwargs):
    a = gca()
    xlim = a.get_xlim()
    ylim = a.get_ylim()
    
    if args:
        sty = args[0]
    else:
        sty = 'r'
        
    if kwargs:
        lw = kwargs['linewidth']
    else:
        lw = 5

    a.plot(xlim, [intercept + gradient * x for x in xlim], sty, linewidth=lw)
    a.set_xlim(xlim)
    a.set_ylim(ylim);

The galton data is just a sample obtained from all possible child-parent population.

Here we create a simulated population containing 1 million families, assuming a normal distribution.

On top of the population distribution plot, we show the line that was fitted to our sample (i.e. the galton data).

In [9]:

parent = np.random.normal(loc=galton['parent'].mean(), scale=galton['parent'].std(), size=1e6)
child = lm1.params[0] + lm1.params[1] * parent + np.random.normal(scale=lm1.resid.std(), size=1e6)

newGalton = pd.DataFrame({'parent' : parent, 'child' : child})

hexbin(newGalton['parent'], newGalton['child'], cmap='PuBu')
abline(lm1.params['Intercept'], lm1.params['parent'], 'r', linewidth=3);

Suppose we take a sample from the population, and fit a line to this sample (in black):

In [11]:

import random

random.seed(134325)

idx = random.sample(np.arange(1e6), 50)
sampleGalton1 = newGalton.ix[idx,:]

sampleLm1 = ols('child ~ parent', sampleGalton1).fit()

plot(sampleGalton1['parent'], sampleGalton1['child'], 'ob')
plot(sampleGalton1['parent'], sampleLm1.fittedvalues, 'k-.', linewidth=3)
abline(lm1.params['Intercept'], lm1.params['parent'], 'r', linewidth=3);

If we take another sample:

In [12]:

idx = random.sample(np.arange(1e6), 50)
sampleGalton2 = newGalton.ix[idx,:]

sampleLm2 = ols('child ~ parent', sampleGalton2).fit()

plot(sampleGalton2['parent'], sampleGalton2['child'], 'ob')
plot(sampleGalton2['parent'], sampleLm2.fittedvalues, 'k-.', linewidth=1)
abline(lm1.params['Intercept'], lm1.params['parent'], 'r', linewidth=3);

We see that every sampling produces different fitted line. If we do this 100 times and plot all the fitted lines:

In [13]:

sampleLm = []

for i in range(100):
    idx = random.sample(np.arange(1e6), 50)
    sampleGalton = newGalton.ix[idx,:]
    
    sampleLm.append(ols('child ~ parent', sampleGalton).fit())

In [14]:

hexbin(newGalton['parent'], newGalton['child'], cmap='PuBu')

for lm in sampleLm:
    abline(lm.params['Intercept'], lm.params['parent'], 'k', linewidth=1)
    
abline(lm1.params['Intercept'], lm1.params['parent'], 'r', linewidth=3);

If we plot the distribution of the line parameters:

In [15]:

f, (ax1, ax2) = subplots(ncols=2)

ax1.hist(map(lambda x: x.params['Intercept'], sampleLm))
ax1.set_xlabel('Intercept')

ax2.hist(map(lambda x: x.params['parent'], sampleLm))
ax2.set_xlabel('Slope')

f.tight_layout();

From the Central limit theorem:

$\hat{b}_0 \sim N(b_0, Var(\hat{b}_0))$

$\hat{b}_1 \sim N(b_1, Var(\hat{b}_1))$

which can be estimated with:

$\hat{b}_0 \approx N(b_0, \widehat{Var}(\hat{b}_0))$

$\hat{b}_1 \approx N(b_1, \widehat{Var}(\hat{b}_1))$

That is, we can estimate the variance of our estimates using its standard error: $\sqrt{\widehat{Var}(\hat{b}_0)}$ .

The estimate's standar error is computed by the ols function, and the value can be accessed in the resulting regression object.

In [16]:

idx = random.sample(np.arange(1e6), 50)
sampleGalton4 = newGalton.ix[idx,:]

sampleLm4 = ols('child ~ parent', sampleGalton4).fit()

sampleLm4.summary()

Out[16]:

OLS Regression Results
Dep. Variable:	child	R-squared:	0.132
Model:	OLS	Adj. R-squared:	0.114
Method:	Least Squares	F-statistic:	7.330
Date:	Sat, 23 Mar 2013	Prob (F-statistic):	0.00937
Time:	16:04:39	Log-Likelihood:	-114.48
No. Observations:	50	AIC:	233.0
Df Residuals:	48	BIC:	236.8
Df Model:	1

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	32.4634	13.141	2.470	0.017	6.042 58.885
parent	0.5177	0.191	2.707	0.009	0.133 0.902

Omnibus:	2.024	Durbin-Watson:	2.725
Prob(Omnibus):	0.364	Jarque-Bera (JB):	1.332
Skew:	0.111	Prob(JB):	0.514
Kurtosis:	2.232	Cond. No.	2.62e+03

In [17]:

# estimated vs real distribution --- very close
import scipy.stats as sst

hist(map(lambda x: x.params['parent'], sampleLm), normed=True, bins=10)
xlabel('Slope')

x = np.linspace(0.2, 1.4, num=100)
y = sst.norm.pdf(x, loc=sampleLm4.params['parent'], scale=sampleLm4.bse[1])
plot(x, y, 'r');

Standardised coefficients:

$\frac{\hat{b_0} - b_0}{S.E. (\hat{b_0})} \sim t_{n-2}$

and

$\frac{\hat{b_1} - b_1}{S.E. (\hat{b_1})} \sim t_{n-2}$

That is, the standardised coefficients have a Student's t-distribution with $(n-2)$ Degrees of Freedom.

Degrees of Freedom $\approx$ (number of samples - number of things you estimated)

t-distribution (3 DoF in red, 10 DoF in black) vs normal distribution (the higher the DoF, the closer a t-distribution is to a normal distribution):

In [18]:

x = np.linspace(-5, 5, num=100)
plot(x, sst.norm.pdf(x), 'k', linewidth=3)
plot(x, sst.t.pdf(x, 3), 'r', linewidth=3)
plot(x, sst.t.pdf(x, 10), 'b', linewidth=3);

Confidence intervals: We have an estimate $b_1$ and we want to know something about how good our estimate is. One way is to create a "level $\alpha$ confidence interval".

A confidence interval
will include the real parameter
$\alpha$ percent of the time
in repeated studies.

In [20]:

sampleLm4.params

Out[20]:

Intercept    32.463411
parent        0.517689

In [19]:

sampleLm4.conf_int()

Out[19]:

	0	1
Intercept	6.041595	58.885228
parent	0.133226	0.902152

How to report the inference? A one inch increase in parental height is associated with a 0.52 inch increase in child's height (95% CI: 0.13 - 0.92 inches).

P-values

Most common measure of "statistical significance"
Commonly reported in papers
Used for decision making (e.g. FDA)
Controversial among statisticians

Approach:

Define the hypothetical distribution of a data summary (statistic) when "nothing is going on" (null hypothesis)
Calculate the summary/statistic with the data we have (test statistic)
Compare what we calculated to our hypothetical distribution and see if the value is "extreme" (p-value)

Our standardised slope parameter:

$\frac{\hat{b_1} - b_1}{S.E. (\hat{b_1})} \sim t_{n-2}$

Define the null hypothesis:

$H_0$ : There is no relationship between parent and child height

That is, the distribution of the slope parameter under the null hypothesis (the null distribution):

$\frac{\hat{b_1}}{S.E. (\hat{b_1})} \sim t_{n-2}$

The alternative to the the null hypothesis (alternate hypothesis):

$H_a$ : There is a relationship, i.e. $b_i \neq 0$

This is a two-tailed test.

We can plot the null distribution together with the observed statistics (t-value):

In [21]:

# null distribution
x = np.linspace(-20, 20, num=100)
plot(x, sst.t.pdf(x, (928 - 2)), linewidth=3)
arr = Arrow(lm1.tvalues[1], .25, 0, -0.25, lw=3, fc='r', ec='r')
ax = gca()
ax.add_patch(arr);

It's obvious from the figure that in this case, the probability of obtaining such results, given the null distribution, is very low. So we reject the null hypothesis.

How to compute p-values?

In [22]:

lm1.summary()

Out[22]:

OLS Regression Results
Dep. Variable:	child	R-squared:	0.210
Model:	OLS	Adj. R-squared:	0.210
Method:	Least Squares	F-statistic:	246.8
Date:	Sat, 23 Mar 2013	Prob (F-statistic):	1.73e-49
Time:	16:41:26	Log-Likelihood:	-2063.6
No. Observations:	928	AIC:	4131.
Df Residuals:	926	BIC:	4141.
Df Model:	1

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	23.9415	2.811	8.517	0.000	18.425 29.458
parent	0.6463	0.041	15.711	0.000	0.566 0.727

Omnibus:	11.057	Durbin-Watson:	0.046
Prob(Omnibus):	0.004	Jarque-Bera (JB):	10.944
Skew:	-0.241	Prob(JB):	0.00420
Kurtosis:	2.775	Cond. No.	2.61e+03

In [24]:

lm1.pvalues

Out[24]:

Intercept    6.536845e-17
parent       1.732509e-49

Some typical values:

$P < 0.05$ (significant)
$P < 0.01$ (strongly significant)
$P < 0.001$ (very significant)

In modern analyses, people generally report both the confidence interval and P-value. This is less true if many many hypotheses are tested.

How to interpret the results: A one inch increase in parental height is associated with a 0.52 inch increase in child's height (95% CI: 0.13 - 0.92 inches). This difference was statistically significant $(P < 0.001)$

In [ ]: