import pandas as pd
import pandas.rpy.common as com
galton = com.load_data('galton', package='UsingR')
We fit a line to the galton
data, and get the line parameters:
from statsmodels.formula.api import ols
lm1 = ols('child ~ parent', galton).fit()
plot(galton['parent'], galton['child'], 'ob')
plot(galton['parent'], lm1.fittedvalues, 'r', linewidth=3);
lm1.params
Intercept 23.941530 parent 0.646291
# emulates abline function, only possible to basic line styles and width
def abline(intercept, gradient, *args, **kwargs):
a = gca()
xlim = a.get_xlim()
ylim = a.get_ylim()
if args:
sty = args[0]
else:
sty = 'r'
if kwargs:
lw = kwargs['linewidth']
else:
lw = 5
a.plot(xlim, [intercept + gradient * x for x in xlim], sty, linewidth=lw)
a.set_xlim(xlim)
a.set_ylim(ylim);
The galton
data is just a sample obtained from all possible child-parent population.
Here we create a simulated population containing 1 million families, assuming a normal distribution.
On top of the population distribution plot, we show the line that was fitted to our sample (i.e. the galton
data).
parent = np.random.normal(loc=galton['parent'].mean(), scale=galton['parent'].std(), size=1e6)
child = lm1.params[0] + lm1.params[1] * parent + np.random.normal(scale=lm1.resid.std(), size=1e6)
newGalton = pd.DataFrame({'parent' : parent, 'child' : child})
hexbin(newGalton['parent'], newGalton['child'], cmap='PuBu')
abline(lm1.params['Intercept'], lm1.params['parent'], 'r', linewidth=3);
Suppose we take a sample from the population, and fit a line to this sample (in black):
import random
random.seed(134325)
idx = random.sample(np.arange(1e6), 50)
sampleGalton1 = newGalton.ix[idx,:]
sampleLm1 = ols('child ~ parent', sampleGalton1).fit()
plot(sampleGalton1['parent'], sampleGalton1['child'], 'ob')
plot(sampleGalton1['parent'], sampleLm1.fittedvalues, 'k-.', linewidth=3)
abline(lm1.params['Intercept'], lm1.params['parent'], 'r', linewidth=3);
If we take another sample:
idx = random.sample(np.arange(1e6), 50)
sampleGalton2 = newGalton.ix[idx,:]
sampleLm2 = ols('child ~ parent', sampleGalton2).fit()
plot(sampleGalton2['parent'], sampleGalton2['child'], 'ob')
plot(sampleGalton2['parent'], sampleLm2.fittedvalues, 'k-.', linewidth=1)
abline(lm1.params['Intercept'], lm1.params['parent'], 'r', linewidth=3);
We see that every sampling produces different fitted line. If we do this 100 times and plot all the fitted lines:
sampleLm = []
for i in range(100):
idx = random.sample(np.arange(1e6), 50)
sampleGalton = newGalton.ix[idx,:]
sampleLm.append(ols('child ~ parent', sampleGalton).fit())
hexbin(newGalton['parent'], newGalton['child'], cmap='PuBu')
for lm in sampleLm:
abline(lm.params['Intercept'], lm.params['parent'], 'k', linewidth=1)
abline(lm1.params['Intercept'], lm1.params['parent'], 'r', linewidth=3);
If we plot the distribution of the line parameters:
f, (ax1, ax2) = subplots(ncols=2)
ax1.hist(map(lambda x: x.params['Intercept'], sampleLm))
ax1.set_xlabel('Intercept')
ax2.hist(map(lambda x: x.params['parent'], sampleLm))
ax2.set_xlabel('Slope')
f.tight_layout();
From the Central limit theorem:
ˆb0∼N(b0,Var(ˆb0))which can be estimated with:
ˆb0≈N(b0,^Var(ˆb0))That is, we can estimate the variance of our estimates using its standard error: √^Var(ˆb0).
The estimate's standar error is computed by the ols
function, and the value can be accessed in the resulting regression object.
idx = random.sample(np.arange(1e6), 50)
sampleGalton4 = newGalton.ix[idx,:]
sampleLm4 = ols('child ~ parent', sampleGalton4).fit()
sampleLm4.summary()
Dep. Variable: | child | R-squared: | 0.132 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.114 |
Method: | Least Squares | F-statistic: | 7.330 |
Date: | Sat, 23 Mar 2013 | Prob (F-statistic): | 0.00937 |
Time: | 16:04:39 | Log-Likelihood: | -114.48 |
No. Observations: | 50 | AIC: | 233.0 |
Df Residuals: | 48 | BIC: | 236.8 |
Df Model: | 1 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 32.4634 | 13.141 | 2.470 | 0.017 | 6.042 58.885 |
parent | 0.5177 | 0.191 | 2.707 | 0.009 | 0.133 0.902 |
Omnibus: | 2.024 | Durbin-Watson: | 2.725 |
---|---|---|---|
Prob(Omnibus): | 0.364 | Jarque-Bera (JB): | 1.332 |
Skew: | 0.111 | Prob(JB): | 0.514 |
Kurtosis: | 2.232 | Cond. No. | 2.62e+03 |
# estimated vs real distribution --- very close
import scipy.stats as sst
hist(map(lambda x: x.params['parent'], sampleLm), normed=True, bins=10)
xlabel('Slope')
x = np.linspace(0.2, 1.4, num=100)
y = sst.norm.pdf(x, loc=sampleLm4.params['parent'], scale=sampleLm4.bse[1])
plot(x, y, 'r');
Standardised coefficients:
^b0−b0S.E.(^b0)∼tn−2and
^b1−b1S.E.(^b1)∼tn−2That is, the standardised coefficients have a Student's t-distribution with (n−2) Degrees of Freedom.
Degrees of Freedom ≈ (number of samples - number of things you estimated)
t-distribution (3 DoF in red, 10 DoF in black) vs normal distribution (the higher the DoF, the closer a t-distribution is to a normal distribution):
x = np.linspace(-5, 5, num=100)
plot(x, sst.norm.pdf(x), 'k', linewidth=3)
plot(x, sst.t.pdf(x, 3), 'r', linewidth=3)
plot(x, sst.t.pdf(x, 10), 'b', linewidth=3);
Confidence intervals: We have an estimate b1 and we want to know something about how good our estimate is. One way is to create a "level α confidence interval".
A confidence interval
will include the real parameter
α percent of the time
in repeated studies.
sampleLm4.params
Intercept 32.463411 parent 0.517689
sampleLm4.conf_int()
0 | 1 | |
---|---|---|
Intercept | 6.041595 | 58.885228 |
parent | 0.133226 | 0.902152 |
How to report the inference? A one inch increase in parental height is associated with a 0.52 inch increase in child's height (95% CI: 0.13 - 0.92 inches).
Approach:
Our standardised slope parameter:
^b1−b1S.E.(^b1)∼tn−2Define the null hypothesis:
H0: There is no relationship between parent and child height
That is, the distribution of the slope parameter under the null hypothesis (the null distribution):
^b1S.E.(^b1)∼tn−2The alternative to the the null hypothesis (alternate hypothesis):
Ha: There is a relationship, i.e. bi≠0
This is a two-tailed test.
We can plot the null distribution together with the observed statistics (t-value):
# null distribution
x = np.linspace(-20, 20, num=100)
plot(x, sst.t.pdf(x, (928 - 2)), linewidth=3)
arr = Arrow(lm1.tvalues[1], .25, 0, -0.25, lw=3, fc='r', ec='r')
ax = gca()
ax.add_patch(arr);
It's obvious from the figure that in this case, the probability of obtaining such results, given the null distribution, is very low. So we reject the null hypothesis.
How to compute p-values?
lm1.summary()
Dep. Variable: | child | R-squared: | 0.210 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.210 |
Method: | Least Squares | F-statistic: | 246.8 |
Date: | Sat, 23 Mar 2013 | Prob (F-statistic): | 1.73e-49 |
Time: | 16:41:26 | Log-Likelihood: | -2063.6 |
No. Observations: | 928 | AIC: | 4131. |
Df Residuals: | 926 | BIC: | 4141. |
Df Model: | 1 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 23.9415 | 2.811 | 8.517 | 0.000 | 18.425 29.458 |
parent | 0.6463 | 0.041 | 15.711 | 0.000 | 0.566 0.727 |
Omnibus: | 11.057 | Durbin-Watson: | 0.046 |
---|---|---|---|
Prob(Omnibus): | 0.004 | Jarque-Bera (JB): | 10.944 |
Skew: | -0.241 | Prob(JB): | 0.00420 |
Kurtosis: | 2.775 | Cond. No. | 2.61e+03 |
lm1.pvalues
Intercept 6.536845e-17 parent 1.732509e-49
Some typical values:
In modern analyses, people generally report both the confidence interval and P-value. This is less true if many many hypotheses are tested.
How to interpret the results: A one inch increase in parental height is associated with a 0.52 inch increase in child's height (95% CI: 0.13 - 0.92 inches). This difference was statistically significant (P<0.001)