import pandas as pd
import numpy as np
import pandas.rpy.common as com
To do this exercise, let's load the galton
data from R:
galton = com.load_data('galton', package='UsingR')
galton.head()
child | parent | |
---|---|---|
1 | 61.7 | 70.5 |
2 | 61.7 | 68.5 |
3 | 61.7 | 65.5 |
4 | 61.7 | 64.5 |
5 | 61.7 | 64.0 |
Plot the histograms of the child and parent heights:
f, (ax1, ax2) = subplots(ncols=2)
ax1.hist(galton['child'], bins=100)
ax2.hist(galton['parent'], bins=100)
f.tight_layout();
If we only know the child data, the average is a good estimate, because if Ci is the height of child i then the average is the value of μ that minimizes:
928∑i=1(Ci−μ)2Add a line to show the mean of the child height:
f, ax = subplots(ncols=1)
ax.hist(galton['child'], bins=100)
meanChild = galton['child'].mean()
ax.plot(np.repeat(meanChild, 100), np.linspace(0, 150, num=100), 'r', linewidth=5);
If we plot child vs parent height:
plot(galton['parent'], galton['child'], 'ob');
Some measurement points are stack up one after another, we can't see all of them in the plot. What we can do is to add some jitter:
def jitter(series, factor):
z = float(series.max()) - float(series.min())
a = float(factor) * z / 50.
return series.apply(lambda x: x + np.random.uniform(-a, a))
np.random.seed(1234)
plot(jitter(galton['parent'], factor=1), jitter(galton['child'], factor=1), 'ob', alpha=.5);
Fitting a line
If Ci is the height of child i and Pi is the height of the average parent, then we can imagine writing the equation for a line
Ci=b0+b1Pi+eiei is everything we didn't measure (how much they eat, where they live, do they stretch in the morning...)
We pick a line that minimises:
928∑i=1(Ci−(b0+b1Pi))2We compute this by using the ols
(ordinary least square) function from statsmodels
.
Then we can plot the line the the 'residuals':
from statsmodels.formula.api import ols
lm = ols('child ~ parent', galton).fit()
f, (ax1, ax2) = subplots(ncols=2)
# plot fitted values
ax1.plot(galton['parent'], galton['child'], 'ob')
ax1.plot(galton['parent'], lm.fittedvalues, 'r', linewidth=3);
# plot residuals
ax2.plot(galton['parent'], lm.resid, 'ob')
ax2.plot(galton['parent'], np.repeat(0, len(galton['parent'])), 'r', linewidth=3)
f.tight_layout();