import pandas as pd
import numpy as np
import pandas.rpy.common as com
To do this exercise, let's load the galton
data from R:
galton = com.load_data('galton', package='UsingR')
galton.head()
child | parent | |
---|---|---|
1 | 61.7 | 70.5 |
2 | 61.7 | 68.5 |
3 | 61.7 | 65.5 |
4 | 61.7 | 64.5 |
5 | 61.7 | 64.0 |
Plot the histograms of the child and parent heights:
f, (ax1, ax2) = subplots(ncols=2)
ax1.hist(galton['child'], bins=100)
ax2.hist(galton['parent'], bins=100)
f.tight_layout();
If we only know the child data, the average is a good estimate, because if $C_i$ is the height of child $i$ then the average is the value of $\mu$ that minimizes:
$$\sum_{i=1}^{928}{(C_i - \mu)}^2$$Add a line to show the mean of the child height:
f, ax = subplots(ncols=1)
ax.hist(galton['child'], bins=100)
meanChild = galton['child'].mean()
ax.plot(np.repeat(meanChild, 100), np.linspace(0, 150, num=100), 'r', linewidth=5);
If we plot child vs parent height:
plot(galton['parent'], galton['child'], 'ob');
Some measurement points are stack up one after another, we can't see all of them in the plot. What we can do is to add some jitter:
def jitter(series, factor):
z = float(series.max()) - float(series.min())
a = float(factor) * z / 50.
return series.apply(lambda x: x + np.random.uniform(-a, a))
np.random.seed(1234)
plot(jitter(galton['parent'], factor=1), jitter(galton['child'], factor=1), 'ob', alpha=.5);
Fitting a line
If $C_i$ is the height of child $i$ and $P_i$ is the height of the average parent, then we can imagine writing the equation for a line
$$C_i = b_0 + b_1 P_i + e_i$$$e_i$ is everything we didn't measure (how much they eat, where they live, do they stretch in the morning...)
We pick a line that minimises:
$$\sum_{i=1}^{928}{(C_i - (b_0 + b_1 P_i))^2}$$We compute this by using the ols
(ordinary least square) function from statsmodels
.
Then we can plot the line the the 'residuals':
from statsmodels.formula.api import ols
lm = ols('child ~ parent', galton).fit()
f, (ax1, ax2) = subplots(ncols=2)
# plot fitted values
ax1.plot(galton['parent'], galton['child'], 'ob')
ax1.plot(galton['parent'], lm.fittedvalues, 'r', linewidth=3);
# plot residuals
ax2.plot(galton['parent'], lm.resid, 'ob')
ax2.plot(galton['parent'], np.repeat(0, len(galton['parent'])), 'r', linewidth=3)
f.tight_layout();