Basic least squares¶

In [1]:

import pandas as pd
import numpy as np
import pandas.rpy.common as com

To do this exercise, let's load the galton data from R:

In [2]:

galton = com.load_data('galton', package='UsingR')
galton.head()

Out[2]:

	child	parent
1	61.7	70.5
2	61.7	68.5
3	61.7	65.5
4	61.7	64.5
5	61.7	64.0

Plot the histograms of the child and parent heights:

In [3]:

f, (ax1, ax2) = subplots(ncols=2)

ax1.hist(galton['child'], bins=100)
ax2.hist(galton['parent'], bins=100)

f.tight_layout();

If we only know the child data, the average is a good estimate, because if $C_i$ is the height of child $i$ then the average is the value of $\mu$ that minimizes:

$$\sum_{i=1}^{928}{(C_i - \mu)}^2$$

Add a line to show the mean of the child height:

In [4]:

f, ax = subplots(ncols=1)

ax.hist(galton['child'], bins=100)

meanChild = galton['child'].mean()

ax.plot(np.repeat(meanChild, 100), np.linspace(0, 150, num=100), 'r', linewidth=5);

If we plot child vs parent height:

In [5]:

plot(galton['parent'], galton['child'], 'ob');

Some measurement points are stack up one after another, we can't see all of them in the plot. What we can do is to add some jitter:

In [6]:

def jitter(series, factor):
    z = float(series.max()) - float(series.min())
    a = float(factor) * z / 50.
    return series.apply(lambda x: x + np.random.uniform(-a, a))
   
np.random.seed(1234)
plot(jitter(galton['parent'], factor=1), jitter(galton['child'], factor=1), 'ob', alpha=.5);

Fitting a line

If $C_i$ is the height of child $i$ and $P_i$ is the height of the average parent, then we can imagine writing the equation for a line

$$C_i = b_0 + b_1 P_i + e_i$$

$e_i$ is everything we didn't measure (how much they eat, where they live, do they stretch in the morning...)

We pick a line that minimises:

$$\sum_{i=1}^{928}{(C_i - (b_0 + b_1 P_i))^2}$$

We compute this by using the ols (ordinary least square) function from statsmodels.

Then we can plot the line the the 'residuals':

In [7]:

from statsmodels.formula.api import ols

lm = ols('child ~ parent', galton).fit()

f, (ax1, ax2) = subplots(ncols=2)

# plot fitted values
ax1.plot(galton['parent'], galton['child'], 'ob')
ax1.plot(galton['parent'], lm.fittedvalues, 'r', linewidth=3);

# plot residuals
ax2.plot(galton['parent'], lm.resid, 'ob')
ax2.plot(galton['parent'], np.repeat(0, len(galton['parent'])), 'r', linewidth=3)

f.tight_layout();

In [ ]: