Multiple variable regression¶

Key ideas:

Regression with multiple covariates
Still using least squares / central limit theorem
Interpretation depends on all variables

In [1]:

# Example: Millenium development goal 1; WHO childhood hunger data

import pandas as pd

hunger = pd.read_csv('http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?profile=text&filter=COUNTRY:;SEX:')

hunger = hunger[hunger['Sex'] != 'Both sexes']
# the last entry is all NaN 
hunger = hunger[hunger['Year'].notnull()]

In [2]:

hunger.head()

Out[2]:

	Indicator	Data Source	Country	Sex	Year	WHO region	Display Value	Numeric	Low	High	Comments
1	Children aged <5 years underweight (%)	NLIS_312819	Afghanistan	Female	2004	Eastern Mediterranean	33.0	33.0	NaN	NaN	NaN
2	Children aged <5 years underweight (%)	NLIS_312819	Afghanistan	Male	2004	Eastern Mediterranean	32.7	32.7	NaN	NaN	NaN
5	Children aged <5 years underweight (%)	NLIS_312361	Albania	Male	2000	Europe	19.6	19.6	NaN	NaN	NaN
6	Children aged <5 years underweight (%)	NLIS_312361	Albania	Female	2000	Europe	14.2	14.2	NaN	NaN	NaN
8	Children aged <5 years underweight (%)	NLIS_312879	Albania	Male	2005	Europe	7.3	7.3	NaN	NaN	NaN

In [3]:

# emulates abline function, only possible to set basic line styles and width
def abline(intercept, gradient, *args, **kwargs):
    a = gca()
    xlim = a.get_xlim()
    ylim = a.get_ylim()
    
    if args:
        sty = args[0]
    else:
        sty = 'r'
        
    if kwargs:
        lw = kwargs['linewidth']
    else:
        lw = 5

    a.plot(xlim, [intercept + gradient * x for x in xlim], sty, linewidth=lw)
    a.set_xlim(xlim)
    a.set_ylim(ylim);

Plot percent hungry versus time, along with a fitted line for the linear model:

$Hu_i = b_0 + b_1Yi + e_i$

$b_0$ = percent hungry at Year 0

$b_1$ = decrease in percent hungry per year

$e_i$ = everything else we didn't measure

In [6]:

from statsmodels.formula.api import ols

lm1 = ols('Numeric ~ Year', hunger).fit()

plot(hunger['Year'], hunger['Numeric'], 'ob', alpha=0.6)
plot(hunger['Year'], lm1.fittedvalues, 'grey', linewidth=3);

Colour by male/female, and add two fitted lines:

$HuF_i = bf_0 + bf_iYF_i + ef_i$

$bf_0$ = percent of girls hungry at Year 0

$bf_i$ = decrease in percent of girls hungry per year

$ef_i$ = everything else we didn't measure

$HuM_i = bf_0 + bm_iYM_i + em_i$

$bm_0$ = percent of boys hungry at Year 0

$bm_i$ = decrease in percent of boys hungry per year

$em_i$ = everything else we didn't measure

In [7]:

sex_groups = hunger.groupby('Sex')

c = {'Female' : 'r', 'Male' : 'k'}

idM = hunger['Sex'] == 'Male'
idF = hunger['Sex'] == 'Female'

lmM = ols('Numeric ~ Year', hunger[idM]).fit()
lmF = ols('Numeric ~ Year', hunger[idF]).fit()

for sex, df in sex_groups:
    scatter(df['Year'], df['Numeric'], c=c[sex], alpha=.6)

plot(hunger['Year'][idM], lmM.fittedvalues, c['Male'])
plot(hunger['Year'][idF], lmF.fittedvalues, c['Female']);

Two lines, same slope:

$Hu_i = b_0 + b_11(Sex_i = "Male") + b_2Y_i + e^*_i$

$b_0$ = percent hungry at year zero for females

$b_0 + b_1$ = percent hungry at year zero for males

$b_2$ = change in percent hungry (for either males or females) in one year

$e^*_i$ = everything else we didn't measure

In [8]:

lmBoth = ols('Numeric ~ Year + Sex', hunger).fit()

for sex, df in sex_groups:
    scatter(df['Year'], df['Numeric'], c=c[sex], alpha=.6)

abline(lmBoth.params['Intercept'], 
       lmBoth.params['Year'], c['Female'], linewidth=3)
abline(lmBoth.params['Intercept'] + lmBoth.params['Sex[T.Male]'], 
       lmBoth.params['Year'], c['Male'], linewidth=3)

Two lines, different slopes (interactions):

$Hu_i = b_0 + b_11(Sex_i = "Male") + b_2Y_i + b_31(Sex_i="Male")Y_i + e^+_i$

$b_0$ = percent hungry at year zero for females

$b_0 + b_1$ = percent hungry at year zero for males

$b_2$ = change in percent hungry (females) in one year

$b_2 + b_3$ = change in percent hungry (males) in one year

$e^+_i$ = everything else we didn't measure

In [9]:

lmBoth = ols('Numeric ~ Year + Sex + Sex * Year', hunger).fit()

for sex, df in sex_groups:
    scatter(df['Year'], df['Numeric'], c=c[sex], alpha=.6)

abline(lmBoth.params['Intercept'], 
       lmBoth.params['Year'], c['Female'], linewidth=3)
abline(lmBoth.params['Intercept'] + lmBoth.params['Sex[T.Male]'], 
       lmBoth.params['Year'] + lmBoth.params['Sex[T.Male]:Year'], c['Male'], linewidth=3)

In [10]:

lmBoth.summary()

Out[10]:

OLS Regression Results
Dep. Variable:	Numeric	R-squared:	0.023
Model:	OLS	Adj. R-squared:	0.020
Method:	Least Squares	F-statistic:	6.820
Date:	Sat, 06 Apr 2013	Prob (F-statistic):	0.000152
Time:	21:05:16	Log-Likelihood:	-3459.3
No. Observations:	860	AIC:	6927.
Df Residuals:	856	BIC:	6946.
Df Model:	3

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	530.9361	191.709	2.769	0.006	154.662 907.210
Sex[T.Male]	59.5787	271.117	0.220	0.826	-472.553 591.710
Year	-0.2569	0.096	-2.681	0.007	-0.445 -0.069
Sex[T.Male]:Year	-0.0288	0.136	-0.213	0.832	-0.295 0.237

Omnibus:	82.976	Durbin-Watson:	0.283
Prob(Omnibus):	0.000	Jarque-Bera (JB):	105.945
Skew:	0.855	Prob(JB):	9.87e-24
Kurtosis:	3.184	Cond. No.	1.54e+06

Interactions for continuous variables -- Lots of care/caution needed:

$Hu_i = b_0 + b_1In_i + b_2Y_i + b_3In_iY_i + e^+_i$

$b_0$ = percent hungry at year zero for children whose parents have no income

$b_1$ = change in percent hungry for each dollar of income in year zero

$b_2$ = change in percent hungry in one year for children whose parents have no income

$b_3$ = increased change in percent hungry by year for each dollar of income - e.g. if income is $10,000, then change in percent hungry in one year will be:

$b_2 + 1e4 x b_3$

$e^+_i$ = everything else we didn't measure

In [ ]: