Key ideas:
# Example: Millenium development goal 1; WHO childhood hunger data
import pandas as pd
hunger = pd.read_csv('http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?profile=text&filter=COUNTRY:;SEX:')
hunger = hunger[hunger['Sex'] != 'Both sexes']
# the last entry is all NaN
hunger = hunger[hunger['Year'].notnull()]
hunger.head()
Indicator | Data Source | Country | Sex | Year | WHO region | Display Value | Numeric | Low | High | Comments | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Children aged <5 years underweight (%) | NLIS_312819 | Afghanistan | Female | 2004 | Eastern Mediterranean | 33.0 | 33.0 | NaN | NaN | NaN |
2 | Children aged <5 years underweight (%) | NLIS_312819 | Afghanistan | Male | 2004 | Eastern Mediterranean | 32.7 | 32.7 | NaN | NaN | NaN |
5 | Children aged <5 years underweight (%) | NLIS_312361 | Albania | Male | 2000 | Europe | 19.6 | 19.6 | NaN | NaN | NaN |
6 | Children aged <5 years underweight (%) | NLIS_312361 | Albania | Female | 2000 | Europe | 14.2 | 14.2 | NaN | NaN | NaN |
8 | Children aged <5 years underweight (%) | NLIS_312879 | Albania | Male | 2005 | Europe | 7.3 | 7.3 | NaN | NaN | NaN |
# emulates abline function, only possible to set basic line styles and width
def abline(intercept, gradient, *args, **kwargs):
a = gca()
xlim = a.get_xlim()
ylim = a.get_ylim()
if args:
sty = args[0]
else:
sty = 'r'
if kwargs:
lw = kwargs['linewidth']
else:
lw = 5
a.plot(xlim, [intercept + gradient * x for x in xlim], sty, linewidth=lw)
a.set_xlim(xlim)
a.set_ylim(ylim);
Plot percent hungry versus time, along with a fitted line for the linear model:
b0 = percent hungry at Year 0
b1 = decrease in percent hungry per year
ei = everything else we didn't measure
from statsmodels.formula.api import ols
lm1 = ols('Numeric ~ Year', hunger).fit()
plot(hunger['Year'], hunger['Numeric'], 'ob', alpha=0.6)
plot(hunger['Year'], lm1.fittedvalues, 'grey', linewidth=3);
Colour by male/female, and add two fitted lines:
bf0 = percent of girls hungry at Year 0
bfi = decrease in percent of girls hungry per year
efi = everything else we didn't measure
bm0 = percent of boys hungry at Year 0
bmi = decrease in percent of boys hungry per year
emi = everything else we didn't measure
sex_groups = hunger.groupby('Sex')
c = {'Female' : 'r', 'Male' : 'k'}
idM = hunger['Sex'] == 'Male'
idF = hunger['Sex'] == 'Female'
lmM = ols('Numeric ~ Year', hunger[idM]).fit()
lmF = ols('Numeric ~ Year', hunger[idF]).fit()
for sex, df in sex_groups:
scatter(df['Year'], df['Numeric'], c=c[sex], alpha=.6)
plot(hunger['Year'][idM], lmM.fittedvalues, c['Male'])
plot(hunger['Year'][idF], lmF.fittedvalues, c['Female']);
Two lines, same slope:
b0 = percent hungry at year zero for females
b0+b1 = percent hungry at year zero for males
b2 = change in percent hungry (for either males or females) in one year
e∗i = everything else we didn't measure
lmBoth = ols('Numeric ~ Year + Sex', hunger).fit()
for sex, df in sex_groups:
scatter(df['Year'], df['Numeric'], c=c[sex], alpha=.6)
abline(lmBoth.params['Intercept'],
lmBoth.params['Year'], c['Female'], linewidth=3)
abline(lmBoth.params['Intercept'] + lmBoth.params['Sex[T.Male]'],
lmBoth.params['Year'], c['Male'], linewidth=3)
Two lines, different slopes (interactions):
b0 = percent hungry at year zero for females
b0+b1 = percent hungry at year zero for males
b2 = change in percent hungry (females) in one year
b2+b3 = change in percent hungry (males) in one year
e+i = everything else we didn't measure
lmBoth = ols('Numeric ~ Year + Sex + Sex * Year', hunger).fit()
for sex, df in sex_groups:
scatter(df['Year'], df['Numeric'], c=c[sex], alpha=.6)
abline(lmBoth.params['Intercept'],
lmBoth.params['Year'], c['Female'], linewidth=3)
abline(lmBoth.params['Intercept'] + lmBoth.params['Sex[T.Male]'],
lmBoth.params['Year'] + lmBoth.params['Sex[T.Male]:Year'], c['Male'], linewidth=3)
lmBoth.summary()
Dep. Variable: | Numeric | R-squared: | 0.023 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.020 |
Method: | Least Squares | F-statistic: | 6.820 |
Date: | Sat, 06 Apr 2013 | Prob (F-statistic): | 0.000152 |
Time: | 21:05:16 | Log-Likelihood: | -3459.3 |
No. Observations: | 860 | AIC: | 6927. |
Df Residuals: | 856 | BIC: | 6946. |
Df Model: | 3 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 530.9361 | 191.709 | 2.769 | 0.006 | 154.662 907.210 |
Sex[T.Male] | 59.5787 | 271.117 | 0.220 | 0.826 | -472.553 591.710 |
Year | -0.2569 | 0.096 | -2.681 | 0.007 | -0.445 -0.069 |
Sex[T.Male]:Year | -0.0288 | 0.136 | -0.213 | 0.832 | -0.295 0.237 |
Omnibus: | 82.976 | Durbin-Watson: | 0.283 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 105.945 |
Skew: | 0.855 | Prob(JB): | 9.87e-24 |
Kurtosis: | 3.184 | Cond. No. | 1.54e+06 |
Interactions for continuous variables -- Lots of care/caution needed:
b0 = percent hungry at year zero for children whose parents have no income
b1 = change in percent hungry for each dollar of income in year zero
b2 = change in percent hungry in one year for children whose parents have no income
b3 = increased change in percent hungry by year for each dollar of income - e.g. if income is $10,000, then change in percent hungry in one year will be:
b2+1e4xb3e+i = everything else we didn't measure