The ideal data for regression:
import pandas.rpy.common as com
galton = com.load_data('galton', package='UsingR')
plot(galton['parent'], galton['child'], 'ob');
Confounder: a variable that is correlated with both the outcome and the covariates.
import pandas as pd
hunger = pd.read_csv('http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?profile=text&filter=COUNTRY:;SEX:')
hunger = hunger[hunger['Sex'] != 'Both sexes']
# the last entry is all NaN
hunger = hunger[hunger['Year'].notnull()]
hunger.columns = map(lambda x: x.replace(' ', '_').lower(), hunger.columns)
hunger.head()
indicator | data_source | country | sex | year | who_region | display_value | numeric | low | high | comments | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Children aged <5 years underweight (%) | NLIS_312819 | Afghanistan | Female | 2004 | Eastern Mediterranean | 33.0 | 33.0 | NaN | NaN | NaN |
2 | Children aged <5 years underweight (%) | NLIS_312819 | Afghanistan | Male | 2004 | Eastern Mediterranean | 32.7 | 32.7 | NaN | NaN | NaN |
5 | Children aged <5 years underweight (%) | NLIS_312361 | Albania | Male | 2000 | Europe | 19.6 | 19.6 | NaN | NaN | NaN |
6 | Children aged <5 years underweight (%) | NLIS_312361 | Albania | Female | 2000 | Europe | 14.2 | 14.2 | NaN | NaN | NaN |
8 | Children aged <5 years underweight (%) | NLIS_312879 | Albania | Male | 2005 | Europe | 7.3 | 7.3 | NaN | NaN | NaN |
# emulates abline function, only possible to set basic line styles and width
def abline(intercept, gradient, *args, **kwargs):
a = gca()
xlim = a.get_xlim()
ylim = a.get_ylim()
if args:
sty = args[0]
else:
sty = 'r'
if kwargs:
lw = kwargs['linewidth']
else:
lw = 5
a.plot(xlim, [intercept + gradient * x for x in xlim], sty, linewidth=lw)
a.set_xlim(xlim)
a.set_ylim(ylim)
Hunger over time by region:
regions = hunger.groupby('who_region')
colors = {'Africa': 'k',
'Americas': 'r',
'Eastern Mediterranean': 'g',
'Europe': 'b',
'South-East Asia': 'c',
'WHO Non Members': 'y',
'Western Pacific': 'm'}
f, (ax1, ax2) = subplots(ncols=2)
f.set_size_inches(8, 4)
for g, v in regions:
ax1.scatter(v['year'], v['numeric'], c=colors[g])
ax2.scatter(0, 0, c=colors[g], label=g)
ax1.set_xlabel('Year')
ax1.set_ylabel('Numeric')
ax2.set_xticklabels('')
ax2.set_yticklabels('')
ax2.legend(loc='center')
f.tight_layout();
Region correlated with year
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
anova_lm(ols('year ~ who_region', hunger).fit())
df | sum_sq | mean_sq | F | PR(>F) | |
---|---|---|---|---|---|
who_region | 6 | 622.888582 | 103.814764 | 2.252376 | 0.036561 |
Residual | 853 | 39315.813743 | 46.091224 | NaN | NaN |
Region correlated with hunger
anova_lm(ols('numeric ~ who_region', hunger).fit())
df | sum_sq | mean_sq | F | PR(>F) | |
---|---|---|---|---|---|
who_region | 6 | 76522.267246 | 12753.711208 | 129.186946 | 3.843931e-116 |
Residual | 853 | 84210.642010 | 98.722910 | NaN | NaN |
Including region - a complicated interaction:
lmRegion = ols('numeric ~ year + who_region + year * who_region', hunger).fit()
#lmRegion.summary()
for g, v in regions:
scatter(v['year'], v['numeric'], c=colors[g])
intercept = lmRegion.params['Intercept']
se_asia = lmRegion.params['who_region[T.South-East Asia]']
year = lmRegion.params['year']
year_se_asia = lmRegion.params['year:who_region[T.South-East Asia]']
abline(intercept + se_asia, year + year_se_asia, 'c', linewidth=3)
incomeData = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None)
# last row is all NaNs
incomeData = incomeData.drop(32561)
incomeData.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
Logs to address right-skew
income = incomeData.ix[:,2]
age = incomeData.ix[:,0]
f, (ax1, ax2, ax3, ax4) = subplots(ncols=4)
# original age vs income data
ax1.scatter(age, income, marker='.')
# original income data
ax2.hist(income, range=(income.min(), income.max()), bins=100)
# log-transformed income data
log_income = income.apply(lambda x: np.log(x + 1))
ax3.hist(log_income, range=(log_income.min(), log_income.max()), bins=200)
# log-transformed age vs income data
ax4.scatter(age, log_income, marker='.')
f.set_size_inches(10, 5)
f.tight_layout();
Example - extreme points
import numpy as np
np.random.seed(34568)
xVals = np.random.standard_cauchy(size=50)
hist(xVals);
Example - outliers may be real
# add Tim Cook, CEO of Apple 2011 income
age = np.append(np.array(age), 52)
income = np.append(np.array(income), 378e6)
scatter(age, income, marker='.')
xlabel('age')
ylabel('income');
Outliers - what to do:
from statsmodels.formula.api import rlm
# robust linear model
?rlm
# http://en.wikipedia.org/wiki/File:Anscombe%27s_quartet_3.svg
bupaData = pd.read_csv('ftp://ftp.ics.uci.edu/pub/machine-learning-databases/liver-disorders/bupa.data', header=None)
ggt = bupaData.ix[:,4]
aat = bupaData.ix[:,2]
plot(np.log(ggt), aat, 'ob')
xlabel('log(ggt)')
ylabel('aat');
Plot the residuals
lm1 = ols('aat ~ np.log(ggt)', pd.DataFrame({'aat' : aat, 'ggt' : ggt})).fit()
plot(ggt.apply(np.log), lm1.resid, 'ob')
xlabel('log(ggt)')
ylabel('lm1.resid');
Other things: