Regression in the real world¶

The ideal data for regression:

In [3]:

import pandas.rpy.common as com

galton = com.load_data('galton', package='UsingR')

plot(galton['parent'], galton['child'], 'ob');

Confounders¶

Confounder: a variable that is correlated with both the outcome and the covariates.

can change the regression line,
can change the sign of the line,
can sometimes be detected by careful exploration

In [4]:

import pandas as pd

hunger = pd.read_csv('http://apps.who.int/gho/athena/data/GHO/WHOSIS_000008.csv?profile=text&filter=COUNTRY:;SEX:')

hunger = hunger[hunger['Sex'] != 'Both sexes']
# the last entry is all NaN 
hunger = hunger[hunger['Year'].notnull()]
hunger.columns = map(lambda x: x.replace(' ', '_').lower(), hunger.columns)

In [5]:

hunger.head()

Out[5]:

	indicator	data_source	country	sex	year	who_region	display_value	numeric	low	high	comments
1	Children aged <5 years underweight (%)	NLIS_312819	Afghanistan	Female	2004	Eastern Mediterranean	33.0	33.0	NaN	NaN	NaN
2	Children aged <5 years underweight (%)	NLIS_312819	Afghanistan	Male	2004	Eastern Mediterranean	32.7	32.7	NaN	NaN	NaN
5	Children aged <5 years underweight (%)	NLIS_312361	Albania	Male	2000	Europe	19.6	19.6	NaN	NaN	NaN
6	Children aged <5 years underweight (%)	NLIS_312361	Albania	Female	2000	Europe	14.2	14.2	NaN	NaN	NaN
8	Children aged <5 years underweight (%)	NLIS_312879	Albania	Male	2005	Europe	7.3	7.3	NaN	NaN	NaN

In [6]:

# emulates abline function, only possible to set basic line styles and width
def abline(intercept, gradient, *args, **kwargs):
    a = gca()
    xlim = a.get_xlim()
    ylim = a.get_ylim()
    
    if args:
        sty = args[0]
    else:
        sty = 'r'
        
    if kwargs:
        lw = kwargs['linewidth']
    else:
        lw = 5

    a.plot(xlim, [intercept + gradient * x for x in xlim], sty, linewidth=lw)
    a.set_xlim(xlim)
    a.set_ylim(ylim)

Hunger over time by region:

In [7]:

regions = hunger.groupby('who_region')

colors = {'Africa': 'k',
          'Americas': 'r',
          'Eastern Mediterranean': 'g',
          'Europe': 'b',
          'South-East Asia': 'c',
          'WHO Non Members': 'y',
          'Western Pacific': 'm'}

f, (ax1, ax2) = subplots(ncols=2)
f.set_size_inches(8, 4)

for g, v in regions:
    ax1.scatter(v['year'], v['numeric'], c=colors[g])
    ax2.scatter(0, 0, c=colors[g], label=g)
    
ax1.set_xlabel('Year')
ax1.set_ylabel('Numeric')
ax2.set_xticklabels('')
ax2.set_yticklabels('')
ax2.legend(loc='center')
f.tight_layout();

Region correlated with year

In [8]:

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

anova_lm(ols('year ~ who_region', hunger).fit())

Out[8]:

	df	sum_sq	mean_sq	F	PR(>F)
who_region	6	622.888582	103.814764	2.252376	0.036561
Residual	853	39315.813743	46.091224	NaN	NaN

Region correlated with hunger

In [9]:

anova_lm(ols('numeric ~ who_region', hunger).fit())

Out[9]:

	df	sum_sq	mean_sq	F	PR(>F)
who_region	6	76522.267246	12753.711208	129.186946	3.843931e-116
Residual	853	84210.642010	98.722910	NaN	NaN

Including region - a complicated interaction:

In [10]:

lmRegion = ols('numeric ~ year + who_region + year * who_region', hunger).fit()
#lmRegion.summary()

In [11]:

for g, v in regions:
    scatter(v['year'], v['numeric'], c=colors[g])

intercept = lmRegion.params['Intercept']
se_asia = lmRegion.params['who_region[T.South-East Asia]']
year = lmRegion.params['year']
year_se_asia = lmRegion.params['year:who_region[T.South-East Asia]']

abline(intercept + se_asia, year + year_se_asia, 'c', linewidth=3)

Right-skewed data¶

In [13]:

incomeData = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None)
# last row is all NaNs
incomeData = incomeData.drop(32561)
incomeData.head()

Out[13]:

	0	1	2	3	4	5	6	7	8	9	10	12	13	14
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

Logs to address right-skew

In [14]:

income = incomeData.ix[:,2]
age = incomeData.ix[:,0]

f, (ax1, ax2, ax3, ax4) = subplots(ncols=4)

# original age vs income data
ax1.scatter(age, income, marker='.')
# original income data
ax2.hist(income, range=(income.min(), income.max()), bins=100)

# log-transformed income data
log_income = income.apply(lambda x: np.log(x + 1))
ax3.hist(log_income, range=(log_income.min(), log_income.max()), bins=200)
# log-transformed age vs income data
ax4.scatter(age, log_income, marker='.')

f.set_size_inches(10, 5)
f.tight_layout();

Outliers¶

Example - extreme points

In [15]:

import numpy as np

np.random.seed(34568)
xVals = np.random.standard_cauchy(size=50)
hist(xVals);

Example - outliers may be real

In [17]:

# add Tim Cook, CEO of Apple 2011 income
age = np.append(np.array(age), 52)
income = np.append(np.array(income), 378e6)

scatter(age, income, marker='.')
xlabel('age')
ylabel('income');

Outliers - what to do:

remove, if not of real interest,
sensitivity analysis - is it a big difference if left-in/taken-out?
logs - if the data are right skewed (lots of outliers),
robust methods

In [18]:

from statsmodels.formula.api import rlm

# robust linear model
?rlm

A line isn't always the best summary¶

In [34]:

# http://en.wikipedia.org/wiki/File:Anscombe%27s_quartet_3.svg

Variance changes¶

In [23]:

bupaData = pd.read_csv('ftp://ftp.ics.uci.edu/pub/machine-learning-databases/liver-disorders/bupa.data', header=None)

In [24]:

ggt = bupaData.ix[:,4]
aat = bupaData.ix[:,2]

plot(np.log(ggt), aat, 'ob')
xlabel('log(ggt)')
ylabel('aat');

Plot the residuals

In [25]:

lm1 = ols('aat ~ np.log(ggt)', pd.DataFrame({'aat' : aat, 'ggt' : ggt})).fit()

plot(ggt.apply(np.log), lm1.resid, 'ob')
xlabel('log(ggt)')
ylabel('lm1.resid');

Other things:

variation in units,
overloading regression,
correlation and causation.

In [ ]: