import pandas as pd
import numpy as np
import statsmodels.api as sm
import patsy
%pylab inline
Populating the interactive namespace from numpy and matplotlib
The New York Times recently published a piece entitled "The Hardest Places to Live in the US" with a ranking of counties in the US by quality of life. The original piece can be found here.
Their data set includes six factors: educational attainment, household income, jobless rate, disability rate, life expectancy and obesity rate, and is available as a separate download.
I hope to play with this data more, but here's the first interesting observation.
df = pd.read_table("data/unemployment.tsv")
scatter(df.education, df.income)
ylabel("Median Income")
xlabel("% population with bachelor's degree")
<matplotlib.text.Text at 0xc9ec8ec>
There's a clear linear relationship here, which we can identify with statsmodels:
y, X = patsy.dmatrices("income ~ education", df)
income_edu_model = sm.OLS(y, X).fit()
income_edu_model.summary()
Dep. Variable: | income | R-squared: | 0.482 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.482 |
Method: | Least Squares | F-statistic: | 2899. |
Date: | Sun, 29 Jun 2014 | Prob (F-statistic): | 0.00 |
Time: | 17:38:12 | Log-Likelihood: | -32562. |
No. Observations: | 3112 | AIC: | 6.513e+04 |
Df Residuals: | 3110 | BIC: | 6.514e+04 |
Df Model: | 1 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 2.729e+04 | 370.307 | 73.700 | 0.000 | 2.66e+04 2.8e+04 |
education | 935.1122 | 17.367 | 53.843 | 0.000 | 901.059 969.165 |
Omnibus: | 185.316 | Durbin-Watson: | 1.587 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 513.123 |
Skew: | 0.307 | Prob(JB): | 3.77e-112 |
Kurtosis: | 4.892 | Cond. No. | 52.1 |
df['income_edu_resid'] = income_edu_model.norm_resid()
df.sort('income_edu_resid')
County | id | rank | education | income | unemployment | disability | life | obesity | income_edu_resid | |
---|---|---|---|---|---|---|---|---|---|---|
2960 | Whitman, Washington | 53075 | 591 | 48.8 | 34169 | 6.3 | 0.5 | 79.9 | 36 | -4.573856 |
1422 | Oktibbeha, Mississippi | 28105 | 1988 | 43.1 | 29430 | 9.2 | 1.9 | 76.7 | 38 | -4.504092 |
2902 | Lexington City, Virginia | 51678 | 1541 | 46.9 | 36511 | 11.4 | 1.1 | 78.4 | 37 | -4.087782 |
385 | Clarke, Georgia | 13059 | 1410 | 40.8 | 33846 | 7.0 | 1.5 | 77.9 | 36 | -3.729109 |
3089 | Albany, Wyoming | 56001 | 150 | 48.1 | 42882 | 4.4 | 0.4 | 79.7 | 29 | -3.468331 |
718 | Monroe, Indiana | 18105 | 555 | 43.3 | 38675 | 6.9 | 0.7 | 79.4 | 32 | -3.435105 |
1953 | Watauga, North Carolina | 37189 | 758 | 38.4 | 34848 | 8.3 | 0.6 | 79.6 | 31 | -3.345997 |
602 | Jackson, Illinois | 17077 | 1656 | 36.0 | 32819 | 7.6 | 1.5 | 77.3 | 38 | -3.320592 |
548 | Latah, Idaho | 16057 | 469 | 42.9 | 39466 | 6.4 | 0.8 | 80.4 | 32 | -3.297611 |
2888 | Charlottesville City, Virginia | 51540 | 675 | 48.1 | 44535 | 5.9 | 1.2 | 77.1 | 25 | -3.273250 |
2343 | Clay, South Dakota | 46027 | 397 | 41.3 | 38377 | 4.0 | 0.5 | 80.0 | 36 | -3.249557 |
1401 | Jefferson, Mississippi | 28063 | 2926 | 20.9 | 20281 | 14.4 | 5.7 | 72.8 | 50 | -3.133867 |
937 | Riley, Kansas | 20161 | 143 | 45.4 | 43364 | 4.5 | 0.4 | 80.3 | 33 | -3.113480 |
2512 | Brazos, Texas | 48041 | 732 | 38.7 | 37638 | 5.5 | 1.0 | 79.5 | 36 | -3.049840 |
240 | Gunnison, Colorado | 8051 | 77 | 51.9 | 50091 | 6.6 | 0.3 | 83.0 | 25 | -3.036915 |
2913 | Radford City, Virginia | 51750 | 1424 | 29.5 | 29757 | 7.7 | 0.9 | 76.7 | 33 | -2.964628 |
879 | Douglas, Kansas | 20045 | 157 | 48.4 | 48395 | 5.3 | 0.7 | 79.7 | 32 | -2.850816 |
2178 | Benton, Oregon | 41003 | 159 | 48.6 | 48635 | 6.1 | 0.7 | 81.2 | 32 | -2.844564 |
2159 | Payne, Oklahoma | 40119 | 1046 | 35.9 | 36762 | 4.8 | 1.0 | 77.4 | 36 | -2.844218 |
1645 | Dawes, Nebraska | 31045 | 447 | 36.1 | 36974 | 3.9 | 0.4 | 79.4 | 36 | -2.841271 |
1461 | Boone, Missouri | 29019 | 249 | 47.3 | 47786 | 4.6 | 1.0 | 79.1 | 32 | -2.801293 |
2900 | Harrisonburg City, Virginia | 51660 | 991 | 35.6 | 36853 | 6.8 | 0.7 | 78.1 | 35 | -2.800372 |
1926 | Orange, North Carolina | 37135 | 93 | 55.2 | 55241 | 6.2 | 0.7 | 80.2 | 24 | -2.793314 |
1851 | Tompkins, New York | 36109 | 170 | 49.9 | 50539 | 6.0 | 0.9 | 80.5 | 31 | -2.763327 |
552 | Madison, Idaho | 16065 | 584 | 31.9 | 33776 | 5.5 | 0.3 | 79.5 | 35 | -2.755181 |
1112 | Lincoln, Louisiana | 22061 | 2130 | 33.6 | 35433 | 8.0 | 2.0 | 75.9 | 40 | -2.747238 |
290 | Alachua, Florida | 12001 | 838 | 41.2 | 42818 | 6.6 | 1.4 | 78.1 | 31 | -2.714412 |
2921 | Williamsburg City, Virginia | 51830 | 516 | 49.5 | 50865 | 13.4 | 0.6 | 80.8 | 34 | -2.680710 |
618 | McDonough, Illinois | 17109 | 1249 | 32.9 | 35812 | 7.5 | 1.1 | 78.3 | 36 | -2.625259 |
842 | Story, Iowa | 19169 | 67 | 47.7 | 49683 | 3.9 | 0.4 | 80.9 | 35 | -2.621560 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1174 | Howard, Maryland | 24027 | 9 | 59.5 | 107821 | 5.0 | 0.4 | 81.7 | 30 | 2.937432 |
2911 | Poquoson City, Virginia | 51735 | 52 | 35.1 | 85033 | 5.4 | 0.3 | 80.5 | 34 | 2.940823 |
2873 | Spotsylvania, Virginia | 51177 | 264 | 29.0 | 79402 | 5.0 | 0.6 | 78.6 | 37 | 2.949460 |
1163 | Anne Arundel, Maryland | 24003 | 180 | 36.8 | 86987 | 6.1 | 0.7 | 79.0 | 35 | 2.983818 |
234 | Elbert, Colorado | 8039 | 91 | 30.1 | 80811 | 7.2 | 0.2 | 79.8 | 26 | 2.994351 |
2858 | Powhatan, Virginia | 51145 | 367 | 25.2 | 76495 | 5.4 | 0.5 | 78.9 | 38 | 3.025749 |
3106 | Sublette, Wyoming | 56035 | 14 | 28.5 | 79776 | 3.7 | 0.3 | 80.0 | 29 | 3.048777 |
1167 | Carroll, Maryland | 24013 | 221 | 31.9 | 83155 | 6.2 | 0.5 | 79.2 | 36 | 3.072336 |
2837 | King George, Virginia | 51099 | 427 | 30.6 | 82195 | 7.0 | 0.6 | 78.0 | 36 | 3.102506 |
2895 | Falls Church City, Virginia | 51610 | 18 | 72.8 | 122844 | 6.8 | 0.1 | 82.0 | 19 | 3.242623 |
2817 | Fairfax, Virginia | 51059 | 3 | 58.2 | 109383 | 4.2 | 0.3 | 83.1 | 26 | 3.265239 |
1719 | Elko, Nevada | 32007 | 768 | 15.8 | 70411 | 5.9 | 0.4 | 78.1 | 39 | 3.345107 |
1761 | Sussex, New Jersey | 34037 | 381 | 31.6 | 85507 | 9.1 | 0.6 | 79.0 | 33 | 3.383017 |
3107 | Sweetwater, Wyoming | 56037 | 362 | 17.0 | 72139 | 4.6 | 0.5 | 78.4 | 35 | 3.416609 |
1178 | Queen Annes, Maryland | 24035 | 149 | 31.2 | 86013 | 6.2 | 0.4 | 79.7 | 36 | 3.486876 |
1848 | Suffolk, New York | 36103 | 208 | 32.6 | 87778 | 7.6 | 0.7 | 80.2 | 33 | 3.540673 |
1826 | Nassau, New York | 36059 | 63 | 41.4 | 97049 | 7.1 | 0.5 | 81.6 | 30 | 3.663648 |
1723 | Lander, Nevada | 32015 | 1187 | 12.8 | 70341 | 5.3 | 0.3 | 77.2 | 41 | 3.667921 |
1179 | St. Marys, Maryland | 24037 | 461 | 28.4 | 85032 | 5.9 | 0.8 | 78.3 | 38 | 3.680106 |
2818 | Fauquier, Virginia | 51061 | 136 | 32.0 | 88687 | 4.7 | 0.4 | 78.6 | 34 | 3.714165 |
1836 | Putnam, New York | 36079 | 66 | 38.8 | 95259 | 6.7 | 0.4 | 80.6 | 33 | 3.739330 |
2527 | Chambers, Texas | 48071 | 1109 | 16.8 | 75200 | 7.7 | 0.9 | 77.8 | 37 | 3.799928 |
3091 | Campbell, Wyoming | 56005 | 763 | 18.2 | 77090 | 4.3 | 0.3 | 76.7 | 39 | 3.868476 |
2578 | Glasscock, Texas | 48173 | 418 | 17.5 | 76563 | 4.3 | 0.0 | 77.8 | 37 | 3.883533 |
1752 | Hunterdon, New Jersey | 34019 | 37 | 48.1 | 105880 | 7.1 | 0.3 | 81.4 | 26 | 3.966447 |
2861 | Prince William, Virginia | 51153 | 42 | 37.7 | 96160 | 4.9 | 0.4 | 80.5 | 35 | 3.967057 |
2874 | Stafford, Virginia | 51179 | 99 | 35.5 | 96355 | 4.9 | 0.4 | 79.4 | 36 | 4.232858 |
1165 | Calvert, Maryland | 24009 | 296 | 29.5 | 92395 | 5.7 | 0.6 | 78.8 | 37 | 4.427664 |
2841 | Loudoun, Virginia | 51107 | 4 | 57.9 | 122068 | 4.2 | 0.2 | 82.6 | 29 | 4.795381 |
1169 | Charles, Maryland | 24017 | 704 | 26.6 | 93063 | 6.0 | 0.8 | 77.9 | 40 | 4.826538 |
3112 rows × 10 columns
What's noticable about the negative outliers? These are counties where educational attainment has not translated to higher median income. They are also all rural college towns.
Whitman, Washington: Washington State University
Oktibbeha, Mississippi: Mississippi State University
Lexington City, Virginia: VMI
Clarke, Georgia: University of Georgia
Albany, Wyoming: University of Wyoming (also state capital)
Monroe, Indiana: Indiana University
Watauga, North Carolina: Appalachian State University
Jackson, Illinois: Southern Illinois University (main ecoonomic engine)
Latah, Idaho: University of Idaho
Charlottesville City, Virginia: University of Virginia
There are a few patterns in the positive outliers, where median income is higher than predicted by educational attainment. Charles, Loudoun, Calvert, Stafford, and Prince William counties all surround Washington D.C. Hunterdon County NJ and Putman County NY are outer suburbs of NYC. Chambers TX is a suburb of Houston, and Campbell, WY is dominated by oil or gas extraction. Glassock TX has 334 residents: let's not jump to any conclusions on that one!
There are a few more interactions worth exploring. See below for all pairwise scatterplots.
cols = ['education', 'income', 'unemployment', 'disability', 'life', 'obesity']
from pandas.tools.plotting import scatter_matrix
scatter_matrix(df[cols], alpha=0.2, figsize=(6, 6), diagonal='kde')
a = 1