The goal of this project is to predict 2012 US presidential election results based on the data on contributions to the 2008 presidential campaign. Data from 2008 includes candidate names, contributor names, occupation, employer, zip code, city, state and contribution amount. In this project, I focus on the number of contributors instead of the amount of donations. I hypothesized that small donations may not contribute significantly to the fundraising but may help us predict the number of votes.
Let's import necessary modules first.
from sklearn import datasets, linear_model, cross_validation, grid_search
import numpy as np
import pandas as pd
from sklearn_pandas import DataFrameMapper, cross_val_score
import sklearn.preprocessing
Federal Election Commision publishes detailed data set of contributers to presidential campaigns. I donwloaded the data set for 2008 and 2012 presidential elections http://www.fec.gov/disclosurep/PDownload.do.
I used pandas' read_csv to read the file in chunks.
data_chunks = pd.read_csv('/Users/sergulaydore/Downloads/P00000001-ALL.csv', # 2008 data
index_col= False, iterator=True, chunksize=1000) # gives TextFileReader, which is iteratable with
# chunks of 1000 rows.
data2008 = pd.concat(data_chunks, ignore_index=True)
data_chunks = pd.read_csv('/Users/sergulaydore/Downloads/P00000001-ALL 2.csv', # 2012 Data
index_col= False, iterator=True, chunksize=1000) # gives TextFileReader, which is iteratable with
# chunks of 1000 rows.
data2012 = pd.concat(data_chunks, ignore_index=True)
I used a subset of 2008 data where only Barack Obama and John McCain received contributions since they were the main two candidates. In 2012 elections, “Mitt Romney” was the main candidate for Republicans, so I had to replace “John McCain” with “Mitt Romney” .
data2008_bo_jm = data2008[data2008.cand_nm.isin(['McCain, John S','Obama, Barack'])]
data2012_bo_mr = data2012[data2012.cand_nm.isin(['Romney, Mitt','Obama, Barack'])]
I ignore contributions from outside US.
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
"MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
"NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
"SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
f = lambda x: x in states
data2008_bo_jm_states = data2008_bo_jm[data2008_bo_jm.contbr_st.map(f)]
data2012_bo_mr_states = data2012_bo_mr[data2012_bo_mr.contbr_st.map(f)]
I selected the following features for my data analysis:
cand_nm : candidate name
contbr_st : Contributor's state
contbr_occupation : Contributor's occupation
contbr_employer : Contributor's employer
contbr_receipt_amt = Amount of contribution
data2008_short = data2008_bo_jm_states[['cand_nm','contbr_st','contbr_occupation', 'contbr_employer',
'contb_receipt_amt']]
data2012_short = data2012_bo_mr_states[['cand_nm','contbr_st','contbr_occupation', 'contbr_employer',
'contb_receipt_amt']]
Ignore 0 or less contributions:
data2008_short = data2008_short[data2008_short.contb_receipt_amt > 0]
data2012_short = data2012_short[data2012_short.contb_receipt_amt > 0]
I noticed that there are several occupations and employers which may refer to the same thing. I cleaned up the data by mapping the several variants of the same occupation/employer to a single name.
occ_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS': 'NOT PROVIDED',
'INFORMATION REQUESTED': 'NOT PROVIDED',
'INFORMATION REQUESTED (BEST EFFORTS)': 'NOT PROVIDED',
'SELF': 'SELF EMPLOYED',
'SELF-EMPLOYED': 'SELF EMPLOYED'}
f = lambda x: occ_mapping.get(x, x)
data2008_short.contbr_occupation = data2008_short.contbr_occupation.map(f)
data2008_short.contbr_employer = data2008_short.contbr_employer.map(f)
data2012_short.contbr_occupation = data2012_short.contbr_occupation.map(f)
data2012_short.contbr_employer = data2012_short.contbr_employer.map(f)
Find 30 most popular occupations and employers
occupation_counts = data2008_short.contbr_occupation.value_counts()
occupations = occupation_counts[1:30].keys()
f = lambda x: x in occupations
data2008_short_occ = data2008_short[data2008_short.contbr_occupation.map(f)]
employer_counts = data2008_short_occ.contbr_employer.value_counts()
employers = employer_counts[1:30].keys()
f = lambda x: x in employers
data2008_short_occ_emp = data2008_short_occ[data2008_short_occ.contbr_employer.map(f)]
print occupations
Index([u'ATTORNEY', u'NOT EMPLOYED', u'NOT PROVIDED', u'PHYSICIAN', u'PROFESSOR', u'HOMEMAKER', u'CONSULTANT', u'TEACHER', u'ENGINEER', u'STUDENT', u'WRITER', u'MANAGER', u'PRESIDENT', u'LAWYER', u'SALES', u'ARTIST', u'EXECUTIVE', u'SOFTWARE ENGINEER', u'OWNER', u'CEO', u'PSYCHOLOGIST', u'ACCOUNTANT', u'ARCHITECT', u'SELF EMPLOYED', u'ADMINISTRATOR', u'REAL ESTATE', u'REGISTERED NURSE', u'EDUCATOR', u'RN'], dtype='object')
print employers
Index([u'NOT EMPLOYED', u'NOT PROVIDED', u'HOMEMAKER', u'UNEMPLOYED', u'IBM', u'UNIVERSITY OF CALIFORNIA', u'HARVARD UNIVERSITY', u'UNIVERSITY OF WASHINGTON', u'COLUMBIA UNIVERSITY', u'UNIVERSITY OF MICHIGAN', u'JONES DAY', u'UNIVERSITY OF CHICAGO', u'KAISER PERMANENTE', u'STANFORD UNIVERSITY', u'UCLA', u'SIDLEY AUSTIN LLP', u'MICROSOFT', u'NORTHWESTERN UNIVERSITY', u'AT&T', u'STUDENT', u'CORNELL UNIVERSITY', u'YALE UNIVERSITY', u'GOOGLE', u'STATE OF CALIFORNIA', u'UNIVERSITY OF PENNSYLVANIA', u'DUKE UNIVERSITY', u'UNIVERSITY OF MINNESOTA', u'JOHNS HOPKINS UNIVERSITY', u'CHICAGO PUBLIC SCHOOLS'], dtype='object')
Extract a subset of 2012 data with these occupations and employers
f = lambda x: x in occupations
data2012_short_occ = data2012_short[data2012_short.contbr_occupation.map(f)]
f = lambda x: x in employers
data2012_short_occ_emp = data2012_short_occ[data2012_short_occ.contbr_employer.map(f)]
Change candidate names with political parties
parties = { 'Romney, Mitt': 'Republican',
'Obama, Barack': 'Democrat',
'McCain, John S': 'Republican'}
data2008_short_occ_emp.cand_nm = data2008_short_occ_emp.cand_nm.map(parties)
data2012_short_occ_emp.cand_nm = data2012_short_occ_emp.cand_nm.map(parties)
/Users/sergulaydore/anaconda/lib/python2.7/site-packages/pandas/core/generic.py:1974: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self[name] = value
Compare by occupation
fig = plt.figure(figsize = (100, 100))
by_occupation = data2008_short_occ_emp.pivot_table(index = 'contbr_occupation',
columns = 'cand_nm', aggfunc = 'size')
by_occupation_normalized = by_occupation.div(by_occupation.sum(axis=1),axis = 0)
by_occupation_normalized.plot(kind = 'barh', stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x18ba09650>
<matplotlib.figure.Figure at 0x165cfc590>
Compare by employer
by_employer = data2008_short_occ_emp.pivot_table(index = 'contbr_employer',
columns = 'cand_nm', aggfunc = 'size')
by_employer_normalized = by_employer.div(by_employer.sum(axis=1),axis = 0)
by_employer_normalized.plot(kind = 'barh', stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x18d676610>
Convert categorical values into numbers for logistic regression.
mapper = DataFrameMapper([
('cand_nm', sklearn.preprocessing.LabelBinarizer()),
('contbr_st', sklearn.preprocessing.LabelBinarizer()),
('contbr_occupation', sklearn.preprocessing.LabelBinarizer()),
('contbr_employer', sklearn.preprocessing.LabelBinarizer() )
])
data2008_encoded = np.round(mapper.fit_transform(data2008_short_occ_emp), 2)
Define target vector and predictors. Target vector consists of party names and redictors are state, occupation and employer.
y = data2008_encoded[:,0]
X = data2008_encoded[:,1:]
Train logistic regression
lr = linear_model.LogisticRegression()
c_range = np.logspace(0, 4, 10)
lrgs = grid_search.GridSearchCV(estimator=lr, param_grid=dict(C=c_range), n_jobs=1)
kf_total = cross_validation.KFold(len(X), n_folds=10, indices=True, shuffle=True, random_state=4)
/Users/sergulaydore/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py:65: DeprecationWarning: The indices parameter is deprecated and will be removed (assumed True) in 0.17 stacklevel=1)
[lrgs.fit(X[train],y[train]).score(X[test],y[test]) for train, test in kf_total]
[0.90431769722814503, 0.90420347243374966, 0.90256254045615503, 0.90046833948901495, 0.90408559570498415, 0.90652248410311087, 0.90290522788714156, 0.90477097056695732, 0.90446635951719145, 0.90827399763926442]
Use 2012 data to test the model
data2012_encoded = np.round(mapper.fit_transform(data2012_short_occ_emp), 2)
y12 = data2012_encoded[:,0]
X12 = data2012_encoded[:,1:]
print 'Prediction accuracy is ', lrgs.score(X12,y12)
Prediction accuracy is 0.721696727282
Predictions for 2012 elections
y12_prediction = lrgs.predict(X12)
Add predictions to 2012 data frame
data2012_short_occ_emp['prediction'] = pd.Series(y12_prediction, data2012_short_occ_emp.index)
data2012_short_occ_emp['prediction'] = data2012_short_occ_emp['prediction'].map({0:'Democrat',1:'Republican'})
Group predictions by state
grouped = data2012_short_occ_emp.groupby(['prediction', 'contbr_st'])
Compute the size of each group
totals = grouped.size().unstack(0).fillna(0)
totals.head()
prediction | Democrat | Republican |
---|---|---|
contbr_st | ||
AK | 666 | 144 |
AL | 593 | 3161 |
AR | 361 | 1346 |
AZ | 1704 | 5156 |
CA | 54130 | 9301 |
Normalize the values
percent = totals.div(totals.sum(1), axis=0)
percent.head()
prediction | Democrat | Republican |
---|---|---|
contbr_st | ||
AK | 0.822222 | 0.177778 |
AL | 0.157965 | 0.842035 |
AR | 0.211482 | 0.788518 |
AZ | 0.248397 | 0.751603 |
CA | 0.853368 | 0.146632 |
Plot these values on US map
%pylab inline
from mpl_toolkits.basemap import Basemap, cm
import numpy as np
from matplotlib.collections import LineCollection
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
#from shapelib import ShapeFile
import shapefile
from matplotlib import cm
shp = shapefile.Reader('/Users/sergulaydore/Downloads/cb_2013_us_state_500k/cb_2013_us_state_500k.shp')
obama = percent['Democrat']
fig = plt.figure(figsize = (12, 12))
ax = fig.add_axes([0.1, 0.1, 0.8, 0.8])
lllat = 21; urlat = 53; lllon = -118; urlon = -62
m = Basemap(ax=ax, projection='stere',
lon_0 = (urlon + lllon) / 2., lat_0 = (urlat + lllat) / 2. ,
llcrnrlat = lllat, urcrnrlat = urlat, llcrnrlon = lllon,
urcrnrlon = urlon, resolution = 'l')
m.drawcoastlines()
m.drawcountries()
m.drawmapboundary(fill_color='aqua')
m.fillcontinents(color = '#C0C0C0')
m.readshapefile('/Users/sergulaydore/Downloads/cb_2013_us_state_500k/cb_2013_us_state_500k',
'states', drawbounds=True)
state_loc = {'CA': (304369, 1.84052e+06), 'OR': (570691, 2.5951e+06),
'WA': (684829, 2.988224e+06), 'ID': (1.04627e+06, 2.38585e+06),
'ND': (2.18765e+06, 2.61413e+06), 'SD': (2.15594e+06, 2.28439e+06),
'TX': (2.12424e+06, 832301), 'MN': (2.6442e+06, 2.44292e+06),
'WI': (2.96125e+06, 2.25269e+06), 'MI': (3.39878e+06, 2.15757e+06),
'OH': (3.57633e+06, 1.80248e+06), 'WV': (3.7729e+06, 1.63127e+06),
'LA': (2.71395e+06, 813278), 'MS': (3.02466e+06, 883029),
'AL': (3.26562e+06, 914734), 'GA': (3.59535e+06, 908393),
'FL': (3.88633e+06, 458181), 'NC': (3.9885e+06, 1.26983e+06),
'VA': (4.00118e+06, 1.54249e+06), 'ME': (4.60992e+06, 2.61413e+06),
'NY':(4.15336e+06, 2.23366e+06)}
for shape, country in zip(m.states, m.states_info):
if country['STUSPS'] in states:
if country['STUSPS'] not in state_loc:
state_loc[country['STUSPS']] = np.mean(shape, 0)
if obama[country['STUSPS']]>=0.5:
poly = Polygon(shape, facecolor='b', alpha = 1*obama[country['STUSPS']])
else:
poly = Polygon(shape, facecolor='r', alpha = 1*(1-obama[country['STUSPS']]))
plt.gca().add_patch(poly)
else:
poly = Polygon(shape, facecolor="gray")
plt.gca().add_patch(poly)
for key, coord in state_loc.items():
if key not in ['HI', 'AK']:
ax.text(coord[0], coord[1], key, style='italic',bbox={'facecolor':'white', 'alpha':1, 'pad':10})
plt.title('Predicted Normalized value of number of contributers per state in 2012', fontsize=20)
plt.show()
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['shape', 'poly'] `%matplotlib` prevents importing * from pylab and numpy
2012 Election results are displayed below.
from IPython.display import Image
Image(filename='election-results-map-usa-2012.jpeg')
Predicted results are very similar to the true results. Incorrectly predicted states are NV, FL, MT, MO, NC and IN.
Number of small donations in presidential campaigns is a better indicator for the number of votes. By focusing on contributors’ states, occupations and employers, we can predict the results of next elections with an accuracy rate which is as high as 0.72.
My next goal is to predict the amount of donations by using the same features. This way, candidates can plan their funding better.