Predict 2012 Presidential Elections Based on Occupation and Employer¶

The goal of this project is to predict 2012 US presidential election results based on the data on contributions to the 2008 presidential campaign. Data from 2008 includes candidate names, contributor names, occupation, employer, zip code, city, state and contribution amount. In this project, I focus on the number of contributors instead of the amount of donations. I hypothesized that small donations may not contribute significantly to the fundraising but may help us predict the number of votes.

Let's import necessary modules first.

In [1]:

from sklearn import datasets, linear_model, cross_validation, grid_search
import numpy as np
import pandas as pd
from sklearn_pandas import DataFrameMapper, cross_val_score
import sklearn.preprocessing

Federal Election Commision publishes detailed data set of contributers to presidential campaigns. I donwloaded the data set for 2008 and 2012 presidential elections http://www.fec.gov/disclosurep/PDownload.do.

I used pandas' read_csv to read the file in chunks.

In [2]:

data_chunks = pd.read_csv('/Users/sergulaydore/Downloads/P00000001-ALL.csv', # 2008 data
                   index_col= False, iterator=True, chunksize=1000) # gives TextFileReader, which is iteratable with 
                                                                    # chunks of 1000 rows.
data2008 = pd.concat(data_chunks, ignore_index=True) 

data_chunks = pd.read_csv('/Users/sergulaydore/Downloads/P00000001-ALL 2.csv', # 2012 Data
                   index_col= False, iterator=True, chunksize=1000) # gives TextFileReader, which is iteratable with 
                                                                    # chunks of 1000 rows.
data2012 = pd.concat(data_chunks, ignore_index=True)

I used a subset of 2008 data where only Barack Obama and John McCain received contributions since they were the main two candidates. In 2012 elections, “Mitt Romney” was the main candidate for Republicans, so I had to replace “John McCain” with “Mitt Romney” .

In [3]:

data2008_bo_jm = data2008[data2008.cand_nm.isin(['McCain, John S','Obama, Barack'])]
data2012_bo_mr = data2012[data2012.cand_nm.isin(['Romney, Mitt','Obama, Barack'])]

I ignore contributions from outside US.

In [4]:

states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]

f = lambda x: x in states
data2008_bo_jm_states = data2008_bo_jm[data2008_bo_jm.contbr_st.map(f)]
data2012_bo_mr_states = data2012_bo_mr[data2012_bo_mr.contbr_st.map(f)]

I selected the following features for my data analysis:

cand_nm : candidate name

contbr_st : Contributor's state

contbr_occupation : Contributor's occupation

contbr_employer : Contributor's employer

contbr_receipt_amt = Amount of contribution

In [5]:

data2008_short = data2008_bo_jm_states[['cand_nm','contbr_st','contbr_occupation', 'contbr_employer',
                                        'contb_receipt_amt']]

data2012_short = data2012_bo_mr_states[['cand_nm','contbr_st','contbr_occupation', 'contbr_employer',
                                        'contb_receipt_amt']]

Ignore 0 or less contributions:

In [6]:

data2008_short = data2008_short[data2008_short.contb_receipt_amt > 0]
data2012_short = data2012_short[data2012_short.contb_receipt_amt > 0]

I noticed that there are several occupations and employers which may refer to the same thing. I cleaned up the data by mapping the several variants of the same occupation/employer to a single name.

In [7]:

occ_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS': 'NOT PROVIDED',
                  'INFORMATION REQUESTED': 'NOT PROVIDED',
                  'INFORMATION REQUESTED (BEST EFFORTS)': 'NOT PROVIDED',
                   'SELF': 'SELF EMPLOYED',
                   'SELF-EMPLOYED': 'SELF EMPLOYED'}

f = lambda x: occ_mapping.get(x, x)
data2008_short.contbr_occupation = data2008_short.contbr_occupation.map(f)
data2008_short.contbr_employer = data2008_short.contbr_employer.map(f)
data2012_short.contbr_occupation = data2012_short.contbr_occupation.map(f)
data2012_short.contbr_employer = data2012_short.contbr_employer.map(f)

Find 30 most popular occupations and employers

In [8]:

occupation_counts = data2008_short.contbr_occupation.value_counts()
occupations = occupation_counts[1:30].keys()
f = lambda x: x in occupations
data2008_short_occ = data2008_short[data2008_short.contbr_occupation.map(f)]

employer_counts = data2008_short_occ.contbr_employer.value_counts()
employers = employer_counts[1:30].keys()
f = lambda x: x in employers
data2008_short_occ_emp = data2008_short_occ[data2008_short_occ.contbr_employer.map(f)]  

In [9]:

print occupations

Index([u'ATTORNEY', u'NOT EMPLOYED', u'NOT PROVIDED', u'PHYSICIAN', u'PROFESSOR', u'HOMEMAKER', u'CONSULTANT', u'TEACHER', u'ENGINEER', u'STUDENT', u'WRITER', u'MANAGER', u'PRESIDENT', u'LAWYER', u'SALES', u'ARTIST', u'EXECUTIVE', u'SOFTWARE ENGINEER', u'OWNER', u'CEO', u'PSYCHOLOGIST', u'ACCOUNTANT', u'ARCHITECT', u'SELF EMPLOYED', u'ADMINISTRATOR', u'REAL ESTATE', u'REGISTERED NURSE', u'EDUCATOR', u'RN'], dtype='object')

In [10]:

print employers

Index([u'NOT EMPLOYED', u'NOT PROVIDED', u'HOMEMAKER', u'UNEMPLOYED', u'IBM', u'UNIVERSITY OF CALIFORNIA', u'HARVARD UNIVERSITY', u'UNIVERSITY OF WASHINGTON', u'COLUMBIA UNIVERSITY', u'UNIVERSITY OF MICHIGAN', u'JONES DAY', u'UNIVERSITY OF CHICAGO', u'KAISER PERMANENTE', u'STANFORD UNIVERSITY', u'UCLA', u'SIDLEY AUSTIN LLP', u'MICROSOFT', u'NORTHWESTERN UNIVERSITY', u'AT&T', u'STUDENT', u'CORNELL UNIVERSITY', u'YALE UNIVERSITY', u'GOOGLE', u'STATE OF CALIFORNIA', u'UNIVERSITY OF PENNSYLVANIA', u'DUKE UNIVERSITY', u'UNIVERSITY OF MINNESOTA', u'JOHNS HOPKINS UNIVERSITY', u'CHICAGO PUBLIC SCHOOLS'], dtype='object')

Extract a subset of 2012 data with these occupations and employers

In [11]:

f = lambda x: x in occupations
data2012_short_occ = data2012_short[data2012_short.contbr_occupation.map(f)]
f = lambda x: x in employers
data2012_short_occ_emp = data2012_short_occ[data2012_short_occ.contbr_employer.map(f)]

Change candidate names with political parties

In [12]:

parties = { 'Romney, Mitt': 'Republican',
           'Obama, Barack': 'Democrat', 
           'McCain, John S': 'Republican'}
data2008_short_occ_emp.cand_nm = data2008_short_occ_emp.cand_nm.map(parties)
data2012_short_occ_emp.cand_nm = data2012_short_occ_emp.cand_nm.map(parties)

/Users/sergulaydore/anaconda/lib/python2.7/site-packages/pandas/core/generic.py:1974: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value

Compare by occupation

In [133]:

fig = plt.figure(figsize = (100, 100))
by_occupation = data2008_short_occ_emp.pivot_table(index  = 'contbr_occupation', 
                                     columns  = 'cand_nm', aggfunc = 'size')

by_occupation_normalized = by_occupation.div(by_occupation.sum(axis=1),axis = 0)
by_occupation_normalized.plot(kind = 'barh', stacked=True)

Out[133]:

<matplotlib.axes._subplots.AxesSubplot at 0x18ba09650>

<matplotlib.figure.Figure at 0x165cfc590>

Compare by employer

In [134]:

by_employer = data2008_short_occ_emp.pivot_table(index  = 'contbr_employer', 
                                     columns  = 'cand_nm', aggfunc = 'size')

by_employer_normalized = by_employer.div(by_employer.sum(axis=1),axis = 0)
by_employer_normalized.plot(kind = 'barh', stacked=True)

Out[134]:

<matplotlib.axes._subplots.AxesSubplot at 0x18d676610>

Convert categorical values into numbers for logistic regression.

In [13]:

mapper = DataFrameMapper([
     ('cand_nm', sklearn.preprocessing.LabelBinarizer()),
     ('contbr_st', sklearn.preprocessing.LabelBinarizer()),
     ('contbr_occupation', sklearn.preprocessing.LabelBinarizer()),
     ('contbr_employer', sklearn.preprocessing.LabelBinarizer() )
		 ])       

data2008_encoded = np.round(mapper.fit_transform(data2008_short_occ_emp), 2)

Define target vector and predictors. Target vector consists of party names and redictors are state, occupation and employer.

In [15]:

y = data2008_encoded[:,0]
X = data2008_encoded[:,1:]

Train logistic regression

In [17]:

lr = linear_model.LogisticRegression()
c_range = np.logspace(0, 4, 10)
lrgs = grid_search.GridSearchCV(estimator=lr, param_grid=dict(C=c_range), n_jobs=1)
kf_total = cross_validation.KFold(len(X), n_folds=10, indices=True, shuffle=True, random_state=4)

/Users/sergulaydore/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py:65: DeprecationWarning: The indices parameter is deprecated and will be removed (assumed True) in 0.17
  stacklevel=1)

In [18]:

[lrgs.fit(X[train],y[train]).score(X[test],y[test]) for train, test in kf_total]

Out[18]:

[0.90431769722814503,
 0.90420347243374966,
 0.90256254045615503,
 0.90046833948901495,
 0.90408559570498415,
 0.90652248410311087,
 0.90290522788714156,
 0.90477097056695732,
 0.90446635951719145,
 0.90827399763926442]

Use 2012 data to test the model

In [19]:

data2012_encoded = np.round(mapper.fit_transform(data2012_short_occ_emp), 2)
y12 = data2012_encoded[:,0]
X12 = data2012_encoded[:,1:]
print 'Prediction accuracy is ', lrgs.score(X12,y12)

Prediction accuracy is  0.721696727282

Predictions for 2012 elections

In [20]:

y12_prediction = lrgs.predict(X12)

Add predictions to 2012 data frame

In [22]:

data2012_short_occ_emp['prediction'] = pd.Series(y12_prediction, data2012_short_occ_emp.index)
data2012_short_occ_emp['prediction'] = data2012_short_occ_emp['prediction'].map({0:'Democrat',1:'Republican'})

Group predictions by state

In [24]:

grouped = data2012_short_occ_emp.groupby(['prediction', 'contbr_st'])

Compute the size of each group

In [25]:

totals = grouped.size().unstack(0).fillna(0)

In [26]:

totals.head()

Out[26]:

prediction	Democrat	Republican
contbr_st
AK	666	144
AL	593	3161
AR	361	1346
AZ	1704	5156
CA	54130	9301

Normalize the values

In [28]:

percent = totals.div(totals.sum(1), axis=0)

In [30]:

percent.head()

Out[30]:

prediction	Democrat	Republican
contbr_st
AK	0.822222	0.177778
AL	0.157965	0.842035
AR	0.211482	0.788518
AZ	0.248397	0.751603
CA	0.853368	0.146632

Plot these values on US map

In [127]:

%pylab inline
from mpl_toolkits.basemap import Basemap, cm
import numpy as np
from matplotlib.collections import LineCollection
import matplotlib.pyplot as plt 
from matplotlib.patches import Polygon

#from shapelib import ShapeFile
import shapefile
from matplotlib import cm
shp = shapefile.Reader('/Users/sergulaydore/Downloads/cb_2013_us_state_500k/cb_2013_us_state_500k.shp')

obama = percent['Democrat']
fig = plt.figure(figsize = (12, 12))
ax = fig.add_axes([0.1, 0.1, 0.8, 0.8])

lllat = 21; urlat = 53; lllon = -118; urlon = -62
m = Basemap(ax=ax, projection='stere',
            lon_0 = (urlon + lllon) / 2., lat_0 = (urlat + lllat) / 2. ,
            llcrnrlat = lllat, urcrnrlat = urlat, llcrnrlon = lllon, 
            urcrnrlon = urlon, resolution = 'l')

m.drawcoastlines()
m.drawcountries()
m.drawmapboundary(fill_color='aqua')
m.fillcontinents(color = '#C0C0C0')
m.readshapefile('/Users/sergulaydore/Downloads/cb_2013_us_state_500k/cb_2013_us_state_500k',
                'states', drawbounds=True)

state_loc = {'CA': (304369, 1.84052e+06), 'OR': (570691, 2.5951e+06),
             'WA': (684829, 2.988224e+06), 'ID': (1.04627e+06, 2.38585e+06),
             'ND': (2.18765e+06, 2.61413e+06), 'SD': (2.15594e+06, 2.28439e+06),
             'TX': (2.12424e+06, 832301), 'MN': (2.6442e+06, 2.44292e+06),
             'WI': (2.96125e+06, 2.25269e+06), 'MI': (3.39878e+06, 2.15757e+06),
             'OH': (3.57633e+06, 1.80248e+06), 'WV': (3.7729e+06, 1.63127e+06),
             'LA': (2.71395e+06, 813278), 'MS': (3.02466e+06, 883029), 
             'AL': (3.26562e+06, 914734), 'GA': (3.59535e+06, 908393),
             'FL': (3.88633e+06, 458181), 'NC': (3.9885e+06, 1.26983e+06),
             'VA': (4.00118e+06, 1.54249e+06), 'ME': (4.60992e+06, 2.61413e+06),
             'NY':(4.15336e+06, 2.23366e+06)}

for shape, country in zip(m.states, m.states_info):
    if country['STUSPS'] in states:
        if country['STUSPS'] not in state_loc:
            
            state_loc[country['STUSPS']] = np.mean(shape, 0)
        if obama[country['STUSPS']]>=0.5:
            poly = Polygon(shape, facecolor='b', alpha = 1*obama[country['STUSPS']])
        else:
            poly = Polygon(shape, facecolor='r', alpha = 1*(1-obama[country['STUSPS']]))   
        plt.gca().add_patch(poly)
    else:
        poly = Polygon(shape, facecolor="gray")
        plt.gca().add_patch(poly)

for key, coord in state_loc.items():
    if key not in ['HI', 'AK']:
        ax.text(coord[0], coord[1], key, style='italic',bbox={'facecolor':'white', 'alpha':1, 'pad':10})
plt.title('Predicted Normalized value of number of contributers per state in 2012', fontsize=20)
plt.show()

Populating the interactive namespace from numpy and matplotlib

WARNING: pylab import has clobbered these variables: ['shape', 'poly']
`%matplotlib` prevents importing * from pylab and numpy

2012 Election results are displayed below.

In [128]:

from IPython.display import Image
Image(filename='election-results-map-usa-2012.jpeg')

Out[128]:

Predicted results are very similar to the true results. Incorrectly predicted states are NV, FL, MT, MO, NC and IN.

Conclusion¶

Number of small donations in presidential campaigns is a better indicator for the number of votes. By focusing on contributors’ states, occupations and employers, we can predict the results of next elections with an accuracy rate which is as high as 0.72.

Future Directions¶

My next goal is to predict the amount of donations by using the same features. This way, candidates can plan their funding better.

In [ ]: