This is a variation of the analysis done by Greg Reda's 3-point analysis for NCAA Basketball. Instead of NCAA basketball 3-pointers, I'm gonna look at field goal percentages in the NFL. The overall purpose of this post is to learn more about Python's pyMC3 and scikit-learn. The work based on scikit-learn is based on work done by Trey Causey at the spread. Here we will examine a Random Forests Classifer to predict field goal performance.
import numpy as np
import scipy.stats as stats
import pandas as pd
import pandas.io.sql as psql
import mysql.connector as sql
%matplotlib inline
import matplotlib.pyplot as plt
# from mpld3 import enable_notebook
# enable_notebook()
I will be using the NFL data set from Armchair Analysis. Their data is conveniently broken up into categories enabling quick analysis studies, they also provide scripts for setting up and importing the data into a SQL database. The following code loads the data from a SQL database to a pandas DataFrame object using pandas.io.sql.
import pickle
with open('sql.config','rb') as fp:
config = pickle.load(fp)
con = sql.connect(**config)
df = psql.frame_query('''
SELECT fgxp.DIST, fgxp.GOOD, games.TEMP, games.STAD FROM fgxp, core, games
WHERE core.PID=fgxp.PID AND games.GID = core.GID AND fgxp.FGXP='FG';''', con)
con.close()
The data for temperature was loaded into the SQL database as character string, convert it to a number.
df.TEMP = df.TEMP.astype(float)
df.head()
DIST | GOOD | TEMP | STAD | |
---|---|---|---|---|
0 | 43 | Y | 79 | Georgia Dome |
1 | 44 | Y | 79 | Georgia Dome |
2 | 24 | Y | 79 | Georgia Dome |
3 | 44 | Y | 79 | Georgia Dome |
4 | 48 | Y | 79 | Georgia Dome |
Next lets add the elevation data extracted using the code in another notebook.
df_elevation = pd.read_csv('stadium_with_altitude.csv', index_col=0)
df_elevation.head()
STAD | lat | lng | alt | |
---|---|---|---|---|
0 | Georgia Dome | 33.757690 | -84.400829 | 303.163727 |
1 | Cleveland Browns Stadium | 41.506483 | -81.700040 | 183.638306 |
2 | Texas Stadium | 32.841068 | -96.910899 | 133.471329 |
3 | Lambeau Field | 44.501357 | -88.060755 | 190.276779 |
4 | Arrowhead Stadium | 39.048939 | -94.483916 | 256.770477 |
Now extract the elevation for every field goal attempt.
def getAlt(d):
c = df_elevation.STAD==d
if np.any(c):
return df_elevation[df_elevation.STAD==d]['alt'].iloc[0]
return np.nan
df['ALT'] = df.STAD.map(getAlt)
df.head()
DIST | GOOD | TEMP | STAD | ALT | |
---|---|---|---|---|---|
0 | 43 | Y | 79 | Georgia Dome | 303.163727 |
1 | 44 | Y | 79 | Georgia Dome | 303.163727 |
2 | 24 | Y | 79 | Georgia Dome | 303.163727 |
3 | 44 | Y | 79 | Georgia Dome | 303.163727 |
4 | 48 | Y | 79 | Georgia Dome | 303.163727 |
Now lets partition the data for doing an MCMC study of the field goal percentages for distances and temperature. Lets look at distances less than 40 yards and define cold temperatures as those less than 40 degrees.
convert = lambda x: True if x == 'Y' else False
normal = (df.TEMP >= 40) & (df.DIST > 40) & df.TEMP.notnull()
normal = df[normal].GOOD.apply(convert)
cold = (df.TEMP < 40) & (df.DIST > 40) & df.TEMP.notnull()
cold = df[cold].GOOD.apply(convert)
print normal.mean(), normal.std()
print cold.mean(), cold.std()
0.659776992113 0.47384847452 0.608137044968 0.488689847026
It looks like there is a small difference in the mean, but not significant. Yet, these are just observations of a Bernoulli Trial. To estimate the underlying probability of making a field goal in normal or cold conditions will execute an MCMC simulation. The simulation will provide an estimate of the probability of making a field goal in normal versus cold conditions.
import pymc as pm
with pm.Model() as model:
p_normal = pm.Uniform("p_normal",lower=0.2, upper=0.8)
p_cold = pm.Uniform("p_cold",lower=0.4, upper=0.8)
# scraped observations
obs_normal = pm.Bernoulli("obs_normal", p_normal, observed=normal.astype(int))
obs_cold = pm.Bernoulli("obs_cold", p_cold, observed=cold.astype(int))
start = pm.find_MAP()
m = pm.NUTS(state=start)
trace = pm.sample(2000, m, start=start, progressbar=False)
/Users/jgoodwin/anaconda/lib/python2.7/site-packages/theano/scan_module/scan_perform_ext.py:85: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility from scan_perform.scan_perform import *
p_normal_samples = trace['p_normal'][:]
p_cold_samples= trace['p_cold'][:]
delta_samples = trace['p_normal']-trace['p_cold']
plt.figure(figsize=(10, 10))
ax = plt.subplot(311)
plt.hist(p_normal_samples, histtype='stepfilled', bins=50, normed=True, color='#E41A1C', label='Warm')
plt.legend()
plt.vlines(normal.mean(), 0, 300, linestyles='--')
plt.xticks(np.arange(0.5, 0.8, 0.05))
plt.xlim(0.5, 0.8)
plt.ylim(0, 80)
plt.grid(False)
ax = plt.subplot(312)
plt.hist(p_cold_samples, histtype='stepfilled', bins=50, normed=True, color='#4DAF4A', label='Cold')
plt.vlines(cold.mean(), 0, 300, linestyles='--')
plt.legend()
plt.xticks(np.arange(0.5, 0.8, 0.05))
plt.xlim(0.5, 0.8)
plt.ylim(0, 60)
plt.grid(False)
ax = plt.subplot(313)
plt.hist(delta_samples, histtype='stepfilled', bins=50, normed=True, color='#377EB8', label='Delta')
plt.vlines(0, 0, 300, linestyles='--', label='$H_0$ (No difference)')
plt.legend()
plt.xticks(np.arange(-0.1, 0.201, 0.1))
plt.xlim(-0.1, 0.2)
plt.ylim(0, 60)
plt.grid(False);
print ("Field Goals less than 40 yards are less successful in temperatures < 40 degrees "
"in {0:.1f}% of simulations").format((delta_samples > 0).mean() * 100)
Field Goals less than 40 yards are less successful in temperatures < 40 degrees in 98.8% of simulations
In approximately 99% of the simulations, field goals at colder temperatures were less likely to have been made.
Now lets look at using a Random Forests Classifer to predict the probability of successfully making a field goal given distance, temperature, and elevation of the stadium as training features.
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report, roc_curve, auc, precision_recall_curve
First lets start with just distance and temperature as a features for determining whether a field goal is made. The approach below will use 60% of the data for training and 40% to evaluate performance. Lets develop a model.
convert = lambda x: 1 if x == 'Y' else 0
data = df;
data = data[data.TEMP.notnull()]
data.TEMP = data.TEMP.astype(float)
data.GOOD = data.GOOD.apply(convert)
X = data[['DIST','TEMP']]
y = data['GOOD']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
# Set up the random forest model with parameters found via grid search
clf = RandomForestClassifier(oob_score = True, criterion='gini', n_estimators=100, min_samples_leaf=20, max_features=2)
clf.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, compute_importances=None, criterion='gini', max_depth=None, max_features=2, min_density=None, min_samples_leaf=20, min_samples_split=2, n_estimators=100, n_jobs=1, oob_score=True, random_state=None, verbose=0)
def summary():
print 'Training set score: %.1f%%' % (clf.score(X_train, y_train)*100)
print 'Test set score: %.1f%%' %(clf.score(X_test, y_test)*100)
preds = clf.predict(X_test)
print classification_report(np.array(y_test), preds)
print 'Confusion matrix: '
print pd.crosstab(y_test, preds, rownames=["Actual"], colnames=["Predicted"])
# summary()
def roc():
full_probs = clf.predict_proba(X_test)
fpr_full, tpr_full, thresholds_full = roc_curve(y_test, full_probs[:, 1])
roc_auc_full = auc(fpr_full, tpr_full)
plt.figure(figsize=(6, 5))
plt.plot(fpr_full, tpr_full, label='Full random forest ROC curve (area = %0.2f)' % roc_auc_full)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic curve')
plt.legend(loc="lower right");
roc()
Above is a ROC curve illustrating our model's performance against the test data. It doesn't show the best performance, but clearly doing better than flipping a coin.
h = 5 # step size in the mesh
x_min, x_max = X.DIST.min() - .5, X.DIST.max() + .5
y_min, y_max = X.TEMP.min() - .5, X.TEMP.max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
s = xx.ravel().shape
cm = plt.cm.RdBu
v = None
for estimator in clf.estimators_:
e = estimator.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
if v is None:
v = e.reshape((len(e),1))
else:
v = np.hstack((v, e.reshape((len(e),1))))
Z = v.mean(axis=1)
p1 = np.percentile(v, 2.5, axis=1)
p2 = np.percentile(v, 97.5, axis=1)
P = p2-p1
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,4))
plt.subplot(121)
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=cm, alpha=.8, levels=np.arange(0,1.1,0.1))
plt.title('Prediction')
plt.ylabel('Temperature (${\ }^{\circ} F$)')
plt.xlabel('Yards From Scrimmage')
plt.subplot(122)
P = P.reshape(xx.shape)
co = plt.contourf(xx, yy, P, cmap=cm, alpha=.8, levels=np.arange(0,1.1,0.1))
plt.xlabel('Yards From Scrimmage')
plt.title('Width of Confidence Interval')
fig.subplots_adjust(right=0.8)
cbar_ax = fig.add_axes([0.81, 0.15, 0.02, 0.7])
fig.colorbar(co, cax=cbar_ax);
The above figure illustrates the predicted probability of making (left) a Field Goal at a given temperature versus a given distance and the width of the 95% confidence interval estimated from the distribution of estimator outputs (left). A couple quick observations: as expected, smaller distance mean higher likelihood the field goal will be made; the confidence interval width is smaller in regions where more data is available. So does temperature impact the likelihood of making a field goal? It appears to be a minor impact, right around 40 degrees there is slight change in the probability. From our MCMC simulation this difference was only a couple percentage points, and the figure above appears to illustrate the same thing. Looking at the 90th-percentile probability, there is a shift of about 5 yards for the same probability. So temperature has an impact, but very minor. Below is the output of the feature importance estimated from the classifier, obviously distance is more important than temperature.
print "Distance: %f, Temperature: %f" %(clf.feature_importances_[0],clf.feature_importances_[1])
Distance: 0.730580, Temperature: 0.269420
convert = lambda x: 1 if x == 'Y' else 0
data = df;
data = data[data.TEMP.notnull()]
data.TEMP = data.TEMP.astype(float)
data.GOOD = data.GOOD.apply(convert)
X = data[['DIST','TEMP','ALT']]
y = data['GOOD']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
# Set up the random forest model with parameters found via grid search
clf = RandomForestClassifier(oob_score = True, criterion='entropy', n_estimators=100, min_samples_leaf=15)
clf.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, compute_importances=None, criterion='entropy', max_depth=None, max_features='auto', min_density=None, min_samples_leaf=15, min_samples_split=2, n_estimators=100, n_jobs=1, oob_score=True, random_state=None, verbose=0)
roc()
The ROC curve above does not show much improvement over the 2 feature classifier. Lets look at the predicted probabilities and confidence interval with altitude included in our model.
h = 5 # step size in the mesh
x_min, x_max = X.DIST.min() - .5, X.DIST.max() + .5
y_min, y_max = X.TEMP.min() - .5, X.TEMP.max() + .5
z_min, z_max = X.ALT.min() - .5, X.ALT.max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
s = xx.ravel().shape
cm = plt.cm.RdBu
def predict(z):
v = None
for estimator in clf.estimators_:
e = estimator.predict_proba(np.c_[xx.ravel(), yy.ravel(), np.repeat(z,s)])[:, 1]
if v is None:
v = e.reshape((len(e),1))
else:
v = np.hstack((v, e.reshape((len(e),1))))
m = v.mean(axis=1)
p1 = np.percentile(v, 2.5, axis=1)
p2 = np.percentile(v, 97.5, axis=1)
p = p2-p1
return m, p
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(10,8))
plt.subplot(321)
Z,P = predict(5.0)
# Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel(), np.repeat(5.0,s)])[:, 1]
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=cm, alpha=.8, levels=np.arange(0,1.1,0.1))
plt.title('5 Meters Alititude Prediction')
plt.ylabel('Temperature (${\ }^{\circ} F$)')
plt.subplot(322)
P = P.reshape(xx.shape)
plt.contourf(xx, yy, P, cmap=cm, alpha=.8, levels=np.arange(0,1.1,0.1))
plt.title('5 Meters Alitude Confidence Interval Width')
plt.subplot(323)
Z,P = predict(500.0)
# Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel(), np.repeat(500.0,s)])[:, 1]
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=cm, alpha=.8, levels=np.arange(0,1.1,0.1))
plt.title('500 Meters Alititude Prediction')
plt.ylabel('Temperature (${\ }^{\circ} F$)')
plt.subplot(324)
P = P.reshape(xx.shape)
plt.contourf(xx, yy, P, cmap=cm, alpha=.8, levels=np.arange(0,1.1,0.1))
plt.title('500 Meters Alitude Confidence Interval Width')
plt.subplot(325)
Z,P = predict(1500.0)
# Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel(), np.repeat(1500.0,s)])[:, 1]
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=cm, alpha=.8, levels=np.arange(0,1.1,0.1))
plt.title('1500 Meters Alititude Prediction')
plt.xlabel('Yards From Scrimmage')
plt.ylabel('Temperature (${\ }^{\circ} F$)')
plt.subplot(326)
P = P.reshape(xx.shape)
plt.contourf(xx, yy, P, cmap=cm, alpha=.8, levels=np.arange(0,1.1,0.1))
plt.title('1500 Meters Alitude Confidence Interval Width')
plt.xlabel('Yards From Scrimmage')
fig.subplots_adjust(right=.8)
cbar_ax = fig.add_axes([1, 0.15, 0.02, 0.7])
fig.colorbar(co, cax=cbar_ax)
plt.tight_layout()
Not too much to take from these results. Higher altitudes have some impact on the probability. A pattern shows up at low altitude for temperature, otherwise at higher altitude temperature doesn't really play a role (especially when accounting for the confidence interval). Probably the most notable change in probability across altitude is in the 80th-percentile probabilities at lower temperatures. It looks like there is an increase in probability at lower temperature from lower altitudes.
plt.figure(figsize=(12,6))
plt.bar([0,1,2],clf.feature_importances_, align='center')
plt.xticks([0,1,2],['Distance','Temperature','Altitude']);
from IPython.core.display import HTML
def css_styling():
styles = open("style.css", "r").read()
return HTML(styles)
css_styling()