This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.
You need to download the Tennis dataset on the book's website, and extract it in the current directory. (http://ipython-books.github.io)
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
%matplotlib inline
player = 'Roger Federer'
filename = "data/{name}.csv".format(
name=player.replace(' ', '-'))
df = pd.read_csv(filename)
print("Number of columns: " + str(len(df.columns)))
df[df.columns[:4]].tail()
npoints = df['player1 total points total']
points = df['player1 total points won'] / npoints
aces = df['player1 aces'] / npoints
plt.plot(points, aces, '.');
plt.xlabel('% of points won');
plt.ylabel('% of aces');
plt.xlim(0., 1.);
plt.ylim(0.);
If the two variables were independent, we would not see any trend in the cloud of points. On this plot, it is a bit hard to tell. Let's use Pandas to compute a coefficient correlation.
DataFrame
with only those fields (note that this step is not compulsory). We also remove the rows where one field is missing.df_bis = pd.DataFrame({'points': points,
'aces': aces}).dropna()
df_bis.tail()
df_bis.corr()
A correlation of ~0.26 seems to indicate a positive correlation between our two variables. In other words, the more aces in a match, the more points the player wins (which is not very surprising!).
df_bis['result'] = df_bis['points'] > df_bis['points'].median()
df_bis['manyaces'] = df_bis['aces'] > df_bis['aces'].median()
pd.crosstab(df_bis['result'], df_bis['manyaces'])
scipy.stats.chi2_contingency
, which returns several objects. We're interested in the second result, which is the p-value.st.chi2_contingency(_)
The p-value is much lower than 0.05, so we reject the null hypothesis and conclude that there is a statistically significant correlation between the proportion of aces and the proportion of points won in a match (for Roger Federer!).
As always, correlation does not imply causation... Here, it is likely that external factors influence both variables. (http://en.wikipedia.org/wiki/Correlation_does_not_imply_causation)
You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).
IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).