Structuring datasets to facilitate analysis (Wickham 2014)
If there's one maxim I can impart it's that your tools shouldn't get in the way of your analysis. Your problem is already difficult enough, don't let the data or your tools make it any harder.
In a tidy dataset...
We'll cover a few methods that help you get there.
Based on this StackOverflow question.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
pd.options.display.max_rows = 10
%matplotlib inline
Earlier, I fetched some data
tables = pd.read_html("http://www.basketball-reference.com/leagues/NBA_2015_games.html")
games = tables[0]
games.to_csv('data/games.csv', index=False)
pd.read_html?
!head -n 2 data/games.csv
Date,Unnamed: 1,Visitor/Neutral,PTS,Home/Neutral,PTS.1,Unnamed: 6,Notes "Tue, Oct 28, 2014",Box Score,Houston Rockets,108,Los Angeles Lakers,90,,
The Question:
How many days of rest did each team get between each game?
Whether or not your dataset is tidy depends on your question. Given our question, what is an observation?
column_names = ['date', '_', 'away_team', 'away_points', 'home_team',
'home_points', 'n_ot', 'notes']
games = (pd.read_csv('data/games.csv', names=column_names, parse_dates=['date'],
skiprows=1)
.drop(['_', 'notes', 'n_ot'], axis='columns')
.set_index('date', append=True))
games.index.names = ['game_id', 'date']
games.head()
away_team | away_points | home_team | home_points | ||
---|---|---|---|---|---|
game_id | date | ||||
0 | 2014-10-28 | Houston Rockets | 108 | Los Angeles Lakers | 90 |
1 | 2014-10-28 | Orlando Magic | 84 | New Orleans Pelicans | 101 |
2 | 2014-10-28 | Dallas Mavericks | 100 | San Antonio Spurs | 101 |
3 | 2014-10-29 | Brooklyn Nets | 105 | Boston Celtics | 121 |
4 | 2014-10-29 | Milwaukee Bucks | 106 | Charlotte Hornets | 108 |
Is games
a tidy dataset, given our question? No, we have multiple observations (teams) per row. We'll use pd.melt
to fix that.
tidy = pd.melt(games.sort_index().reset_index(),
id_vars=['game_id', 'date'], value_vars=['away_team', 'home_team'],
value_name='team')
tidy.head()
game_id | date | variable | team | |
---|---|---|---|---|
0 | 0 | 2014-10-28 | away_team | Houston Rockets |
1 | 1 | 2014-10-28 | away_team | Orlando Magic |
2 | 2 | 2014-10-28 | away_team | Dallas Mavericks |
3 | 3 | 2014-10-29 | away_team | Brooklyn Nets |
4 | 4 | 2014-10-29 | away_team | Milwaukee Bucks |
Now the translation from question to operation is direct:
# For each team... get number of dates between games
tidy.groupby('team')['date'].diff().dt.days - 1
0 NaN 1 NaN 2 NaN 3 NaN 4 NaN .. 2455 1 2456 4 2457 4 2458 1 2459 10 dtype: float64
tidy['rest'] = tidy.sort('date').groupby('team').date.diff().dt.days - 1
tidy.dropna().head()
game_id | date | variable | team | rest | |
---|---|---|---|---|---|
10 | 10 | 2014-10-29 | away_team | Los Angeles Lakers | 0 |
14 | 14 | 2014-10-29 | away_team | Houston Rockets | 0 |
15 | 15 | 2014-10-30 | away_team | New York Knicks | 0 |
16 | 16 | 2014-10-30 | away_team | Utah Jazz | 0 |
17 | 17 | 2014-10-30 | away_team | Oklahoma City Thunder | 0 |
un = pd.pivot_table(tidy, values='rest',
index=['game_id', 'date'],
columns='variable').rename(
columns={'away_team': 'away_rest', 'home_team': 'home_rest'}
)
un.columns.name = None
un.dropna().head()
away_rest | home_rest | ||
---|---|---|---|
game_id | date | ||
16 | 2014-10-30 | 0 | 1 |
18 | 2014-10-30 | 0 | 0 |
19 | 2014-10-30 | 0 | 1 |
20 | 2014-10-31 | 0 | 1 |
21 | 2014-10-31 | 1 | 1 |
df = pd.concat([games, un], axis=1)
df
away_team | away_points | home_team | home_points | away_rest | home_rest | ||
---|---|---|---|---|---|---|---|
game_id | date | ||||||
0 | 2014-10-28 | Houston Rockets | 108 | Los Angeles Lakers | 90 | NaN | NaN |
1 | 2014-10-28 | Orlando Magic | 84 | New Orleans Pelicans | 101 | NaN | NaN |
2 | 2014-10-28 | Dallas Mavericks | 100 | San Antonio Spurs | 101 | NaN | NaN |
3 | 2014-10-29 | Brooklyn Nets | 105 | Boston Celtics | 121 | NaN | NaN |
4 | 2014-10-29 | Milwaukee Bucks | 106 | Charlotte Hornets | 108 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
1225 | 2015-04-15 | Oklahoma City Thunder | 138 | Minnesota Timberwolves | 113 | 1 | 1 |
1226 | 2015-04-15 | San Antonio Spurs | 103 | New Orleans Pelicans | 108 | 2 | 1 |
1227 | 2015-04-15 | Detroit Pistons | 112 | New York Knicks | 90 | 1 | 1 |
1228 | 2015-04-15 | Miami Heat | 105 | Philadelphia 76ers | 101 | 1 | 1 |
1229 | 2015-04-15 | Charlotte Hornets | 87 | Toronto Raptors | 92 | 1 | 0 |
1230 rows × 6 columns
g = sns.FacetGrid(data=tidy.dropna(), col='team', col_wrap=5, hue='team')
g.map(sns.barplot, "variable", "rest");
delta = (un.home_rest - un.away_rest).dropna().astype(int)
(delta.value_counts()
.reindex(np.arange(delta.min(), delta.max() + 1), fill_value=0)
.sort_index().plot(kind='bar', color='k', width=.9, rot=0, figsize=(12, 6)))
<matplotlib.axes._subplots.AxesSubplot at 0x10a769a90>
An "observation" depends on the question. Home team advantage?
home_adv = games.home_points - games.away_points
ax = (home_adv).plot(kind='hist', bins=80, color='k', figsize=(10, 5))
ax.set_xlim(-40, 40)
mu = home_adv.mean()
ax.vlines(mu, *ax.get_ylim(), color='steelblue', linewidth=3)
print('Home win percent:', (home_adv > 0).mean())
Home win percent: 0.574796747967
What's the effect (in terms of probability to win) of being the home team.
We need to create an indicator for whether the home team won.
Add it as a column called home_win
in games
.
games['home_win'] = ... # fill this in
#%load -r 1:4 solutions_tidy.py
Teams are split across two columns. It's easiest to calculate the number of wins and number of games as away, and the number of wins and number of games as home. Then combine those two results to get the win percent.
wins_as_home = games.groupby('').agg([])
# hint: use `~` to flip an array of booleans
wins_as_away = ...
wins_as_home.columns = ['n_wins', 'n_games']
wins_as_away.columns = ['n_wins', 'n_games']
away_points | home_points | home_win | ||||
---|---|---|---|---|---|---|
a | <lambda> | a | <lambda> | a | <lambda> | |
away_team | ||||||
Atlanta Hawks | 4164 | -4165 | 4069 | -4070 | 16 | -17 |
Boston Celtics | 4154 | -4155 | 4205 | -4206 | 22 | -23 |
Brooklyn Nets | 3978 | -3979 | 4119 | -4120 | 22 | -23 |
Charlotte Hornets | 3825 | -3826 | 4019 | -4020 | 27 | -28 |
Chicago Bulls | 4087 | -4088 | 4018 | -4019 | 18 | -19 |
... | ... | ... | ... | ... | ... | ... |
Sacramento Kings | 4154 | -4155 | 4419 | -4420 | 30 | -31 |
San Antonio Spurs | 4057 | -4058 | 3942 | -3943 | 19 | -20 |
Toronto Raptors | 4316 | -4317 | 4242 | -4243 | 19 | -20 |
Utah Jazz | 3853 | -3854 | 3955 | -3956 | 24 | -25 |
Washington Wizards | 3885 | -3886 | 4073 | -4074 | 24 | -25 |
30 rows × 6 columns
%load -r 5:13 solutions_tidy.py
Now add wins_as_home
and wins_as_away
to get a DataFrame with
two columsn, n_wins
, and n_games
and one row per team.
Finally, calculate the win percent.
%load -r 14:20 solutions_tidy.py
strength.order().plot(kind='barh', figsize=(5, 12))
<matplotlib.axes._subplots.AxesSubplot at 0x10b6db438>
Bring the strength
valuess in for each team, for each game.
games.head()
away_team | away_points | home_team | home_points | home_win | ||
---|---|---|---|---|---|---|
game_id | date | |||||
0 | 2014-10-28 | Houston Rockets | 108 | Los Angeles Lakers | 90 | False |
1 | 2014-10-28 | Orlando Magic | 84 | New Orleans Pelicans | 101 | True |
2 | 2014-10-28 | Dallas Mavericks | 100 | San Antonio Spurs | 101 | True |
3 | 2014-10-29 | Brooklyn Nets | 105 | Boston Celtics | 121 | True |
4 | 2014-10-29 | Milwaukee Bucks | 106 | Charlotte Hornets | 108 | True |
For SQL people
SELECT *
FROM games NATURAL JOIN strength
We just need to get the names worked out.
strength.head().reset_index().rename(columns=lambda x: 'away_' + x)
away_team | away_strength | |
---|---|---|
0 | Atlanta Hawks | 0.731707 |
1 | Boston Celtics | 0.487805 |
2 | Brooklyn Nets | 0.463415 |
3 | Charlotte Hornets | 0.402439 |
4 | Chicago Bulls | 0.609756 |
(pd.merge(games.reset_index(), strength.reset_index().add_prefix('away_'))
.pipe(pd.merge, strength.reset_index().add_prefix('home_'))
.set_index(['game_id', 'date']))
away_team | away_points | home_team | home_points | home_win | away_strength | home_strength | ||
---|---|---|---|---|---|---|---|---|
game_id | date | |||||||
0 | 2014-10-28 | Houston Rockets | 108 | Los Angeles Lakers | 90 | False | 0.682927 | 0.256098 |
661 | 2015-01-25 | Houston Rockets | 99 | Los Angeles Lakers | 87 | False | 0.682927 | 0.256098 |
541 | 2015-01-09 | Orlando Magic | 84 | Los Angeles Lakers | 101 | True | 0.304878 | 0.256098 |
932 | 2015-03-08 | Dallas Mavericks | 100 | Los Angeles Lakers | 93 | False | 0.609756 | 0.256098 |
1197 | 2015-04-12 | Dallas Mavericks | 120 | Los Angeles Lakers | 106 | False | 0.609756 | 0.256098 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
214 | 2014-11-26 | Sacramento Kings | 89 | Houston Rockets | 102 | True | 0.353659 | 0.682927 |
1112 | 2015-04-01 | Sacramento Kings | 111 | Houston Rockets | 115 | True | 0.353659 | 0.682927 |
476 | 2014-12-31 | Charlotte Hornets | 83 | Houston Rockets | 102 | True | 0.402439 | 0.682927 |
380 | 2014-12-18 | New Orleans Pelicans | 99 | Houston Rockets | 90 | False | 0.548780 | 0.682927 |
1195 | 2015-04-12 | New Orleans Pelicans | 114 | Houston Rockets | 121 | True | 0.548780 | 0.682927 |
1230 rows × 7 columns
For python people
games = games.assign(away_strength=games.away_team.map(strength),
home_strength=games.home_team.map(strength))
games.head()
away_team | away_points | home_team | home_points | home_win | away_strength | home_strength | ||
---|---|---|---|---|---|---|---|---|
game_id | date | |||||||
0 | 2014-10-28 | Houston Rockets | 108 | Los Angeles Lakers | 90 | False | 0.682927 | 0.256098 |
1 | 2014-10-28 | Orlando Magic | 84 | New Orleans Pelicans | 101 | True | 0.304878 | 0.548780 |
2 | 2014-10-28 | Dallas Mavericks | 100 | San Antonio Spurs | 101 | True | 0.609756 | 0.670732 |
3 | 2014-10-29 | Brooklyn Nets | 105 | Boston Celtics | 121 | True | 0.463415 | 0.487805 |
4 | 2014-10-29 | Milwaukee Bucks | 106 | Charlotte Hornets | 108 | True | 0.500000 | 0.402439 |
X = pd.concat([games, un], axis=1).set_index(['away_team', 'home_team'], append=True).dropna()
X.head()
away_points | home_points | home_win | away_strength | home_strength | away_rest | home_rest | ||||
---|---|---|---|---|---|---|---|---|---|---|
game_id | date | away_team | home_team | |||||||
16 | 2014-10-30 | Utah Jazz | Dallas Mavericks | 102 | 120 | True | 0.463415 | 0.609756 | 0 | 1 |
18 | 2014-10-30 | Detroit Pistons | Minnesota Timberwolves | 91 | 97 | True | 0.390244 | 0.195122 | 0 | 0 |
19 | 2014-10-30 | Washington Wizards | Orlando Magic | 105 | 98 | False | 0.560976 | 0.304878 | 0 | 1 |
20 | 2014-10-31 | Cleveland Cavaliers | Chicago Bulls | 114 | 108 | False | 0.646341 | 0.609756 | 0 | 1 |
21 | 2014-10-31 | Memphis Grizzlies | Indiana Pacers | 97 | 89 | False | 0.670732 | 0.463415 | 1 | 1 |
X['home_win'] = X.home_win.astype(int) # for statsmodels
import statsmodels.api as sm
mod = sm.Logit.from_formula('home_win ~ home_strength + away_strength + home_rest + away_rest', X)
res = mod.fit()
res.summary()
Optimization terminated successfully. Current function value: 0.564564 Iterations 6
Dep. Variable: | home_win | No. Observations: | 1213 |
---|---|---|---|
Model: | Logit | Df Residuals: | 1208 |
Method: | MLE | Df Model: | 4 |
Date: | Thu, 23 Jul 2015 | Pseudo R-squ.: | 0.1728 |
Time: | 19:07:38 | Log-Likelihood: | -684.82 |
converged: | True | LL-Null: | -827.83 |
LLR p-value: | 1.120e-60 |
coef | std err | z | P>|z| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | -0.3637 | 0.302 | -1.205 | 0.228 | -0.956 0.228 |
home_strength | 5.5604 | 0.449 | 12.389 | 0.000 | 4.681 6.440 |
away_strength | -4.3474 | 0.442 | -9.835 | 0.000 | -5.214 -3.481 |
home_rest | 0.1327 | 0.069 | 1.913 | 0.056 | -0.003 0.269 |
away_rest | -0.0345 | 0.066 | -0.526 | 0.599 | -0.163 0.094 |
mod = sm.Logit.from_formula('home_win ~ rest_difference',
X.assign(rest_difference=lambda df: df.home_rest - df.away_rest))
res = mod.fit()
res.summary()
Optimization terminated successfully. Current function value: 0.682125 Iterations 4
Dep. Variable: | home_win | No. Observations: | 1213 |
---|---|---|---|
Model: | Logit | Df Residuals: | 1211 |
Method: | MLE | Df Model: | 1 |
Date: | Thu, 23 Jul 2015 | Pseudo R-squ.: | 0.0004950 |
Time: | 19:08:09 | Log-Likelihood: | -827.42 |
converged: | True | LL-Null: | -827.83 |
LLR p-value: | 0.3653 |
coef | std err | z | P>|z| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 0.2815 | 0.060 | 4.721 | 0.000 | 0.165 0.398 |
rest_difference | 0.0473 | 0.052 | 0.903 | 0.366 | -0.055 0.150 |
mod = sm.OLS.from_formula('spread ~ home_strength + away_strength + rest_difference',
X.assign(rest_difference=lambda df: df.home_rest - df.away_rest,
spread=lambda df: df.home_points - df.away_points))
res = mod.fit()
res.summary()
Dep. Variable: | spread | R-squared: | 0.236 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.234 |
Method: | Least Squares | F-statistic: | 124.4 |
Date: | Thu, 23 Jul 2015 | Prob (F-statistic): | 3.22e-70 |
Time: | 19:08:27 | Log-Likelihood: | -4712.3 |
No. Observations: | 1213 | AIC: | 9433. |
Df Residuals: | 1209 | BIC: | 9453. |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | -0.3059 | 1.543 | -0.198 | 0.843 | -3.334 2.722 |
home_strength | 30.6895 | 2.096 | 14.639 | 0.000 | 26.576 34.802 |
away_strength | -25.5859 | 2.106 | -12.151 | 0.000 | -29.717 -21.455 |
rest_difference | 0.5926 | 0.304 | 1.949 | 0.051 | -0.004 1.189 |
Omnibus: | 9.779 | Durbin-Watson: | 1.993 |
---|---|---|---|
Prob(Omnibus): | 0.008 | Jarque-Bera (JB): | 12.290 |
Skew: | 0.106 | Prob(JB): | 0.00214 |
Kurtosis: | 3.446 | Cond. No. | 10.2 |