Tidy Data¶

Structuring datasets to facilitate analysis (Wickham 2014)

If there's one maxim I can impart it's that your tools shouldn't get in the way of your analysis. Your problem is already difficult enough, don't let the data or your tools make it any harder.

The Rules¶

In a tidy dataset...

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

We'll cover a few methods that help you get there.

Based on this StackOverflow question.

In [1]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

pd.options.display.max_rows = 10
%matplotlib inline

Earlier, I fetched some data

tables = pd.read_html("http://www.basketball-reference.com/leagues/NBA_2015_games.html")
games = tables[0]
games.to_csv('data/games.csv', index=False)

In [2]:

pd.read_html?

In [3]:

!head -n 2 data/games.csv

Date,Unnamed: 1,Visitor/Neutral,PTS,Home/Neutral,PTS.1,Unnamed: 6,Notes
"Tue, Oct 28, 2014",Box Score,Houston Rockets,108,Los Angeles Lakers,90,,

The Question:

How many days of rest did each team get between each game?

Whether or not your dataset is tidy depends on your question. Given our question, what is an observation?

In [4]:

column_names = ['date', '_', 'away_team', 'away_points', 'home_team',
                'home_points', 'n_ot', 'notes']
games = (pd.read_csv('data/games.csv', names=column_names, parse_dates=['date'],
                     skiprows=1)
           .drop(['_', 'notes', 'n_ot'], axis='columns')
           .set_index('date', append=True))
games.index.names = ['game_id', 'date']
games.head()

Out[4]:

		away_team	away_points	home_team	home_points
game_id	date
0	2014-10-28	Houston Rockets	108	Los Angeles Lakers	90
1	2014-10-28	Orlando Magic	84	New Orleans Pelicans	101
2	2014-10-28	Dallas Mavericks	100	San Antonio Spurs	101
3	2014-10-29	Brooklyn Nets	105	Boston Celtics	121
4	2014-10-29	Milwaukee Bucks	106	Charlotte Hornets	108

Is games a tidy dataset, given our question? No, we have multiple observations (teams) per row. We'll use pd.melt to fix that.

In [5]:

tidy = pd.melt(games.sort_index().reset_index(),
               id_vars=['game_id', 'date'], value_vars=['away_team', 'home_team'],
               value_name='team')

In [6]:

tidy.head()

Out[6]:

	game_id	date	variable	team
0	0	2014-10-28	away_team	Houston Rockets
1	1	2014-10-28	away_team	Orlando Magic
2	2	2014-10-28	away_team	Dallas Mavericks
3	3	2014-10-29	away_team	Brooklyn Nets
4	4	2014-10-29	away_team	Milwaukee Bucks

Now the translation from question to operation is direct:

In [7]:

# For each team... get number of dates between games
tidy.groupby('team')['date'].diff().dt.days - 1

Out[7]:

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
2455     1
2456     4
2457     4
2458     1
2459    10
dtype: float64

In [8]:

tidy['rest'] = tidy.sort('date').groupby('team').date.diff().dt.days - 1
tidy.dropna().head()

Out[8]:

	game_id	date	variable	team
10	10	2014-10-29	away_team	Los Angeles Lakers
14	14	2014-10-29	away_team	Houston Rockets
15	15	2014-10-30	away_team	New York Knicks
16	16	2014-10-30	away_team	Utah Jazz
17	17	2014-10-30	away_team	Oklahoma City Thunder

In [9]:

un = pd.pivot_table(tidy, values='rest',
                    index=['game_id', 'date'],
                    columns='variable').rename(
    columns={'away_team': 'away_rest', 'home_team': 'home_rest'}
)
un.columns.name = None

In [10]:

un.dropna().head()

Out[10]:

		away_rest	home_rest
game_id	date
16	2014-10-30	0	1
18	2014-10-30	0	0
19	2014-10-30	0	1
20	2014-10-31	0	1
21	2014-10-31	1	1

In [11]:

df = pd.concat([games, un], axis=1)
df

Out[11]:

		away_team	away_points	home_team	home_points	away_rest	home_rest
game_id	date
0	2014-10-28	Houston Rockets	108	Los Angeles Lakers	90	NaN	NaN
1	2014-10-28	Orlando Magic	84	New Orleans Pelicans	101	NaN	NaN
2	2014-10-28	Dallas Mavericks	100	San Antonio Spurs	101	NaN	NaN
3	2014-10-29	Brooklyn Nets	105	Boston Celtics	121	NaN	NaN
4	2014-10-29	Milwaukee Bucks	106	Charlotte Hornets	108	NaN	NaN
...	...	...	...	...	...	...	...
1225	2015-04-15	Oklahoma City Thunder	138	Minnesota Timberwolves	113	1	1
1226	2015-04-15	San Antonio Spurs	103	New Orleans Pelicans	108	2	1
1227	2015-04-15	Detroit Pistons	112	New York Knicks	90	1	1
1228	2015-04-15	Miami Heat	105	Philadelphia 76ers	101	1	1
1229	2015-04-15	Charlotte Hornets	87	Toronto Raptors	92	1	0

1230 rows × 6 columns

In [12]:

g = sns.FacetGrid(data=tidy.dropna(), col='team', col_wrap=5, hue='team')
g.map(sns.barplot, "variable", "rest");

In [13]:

delta = (un.home_rest - un.away_rest).dropna().astype(int)
(delta.value_counts()
    .reindex(np.arange(delta.min(), delta.max() + 1), fill_value=0)
    .sort_index().plot(kind='bar', color='k', width=.9, rot=0, figsize=(12, 6)))

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a769a90>

Stack / Unstack¶

An "observation" depends on the question. Home team advantage?

In [14]:

home_adv = games.home_points - games.away_points
ax = (home_adv).plot(kind='hist', bins=80, color='k', figsize=(10, 5))
ax.set_xlim(-40, 40)
mu = home_adv.mean()
ax.vlines(mu, *ax.get_ylim(), color='steelblue', linewidth=3)
print('Home win percent:', (home_adv > 0).mean())

Home win percent: 0.574796747967

Team Strength¶

Mini Project: Home Court Advantage?¶

What's the effect (in terms of probability to win) of being the home team.

Step 1: Calculate Win %¶

We need to create an indicator for whether the home team won. Add it as a column called home_win in games.

In [ ]:

games['home_win'] = ...  # fill this in

In [18]:

#%load -r 1:4 solutions_tidy.py

Step 2: Find the win percent for each team¶

Teams are split across two columns. It's easiest to calculate the number of wins and number of games as away, and the number of wins and number of games as home. Then combine those two results to get the win percent.

In [34]:

wins_as_home = games.groupby('').agg([])
# hint: use `~` to flip an array of booleans
wins_as_away = ...

wins_as_home.columns = ['n_wins', 'n_games']
wins_as_away.columns = ['n_wins', 'n_games']

Out[34]:

	away_points		home_points		home_win
	a	<lambda>	a	<lambda>	a	<lambda>
away_team
Atlanta Hawks	4164	-4165	4069	-4070	16	-17
Boston Celtics	4154	-4155	4205	-4206	22	-23
Brooklyn Nets	3978	-3979	4119	-4120	22	-23
Charlotte Hornets	3825	-3826	4019	-4020	27	-28
Chicago Bulls	4087	-4088	4018	-4019	18	-19
...	...	...	...	...	...	...
Sacramento Kings	4154	-4155	4419	-4420	30	-31
San Antonio Spurs	4057	-4058	3942	-3943	19	-20
Toronto Raptors	4316	-4317	4242	-4243	19	-20
Utah Jazz	3853	-3854	3955	-3956	24	-25
Washington Wizards	3885	-3886	4073	-4074	24	-25

30 rows × 6 columns

In [ ]:

%load -r 5:13 solutions_tidy.py

Now add wins_as_home and wins_as_away to get a DataFrame with two columsn, n_wins, and n_games and one row per team.

Finally, calculate the win percent.

In [59]:

%load -r 14:20 solutions_tidy.py

In [60]:

strength.order().plot(kind='barh', figsize=(5, 12))

Out[60]:

<matplotlib.axes._subplots.AxesSubplot at 0x10b6db438>

Bring the strength valuess in for each team, for each game.

In [61]:

games.head()

Out[61]:

		away_team	away_points	home_team	home_points	home_win
game_id	date
0	2014-10-28	Houston Rockets	108	Los Angeles Lakers	90	False
1	2014-10-28	Orlando Magic	84	New Orleans Pelicans	101	True
2	2014-10-28	Dallas Mavericks	100	San Antonio Spurs	101	True
3	2014-10-29	Brooklyn Nets	105	Boston Celtics	121	True
4	2014-10-29	Milwaukee Bucks	106	Charlotte Hornets	108	True

For SQL people

SELECT *
FROM games NATURAL JOIN strength

We just need to get the names worked out.

In [62]:

strength.head().reset_index().rename(columns=lambda x: 'away_' + x)

Out[62]:

	away_team	away_strength
0	Atlanta Hawks	0.731707
1	Boston Celtics	0.487805
2	Brooklyn Nets	0.463415
3	Charlotte Hornets	0.402439
4	Chicago Bulls	0.609756

In [63]:

(pd.merge(games.reset_index(), strength.reset_index().add_prefix('away_'))
   .pipe(pd.merge, strength.reset_index().add_prefix('home_'))
   .set_index(['game_id', 'date']))

Out[63]:

		away_team	away_points	home_team	home_points	home_win	away_strength	home_strength
game_id	date
0	2014-10-28	Houston Rockets	108	Los Angeles Lakers	90	False	0.682927	0.256098
661	2015-01-25	Houston Rockets	99	Los Angeles Lakers	87	False	0.682927	0.256098
541	2015-01-09	Orlando Magic	84	Los Angeles Lakers	101	True	0.304878	0.256098
932	2015-03-08	Dallas Mavericks	100	Los Angeles Lakers	93	False	0.609756	0.256098
1197	2015-04-12	Dallas Mavericks	120	Los Angeles Lakers	106	False	0.609756	0.256098
...	...	...	...	...	...	...	...	...
214	2014-11-26	Sacramento Kings	89	Houston Rockets	102	True	0.353659	0.682927
1112	2015-04-01	Sacramento Kings	111	Houston Rockets	115	True	0.353659	0.682927
476	2014-12-31	Charlotte Hornets	83	Houston Rockets	102	True	0.402439	0.682927
380	2014-12-18	New Orleans Pelicans	99	Houston Rockets	90	False	0.548780	0.682927
1195	2015-04-12	New Orleans Pelicans	114	Houston Rockets	121	True	0.548780	0.682927

1230 rows × 7 columns

For python people

In [64]:

games = games.assign(away_strength=games.away_team.map(strength),
                     home_strength=games.home_team.map(strength))
games.head()

Out[64]:

		away_team	away_points	home_team	home_points	home_win	away_strength	home_strength
game_id	date
0	2014-10-28	Houston Rockets	108	Los Angeles Lakers	90	False	0.682927	0.256098
1	2014-10-28	Orlando Magic	84	New Orleans Pelicans	101	True	0.304878	0.548780
2	2014-10-28	Dallas Mavericks	100	San Antonio Spurs	101	True	0.609756	0.670732
3	2014-10-29	Brooklyn Nets	105	Boston Celtics	121	True	0.463415	0.487805
4	2014-10-29	Milwaukee Bucks	106	Charlotte Hornets	108	True	0.500000	0.402439

In [65]:

X = pd.concat([games, un], axis=1).set_index(['away_team', 'home_team'], append=True).dropna()

In [66]:

X.head()

Out[66]:

				away_points	home_points	home_win	away_strength	home_strength	away_rest	home_rest
game_id	date	away_team	home_team
16	2014-10-30	Utah Jazz	Dallas Mavericks	102	120	True	0.463415	0.609756	0	1
18	2014-10-30	Detroit Pistons	Minnesota Timberwolves	91	97	True	0.390244	0.195122	0	0
19	2014-10-30	Washington Wizards	Orlando Magic	105	98	False	0.560976	0.304878	0	1
20	2014-10-31	Cleveland Cavaliers	Chicago Bulls	114	108	False	0.646341	0.609756	0	1
21	2014-10-31	Memphis Grizzlies	Indiana Pacers	97	89	False	0.670732	0.463415	1	1

In [67]:

X['home_win'] = X.home_win.astype(int)  # for statsmodels

In [68]:

import statsmodels.api as sm

In [69]:

mod = sm.Logit.from_formula('home_win ~ home_strength + away_strength + home_rest + away_rest', X)
res = mod.fit()
res.summary()

Optimization terminated successfully.
         Current function value: 0.564564
         Iterations 6

Out[69]:

Logit Regression Results
Dep. Variable:	home_win	No. Observations:	1213
Model:	Logit	Df Residuals:	1208
Method:	MLE	Df Model:	4
Date:	Thu, 23 Jul 2015	Pseudo R-squ.:	0.1728
Time:	19:07:38	Log-Likelihood:	-684.82
converged:	True	LL-Null:	-827.83
		LLR p-value:	1.120e-60

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-0.3637	0.302	-1.205	0.228	-0.956 0.228
home_strength	5.5604	0.449	12.389	0.000	4.681 6.440
away_strength	-4.3474	0.442	-9.835	0.000	-5.214 -3.481
home_rest	0.1327	0.069	1.913	0.056	-0.003 0.269
away_rest	-0.0345	0.066	-0.526	0.599	-0.163 0.094

In [70]:

mod = sm.Logit.from_formula('home_win ~ rest_difference',
                            X.assign(rest_difference=lambda df: df.home_rest - df.away_rest))
res = mod.fit()
res.summary()

Optimization terminated successfully.
         Current function value: 0.682125
         Iterations 4

Out[70]:

Logit Regression Results
Dep. Variable:	home_win	No. Observations:	1213
Model:	Logit	Df Residuals:	1211
Method:	MLE	Df Model:	1
Date:	Thu, 23 Jul 2015	Pseudo R-squ.:	0.0004950
Time:	19:08:09	Log-Likelihood:	-827.42
converged:	True	LL-Null:	-827.83
		LLR p-value:	0.3653

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.2815	0.060	4.721	0.000	0.165 0.398
rest_difference	0.0473	0.052	0.903	0.366	-0.055 0.150

In [71]:

mod = sm.OLS.from_formula('spread ~ home_strength + away_strength + rest_difference',
                            X.assign(rest_difference=lambda df: df.home_rest - df.away_rest,
                                     spread=lambda df: df.home_points - df.away_points))
res = mod.fit()
res.summary()

Out[71]:

OLS Regression Results
Dep. Variable:	spread	R-squared:	0.236
Model:	OLS	Adj. R-squared:	0.234
Method:	Least Squares	F-statistic:	124.4
Date:	Thu, 23 Jul 2015	Prob (F-statistic):	3.22e-70
Time:	19:08:27	Log-Likelihood:	-4712.3
No. Observations:	1213	AIC:	9433.
Df Residuals:	1209	BIC:	9453.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-0.3059	1.543	-0.198	0.843	-3.334 2.722
home_strength	30.6895	2.096	14.639	0.000	26.576 34.802
away_strength	-25.5859	2.106	-12.151	0.000	-29.717 -21.455
rest_difference	0.5926	0.304	1.949	0.051	-0.004 1.189

Omnibus:	9.779	Durbin-Watson:	1.993
Prob(Omnibus):	0.008	Jarque-Bera (JB):	12.290
Skew:	0.106	Prob(JB):	0.00214
Kurtosis:	3.446	Cond. No.	10.2

Recap¶

Tidy data: one row per observation
- melt / stack: wide to long
- pivot_table / unstack: long to wide

In [ ]: