from ggplot import *
import pandas as pd
import numpy as np
%matplotlib inline
df = pd.read_csv("./baseball-pitches-clean.csv")
df = df[['pitch_time', 'inning', 'pitcher_name', 'hitter_name', 'pitch_type',
'px', 'pz', 'pitch_name', 'start_speed', 'end_speed', 'type_confidence']]
df.head()
pitch_time | inning | pitcher_name | hitter_name | pitch_type | px | pz | pitch_name | start_speed | end_speed | type_confidence | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-10-01 20:07:43 -0400 | 1 | Francisco Liriano | Shin-Soo Choo | B | 0.628 | 1.547 | Fastball | 93.2 | 85.3 | 0.894 |
1 | 2013-10-01 20:07:57 -0400 | 1 | Francisco Liriano | Shin-Soo Choo | S | 0.545 | 3.069 | Fastball | 93.4 | 85.6 | 0.895 |
2 | 2013-10-01 20:08:12 -0400 | 1 | Francisco Liriano | Shin-Soo Choo | S | 0.120 | 1.826 | Slider | 89.1 | 82.8 | 0.931 |
3 | 2013-10-01 20:08:31 -0400 | 1 | Francisco Liriano | Shin-Soo Choo | S | -0.229 | 1.667 | Slider | 90.0 | 83.3 | 0.926 |
4 | 2013-10-01 20:09:09 -0400 | 1 | Francisco Liriano | Ryan Ludwick | B | -1.917 | 0.438 | Slider | 87.7 | 81.6 | 0.915 |
5 rows × 11 columns
px
and pz
are the coordinates of a pitch as they cross home plate. Let's plot these and see if our data makes sense.
ggplot(df, aes(x='px', y='pz')) + geom_point()
<ggplot: (272839901)>
What about the pitch speed?
ggplot(aes(x='start_speed', y='end_speed'), data=df) + geom_point()
<ggplot: (276734237)>
geom_hist
¶A better way to inspect pitch speed might be to look at a distribution of the data.
Does this make sense? Let's consult the source of all true wisdom: https://answers.yahoo.com/question/index?qid=20080126131031AAwVCNk
ggplot(df, aes(x='start_speed')) + geom_histogram()
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
<ggplot: (285457305)>
What about for specific pitches?
for name, frame in df.groupby("pitch_name"):
print ggplot(aes(x='start_speed'), data=frame) + geom_histogram() + ggtitle("Distribution of " + str(name))
<ggplot: (288278377)>
<ggplot: (285224941)>
<ggplot: (288277409)>
<ggplot: (289871437)>
<ggplot: (293071473)>
<ggplot: (292574941)>
<ggplot: (289870441)>
<ggplot: (291709497)>
That was helpful but I'm sort of on plot overload now.
facet_wrap
FTW¶Use the trellis.
"Trellis Graphics is a family of techniques for viewing complex, multi-variable data sets." Read more here.
ggplot(aes(x='start_speed'), data=df) +\
geom_histogram() +\
facet_wrap('pitch_name')
/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ggplot-0.5.9-py2.7.egg/ggplot/ggplot.py:198: RuntimeWarning: Facetting is currently not supported with geom_bar. See https://github.com/yhat/ggplot/issues/196 for more information warnings.warn(msg, RuntimeWarning)
<ggplot: (292575897)>
from IPython.display import YouTubeVideo
YouTubeVideo("ikLlRT2j7EQ")
Ok so what about balls and strikes.
ggplot(aes(x='pitch_type'), data=df) + geom_bar()
<ggplot: (275730281)>
facet_grid
¶(facet_wraps brother)
ggplot(aes(x='start_speed'), data=df) +\
geom_histogram() +\
facet_grid('pitch_type')
<ggplot: (276653609)>
ggplot(aes(x='start_speed'), data=df) +\
geom_histogram() +\
facet_grid('pitch_name', 'pitch_type', scales="free")
<ggplot: (271338625)>
geom_density
¶Similar to geom_histogram
but relative y scale.
ggplot(df, aes(x='start_speed')) +\
geom_density()
<ggplot: (275662825)>
ggplot(df, aes(x='start_speed', color='pitch_name')) +\
geom_density()
<ggplot: (278182857)>