Pitches and Pitchers¶

In [1]:

import numpy as np
import pandas as pd
%matplotlib inline
from ggplot import *

In [245]:

# Inspect data, the data is pitches tracked over a 2 month stretch in the 2013
# MLB season.
baseball = pd.read_csv('./data/baseball-pitches-clean.csv')
print baseball.shape[0], " pitches were tracked."
baseball.head()

133601  pitches were tracked.

Out[245]:

	pitch_time	inning	top_or_bottom	pitcher_name	hitter_name	pitch_type	x	y	start_speed	end_speed	sz_top	sz_bottom	pfx_x	pfx_z	px	pz	x0	y0	ax	ay
0	2013-10-01 20:07:43 -0400	1	Top	Francisco Liriano	Shin-Soo Choo	B	78.97	164.92	93.2	85.3	3.10	1.53	11.01	6.47	0.628	1.547	1.757	50	5.472	-6.862	...
1	2013-10-01 20:07:57 -0400	1	Top	Francisco Liriano	Shin-Soo Choo	S	82.40	131.24	93.4	85.6	3.06	1.56	10.14	7.99	0.545	3.069	1.711	50	5.650	-6.693	...
2	2013-10-01 20:08:12 -0400	1	Top	Francisco Liriano	Shin-Soo Choo	S	96.14	161.47	89.1	82.8	3.25	1.53	3.11	4.95	0.120	1.826	1.559	50	5.792	-4.763	...
3	2013-10-01 20:08:31 -0400	1	Top	Francisco Liriano	Shin-Soo Choo	S	106.44	163.19	90.0	83.3	3.25	1.53	-0.38	2.15	-0.229	1.667	1.172	50	5.832	-3.519	...
4	2013-10-01 20:09:09 -0400	1	Top	Francisco Liriano	Ryan Ludwick	B	163.95	194.28	87.7	81.6	3.62	1.78	1.62	1.93	-1.917	0.438	0.194	50	5.578	-5.886	...

5 rows × 36 columns

In [4]:

baseball.columns

Out[4]:

Index([u'pitch_time', u'inning', u'top_or_bottom', u'pitcher_name', u'hitter_name', u'pitch_type', u'x', u'y', u'start_speed', u'end_speed', u'sz_top', u'sz_bottom', u'pfx_x', u'pfx_z', u'px', u'pz', u'x0', u'y0', u'ax', u'ay', u'az', u'z0', u'vx0', u'vy0', u'vz0', u'break_y', u'break_angle', u'break_length', u'pitch_name', u'type_confidence', u'zone', u'nasty', u'spin_dir', u'spin_rate', u'comments', u'unk'], dtype='object')

In [7]:

# How many pitches types are there?
baseball.pitch_type.unique()

Out[7]:

array(['B', 'S', 'X'], dtype=object)

In [8]:

baseball.pitch_name.unique()

Out[8]:

array(['Fastball', 'Slider', 'Changeup', 'Cut fastball', 'Curveball',
       'Fastball (sinker|split-fingered)', 'Knuckleball', 'Eephus'], dtype=object)

In [16]:

# How many pitchers are in the dataset?
len(baseball.pitcher_name.unique())

Out[16]:

In [23]:

baseball.describe()[['start_speed', 'end_speed']]

Out[23]:

	start_speed	end_speed
count	133601.000000	133601.000000
mean	88.010358	81.342203
std	5.959400	5.320716
min	49.400000	45.500000
25%	83.900000	77.900000
50%	89.500000	82.600000
75%	92.600000	85.300000
max	103.400000	95.500000

8 rows × 2 columns

A start speed of 49.4 mph seems very very low, let's investigate this further.

In [31]:

slowest_pitch = baseball[baseball['start_speed'] == baseball['start_speed'].min(0)]
slowest_pitch.pitcher_name

Out[31]:

51404    Zack Wheeler
Name: pitcher_name, dtype: object

In [49]:

zach_wheeler = baseball[baseball['pitcher_name'] == 'Zack Wheeler']
less_than_70 = zach_wheeler[zach_wheeler['start_speed'] < 70]
print 'Number of pitches under 70 mph =', len(less_than_70)
print 'Mean of Zach Wheeler\'s pitch speeds', round(zach_wheeler['start_speed'].mean(),2), 'MPH.'

Number of pitches under 70 mph = 1
Mean of Zach Wheeler's pitch speeds 90.22 MPH.

Ok so from what we see above that pitch that's 49 MPH is definately an error, there's no way a guy who's throwing 90 MPH on average is going to throw a 49 MPH pitch.

In [69]:

print len(baseball[baseball['start_speed'] < 60]), 'pitches are under 60 mph'

# R.A. Dickey is a knuckleballer, one of only ones in the entire league
dickey = baseball[baseball['pitcher_name'] == 'R.A. Dickey']
print 'R. A. Dickey has ', len(dickey[dickey['start_speed'] < 60]), 'under 60 mph'

7 pitches are under 60 mph
R. A. Dickey has  0 under 60 mph

If Dickey who's a knuckleballer isn't throwing anything under 60 MPH, then it's pretty safe to say these pitches under 60 are outliars.

In [75]:

over_60 = baseball['start_speed'] >= 60
baseball = baseball[over_60]

Now that we've cleaned up the dataset a little, let's start visualizing it.

Visualizations¶

Before we plot, let's simplify the dataset a bit more

In [142]:

baseball = baseball[['pitch_time', 'inning', 'pitcher_name', 'hitter_name', 'pitch_type', 
         'px', 'pz', 'pitch_name', 'start_speed', 'end_speed', 'type_confidence']]
baseball.head()

Out[142]:

	pitch_time	inning	pitcher_name	hitter_name	pitch_type	px	pz	pitch_name	start_speed	end_speed	type_confidence
0	2013-10-01 20:07:43 -0400	1	Francisco Liriano	Shin-Soo Choo	B	0.628	1.547	Fastball	93.2	85.3	0.894
1	2013-10-01 20:07:57 -0400	1	Francisco Liriano	Shin-Soo Choo	S	0.545	3.069	Fastball	93.4	85.6	0.895
2	2013-10-01 20:08:12 -0400	1	Francisco Liriano	Shin-Soo Choo	S	0.120	1.826	Slider	89.1	82.8	0.931
3	2013-10-01 20:08:31 -0400	1	Francisco Liriano	Shin-Soo Choo	S	-0.229	1.667	Slider	90.0	83.3	0.926
4	2013-10-01 20:09:09 -0400	1	Francisco Liriano	Ryan Ludwick	B	-1.917	0.438	Slider	87.7	81.6	0.915

5 rows × 11 columns

In [102]:

p = ggplot(aes(x='px', y='pz', color='pitch_name'), data=baseball) + geom_jitter()
p

Out[102]:

<ggplot: (363066169)>

That's a bit hard to see let's do a facet wrap

In [87]:

p = ggplot(aes(x='px', y='pz'), data=baseball) + geom_point(color='blue') + facet_wrap('pitch_name')
p

Out[87]:

<ggplot: (278438693)>

Some Obsversations

Knuckleballs look to be have the most variance. This isn't that suprising since knuckleballs are based on Chaos Theory.
Changeups appear to be located mostly in the bottom half of the zone. This intuitively makes sense since a changeup is meant to look exactly like a fastball, the changeup has a slower speed than the fastball thereby confusing the hitter.
Because the changeup is on the same trajectory as a fastball but slower, gravity has a greater effect, therefore the pitch ends up lower in the strikezone.

Ok so I watch baseball and I've literally never heard of the Eephus pitch. From the graph it looks like it's really unpredictable, but also that there's not much data on it. Let's take a look at the actual counts.

In [89]:

baseball['pitch_name'].value_counts()

Out[89]:

Fastball                            68227
Slider                              20714
Curveball                           13798
Changeup                            12900
Fastball (sinker|split-fingered)    10267
Cut fastball                         7182
Knuckleball                           447
Eephus                                 59
dtype: int64

In [98]:

# Show in percentages
baseball['pitch_name'].value_counts() / len(baseball) * 100

Out[98]:

Fastball                            51.070407
Slider                              15.505187
Curveball                           10.328308
Changeup                             9.656122
Fastball (sinker|split-fingered)     7.685225
Cut fastball                         5.375990
Knuckleball                          0.334596
Eephus                               0.044164
dtype: float64

There are only 59 Eephus pitches thrown in our entire dataset! Put that in comparison with the 447 knuckleballs which are a rarity in themselves. So what is a Eephus pitch then?

In [97]:

from IPython.display import YouTubeVideo
YouTubeVideo('uW0V6OsxDBo', 600, 338)

Out[97]:

Let's checkout the distribution of pitch types

In [103]:

p = ggplot(aes(x='start_speed'), data=baseball) + geom_histogram() + facet_wrap('pitch_name')
p

Out[103]:

<ggplot: (363066165)>

This rules out my suspicion that the Eephus pitch is similar to the Knuckleball. It's suprising the knuckeball distribution is centered where it is in the high 70's. Traditionally Knuckleballs are high 60's pitches. This might be due to R.A. Dickey being the dominant Knuckleball user in today's game. His are known to be faster than most.

In [120]:

# Let's see how many of these Dickey throws
knuckles = baseball[baseball['pitch_name'] == 'Knuckleball']
dickey = knuckles[knuckles['pitcher_name'] == 'R.A. Dickey']
print 'Percentage of Knuckleballs belonging to Dickey', (len(dickey) / len(knuckles) * 100)

Percentage of Knuckleballs belonging to Dickey 100

Well it turns out all the Knuckleballs in our dataset are thrown by R.A. Dickey! Well that confirms the suspicion about the Knuckleball speeds.

We saw previously that it was pretty difficult to gain much insight into pitch types aside from general differences. This might be more meaningful if we analyzed a specific pitcher. Let's do Yu Darvish.

Darvish is known for having a wide array of pitches at his disposal and is one of the best current pitchers in baseball so he's a solid choice.

Yu Darvish¶

In [129]:

# Let's get darvish data
darvish = baseball[baseball['pitcher_name'] == 'Yu Darvish']
darvish['pitch_name'].value_counts() / len(darvish) * 100

Out[129]:

Fastball                            36.311239
Slider                              35.446686
Cut fastball                        22.334294
Fastball (sinker|split-fingered)     3.314121
Curveball                            2.593660
dtype: float64

Darvish's percentage pitch counts are drastically different from the average of the dataset, his approach is far more balanced. Over the 50% of pitches in the dataset are fastballs.

In [145]:

p = ggplot(aes(x='px', y='pz', color='pitch_name'), data=darvish) + geom_jitter(alpha=0.3)
p = p + ggtitle('Darvish Pitch Spread') + stat_smooth(method='lm')
p

Out[145]:

<ggplot: (278368393)>

It looks like Darvish's pitches all land in similar locations, looking further at the smoothing lines though we can see that the lines for his top 3 pitches (~94% of the pitches) are very similar.

Looking at the data it's easy to see why Darvish is such a lethal pitcher. In summary he was a wide array of pitches and to the hitter they all look pretty much identical.

Does Darvish Slow Down?¶

In [152]:

p = ggplot(aes(x='inning', y='start_speed', color='pitch_name'), data=darvish)
p = p + stat_smooth(method='lm', size=5)
p

Out[152]:

<ggplot: (278453569)>

In [153]:

p = ggplot(aes(x='inning', y='start_speed', color='pitch_name'), data=darvish)
p = p + geom_jitter(alpha=0.3)
p

Out[153]:

<ggplot: (278377641)>

Apart from his slider there's no drastic change in pitch speeds. Further if we take a lot at his top 3 pitches: fastball, cut fastball and slider we see that the distribution of the pitches speeds is consistent and it stays consistent throughout the entire game.

If a hitter's hope was that Darvish was become weaker over the course of a game it looks like they're out of a luck.

In [159]:

baseball['pitcher_name'].value_counts()

Out[159]:

David Price          762
Justin Verlander     755
Chris Tillman        727
Andy Pettitte        718
Ubaldo Jimenez       698
Yu Darvish           694
Jason Vargas         691
Wade Miley           677
Jon Lester           674
J.A. Happ            672
Adam Wainwright      669
Garrett Richards     666
C.J. Wilson          653
Francisco Liriano    653
Gio Gonzalez         649
...
Donnie Joseph     32
Darin Downs       32
Clay Rapada       31
Michael Stutes    30
Scott Rice        27
Tommy Layne       25
Robert Carson     24
Brett Cecil       20
Jeurys Familia    19
Cory Burns        16
Jeff Beliveau     11
Jeremy Affeldt    11
Mike Zagurski     10
Michael Bowden     5
Sam Fuld           5
Length: 513, dtype: int64

How does Verlander compare to Darvish?¶

In [160]:

verlander = baseball[baseball['pitcher_name'] == 'Justin Verlander']
verlander.head()

Out[160]:

	pitch_time	inning	pitcher_name	hitter_name	pitch_type	px	pz	pitch_name	start_speed	end_speed	type_confidence
871	2013-09-29 13:16:29 -0400	1	Justin Verlander	Juan Pierre	B	-1.422	2.909	Fastball	91.8	83.5	2
872	2013-09-29 13:16:43 -0400	1	Justin Verlander	Juan Pierre	S	-0.868	2.379	Fastball	91.0	83.1	2
873	2013-09-29 13:17:06 -0400	1	Justin Verlander	Juan Pierre	X	0.033	1.891	Fastball	91.5	82.8	2
874	2013-09-29 13:17:51 -0400	1	Justin Verlander	Ed Lucas	S	0.670	3.067	Fastball	91.0	82.9	2
875	2013-09-29 13:18:06 -0400	1	Justin Verlander	Ed Lucas	S	0.702	1.819	Fastball	90.6	82.9	2

5 rows × 11 columns

In [161]:

verlander['pitch_name'].value_counts() / len(verlander) * 100

Out[161]:

Fastball     55.761589
Changeup     16.821192
Curveball    14.834437
Slider       12.582781
dtype: float64

Already we can see Verlander is a drastically different pitcher than Darvish, fastballs make up 55% of his routine, Darvish fastballs made up 36% of his routine. Verlander throws his 3 other pitches for around the same amount.

It's interesting to note that 94% of Darvish's routine was made up of fastball, cut fastball and slider. Verlander 2nd and 3rd pitches are Darvish's 4th and 5th, thrown for ~32% vs ~6%.

In [163]:

p = ggplot(aes(x='px', y='pz', color='pitch_name'), data=verlander) + geom_jitter(alpha=0.3)
p = p + ggtitle('Verlander Pitch Spread') + stat_smooth(method='lm')
p

Out[163]:

<ggplot: (405650557)>

Verlander's distribution is more predictable than Darvish's. We see that fastball end up in the upper portion of the strikezone while the other 3 pitches end up in the lower portion.

The changeup and curveball are similar in terms of their distribution, it would be difficult for a hitter to tell them apart.

All 3 secondary pitches follow the trend that the farther right in the strikezone you go the lower the pitch will likely be.

Does Verlander slow down?¶

In [164]:

p = ggplot(aes(x='inning', y='start_speed', color='pitch_name'), data=verlander)
p = p + stat_smooth(method='lm', size=5)
p

Out[164]:

<ggplot: (281360081)>

In [165]:

p = ggplot(aes(x='inning', y='start_speed', color='pitch_name'), data=verlander)
p = p + geom_jitter(alpha=0.3)
p

Out[165]:

<ggplot: (278375433)>

Verlander's fastball becomes faster over the course of the game and his changeup slower. We can also see that Verlander isn't as consistent with his pitch speeds as Darvish. He's more consistent during the middle innings.

This makes sense intuitively since in the first couple of innings the pitcher is finding their "groove" and in the latter innings fatigue starts to set in.

I found it weird that Verlander's fastball gets faster over the course of the game. So I decided to compare it to the norm.

Do Pitcher Pitch Speeds decrease over time?¶

In [168]:

p = ggplot(aes(x='inning', y='start_speed', color='pitch_name'), data=baseball)
p = p + stat_smooth(method='lm', size=5) + ggtitle('Pitch Speed vs Innings')
p

Out[168]:

<ggplot: (369987989)>

In [169]:

p = ggplot(aes(x='inning', y='start_speed'), data=baseball)
p = p + stat_smooth(method='lm', size=5) + ggtitle('Pitch speed vs Innings')
p

Out[169]:

<ggplot: (404579133)>

Over this is super weird, at least to me. Shouldn't the speeds get slower as the game progresses?

A problem with the current approach is that it doesn't take into account switching the pitcher, pitch count would probably be a much better way to measure this.

In [186]:

baseball['date'] = baseball['pitch_time'].str.slice(0,10)
baseball['pitch_count'] = 1
baseball['pitch_count'] = baseball.groupby(['pitcher_name', 'date'])['pitch_count'].cumsum()

Let's try it again with the pitch counts.

In [187]:

p = ggplot(aes(x='pitch_count', y='start_speed', color='pitch_name'), data=baseball)
p = p + stat_smooth(method='lm', size=5) + ggtitle('Pitch Speed vs Pitch Count')
p

Out[187]:

<ggplot: (369486053)>

In [188]:

p = ggplot(aes(x='pitch_count', y='start_speed'), data=baseball)
p = p + stat_smooth(method='lm', size=5) + ggtitle('Pitch Speed vs Pitch Count')
p

Out[188]:

<ggplot: (377576197)>

Now that makes more sense, here we can see that pitch count has a clear negative correlation with the pitch speed.

Was it invalid to use Innings for Verlander and Darvish?¶

Does it make a difference to measure pitch speed vs innings instead of pitch speed vs pitch count for starters like Verlander and Darvish?

In [196]:

darvish = baseball[baseball['pitcher_name'] == 'Yu Darvish']
p = ggplot(aes(x='pitch_count', y='start_speed', color='pitch_name'), data=darvish)
p = p + stat_smooth(se=False, size=5) + geom_jitter(alpha=0.3) 
p = p + ggtitle('Darvish: Pitch Speed vs Pitch Count')
p

Out[196]:

<ggplot: (375613393)>

In [197]:

verlander = baseball[baseball['pitcher_name'] == 'Justin Verlander']
p = ggplot(aes(x='pitch_count', y='start_speed', color='pitch_name'), data=verlander)
p = p + stat_smooth(se=False, size=5) + geom_jitter(alpha=0.3) 
p = p + ggtitle('Verlander: Pitch Speed vs Pitch Count')
p

Out[197]:

<ggplot: (278382757)>

It turns out the answer is no, and in general it's probably true for all starting pitchers. This also verifies that the problem with the original model was that innings don't take into account the bullpen (changing pitchers). This gave our initial analysis the weird result.

The really cool thing about this is that Verlander's fastball indeed gets faster over the course of the game!

In [237]:

# Let's see if anyone else's fastball speed increases over time!
fastballs = baseball[baseball['pitch_name'] == 'Fastball']
top10 = set(fastballs.pitcher_name.value_counts().index[:10])
pitchers = set(fastballs.pitcher_name)
for name in pitchers:
    if name not in top10:
        drop_rows = fastballs.index[fastballs.pitcher_name == name]
        fastballs = fastballs.drop(drop_rows, axis=0)

In [238]:

set(fastballs.pitcher_name)

Out[238]:

{'Bartolo Colon',
 'David Price',
 'Gio Gonzalez',
 'Henderson Alvarez',
 'J.A. Happ',
 'Justin Verlander',
 'Lance Lynn',
 'Liam Hendriks',
 'Nathan Eovaldi',
 'Wade Miley'}

In [240]:

p = ggplot(aes(x='pitch_count',y='start_speed'), data=fastballs) + stat_smooth(se=False, size=3)
p = p + facet_wrap('pitcher_name')
p

Out[240]:

<ggplot: (375853813)>

Pretty cool, these were top 10 fastball using pitchers in our dataset, and we see a bunch of different lines.

7/10 pitchers ulitmately lose speed as the game progresses, although some like Wade Miley increase at the start then around half through the pitch count start do decrease.

J.A. Happ has a really weird curve. His pitch speed starts to decrease right away and it's its lowest about halfway in the game but then his speed starts to increase quite drastically for the second half of his pitch count.

The only other pitcher whos speed definately increases over time besides Verlander is Bartolo Colon. Colon is a notorious fastball heavy pitcher.

In [244]:

colon = baseball[baseball.pitcher_name == 'Bartolo Colon']
colon.pitch_name.value_counts() / len(colon) * 100

Out[244]:

Fastball    88.52459
Slider       8.56102
Changeup     2.91439
dtype: float64

Colon uses fastballs 88.5% of the time, that's 30% more than Verlander! Talk about a one trick pony.

Summary¶

There's no formula for a great pitcher, we saw that Darvish and Verlander have drastically different styles yet are 2 of the most dominant pitchers in todays game.
Verlander is a freak who becomes more deadly as the game progresses.
Darvish is crazy consistent with his pitching speeds and locations.
As the pitch count increases, pitch speed decreases but as the game progresses pitch speed increases ... What?
Dickey throws literally all the knuckleballs in todays game.
There's something called as Eephus pitch and it's completely obsurd.

That's it for now!