The New York Police Department make vehicle collision data available as PDFs. The NYPD Crash Data Band-Aid site scrapes this data and kindly makes it available in a structured manner. The data is available as CSV files and JSON.
Let's see what we can make of this data.
First, we'll load it. The URL for the CSV file is http://nypd.openscrape.com/data/collisions.csv.gz. First, let's read the file. (Note: though it says it's a CSV file, it's actually a TSV file -- TAB separated values)
import pandas as pd
# From Pandas 0.12 onwards, you can just use a single line:
# data = pd.read_csv('http://nypd.openscrape.com/data/collisions.csv.gz', compression='gzip', sep='\t')
# But for older versions, we need to download the file first.
import os.path
import urllib
# Let's download it only if it does not already exist. Save ourselves some bandwidth.
if not os.path.exists('collisions.csv.gz'):
urllib.urlretrieve('http://nypd.openscrape.com/data/collisions.csv.gz', 'collisions.csv.gz')
data = pd.read_csv('collisions.csv.gz', compression='gzip', sep='\t')
data
<class 'pandas.core.frame.DataFrame'> Int64Index: 230138 entries, 0 to 230137 Data columns (total 68 columns): borocode 230138 non-null values precinct 230138 non-null values year 230138 non-null values month 230138 non-null values lon 228297 non-null values lat 228297 non-null values street1 230138 non-null values street2 230138 non-null values collisions 230138 non-null values persons_involved 230138 non-null values collisions_with_injuries 230138 non-null values motorists_injured 230138 non-null values passengers_injured 230138 non-null values cyclists_injured 230138 non-null values pedestr_injured 230138 non-null values total_injured 230138 non-null values motorists_killed 230138 non-null values passengers_killed 230138 non-null values cyclists_killed 230138 non-null values pedestr_killed 230138 non-null values total_killed 230138 non-null values ambulance 999 non-null values bicycle 7167 non-null values bus 7846 non-null values fire_truck 522 non-null values large_com_veh_6_or_more_tires 6417 non-null values livery_vehicle 5204 non-null values motorcycle 2678 non-null values other 11458 non-null values passenger_vehicle 170780 non-null values pedicab 31 non-null values pick_up_truck 6539 non-null values scooter 199 non-null values small_com_veh_4_tires 7231 non-null values sport_utility_station_wagon 79000 non-null values taxi_vehicle 14690 non-null values unknown 35098 non-null values van 15283 non-null values aggressive_driving_road_rage 1096 non-null values alcohol_involvement 2255 non-null values backing_unsafely 7844 non-null values cell_phone_hand_held 145 non-null values driver_inattention_distraction 0 non-null values driver_inexperience 35676 non-null values drugs_illegal 162 non-null values eating_or_drinking 2 non-null values err_confusn_ped_bike_other_ped 0 non-null values failure_to_keep_right 0 non-null values failure_to_yield_right_of_way 15120 non-null values fatigued_drowsy 344 non-null values fell_asleep 539 non-null values following_too_closely 12588 non-null values illness 420 non-null values listening_using_headphones 2 non-null values lost_consciousness 207 non-null values other_electronic_device 5704 non-null values other_uninvolved_vehicle 0 non-null values outside_car_distraction 680 non-null values passenger_distraction 865 non-null values passing_or_lane_usage 8088 non-null values physical_disability 122 non-null values prescription_medication 36 non-null values texting 2 non-null values traffic_control_disregarded 4183 non-null values turning_improperly 6787 non-null values unsafe_lane_changing 9417 non-null values unsafe_speed 0 non-null values vehicle_vandalism 14 non-null values dtypes: float64(49), int64(17), object(2)
That's a lot of values. To get a sense of what's happening here, let's view these values. There are a couple of ways of doing this. One is to explicitly convert to HTML and displaying it using IPython.display
.
from IPython.display import HTML
HTML(data.head().to_html())
borocode | precinct | year | month | lon | lat | street1 | street2 | collisions | persons_involved | collisions_with_injuries | motorists_injured | passengers_injured | cyclists_injured | pedestr_injured | total_injured | motorists_killed | passengers_killed | cyclists_killed | pedestr_killed | total_killed | ambulance | bicycle | bus | fire_truck | large_com_veh_6_or_more_tires | livery_vehicle | motorcycle | other | passenger_vehicle | pedicab | pick_up_truck | scooter | small_com_veh_4_tires | sport_utility_station_wagon | taxi_vehicle | unknown | van | aggressive_driving_road_rage | alcohol_involvement | backing_unsafely | cell_phone_hand_held | driver_inattention_distraction | driver_inexperience | drugs_illegal | eating_or_drinking | err_confusn_ped_bike_other_ped | failure_to_keep_right | failure_to_yield_right_of_way | fatigued_drowsy | fell_asleep | following_too_closely | illness | listening_using_headphones | lost_consciousness | other_electronic_device | other_uninvolved_vehicle | outside_car_distraction | passenger_distraction | passing_or_lane_usage | physical_disability | prescription_medication | texting | traffic_control_disregarded | turning_improperly | unsafe_lane_changing | unsafe_speed | vehicle_vandalism | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 60 | 2011 | 8 | -73.988165 | 40.590100 | 27 AVENUE | HARWAY AVENUE | 1 | 2 | 1 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 3 | 60 | 2011 | 8 | -73.988310 | 40.586507 | 28 AVENUE | CROPSEY AVENUE | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 3 | 60 | 2011 | 8 | -73.985962 | 40.588771 | 28 AVENUE | HARWAY AVENUE | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 3 | 60 | 2011 | 8 | -73.984093 | 40.590566 | 28 AVENUE | STILLWELL AVENUE | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 3 | 60 | 2011 | 8 | -73.988643 | 40.574282 | 38 STREET | SURF AVENUE | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
The other way is to tell Pandas that we're OK with really long lines, so that it can go ahead and display lots of columns.
pd.set_option('display.max_columns', 100)
pd.set_option('display.line_width', 10000)
data.head()
borocode | precinct | year | month | lon | lat | street1 | street2 | collisions | persons_involved | collisions_with_injuries | motorists_injured | passengers_injured | cyclists_injured | pedestr_injured | total_injured | motorists_killed | passengers_killed | cyclists_killed | pedestr_killed | total_killed | ambulance | bicycle | bus | fire_truck | large_com_veh_6_or_more_tires | livery_vehicle | motorcycle | other | passenger_vehicle | pedicab | pick_up_truck | scooter | small_com_veh_4_tires | sport_utility_station_wagon | taxi_vehicle | unknown | van | aggressive_driving_road_rage | alcohol_involvement | backing_unsafely | cell_phone_hand_held | driver_inattention_distraction | driver_inexperience | drugs_illegal | eating_or_drinking | err_confusn_ped_bike_other_ped | failure_to_keep_right | failure_to_yield_right_of_way | fatigued_drowsy | fell_asleep | following_too_closely | illness | listening_using_headphones | lost_consciousness | other_electronic_device | other_uninvolved_vehicle | outside_car_distraction | passenger_distraction | passing_or_lane_usage | physical_disability | prescription_medication | texting | traffic_control_disregarded | turning_improperly | unsafe_lane_changing | unsafe_speed | vehicle_vandalism | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 60 | 2011 | 8 | -73.988165 | 40.590100 | 27 AVENUE | HARWAY AVENUE | 1 | 2 | 1 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 3 | 60 | 2011 | 8 | -73.988310 | 40.586507 | 28 AVENUE | CROPSEY AVENUE | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 3 | 60 | 2011 | 8 | -73.985962 | 40.588771 | 28 AVENUE | HARWAY AVENUE | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 3 | 60 | 2011 | 8 | -73.984093 | 40.590566 | 28 AVENUE | STILLWELL AVENUE | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 3 | 60 | 2011 | 8 | -73.988643 | 40.574282 | 38 STREET | SURF AVENUE | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Let's take a look at where most of the accidents happen. In other words, let's plot the latitudes and longitudes as a heatmap.
I didn't quite know how to plot a 2D heatmap, but as with all things in life, Google and StackOverflow are your friends. This led to http://stackoverflow.com/a/2461029/100904 which provided this function:
heatmap, xedges, yedges = np.histogram2d(data['lon'], data['lat'], bins=50)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-4-744ec5a742e9> in <module>() ----> 1 heatmap, xedges, yedges = np.histogram2d(data['lon'], data['lat'], bins=50) D:\anaconda\64\lib\site-packages\numpy\lib\twodim_base.pyc in histogram2d(x, y, bins, range, normed, weights) 609 xedges = yedges = asarray(bins, float) 610 bins = [xedges, yedges] --> 611 hist, edges = histogramdd([x,y], bins, range, normed, weights) 612 return hist, edges[0], edges[1] 613 D:\anaconda\64\lib\site-packages\numpy\lib\function_base.pyc in histogramdd(sample, bins, range, normed, weights) 356 mindiff = dedges[i].min() 357 if not np.isinf(mindiff): --> 358 decimal = int(-log10(mindiff)) + 6 359 # Find which points are on the rightmost edge. 360 on_edge = where(around(sample[:,i], decimal) == around(edges[i][-1], ValueError: cannot convert float NaN to integer
We get an error when plotting latitude and longitude. This looks like it's because some of the values are NaN. Let's get rid of those.
locations = data[['lon', 'lat']].dropna()
locations
<class 'pandas.core.frame.DataFrame'> Int64Index: 228297 entries, 0 to 230137 Data columns (total 2 columns): lon 228297 non-null values lat 228297 non-null values dtypes: float64(2)
Now we can create the 2D histogram and plot it.
heatmap, xedges, yedges = np.histogram2d(locations['lon'], locations['lat'], bins=50)
plt.imshow(heatmap, extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]])
<matplotlib.image.AxesImage at 0xc269240>
That's clustered around the corner. Most likely because there are a few outliers. Could we clean up these outliers? Maybe we can restrict ourselves to 40.5 - 41 longitude and -79 to -78.5 latitude. Let's plot the distribution of the data and see what it looks like.
locations['lon'].hist()
<matplotlib.axes.AxesSubplot at 0x13f58c18>
Not very helpful. Let's increase the number of bins, and the size of the images going forward.
rcParams['figure.figsize'] = 15, 5
locations['lon'].hist(bins=100)
<matplotlib.axes.AxesSubplot at 0x1481b160>
locations['lat'].hist(bins=100)
<matplotlib.axes.AxesSubplot at 0x12c82748>
Hmm... might have been nice to see those side by side. Fortunately, Pandas has some good documentation on subplots: http://pandas.pydata.org/pandas-docs/dev/visualization.html#targeting-different-subplots
fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True)
locations['lon'].hist(bins=100, ax=axes[0])
locations['lat'].hist(bins=100, ax=axes[1])
<matplotlib.axes.AxesSubplot at 0x147ba5f8>
Let's get rid of the outliers.
coords = locations[(locations['lon'] < -73.5) & (locations['lon'] > -74.5) &
(locations['lat'] < +41.0) & (locations['lat'] > +40.5)]
coords
<class 'pandas.core.frame.DataFrame'> Int64Index: 228292 entries, 0 to 230137 Data columns (total 2 columns): lon 228292 non-null values lat 228292 non-null values dtypes: float64(2)
This can be plotted the same way.
heatmap, xedges, yedges = np.histogram2d(coords['lon'], coords['lat'], bins=50)
plt.imshow(heatmap, extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]])
<matplotlib.image.AxesImage at 0xf1d3a58>
That's a bit fuzzy. Nor does it look like New York. Let's flip the longitude axis and increase the bins. We can also change the color scales, so let's do that.
rcParams['figure.figsize'] = [10,10]
heatmap, xedges, yedges = np.histogram2d(-coords['lon'], coords['lat'], bins=500)
plt.imshow(heatmap ** .5, cmap='YlOrBr', aspect=1.5, extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]])
<matplotlib.image.AxesImage at 0xfd1cb38>
Let's put together a news report on what's happened so far. We'll ask basic questions like:
Then let's start breaking this up monthwise.
# How many collisions?
data.collisions.sum()
337930
# How many were injured?
data.total_injured.sum()
92768
# How many were killed?
data.total_killed.sum()
431
# Let's break this up by month.
data.groupby(['year', 'month'])['collisions', 'total_injured', 'total_killed'].sum()
collisions | total_injured | total_killed | ||
---|---|---|---|---|
year | month | |||
2011 | 8 | 14196 | 4013 | 14 |
9 | 14388 | 4104 | 11 | |
10 | 14934 | 4184 | 20 | |
11 | 14094 | 4084 | 14 | |
12 | 14503 | 4063 | 23 | |
2012 | 1 | 13166 | 3516 | 15 |
2 | 12194 | 3266 | 13 | |
3 | 13888 | 3858 | 15 | |
4 | 13839 | 3958 | 18 | |
5 | 15115 | 4099 | 27 | |
6 | 15214 | 4105 | 21 | |
7 | 14142 | 4053 | 21 | |
8 | 14205 | 3928 | 22 | |
9 | 13819 | 3924 | 21 | |
10 | 14152 | 3837 | 18 | |
11 | 13332 | 3410 | 13 | |
12 | 14345 | 3877 | 22 | |
2013 | 1 | 13092 | 3387 | 25 |
2 | 12174 | 2995 | 17 | |
3 | 13848 | 3539 | 18 | |
4 | 13809 | 3731 | 13 | |
5 | 15583 | 4217 | 12 | |
6 | 15177 | 4262 | 17 | |
7 | 14721 | 4358 | 21 |
Of course, it's good to be able to see what this looks like.
data.groupby(['year', 'month'])['collisions', 'total_injured', 'total_killed'].sum().plot(subplots=True, kind='line', figsize=(15,9))
array([<matplotlib.axes.AxesSubplot object at 0x000000000FD10710>, <matplotlib.axes.AxesSubplot object at 0x000000000FD496A0>, <matplotlib.axes.AxesSubplot object at 0x000000001451A908>], dtype=object)
We can now ask questions around: are there specific boroughs or precincts that are more prone to accidents than others.
data.groupby('borocode')['collisions'].sum().order().plot(kind='barh', figsize=[10,3])
<matplotlib.axes.AxesSubplot at 0x144a5dd8>
data.groupby('precinct')['collisions'].sum().order().plot(kind='barh', figsize=[10,15])
<matplotlib.axes.AxesSubplot at 0x144bacc0>
What do the numbers mean? A careful look at the scraper source code reveals that these are actually the precinct names -- with a few exceptions.
Let's do that mapping:
precinct = data['precinct'].unique()
precinct
array([ 60, 61, 62, 63, 66, 67, 68, 69, 70, 71, 72, 73, 75, 76, 77, 78, 79, 81, 83, 84, 88, 90, 94, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52, 1, 5, 6, 7, 9, 10, 13, 17, 19, 20, 22, 23, 24, 25, 26, 28, 30, 32, 33, 34, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 120, 122, 123, 14, 18, 121], dtype=int64)
Let's create a mapping function.
def precinct_name(number):
if number == 14:
return 'Midtown South'
elif number == 18:
return 'Midtown North'
elif number == 22:
return 'Central Park'
elif number % 10 == 1 and number % 100 != 11:
return '%dst precinct' % number
elif number % 10 == 2 and number % 100 != 12:
return '%dnd precinct' % number
elif number % 10 == 3 and number % 100 != 13:
return '%drd precinct' % number
else:
return '%dth precinct' % number
This needs to be applied to collisions by precinct.
data.groupby('precinct')['collisions'].sum().order().rename(precinct_name).plot(kind='barh', figsize=[10, 15])
<matplotlib.axes.AxesSubplot at 0xfd27ac8>
So Central Park looks safe. But what if it's because no one lives there? In other words, what's the number of accidents per capita?
First, we'll need the number of people in each precinct. A search leads us to http://johnkeefe.net/nyc-police-precinct-and-census-data which leads us to the Google Fusion Table with NYC census data. Aggregating this by precinct, we get the population by precinct which I'm reading below.
from cStringIO import StringIO
population = pd.read_table(StringIO('''precinct population
1 66679
5 52568
6 62226
7 56355
9 76443
10 50180
13 93640
14 20651
17 79126
18 54066
19 208259
20 102624
22 25
23 73106
24 106460
25 47405
26 49508
28 44781
30 60685
32 70942
33 77645
34 112375
40 91497
41 52246
42 79762
43 172122
44 146441
45 120833
46 128200
47 152374
48 83266
49 114712
50 101720
52 139307
60 104278
61 159645
62 181981
63 108646
66 191382
67 155252
68 124491
69 84480
70 160664
71 98429
72 126230
73 86468
75 183328
76 43694
77 96309
78 61099
79 90263
81 62722
83 112634
84 48196
88 51421
90 116836
94 56247
100 47913
101 67065
102 144008
103 105803
104 170190
105 188582
106 122441
107 151107
108 113200
109 247354
110 172634
111 116431
112 112277
113 120132
114 202766
115 171576
120 175876
122 194822
123 98032
'''))
population.sort('population').head()
precinct | population | |
---|---|---|
12 | 22 | 25 |
7 | 14 | 20651 |
47 | 76 | 43694 |
17 | 28 | 44781 |
15 | 25 | 47405 |
Precinct #22 looks suspiciously low. That's Central Park, if you remember. But presumably no one lives in Central Park.
Let's merge this data into the accidents by precinct. We'll start with the list of collisions:
collisions = pd.DataFrame({'collisions': data.groupby('precinct')['collisions'].sum().order()})
collisions.head()
collisions | |
---|---|
precinct | |
22 | 144 |
121 | 219 |
100 | 1294 |
30 | 1501 |
101 | 1523 |
To merge the population data, let's ensure that it has the same index:
population = population.set_index('precinct')
population.head()
population | |
---|---|
precinct | |
1 | 66679 |
5 | 52568 |
6 | 62226 |
7 | 56355 |
9 | 76443 |
Now that they have the same index, let's just assign the columns. Pandas automatically takes care of aligning the indexes.
collisions['population'] = population['population']
collisions
<class 'pandas.core.frame.DataFrame'> Int64Index: 77 entries, 22 to 17 Data columns (total 2 columns): collisions 77 non-null values population 76 non-null values dtypes: float64(1), int64(1)
Did you notice that population
only has 77 values, whereas the dataset has 77 precincts? Which is the missing one?
collisions[collisions['population'].isnull()]
collisions | population | |
---|---|---|
precinct | ||
121 | 219 | NaN |
Can't be helped. We just don't have the data for the 121st precinct. Let's move on, and plot the rest.
(collisions['collisions'].astype(float) / collisions['population']).rename(precinct_name).order().plot(kind='barh', figsize=[10, 15])
<matplotlib.axes.AxesSubplot at 0x134002e8>
Which looks ridiculous, of course. Let's ignore Central Park, as hope that the accidents happening there are because of people passing through rather than the 25 people that live there. So if we remove those...
precincts = collisions.drop([22, 121])
(precincts['collisions'].astype(float) / precincts['population']).rename(precinct_name).order().plot(kind='barh', figsize=[10, 15])
<matplotlib.axes.AxesSubplot at 0x11d2d5f8>
... we would be rather worried about the 17th Precinct.