# The usual preamble
import pandas as pd
# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
pd.set_option('display.line_width', 5000)
pd.set_option('display.max_columns', 60)
figsize(15, 5)
We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is a subset of the of 311 service requests from NYC Open Data.
complaints = pd.read_csv('../data/311-service-requests.csv')
When you look at a large dataframe, instead of showing you the contents of the dataframe, it'll show you a summary. This includes all the columns, and how many non-null values there are in each column.
complaints
<class 'pandas.core.frame.DataFrame'> Int64Index: 111069 entries, 0 to 111068 Data columns (total 52 columns): Unique Key 111069 non-null values Created Date 111069 non-null values Closed Date 60270 non-null values Agency 111069 non-null values Agency Name 111069 non-null values Complaint Type 111069 non-null values Descriptor 111068 non-null values Location Type 79048 non-null values Incident Zip 98813 non-null values Incident Address 84441 non-null values Street Name 84438 non-null values Cross Street 1 84728 non-null values Cross Street 2 84005 non-null values Intersection Street 1 19364 non-null values Intersection Street 2 19366 non-null values Address Type 102247 non-null values City 98860 non-null values Landmark 95 non-null values Facility Type 110938 non-null values Status 111069 non-null values Due Date 39239 non-null values Resolution Action Updated Date 96507 non-null values Community Board 111069 non-null values Borough 111069 non-null values X Coordinate (State Plane) 98143 non-null values Y Coordinate (State Plane) 98143 non-null values Park Facility Name 111069 non-null values Park Borough 111069 non-null values School Name 111069 non-null values School Number 111052 non-null values School Region 110524 non-null values School Code 110524 non-null values School Phone Number 111069 non-null values School Address 111069 non-null values School City 111069 non-null values School State 111069 non-null values School Zip 111069 non-null values School Not Found 38984 non-null values School or Citywide Complaint 0 non-null values Vehicle Type 99 non-null values Taxi Company Borough 117 non-null values Taxi Pick Up Location 1059 non-null values Bridge Highway Name 185 non-null values Bridge Highway Direction 185 non-null values Road Ramp 184 non-null values Bridge Highway Segment 223 non-null values Garage Lot Name 49 non-null values Ferry Direction 37 non-null values Ferry Terminal Name 336 non-null values Latitude 98143 non-null values Longitude 98143 non-null values Location 98143 non-null values dtypes: float64(5), int64(1), object(46)
To select a column, we index with the name of the column, like this:
complaints['Complaint Type']
0 Noise - Street/Sidewalk 1 Illegal Parking 2 Noise - Commercial 3 Noise - Vehicle 4 Rodent 5 Noise - Commercial 6 Blocked Driveway 7 Noise - Commercial 8 Noise - Commercial 9 Noise - Commercial 10 Noise - House of Worship 11 Noise - Commercial 12 Illegal Parking 13 Noise - Vehicle 14 Rodent ... 111054 Noise - Street/Sidewalk 111055 Noise - Commercial 111056 Street Sign - Missing 111057 Noise 111058 Noise - Commercial 111059 Noise - Street/Sidewalk 111060 Noise 111061 Noise - Commercial 111062 Water System 111063 Water System 111064 Maintenance or Facility 111065 Illegal Parking 111066 Noise - Street/Sidewalk 111067 Noise - Commercial 111068 Blocked Driveway Name: Complaint Type, Length: 111069, dtype: object
To get the first 5 rows of a dataframe, we can use a slice: df[:5]
.
This is a great way to get a sense for what kind of information is in the dataframe -- take a minute to look at the contents and get a feel for this dataset.
complaints[:5]
Unique Key | Created Date | Closed Date | Agency | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | Street Name | Cross Street 1 | Cross Street 2 | Intersection Street 1 | Intersection Street 2 | Address Type | City | Landmark | Facility Type | Status | Due Date | Resolution Action Updated Date | Community Board | Borough | X Coordinate (State Plane) | Y Coordinate (State Plane) | Park Facility Name | Park Borough | School Name | School Number | School Region | School Code | School Phone Number | School Address | School City | School State | School Zip | School Not Found | School or Citywide Complaint | Vehicle Type | Taxi Company Borough | Taxi Pick Up Location | Bridge Highway Name | Bridge Highway Direction | Road Ramp | Bridge Highway Segment | Garage Lot Name | Ferry Direction | Ferry Terminal Name | Latitude | Longitude | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 26589651 | 10/31/2013 02:08:41 AM | NaN | NYPD | New York City Police Department | Noise - Street/Sidewalk | Loud Talking | Street/Sidewalk | 11432 | 90-03 169 STREET | 169 STREET | 90 AVENUE | 91 AVENUE | NaN | NaN | ADDRESS | JAMAICA | NaN | Precinct | Assigned | 10/31/2013 10:08:41 AM | 10/31/2013 02:35:17 AM | 12 QUEENS | QUEENS | 1042027 | 197389 | Unspecified | QUEENS | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.708275 | -73.791604 | (40.70827532593202, -73.79160395779721) |
1 | 26593698 | 10/31/2013 02:01:04 AM | NaN | NYPD | New York City Police Department | Illegal Parking | Commercial Overnight Parking | Street/Sidewalk | 11378 | 58 AVENUE | 58 AVENUE | 58 PLACE | 59 STREET | NaN | NaN | BLOCKFACE | MASPETH | NaN | Precinct | Open | 10/31/2013 10:01:04 AM | NaN | 05 QUEENS | QUEENS | 1009349 | 201984 | Unspecified | QUEENS | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.721041 | -73.909453 | (40.721040535628305, -73.90945306791765) |
2 | 26594139 | 10/31/2013 02:00:24 AM | 10/31/2013 02:40:32 AM | NYPD | New York City Police Department | Noise - Commercial | Loud Music/Party | Club/Bar/Restaurant | 10032 | 4060 BROADWAY | BROADWAY | WEST 171 STREET | WEST 172 STREET | NaN | NaN | ADDRESS | NEW YORK | NaN | Precinct | Closed | 10/31/2013 10:00:24 AM | 10/31/2013 02:39:42 AM | 12 MANHATTAN | MANHATTAN | 1001088 | 246531 | Unspecified | MANHATTAN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.843330 | -73.939144 | (40.84332975466513, -73.93914371913482) |
3 | 26595721 | 10/31/2013 01:56:23 AM | 10/31/2013 02:21:48 AM | NYPD | New York City Police Department | Noise - Vehicle | Car/Truck Horn | Street/Sidewalk | 10023 | WEST 72 STREET | WEST 72 STREET | COLUMBUS AVENUE | AMSTERDAM AVENUE | NaN | NaN | BLOCKFACE | NEW YORK | NaN | Precinct | Closed | 10/31/2013 09:56:23 AM | 10/31/2013 02:21:10 AM | 07 MANHATTAN | MANHATTAN | 989730 | 222727 | Unspecified | MANHATTAN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.778009 | -73.980213 | (40.7780087446372, -73.98021349023975) |
4 | 26590930 | 10/31/2013 01:53:44 AM | NaN | DOHMH | Department of Health and Mental Hygiene | Rodent | Condition Attracting Rodents | Vacant Lot | 10027 | WEST 124 STREET | WEST 124 STREET | LENOX AVENUE | ADAM CLAYTON POWELL JR BOULEVARD | NaN | NaN | BLOCKFACE | NEW YORK | NaN | N/A | Pending | 11/30/2013 01:53:44 AM | 10/31/2013 01:59:54 AM | 10 MANHATTAN | MANHATTAN | 998815 | 233545 | Unspecified | MANHATTAN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.807691 | -73.947387 | (40.80769092704951, -73.94738703491433) |
We can combine these to get the first 5 rows of a column:
complaints['Complaint Type'][:5]
0 Noise - Street/Sidewalk 1 Illegal Parking 2 Noise - Commercial 3 Noise - Vehicle 4 Rodent Name: Complaint Type, dtype: object
and it doesn't matter which direction we do it in:
complaints[:5]['Complaint Type']
0 Noise - Street/Sidewalk 1 Illegal Parking 2 Noise - Commercial 3 Noise - Vehicle 4 Rodent Name: Complaint Type, dtype: object
What if we just want to know the complaint type and the borough, but not the rest of the information? Pandas makes it really easy to select a subset of the columns: just index with list of columns you want.
complaints[['Complaint Type', 'Borough']]
<class 'pandas.core.frame.DataFrame'> Int64Index: 111069 entries, 0 to 111068 Data columns (total 2 columns): Complaint Type 111069 non-null values Borough 111069 non-null values dtypes: object(2)
That showed us a summary, and then we can look at the first 10 rows:
complaints[['Complaint Type', 'Borough']][:10]
Complaint Type | Borough | |
---|---|---|
0 | Noise - Street/Sidewalk | QUEENS |
1 | Illegal Parking | QUEENS |
2 | Noise - Commercial | MANHATTAN |
3 | Noise - Vehicle | MANHATTAN |
4 | Rodent | MANHATTAN |
5 | Noise - Commercial | QUEENS |
6 | Blocked Driveway | QUEENS |
7 | Noise - Commercial | QUEENS |
8 | Noise - Commercial | MANHATTAN |
9 | Noise - Commercial | BROOKLYN |
This is a really easy question to answer! There's a .value_counts()
method that we can use:
complaints['Complaint Type'].value_counts()
HEATING 14200 GENERAL CONSTRUCTION 7471 Street Light Condition 7117 DOF Literature Request 5797 PLUMBING 5373 PAINT - PLASTER 5149 Blocked Driveway 4590 NONCONST 3998 Street Condition 3473 Illegal Parking 3343 Noise 3321 Traffic Signal Condition 3145 Dirty Conditions 2653 Water System 2636 Noise - Commercial 2578 ... Opinion for the Mayor 2 Window Guard 2 DFTA Literature Request 2 Legal Services Provider Complaint 2 Open Flame Permit 1 Snow 1 Municipal Parking Facility 1 X-Ray Machine/Equipment 1 Stalled Sites 1 DHS Income Savings Requirement 1 Tunnel Condition 1 Highway Sign - Damaged 1 Ferry Permit 1 Trans Fat 1 DWD 1 Length: 165, dtype: int64
If we just wanted the top 10 most common complaints, we can do this:
complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]
HEATING 14200 GENERAL CONSTRUCTION 7471 Street Light Condition 7117 DOF Literature Request 5797 PLUMBING 5373 PAINT - PLASTER 5149 Blocked Driveway 4590 NONCONST 3998 Street Condition 3473 Illegal Parking 3343 dtype: int64
But it gets better! We can plot them!
complaint_counts[:10].plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x7ba2290>