# The usual preamble
import pandas as pd
# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
pd.set_option('display.line_width', 5000)
pd.set_option('display.max_columns', 60)
figsize(15, 5)
We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is a subset of the of 311 service requests from NYC Open Data.
complaints = pd.read_csv('../data/311-service-requests.csv')
When you look at a large dataframe, instead of showing you the contents of the dataframe, it'll show you a summary. This includes all the columns, and how many non-null values there are in each column.
complaints
To select a column, we index with the name of the column, like this:
complaints['Complaint Type']
To get the first 5 rows of a dataframe, we can use a slice: df[:5]
.
This is a great way to get a sense for what kind of information is in the dataframe -- take a minute to look at the contents and get a feel for this dataset.
complaints[:5]
We can combine these to get the first 5 rows of a column:
complaints['Complaint Type'][:5]
and it doesn't matter which direction we do it in:
complaints[:5]['Complaint Type']
What if we just want to know the complaint type and the borough, but not the rest of the information? Pandas makes it really easy to select a subset of the columns: just index with list of columns you want.
complaints[['Complaint Type', 'Borough']]
That showed us a summary, and then we can look at the first 10 rows:
complaints[['Complaint Type', 'Borough']][:10]
This is a really easy question to answer! There's a .value_counts()
method that we can use:
complaints['Complaint Type'].value_counts()
If we just wanted the top 10 most common complaints, we can do this:
complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]
But it gets better! We can plot them!
complaint_counts[:10].plot(kind='bar')