Hi!

I'm Julia.

In [ ]:
%pylab inline
import pandas as pd
pd.set_option('display.mpl_style', 'default')
figsize(15, 6)
pd.set_option('display.line_width', 4000)
pd.set_option('display.max_columns', 100)

Goal (Today)

Know how to use pandas to answer some specific questions about a dataset

Roadmap:

  1. Demo with rats
  2. Dataframes: what makes pandas powerful
  3. Selecting data from a dataframe
  4. Time series and indexes and resampling
  5. Groupby + aggregate

Some notes about installation:

Don't do this:

sudo apt-get install ipython-notebook

Instead, do this:

pip install ipython tornado pyzmq

or install Anaconda from http://store.continuum.io (what I do)

You can start IPython notebook by running

ipython notebook --pylab inline

First: Read the data

In [45]:
# Download and read the data
!wget "http://bit.ly/311-data-tar-gz" -O 311-data.tar.gz
!wget "https://raw2.github.com/jvns/talks/master/pyladiesfeb2014/tiny.csv" -O tiny.csv
!tar -xzf "311-data.tar.gz" # wget does different things
orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])
In [ ]:
plot(orig_data['Longitude'], orig_data['Latitude'], '.', color="purple")

Example 1: Graph the number of noise complaints each hour in New York

In [ ]:
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()

Example 2: What are the most common complaint types?

In [ ]:
orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')

Roadmap:

  1. Dataframes: what makes pandas powerful
  2. Selecting data from a dataframe
  3. Time series and indexes
  4. Graphing

What is pandas?

A few awesome things about pandas

  • Really, really, really, really good at time series
  • Can import Excel files (!!!)
  • Fast (joining dataframes, etc.)

This is what lets you manipulate data easily -- the dataframe is basically the whole reason for pandas. It's a powerful concept from the statistical computing language R.

If you don't know R, you can think of it like a database table (it has rows and columns), or like a table of numbers.

1. Dataframes: what makes pandas powerful

In [ ]:
people = pd.read_csv('tiny.csv')
people

This is a like a SQL database, or an R dataframe. There are 3 columns, called 'name', 'age', and 'height, and 6 rows.

2. Selecting data from a dataframe

I want you to know about this because you almost always only want a subset of the data you're working on. We are going to look at a CSV with 40 columns and 1,000,000 rows.

In [ ]:
# Load the first 5 rows of our CSV
small_requests = pd.read_csv('./311-service-requests.csv', nrows=5)
In [ ]:
# How to get a column
small_requests['Complaint Type']
In [ ]:
# How to get a subset of the columns
small_requests[['Complaint Type', 'Created Date']]
In [ ]:
# How to get 3 rows
small_requests[:3]

Get the first 3 rows of a column

In [ ]:
small_requests['Agency Name'][:3]
In [ ]:
small_requests[:3]['Agency Name']

Compare a column to a value

In [ ]:
small_requests['Complaint Type']
In [ ]:
# This is like our numpy example from before
small_requests['Complaint Type'] == 'Noise - Street/Sidewalk'

That's numpy in action! Using == on a column of a dataframe gives us a series of True and False values

Selecting only the rows with noise complaints

In [ ]:
# This is like our numpy example earlier
noise_complaints = small_requests[small_requests['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints

Any Dataframe has an index, which is a integer or date or something else associated to each row.

In [ ]:
# How to get a specific row
small_requests.ix[0]
In [ ]:
# How not to get a row
small_requests[0]

Exercise 1: Selecting things from dataframes

  • Find out how many complaints were filed with the NYPD
  • How many complaints were filed in the zip code 10007?
In [ ]:
# Your code here
In [ ]:
 
In [ ]:
 
In [ ]:
 

Back to our example

In [ ]:
# We ran this at the beginning, so we don't have to run it again. Just here as a reminder.
#orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])
In [ ]:
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()

Indexes

In [ ]:
noise_complaints[:3]
In [ ]:
noise_complaints = noise_complaints.set_index('Created Date')
In [ ]:
noise_complaints[:3]

Sorting the index

Pandas is awesome for date time index stuff. It was built for dealing with financial data is which is ALL TIME SERIES

In [ ]:
noise_complaints = noise_complaints.sort_index()
noise_complaints[:3]

Counting the complaints each hour

In [ ]:
noise_complaints.resample('H', how=len)[:3]

Example 1: done!

In [ ]:
noise_complaints.resample('H', how=len).plot()

Chaining commands together

In [ ]:
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()

Exercise 2: Time series resampling

  • Find the number of noise complaints every day!
  • Find how many complaints about rodents there are each week. Make a graph!

Example 2: What are the most common complaint types?

In [ ]:
orig_data['Complaint Type'].value_counts()
In [ ]:
orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')

Exercise 3: Do the same thing for a different column

In [ ]:
# Your code here.

Example 3: Which weekday has the most noise complaints?

In [ ]:
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints = noise_complaints.set_index("Created Date")
In [ ]:
noise_complaints['weekday'] = noise_complaints.index.weekday
noise_complaints[:3]
In [ ]:
# Count the complaints by weekday
counts_by_weekday = noise_complaints.groupby('weekday').aggregate(len)
counts_by_weekday
In [ ]:
# change the index to be actual days
counts_by_weekday.index = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]
In [ ]:
counts_by_weekday.plot(kind='bar')

Exercise 4: Count the complaints by hour instead

In [ ]:
# Your code here
In [ ]:
 
In [ ]:
 

A few more cool things

String searching

In [ ]:
# We need to get rid of the NA values for this to work
street_names = orig_data['Street Name'].fillna('')
In [ ]:
manhattan_streets = street_names[street_names.str.contains("MANHATTAN")]
manhattan_streets
In [ ]:
manhattan_streets.value_counts()

Looking at complaints close to us

In [ ]:
# Our current latitude and longitude
our_lat, our_long = 40.714151,-74.00878
In [ ]:
distance_from_us = (orig_data['Longitude'] - our_long)**2 + (orig_data['Latitude'] - our_lat)**2
In [ ]:
pd.Series(distance_from_us).hist()
In [ ]:
close_complaints = orig_data[distance_from_us < 0.00005]
In [ ]:
close_complaints['Complaint Type'].value_counts()[:20].plot(kind='bar')