Hello! Welcome!¶

Did you see my talk at PyCon 2014! You can use this notebook to try it out for yourself!

If you didn't, and you'd like to, the video is here: Diving into Open Data with IPython Notebook and Pandas. The pandas cookbook I mentioned is at http://github.com/jvns/pandas-cookbook.

IMPORTANT: To make this work, you'll need to

install IPython Notebook and Pandas (easiest way: use the free/open source Anaconda)
Download this notebook
Download the data and put it in the same directory: 2012.csv (original source)
In that directory, run ipython notebook

This work is licensed under a Creative Commons Attribution 4.0 International License.

In [1]:

# Some imports we'll need
import numpy as np
import pandas as pd
julia = {'email': 'julia@jvns.ca', 'twitter': 'http://twitter.com/b0rk', 'slides': 'http://bit.ly/pycon-pandas', 'website': 'http://jvns.ca'}

Diving into Open Data with IPython Notebook & Pandas¶

I'm Julia Evans¶

Data at Stripe, work on Montréal All-Girl Hack Night, PyLadies MTL

You can follow along with this talk at:

http://bit.ly/pycon-pandas

In [2]:

print 'Email:', julia['email']
print 'Twitter:', julia['twitter']
print 'Blog:', julia['website']

Email: julia@jvns.ca
Twitter: http://twitter.com/b0rk
Blog: http://jvns.ca

This talk¶

1. What are IPython Notebook & Pandas?¶

2. Practical examples of using them!¶

3. Advice.¶

IPython Notebook + Pandas + Numpy: what are they?¶

IPython Notebook¶

web-based user interface to IPython
pretty graphs
literate programming
Can make slideshows :) (this presentation)
version controlled science!

Pandas: No loops!¶

Imagine:

dataset with every complaint call ever made in New York
You want to know how many noise complaints are made in each borough each hour

You don't need to write loops to do this!¶

Also the solution is 5 lines of code

Numpy makes Pandas fast¶

numpy: fast array computations
pandas is built on top of numpy

In [3]:

py_list = range(20000000)
numpy_array = np.arange(20000000)

In [4]:

%%timeit
total = 0
for x in py_list:
    x += total * total

1 loops, best of 3: 1.22 s per loop

In [5]:

%%timeit
np.sum(numpy_array * numpy_array)

10 loops, best of 3: 83.4 ms per loop

How do I install it?¶

Don't: Use the Ubuntu packages¶


sudo apt-get install ipython-notebook

Do: Use pip or Anaconda ¶


pip install ipython tornado pyzmq

pip install numpy pandas matplotlib

How to run IPython Notebook¶


$ ipython notebook

Practical examples!!!¶

Sensors on bike paths measure the number of cyclists¶

Part 1: Import the 2012 bike path data from a CSV¶

In [6]:

import pandas as pd
import numpy as np

In [7]:

import matplotlib
# display graphs inline
%matplotlib inline 

# Make graphs prettier
pd.set_option('display.max_columns', 15)
pd.set_option('display.line_width', 400)
pd.set_option('display.mpl_style', 'default')

# Make the fonts bigger
matplotlib.rc('figure', figsize=(14, 7))
matplotlib.rc('font', family='normal', weight='bold', size=22)

In [8]:

bike_data = pd.read_csv("./2012.csv")
bike_data[:5]

Out[8]:

	Date;Berri 1;Br�beuf (donn�es non disponibles);C�te-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (donn�es non disponibles)
0	01/01/2012;35;;0;38;51;26;10;16;
1	02/01/2012;83;;1;68;153;53;6;43;
2	03/01/2012;135;;2;104;248;89;3;58;
3	04/01/2012;144;;1;116;318;111;8;61;
4	05/01/2012;197;;2;124;330;97;13;95;

5 rows × 1 columns

After¶

In [9]:

bike_data = pd.read_csv("./2012.csv", encoding='latin1', sep=';', index_col='Date', parse_dates=True, dayfirst=True)

In [10]:

# Get rid of missing columns
bike_data = bike_data.dropna(axis=1)
# Only use 3 of the columns so it all fits on the screen

In [11]:

bike_data = bike_data[['Berri 1', u'Côte-Sainte-Catherine', 'Maisonneuve 1']]
bike_data[:5]

Out[11]:

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1
Date
2012-01-01	35	0	38
2012-01-02	83	1	68
2012-01-03	135	2	104
2012-01-04	144	1	116
2012-01-05	197	2	124

5 rows × 3 columns

Part 2: take a look at the data¶

We have a dataframe:

In [12]:

bike_data[:3]

Out[12]:

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1
Date
2012-01-01	35	0	38
2012-01-02	83	1	68
2012-01-03	135	2	104

3 rows × 3 columns

In [13]:

bike_data.plot()

Out[13]:

<matplotlib.axes.AxesSubplot at 0x21100390>

/opt/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:1224: UserWarning: findfont: Font family ['normal'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

In [13]:

bike_data.median()

Out[13]:

Berri 1                  3128.0
Côte-Sainte-Catherine    1269.0
Maisonneuve 1            2019.5
dtype: float64

In [14]:

bike_data.median().plot(kind='bar')

Out[14]:

<matplotlib.axes.AxesSubplot at 0x21e3c090>

/opt/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:1224: UserWarning: findfont: Font family ['normal'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Slicing dataframes¶

In [15]:

# column slice
column_slice = bike_data[['Berri 1', 'Maisonneuve 1']]
# row slice
column_slice[:3]

Out[15]:

	Berri 1	Maisonneuve 1
Date
2012-01-01	35	38
2012-01-02	83	68
2012-01-03	135	104

3 rows × 2 columns

Which days are the worst?¶

In [16]:

bike_data['Berri 1'] < 75

Out[16]:

Date
2012-01-01     True
2012-01-02    False
2012-01-03    False
2012-01-04    False
2012-01-05    False
2012-01-06    False
2012-01-07    False
2012-01-08    False
2012-01-09    False
2012-01-10    False
2012-01-11    False
2012-01-12    False
2012-01-13    False
2012-01-14     True
2012-01-15     True
...
2012-10-22    False
2012-10-23    False
2012-10-24    False
2012-10-25    False
2012-10-26    False
2012-10-27    False
2012-10-28    False
2012-10-29    False
2012-10-30    False
2012-10-31    False
2012-11-01    False
2012-11-02    False
2012-11-03    False
2012-11-04    False
2012-11-05    False
Name: Berri 1, Length: 310

Which days are the worst?¶

In [17]:

bike_data[bike_data['Berri 1'] < 75]

Out[17]:

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1
Date
2012-01-01	35	0	38
2012-01-14	32	0	54
2012-01-15	54	0	33
2012-01-21	53	0	47
2012-01-22	71	0	41
2012-02-05	72	0	46
2012-02-11	71	0	63
2012-02-25	62	0	48

8 rows × 3 columns

Part 2: Do more people bike on weekdays or weekends?¶

Step 1: add a 'weekday' column to our dataframe¶

In [18]:

bike_data['weekday'] = bike_data.index.weekday
bike_data.head()

Out[18]:

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1	weekday
Date
2012-01-01	35	0	38	6
2012-01-02	83	1	68	0
2012-01-03	135	2	104	1
2012-01-04	144	1	116	2
2012-01-05	197	2	124	3

5 rows × 4 columns

Step 2: Use .groupby() and .aggregate() to get the counts¶

In [19]:

counts_by_day = bike_data.groupby('weekday').aggregate(np.sum)
counts_by_day

Out[19]:

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1
weekday
0	134298	60329	90051
1	135305	58708	92035
2	152972	67344	104891
3	160131	69028	111895
4	141771	56446	98568
5	101578	34018	62067
6	99310	36466	55324

7 rows × 3 columns

Step 3: draw a graph!¶

In [20]:

counts_by_day.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
counts_by_day['Berri 1'].plot(kind='bar')

Out[20]:

<matplotlib.axes.AxesSubplot at 0x21aad6d0>

ZERO LOOPS!¶

There's more going on, though¶

In [19]:

bike_data['Berri 1'].plot()

Out[19]:

<matplotlib.axes.AxesSubplot at 0x57c2650>

Part 3: Grab some weather data and look at the temperatures¶

In [20]:

def get_weather_data(year):
    url_template = "http://climate.weather.gc.ca/climateData/bulkdata_e.html?format=csv&stationID=5415&Year={year}&Month={month}&timeframe=1&submit=Download+Data"
    # mctavish station: 10761, airport station: 5415
    data_by_month = []
    for month in range(1, 13):
        url = url_template.format(year=year, month=month)
        weather_data = pd.read_csv(url, skiprows=16, index_col='Date/Time', parse_dates=True).dropna(axis=1)
        weather_data.columns = map(lambda x: x.replace('\xb0', ''), weather_data.columns)
        weather_data = weather_data.drop(['Year', 'Day', 'Month', 'Time', 'Data Quality'], axis=1)
        data_by_month.append(weather_data.dropna())
    # Concatenate and drop any empty columns
    return pd.concat(data_by_month).dropna(axis=1, how='all').dropna()

In [21]:

weather_data = get_weather_data(2012)

In [22]:

weather_data[:5]

Out[22]:

	Temp (C)	Dew Point Temp (C)	Rel Hum (%)	Wind Spd (km/h)	Visibility (km)	Stn Press (kPa)	Weather
Date/Time
2012-01-01 00:00:00	-1.8	-3.9	86	4	8.0	101.24	Fog
2012-01-01 01:00:00	-1.8	-3.7	87	4	8.0	101.24	Fog
2012-01-01 02:00:00	-1.8	-3.4	89	7	4.0	101.26	Freezing Drizzle,Fog
2012-01-01 03:00:00	-1.5	-3.2	88	6	4.0	101.27	Freezing Drizzle,Fog
2012-01-01 04:00:00	-1.5	-3.3	88	7	4.8	101.23	Fog

5 rows × 7 columns

We need the temperatures every day, not every hour...¶

In [23]:

bike_data['mean temp'] = weather_data['Temp (C)'].resample('D', how='mean')

In [24]:

bike_data.head()

Out[24]:

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1	weekday	mean temp
Date
2012-01-01	35	0	38	6	0.629167
2012-01-02	83	1	68	0	0.041667
2012-01-03	135	2	104	1	-14.416667
2012-01-04	144	1	116	2	-13.645833
2012-01-05	197	2	124	3	-6.750000

5 rows × 5 columns

Bikers per day and temperature¶

In [25]:

bike_data[['Berri 1', 'mean temp']].plot(subplots=True)

Out[25]:

array([<matplotlib.axes.AxesSubplot object at 0x5bae610>,
       <matplotlib.axes.AxesSubplot object at 0x616d2d0>], dtype=object)

Do people bike when it's raining?¶

In [26]:

bike_data['Rain'] = weather_data['Weather'].str.contains('Rain').resample('D', how='mean')

In [27]:

bike_data[['Berri 1', 'Rain']].plot(subplots=True)

Out[27]:

array([<matplotlib.axes.AxesSubplot object at 0x6855b90>,
       <matplotlib.axes.AxesSubplot object at 0x6548b50>], dtype=object)

Some advice¶

Read (some of) the documentation
http://pandas.pydata.org/ has a 460-page PDF with lots of examples
Python for Data Analysis by Wes McKinney is great
Use numpy/pandas built-in operations, try not to write your own loops (though: see Numba, Cython, ...)

http://github.com/jvns/pandas-cookbook ¶

Thanks! Questions?

In [78]:

print 'Email:', julia['email']
print 'Twitter:', julia['twitter']
print 'Blog:', julia['website']
print 'Slides:',  julia['slides']

Email: julia@jvns.ca
Twitter: http://twitter.com/b0rk
Blog: http://jvns.ca
Slides: http://bit.ly/pycon-pandas

Hello! Welcome!¶

Diving into Open Data with IPython Notebook & Pandas¶

I'm Julia Evans¶

This talk¶

1. What are IPython Notebook & Pandas?¶

2. Practical examples of using them!¶

3. Advice.¶

IPython Notebook + Pandas + Numpy: what are they?¶

IPython Notebook¶

Pandas: No loops!¶

You don't need to write loops to do this!¶

Numpy makes Pandas fast¶

How do I install it?¶

Don't: Use the Ubuntu packages¶

Do: Use pip or Anaconda¶

How to run IPython Notebook¶

Practical examples!!!¶

Sensors on bike paths measure the number of cyclists¶

Part 1: Import the 2012 bike path data from a CSV¶

After¶

Part 2: take a look at the data¶

Slicing dataframes¶

Which days are the worst?¶

Which days are the worst?¶

Part 2: Do more people bike on weekdays or weekends?¶

Step 1: add a 'weekday' column to our dataframe¶

Step 2: Use .groupby() and .aggregate() to get the counts¶

Step 3: draw a graph!¶

ZERO LOOPS!¶

There's more going on, though¶

Part 3: Grab some weather data and look at the temperatures¶

We need the temperatures every day, not every hour...¶

Bikers per day and temperature¶

Do people bike when it's raining?¶

Some advice¶

http://github.com/jvns/pandas-cookbook¶

Thanks! Questions?

Do: Use pip or Anaconda ¶

http://github.com/jvns/pandas-cookbook ¶