Did you see my talk at PyCon 2014! You can use this notebook to try it out for yourself!
If you didn't, and you'd like to, the video is here: Diving into Open Data with IPython Notebook and Pandas. The pandas cookbook I mentioned is at http://github.com/jvns/pandas-cookbook.
IMPORTANT: To make this work, you'll need to
ipython notebook
This work is licensed under a Creative Commons Attribution 4.0 International License.
# Some imports we'll need
import numpy as np
import pandas as pd
julia = {'email': 'julia@jvns.ca', 'twitter': 'http://twitter.com/b0rk', 'slides': 'http://bit.ly/pycon-pandas', 'website': 'http://jvns.ca'}
Data at Stripe, work on Montréal All-Girl Hack Night, PyLadies MTL
You can follow along with this talk at:
print 'Email:', julia['email']
print 'Twitter:', julia['twitter']
print 'Blog:', julia['website']
Email: julia@jvns.ca Twitter: http://twitter.com/b0rk Blog: http://jvns.ca
py_list = range(20000000)
numpy_array = np.arange(20000000)
%%timeit
total = 0
for x in py_list:
x += total * total
1 loops, best of 3: 1.22 s per loop
%%timeit
np.sum(numpy_array * numpy_array)
10 loops, best of 3: 83.4 ms per loop
$ ipython notebook
import pandas as pd
import numpy as np
import matplotlib
# display graphs inline
%matplotlib inline
# Make graphs prettier
pd.set_option('display.max_columns', 15)
pd.set_option('display.line_width', 400)
pd.set_option('display.mpl_style', 'default')
# Make the fonts bigger
matplotlib.rc('figure', figsize=(14, 7))
matplotlib.rc('font', family='normal', weight='bold', size=22)
bike_data = pd.read_csv("./2012.csv")
bike_data[:5]
Date;Berri 1;Br�beuf (donn�es non disponibles);C�te-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (donn�es non disponibles) | |
---|---|
0 | 01/01/2012;35;;0;38;51;26;10;16; |
1 | 02/01/2012;83;;1;68;153;53;6;43; |
2 | 03/01/2012;135;;2;104;248;89;3;58; |
3 | 04/01/2012;144;;1;116;318;111;8;61; |
4 | 05/01/2012;197;;2;124;330;97;13;95; |
5 rows × 1 columns
bike_data = pd.read_csv("./2012.csv", encoding='latin1', sep=';', index_col='Date', parse_dates=True, dayfirst=True)
# Get rid of missing columns
bike_data = bike_data.dropna(axis=1)
# Only use 3 of the columns so it all fits on the screen
bike_data = bike_data[['Berri 1', u'Côte-Sainte-Catherine', 'Maisonneuve 1']]
bike_data[:5]
Berri 1 | Côte-Sainte-Catherine | Maisonneuve 1 | |
---|---|---|---|
Date | |||
2012-01-01 | 35 | 0 | 38 |
2012-01-02 | 83 | 1 | 68 |
2012-01-03 | 135 | 2 | 104 |
2012-01-04 | 144 | 1 | 116 |
2012-01-05 | 197 | 2 | 124 |
5 rows × 3 columns
We have a dataframe:
bike_data[:3]
Berri 1 | Côte-Sainte-Catherine | Maisonneuve 1 | |
---|---|---|---|
Date | |||
2012-01-01 | 35 | 0 | 38 |
2012-01-02 | 83 | 1 | 68 |
2012-01-03 | 135 | 2 | 104 |
3 rows × 3 columns
bike_data.plot()
<matplotlib.axes.AxesSubplot at 0x21100390>
/opt/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:1224: UserWarning: findfont: Font family ['normal'] not found. Falling back to Bitstream Vera Sans (prop.get_family(), self.defaultFamily[fontext]))
bike_data.median()
Berri 1 3128.0 Côte-Sainte-Catherine 1269.0 Maisonneuve 1 2019.5 dtype: float64
bike_data.median().plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x21e3c090>
/opt/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:1224: UserWarning: findfont: Font family ['normal'] not found. Falling back to Bitstream Vera Sans (prop.get_family(), self.defaultFamily[fontext]))
# column slice
column_slice = bike_data[['Berri 1', 'Maisonneuve 1']]
# row slice
column_slice[:3]
Berri 1 | Maisonneuve 1 | |
---|---|---|
Date | ||
2012-01-01 | 35 | 38 |
2012-01-02 | 83 | 68 |
2012-01-03 | 135 | 104 |
3 rows × 2 columns
bike_data['Berri 1'] < 75
Date 2012-01-01 True 2012-01-02 False 2012-01-03 False 2012-01-04 False 2012-01-05 False 2012-01-06 False 2012-01-07 False 2012-01-08 False 2012-01-09 False 2012-01-10 False 2012-01-11 False 2012-01-12 False 2012-01-13 False 2012-01-14 True 2012-01-15 True ... 2012-10-22 False 2012-10-23 False 2012-10-24 False 2012-10-25 False 2012-10-26 False 2012-10-27 False 2012-10-28 False 2012-10-29 False 2012-10-30 False 2012-10-31 False 2012-11-01 False 2012-11-02 False 2012-11-03 False 2012-11-04 False 2012-11-05 False Name: Berri 1, Length: 310
bike_data[bike_data['Berri 1'] < 75]
Berri 1 | Côte-Sainte-Catherine | Maisonneuve 1 | |
---|---|---|---|
Date | |||
2012-01-01 | 35 | 0 | 38 |
2012-01-14 | 32 | 0 | 54 |
2012-01-15 | 54 | 0 | 33 |
2012-01-21 | 53 | 0 | 47 |
2012-01-22 | 71 | 0 | 41 |
2012-02-05 | 72 | 0 | 46 |
2012-02-11 | 71 | 0 | 63 |
2012-02-25 | 62 | 0 | 48 |
8 rows × 3 columns
bike_data['weekday'] = bike_data.index.weekday
bike_data.head()
Berri 1 | Côte-Sainte-Catherine | Maisonneuve 1 | weekday | |
---|---|---|---|---|
Date | ||||
2012-01-01 | 35 | 0 | 38 | 6 |
2012-01-02 | 83 | 1 | 68 | 0 |
2012-01-03 | 135 | 2 | 104 | 1 |
2012-01-04 | 144 | 1 | 116 | 2 |
2012-01-05 | 197 | 2 | 124 | 3 |
5 rows × 4 columns
counts_by_day = bike_data.groupby('weekday').aggregate(np.sum)
counts_by_day
Berri 1 | Côte-Sainte-Catherine | Maisonneuve 1 | |
---|---|---|---|
weekday | |||
0 | 134298 | 60329 | 90051 |
1 | 135305 | 58708 | 92035 |
2 | 152972 | 67344 | 104891 |
3 | 160131 | 69028 | 111895 |
4 | 141771 | 56446 | 98568 |
5 | 101578 | 34018 | 62067 |
6 | 99310 | 36466 | 55324 |
7 rows × 3 columns
counts_by_day.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
counts_by_day['Berri 1'].plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x21aad6d0>
bike_data['Berri 1'].plot()
<matplotlib.axes.AxesSubplot at 0x57c2650>
def get_weather_data(year):
url_template = "http://climate.weather.gc.ca/climateData/bulkdata_e.html?format=csv&stationID=5415&Year={year}&Month={month}&timeframe=1&submit=Download+Data"
# mctavish station: 10761, airport station: 5415
data_by_month = []
for month in range(1, 13):
url = url_template.format(year=year, month=month)
weather_data = pd.read_csv(url, skiprows=16, index_col='Date/Time', parse_dates=True).dropna(axis=1)
weather_data.columns = map(lambda x: x.replace('\xb0', ''), weather_data.columns)
weather_data = weather_data.drop(['Year', 'Day', 'Month', 'Time', 'Data Quality'], axis=1)
data_by_month.append(weather_data.dropna())
# Concatenate and drop any empty columns
return pd.concat(data_by_month).dropna(axis=1, how='all').dropna()
weather_data = get_weather_data(2012)
weather_data[:5]
Temp (C) | Dew Point Temp (C) | Rel Hum (%) | Wind Spd (km/h) | Visibility (km) | Stn Press (kPa) | Weather | |
---|---|---|---|---|---|---|---|
Date/Time | |||||||
2012-01-01 00:00:00 | -1.8 | -3.9 | 86 | 4 | 8.0 | 101.24 | Fog |
2012-01-01 01:00:00 | -1.8 | -3.7 | 87 | 4 | 8.0 | 101.24 | Fog |
2012-01-01 02:00:00 | -1.8 | -3.4 | 89 | 7 | 4.0 | 101.26 | Freezing Drizzle,Fog |
2012-01-01 03:00:00 | -1.5 | -3.2 | 88 | 6 | 4.0 | 101.27 | Freezing Drizzle,Fog |
2012-01-01 04:00:00 | -1.5 | -3.3 | 88 | 7 | 4.8 | 101.23 | Fog |
5 rows × 7 columns
bike_data['mean temp'] = weather_data['Temp (C)'].resample('D', how='mean')
bike_data.head()
Berri 1 | Côte-Sainte-Catherine | Maisonneuve 1 | weekday | mean temp | |
---|---|---|---|---|---|
Date | |||||
2012-01-01 | 35 | 0 | 38 | 6 | 0.629167 |
2012-01-02 | 83 | 1 | 68 | 0 | 0.041667 |
2012-01-03 | 135 | 2 | 104 | 1 | -14.416667 |
2012-01-04 | 144 | 1 | 116 | 2 | -13.645833 |
2012-01-05 | 197 | 2 | 124 | 3 | -6.750000 |
5 rows × 5 columns
bike_data[['Berri 1', 'mean temp']].plot(subplots=True)
array([<matplotlib.axes.AxesSubplot object at 0x5bae610>, <matplotlib.axes.AxesSubplot object at 0x616d2d0>], dtype=object)
bike_data['Rain'] = weather_data['Weather'].str.contains('Rain').resample('D', how='mean')
bike_data[['Berri 1', 'Rain']].plot(subplots=True)
array([<matplotlib.axes.AxesSubplot object at 0x6855b90>, <matplotlib.axes.AxesSubplot object at 0x6548b50>], dtype=object)
print 'Email:', julia['email']
print 'Twitter:', julia['twitter']
print 'Blog:', julia['website']
print 'Slides:', julia['slides']
Email: julia@jvns.ca Twitter: http://twitter.com/b0rk Blog: http://jvns.ca Slides: http://bit.ly/pycon-pandas