Pandas is a Python library that builds on NumPy to provide convenient data structures for "tabular" datasets. From the documentation:
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
With NumPy arrays you are mostly limited to numeric indexing. Pandas gives you the ability to label data by row and column with rich, sophisticated indexes. If you are familar with Excel or SQL, Pandas provides features similar to what you might find in those.
from datetime import datetime
import pandas as pd
We're going to work with a dataset of precipitation data from weather stations around California. It contains monthly totals of precipitation at each station since 2011. It was parsed from public data like this: http://cdec.water.ca.gov/cgi-progs/reports/PRECIPOUT.2011. (The work I did to parse the data is available.)
Here's a look at the CSV file of data:
!head precip_data.csv
region,subregion,station,abbreviation,elevation,year,month,precip,avg precip,pct of avg NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,Oct,8.4,7.53,112.0 NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,Nov,12.42,14.14,88.0 NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,Dec,25.0,16.37,153.0 NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,Jan,4.91,16.45,30.0 NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,Feb,9.53,11.95,80.0 NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,Mar,26.47,11.08,239.0 NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,Apr,10.25,6.47,158.0 NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,May,4.67,4.43,105.0 NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,Jun,1.62,0.83,195.0
We'll use Pandas's read_csv function
to load the CSV file.
read_csv
can parse data into dates.
The default date parser is usually pretty good, but with this data it'll need some help
from the datetime module.
def date_parser(month, year):
if '-' not in month:
return datetime.strptime('{} {}'.format(month, year), '%b %Y')
else:
# some "months" are not actual months, but ranges
# that cover a "Water Year".
# we'll label those with just the year value.
return datetime(int(year), 1, 1)
df = pd.read_csv('precip_data.csv', parse_dates={'date': ['month', 'year']},
keep_date_col=True, date_parser=date_parser)
df.head()
date | region | subregion | station | abbreviation | elevation | year | month | precip | avg precip | pct of avg | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-10-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | GAS | 384 | 2011 | Oct | 8.40 | 7.53 | 112 |
1 | 2011-11-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | GAS | 384 | 2011 | Nov | 12.42 | 14.14 | 88 |
2 | 2011-12-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | GAS | 384 | 2011 | Dec | 25.00 | 16.37 | 153 |
3 | 2011-01-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | GAS | 384 | 2011 | Jan | 4.91 | 16.45 | 30 |
4 | 2011-02-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | GAS | 384 | 2011 | Feb | 9.53 | 11.95 | 80 |
Pandas' basic tabular data structure is the DataFrame. You can grab individual columns:
df['station']
0 Gasquet Ranger Station 1 Gasquet Ranger Station 2 Gasquet Ranger Station 3 Gasquet Ranger Station 4 Gasquet Ranger Station 5 Gasquet Ranger Station 6 Gasquet Ranger Station 7 Gasquet Ranger Station 8 Gasquet Ranger Station 9 Gasquet Ranger Station 10 Gasquet Ranger Station 11 Gasquet Ranger Station 12 Gasquet Ranger Station 13 Gasquet Ranger Station 14 Callahan ... 9665 Niland 9666 Niland 9667 Imperial Valley 9668 Imperial Valley 9669 Imperial Valley 9670 Imperial Valley 9671 Imperial Valley 9672 Imperial Valley 9673 Imperial Valley 9674 Imperial Valley 9675 Imperial Valley 9676 Imperial Valley 9677 Imperial Valley 9678 Imperial Valley 9679 Imperial Valley Name: station, Length: 9680, dtype: object
And individual rows:
df.loc[8]
date 2011-06-01 00:00:00 region NORTH COAST subregion SMITH RIVER station Gasquet Ranger Station abbreviation GAS elevation 384 year 2011 month Jun precip 1.62 avg precip 0.83 pct of avg 195 Name: 8, dtype: object
The one dimensional structure returned when indexing a single column or row is called a Series. It's similar to a NumPy array, but with an extra index. Individual items can be accessed by their index, or by their location:
row = df.loc[8]
print('Station:', row['station'])
print('Abbreviation:', row.iloc[4])
Station: Gasquet Ranger Station Abbreviation: GAS
Series objects have many of the same methods as NumPy arrays, such as mean, sum, sort, etc.
Filtering Pandas objects is even more powerful than with NumPy arrays. The DataFrame we're working with has rows for individual months and rows for the entire year. It's hard to have both in the same table, so we can separate them. One way is with boolean filtering:
year_obs = df.loc[df.month.isin(['Oct-Sep', 'Oct-Aug'])]
year_obs.head()
date | region | subregion | station | abbreviation | elevation | year | month | precip | avg precip | pct of avg | |
---|---|---|---|---|---|---|---|---|---|---|---|
12 | 2011-01-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | GAS | 384 | 2011 | Oct-Sep | 104.00 | 92.33 | 113 |
26 | 2011-01-01 | NORTH COAST | KLAMATH RIVER | Callahan | CAL | 3185 | 2011 | Oct-Sep | 23.70 | 20.05 | 118 |
40 | 2011-01-01 | NORTH COAST | KLAMATH RIVER | Fort Jones RS | FJN | 2725 | 2011 | Oct-Sep | 23.19 | 21.85 | 106 |
54 | 2011-01-01 | NORTH COAST | KLAMATH RIVER | Yreka | YRK | 2625 | 2011 | Oct-Sep | 22.47 | 18.28 | 123 |
68 | 2011-01-01 | NORTH COAST | KLAMATH RIVER | Happy Camp RS | HAP | 1120 | 2011 | Oct-Sep | 52.33 | 54.09 | 97 |
Another option for filtering is the DataFrame.query method.
monthly_obs = df.query('month not in ["Oct-Sep", "Oct-Aug", "Wat-Yr"]')
monthly_obs.head()
date | region | subregion | station | abbreviation | elevation | year | month | precip | avg precip | pct of avg | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-10-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | GAS | 384 | 2011 | Oct | 8.40 | 7.53 | 112 |
1 | 2011-11-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | GAS | 384 | 2011 | Nov | 12.42 | 14.14 | 88 |
2 | 2011-12-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | GAS | 384 | 2011 | Dec | 25.00 | 16.37 | 153 |
3 | 2011-01-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | GAS | 384 | 2011 | Jan | 4.91 | 16.45 | 30 |
4 | 2011-02-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | GAS | 384 | 2011 | Feb | 9.53 | 11.95 | 80 |
We may want to be able to select individual rows by something other than
some number, like by actually using the information in the rows.
We could use filtering, but we can also take advantage of Pandas' indexing.
Each row in the monthly_obs
table is uniquely identified by a combination
of abbreviation, year, and month.
Let's make that the index!
monthly_obs = monthly_obs.set_index(['abbreviation', 'year', 'month'])
monthly_obs.head()
date | region | subregion | station | elevation | precip | avg precip | pct of avg | |||
---|---|---|---|---|---|---|---|---|---|---|
abbreviation | year | month | ||||||||
GAS | 2011 | Oct | 2011-10-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | 384 | 8.40 | 7.53 | 112 |
Nov | 2011-11-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | 384 | 12.42 | 14.14 | 88 | ||
Dec | 2011-12-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | 384 | 25.00 | 16.37 | 153 | ||
Jan | 2011-01-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | 384 | 4.91 | 16.45 | 30 | ||
Feb | 2011-02-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | 384 | 9.53 | 11.95 | 80 |
monthly_obs.loc[('GAS', '2013', 'Jun')]
date 2013-06-01 00:00:00 region NORTH COAST subregion SMITH RIVER station Gasquet Ranger Station elevation 384 precip 2.34 avg precip 0.83 pct of avg 282 Name: (GAS, 2013, Jun), dtype: object
And it's easy to move columns back into the table from the index:
monthly_obs = monthly_obs.reset_index()
monthly_obs.head()
abbreviation | year | month | date | region | subregion | station | elevation | precip | avg precip | pct of avg | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | GAS | 2011 | Oct | 2011-10-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | 384 | 8.40 | 7.53 | 112 |
1 | GAS | 2011 | Nov | 2011-11-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | 384 | 12.42 | 14.14 | 88 |
2 | GAS | 2011 | Dec | 2011-12-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | 384 | 25.00 | 16.37 | 153 |
3 | GAS | 2011 | Jan | 2011-01-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | 384 | 4.91 | 16.45 | 30 |
4 | GAS | 2011 | Feb | 2011-02-01 | NORTH COAST | SMITH RIVER | Gasquet Ranger Station | 384 | 9.53 | 11.95 | 80 |
One of Pandas' most powerful features is groupby. "Group By" separates table rows into groups based on some commonality and allows you to iterate over those groups or intelligently perform reductions on them. It's a fantastic way to slice-and-dice data to get high level understanding. With our precipitation data we may want to look at it by year:
monthly_obs.groupby(['year']).mean()
elevation | precip | avg precip | pct of avg | |
---|---|---|---|---|
year | ||||
2011 | 2746.403409 | 3.259749 | 2.348556 | 161.191063 |
2012 | 2746.403409 | 1.818086 | 2.348556 | 79.549051 |
2013 | 2746.403409 | 1.805370 | 2.348556 | 79.644730 |
2014 | 2746.403409 | 1.456295 | 2.519365 | 58.693638 |
Or by year and region:
monthly_obs.groupby(['year', 'region'])['pct of avg'].mean()
year region 2011 CENTRAL COAST 177.456000 COLORADO RIVER 101.362319 NORTH COAST 126.065217 NORTH LAHONTAN 138.477612 SACRAMENTO RIVER 146.660131 SAN FRANCISCO BAY 190.696970 SAN JOAQUIN RIVER 208.258065 SOUTH COAST 153.238710 SOUTH LAHONTAN 123.261146 TULARE LAKE 193.730408 2012 CENTRAL COAST 74.982143 COLORADO RIVER 86.153846 NORTH COAST 96.670330 NORTH LAHONTAN 63.569231 SACRAMENTO RIVER 77.905830 SAN FRANCISCO BAY 70.409091 SAN JOAQUIN RIVER 69.890566 SOUTH COAST 117.828947 SOUTH LAHONTAN 67.879747 TULARE LAKE 76.312500 2013 CENTRAL COAST 47.000000 COLORADO RIVER 63.727273 NORTH COAST 109.677632 NORTH LAHONTAN 84.864000 SACRAMENTO RIVER 97.772523 SAN FRANCISCO BAY 98.444444 SAN JOAQUIN RIVER 58.025532 SOUTH COAST 73.176471 SOUTH LAHONTAN 68.322148 TULARE LAKE 71.631068 2014 CENTRAL COAST 37.000000 COLORADO RIVER 64.771930 NORTH COAST 50.150000 NORTH LAHONTAN 78.877049 SACRAMENTO RIVER 59.259358 SAN FRANCISCO BAY 50.711864 SAN JOAQUIN RIVER 58.622807 SOUTH COAST 59.023438 SOUTH LAHONTAN 79.812950 TULARE LAKE 50.175439 Name: pct of avg, dtype: float64
Or by station and year:
monthly_obs.groupby(['station', 'year'])['pct of avg'].mean()
station year Adin RS 2011 103.000000 2012 54.416667 2013 66.727273 2014 58.818182 Alturas RS 2011 101.916667 2012 57.272727 2013 80.300000 2014 55.000000 Angwin Pacific Union Col 2011 209.800000 2012 61.454545 2013 130.000000 2014 40.500000 Ash Mountain 2011 175.166667 2012 55.000000 2013 54.900000 ... West Point 2012 82.363636 2013 78.000000 2014 42.181818 Woodland 1 Wnw 2011 242.000000 2012 81.625000 2013 95.500000 2014 46.600000 Yosemite Headquarters 2011 182.583333 2012 60.833333 2013 59.500000 2014 54.272727 Yreka 2011 139.333333 2012 74.250000 2013 93.166667 2014 55.181818 Name: pct of avg, Length: 704, dtype: float64
For quick inspection Pandas as builtin plotting support using matplotlib.
%matplotlib inline
monthly_obs.groupby('date')['pct of avg'].mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1075fe9e8>
ordered_months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
monthly_obs.groupby('month').precip.sum().loc[ordered_months].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x108aa19b0>