Intro to Pandas¶

Pandas is a Python library that builds on NumPy to provide convenient data structures for "tabular" datasets. From the documentation:

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

With NumPy arrays you are mostly limited to numeric indexing. Pandas gives you the ability to label data by row and column with rich, sophisticated indexes. If you are familar with Excel or SQL, Pandas provides features similar to what you might find in those.

In [1]:

from datetime import datetime
import pandas as pd

Loading Data¶

We're going to work with a dataset of precipitation data from weather stations around California. It contains monthly totals of precipitation at each station since 2011. It was parsed from public data like this: http://cdec.water.ca.gov/cgi-progs/reports/PRECIPOUT.2011. (The work I did to parse the data is available.)

Here's a look at the CSV file of data:

In [2]:

!head precip_data.csv

region,subregion,station,abbreviation,elevation,year,month,precip,avg precip,pct of avg
NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,Oct,8.4,7.53,112.0
NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,Nov,12.42,14.14,88.0
NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,Dec,25.0,16.37,153.0
NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,Jan,4.91,16.45,30.0
NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,Feb,9.53,11.95,80.0
NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,Mar,26.47,11.08,239.0
NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,Apr,10.25,6.47,158.0
NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,May,4.67,4.43,105.0
NORTH COAST,SMITH RIVER,Gasquet Ranger Station,GAS,384,2011,Jun,1.62,0.83,195.0

We'll use Pandas's read_csv function to load the CSV file. read_csv can parse data into dates. The default date parser is usually pretty good, but with this data it'll need some help from the datetime module.

In [3]:

def date_parser(month, year):
    if '-' not in month:
        return datetime.strptime('{} {}'.format(month, year), '%b %Y')
    else:
        # some "months" are not actual months, but ranges
        # that cover a "Water Year".
        # we'll label those with just the year value.
        return datetime(int(year), 1, 1)

In [4]:

df = pd.read_csv('precip_data.csv', parse_dates={'date': ['month', 'year']}, 
                 keep_date_col=True, date_parser=date_parser)

In [5]:

df.head()

Out[5]:

	date	region	subregion	station	abbreviation	elevation	year	month	precip	avg precip	pct of avg
0	2011-10-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	GAS	384	2011	Oct	8.40	7.53	112
1	2011-11-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	GAS	384	2011	Nov	12.42	14.14	88
2	2011-12-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	GAS	384	2011	Dec	25.00	16.37	153
3	2011-01-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	GAS	384	2011	Jan	4.91	16.45	30
4	2011-02-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	GAS	384	2011	Feb	9.53	11.95	80

Basic Indexing¶

Pandas' basic tabular data structure is the DataFrame. You can grab individual columns:

In [6]:

df['station']

Out[6]:

0     Gasquet Ranger Station
1     Gasquet Ranger Station
2     Gasquet Ranger Station
3     Gasquet Ranger Station
4     Gasquet Ranger Station
5     Gasquet Ranger Station
6     Gasquet Ranger Station
7     Gasquet Ranger Station
8     Gasquet Ranger Station
9     Gasquet Ranger Station
10    Gasquet Ranger Station
11    Gasquet Ranger Station
12    Gasquet Ranger Station
13    Gasquet Ranger Station
14                  Callahan
...
9665             Niland
9666             Niland
9667    Imperial Valley
9668    Imperial Valley
9669    Imperial Valley
9670    Imperial Valley
9671    Imperial Valley
9672    Imperial Valley
9673    Imperial Valley
9674    Imperial Valley
9675    Imperial Valley
9676    Imperial Valley
9677    Imperial Valley
9678    Imperial Valley
9679    Imperial Valley
Name: station, Length: 9680, dtype: object

And individual rows:

In [7]:

df.loc[8]

Out[7]:

date               2011-06-01 00:00:00
region                     NORTH COAST
subregion                  SMITH RIVER
station         Gasquet Ranger Station
abbreviation                       GAS
elevation                          384
year                              2011
month                              Jun
precip                            1.62
avg precip                        0.83
pct of avg                         195
Name: 8, dtype: object

The one dimensional structure returned when indexing a single column or row is called a Series. It's similar to a NumPy array, but with an extra index. Individual items can be accessed by their index, or by their location:

In [8]:

row = df.loc[8]
print('Station:', row['station'])
print('Abbreviation:', row.iloc[4])

Station: Gasquet Ranger Station
Abbreviation: GAS

Series objects have many of the same methods as NumPy arrays, such as mean, sum, sort, etc.

Filtering¶

Filtering Pandas objects is even more powerful than with NumPy arrays. The DataFrame we're working with has rows for individual months and rows for the entire year. It's hard to have both in the same table, so we can separate them. One way is with boolean filtering:

In [16]:

year_obs = df.loc[df.month.isin(['Oct-Sep', 'Oct-Aug'])]
year_obs.head()

Out[16]:

	date	region	subregion	station	abbreviation	elevation	year	month	precip	avg precip	pct of avg
12	2011-01-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	GAS	384	2011	Oct-Sep	104.00	92.33	113
26	2011-01-01	NORTH COAST	KLAMATH RIVER	Callahan	CAL	3185	2011	Oct-Sep	23.70	20.05	118
40	2011-01-01	NORTH COAST	KLAMATH RIVER	Fort Jones RS	FJN	2725	2011	Oct-Sep	23.19	21.85	106
54	2011-01-01	NORTH COAST	KLAMATH RIVER	Yreka	YRK	2625	2011	Oct-Sep	22.47	18.28	123
68	2011-01-01	NORTH COAST	KLAMATH RIVER	Happy Camp RS	HAP	1120	2011	Oct-Sep	52.33	54.09	97

Another option for filtering is the DataFrame.query method.

In [21]:

monthly_obs = df.query('month not in ["Oct-Sep", "Oct-Aug", "Wat-Yr"]')

In [15]:

monthly_obs.head()

Out[15]:

	date	region	subregion	station	abbreviation	elevation	year	month	precip	avg precip	pct of avg
0	2011-10-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	GAS	384	2011	Oct	8.40	7.53	112
1	2011-11-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	GAS	384	2011	Nov	12.42	14.14	88
2	2011-12-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	GAS	384	2011	Dec	25.00	16.37	153
3	2011-01-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	GAS	384	2011	Jan	4.91	16.45	30
4	2011-02-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	GAS	384	2011	Feb	9.53	11.95	80

Creating Indexes¶

We may want to be able to select individual rows by something other than some number, like by actually using the information in the rows. We could use filtering, but we can also take advantage of Pandas' indexing. Each row in the monthly_obs table is uniquely identified by a combination of abbreviation, year, and month. Let's make that the index!

In [33]:

monthly_obs = monthly_obs.set_index(['abbreviation', 'year', 'month'])
monthly_obs.head()

Out[33]:

			date	region	subregion	station	elevation	precip	avg precip	pct of avg
abbreviation	year	month
GAS	2011	Oct	2011-10-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	384	8.40	7.53	112
		Nov	2011-11-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	384	12.42	14.14	88
		Dec	2011-12-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	384	25.00	16.37	153
		Jan	2011-01-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	384	4.91	16.45	30
		Feb	2011-02-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	384	9.53	11.95	80

In [34]:

monthly_obs.loc[('GAS', '2013', 'Jun')]

Out[34]:

date             2013-06-01 00:00:00
region                   NORTH COAST
subregion                SMITH RIVER
station       Gasquet Ranger Station
elevation                        384
precip                          2.34
avg precip                      0.83
pct of avg                       282
Name: (GAS, 2013, Jun), dtype: object

And it's easy to move columns back into the table from the index:

In [36]:

monthly_obs = monthly_obs.reset_index()
monthly_obs.head()

Out[36]:

	abbreviation	year	month	date	region	subregion	station	elevation	precip	avg precip	pct of avg
0	GAS	2011	Oct	2011-10-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	384	8.40	7.53	112
1	GAS	2011	Nov	2011-11-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	384	12.42	14.14	88
2	GAS	2011	Dec	2011-12-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	384	25.00	16.37	153
3	GAS	2011	Jan	2011-01-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	384	4.91	16.45	30
4	GAS	2011	Feb	2011-02-01	NORTH COAST	SMITH RIVER	Gasquet Ranger Station	384	9.53	11.95	80

Group By¶

One of Pandas' most powerful features is groupby. "Group By" separates table rows into groups based on some commonality and allows you to iterate over those groups or intelligently perform reductions on them. It's a fantastic way to slice-and-dice data to get high level understanding. With our precipitation data we may want to look at it by year:

In [38]:

monthly_obs.groupby(['year']).mean()

Out[38]:

	elevation	precip	avg precip	pct of avg
year
2011	2746.403409	3.259749	2.348556	161.191063
2012	2746.403409	1.818086	2.348556	79.549051
2013	2746.403409	1.805370	2.348556	79.644730
2014	2746.403409	1.456295	2.519365	58.693638

Or by year and region:

In [51]:

monthly_obs.groupby(['year', 'region'])['pct of avg'].mean()

Out[51]:

year  region           
2011  CENTRAL COAST        177.456000
      COLORADO RIVER       101.362319
      NORTH COAST          126.065217
      NORTH LAHONTAN       138.477612
      SACRAMENTO RIVER     146.660131
      SAN FRANCISCO BAY    190.696970
      SAN JOAQUIN RIVER    208.258065
      SOUTH COAST          153.238710
      SOUTH LAHONTAN       123.261146
      TULARE LAKE          193.730408
2012  CENTRAL COAST         74.982143
      COLORADO RIVER        86.153846
      NORTH COAST           96.670330
      NORTH LAHONTAN        63.569231
      SACRAMENTO RIVER      77.905830
      SAN FRANCISCO BAY     70.409091
      SAN JOAQUIN RIVER     69.890566
      SOUTH COAST          117.828947
      SOUTH LAHONTAN        67.879747
      TULARE LAKE           76.312500
2013  CENTRAL COAST         47.000000
      COLORADO RIVER        63.727273
      NORTH COAST          109.677632
      NORTH LAHONTAN        84.864000
      SACRAMENTO RIVER      97.772523
      SAN FRANCISCO BAY     98.444444
      SAN JOAQUIN RIVER     58.025532
      SOUTH COAST           73.176471
      SOUTH LAHONTAN        68.322148
      TULARE LAKE           71.631068
2014  CENTRAL COAST         37.000000
      COLORADO RIVER        64.771930
      NORTH COAST           50.150000
      NORTH LAHONTAN        78.877049
      SACRAMENTO RIVER      59.259358
      SAN FRANCISCO BAY     50.711864
      SAN JOAQUIN RIVER     58.622807
      SOUTH COAST           59.023438
      SOUTH LAHONTAN        79.812950
      TULARE LAKE           50.175439
Name: pct of avg, dtype: float64

Or by station and year:

In [39]:

monthly_obs.groupby(['station', 'year'])['pct of avg'].mean()

Out[39]:

station                   year
Adin RS                   2011    103.000000
                          2012     54.416667
                          2013     66.727273
                          2014     58.818182
Alturas RS                2011    101.916667
                          2012     57.272727
                          2013     80.300000
                          2014     55.000000
Angwin Pacific Union Col  2011    209.800000
                          2012     61.454545
                          2013    130.000000
                          2014     40.500000
Ash Mountain              2011    175.166667
                          2012     55.000000
                          2013     54.900000
...
West Point             2012     82.363636
                       2013     78.000000
                       2014     42.181818
Woodland 1 Wnw         2011    242.000000
                       2012     81.625000
                       2013     95.500000
                       2014     46.600000
Yosemite Headquarters  2011    182.583333
                       2012     60.833333
                       2013     59.500000
                       2014     54.272727
Yreka                  2011    139.333333
                       2012     74.250000
                       2013     93.166667
                       2014     55.181818
Name: pct of avg, Length: 704, dtype: float64

Plotting¶

For quick inspection Pandas as builtin plotting support using matplotlib.

In [40]:

%matplotlib inline

In [48]:

monthly_obs.groupby('date')['pct of avg'].mean().plot()

Out[48]:

<matplotlib.axes._subplots.AxesSubplot at 0x1075fe9e8>

In [49]:

ordered_months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
                  'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
monthly_obs.groupby('month').precip.sum().loc[ordered_months].plot()

Out[49]:

<matplotlib.axes._subplots.AxesSubplot at 0x108aa19b0>

In [ ]: