In [63]:

%load_ext load_style
%load_style talk.css
from IPython.display import Image, HTML
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

The load_style extension is already loaded. To reload it, use:
  %reload_ext load_style

Pandas¶

Pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

pandas is well suited for:

Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
Ordered and unordered (not necessarily fixed-frequency) time series data.
Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

Key features:

Easy handling of missing data
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
Intuitive merging and joining data sets
Flexible reshaping and pivoting of data sets
Hierarchical labeling of axes
Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

In [39]:

HTML('<iframe src=http://pandas.pydata.org/index.html width=900 height=350></iframe>')

Out[39]:

The conventional way to import pandas is import pandas as pd

In [40]:

import pandas as pd

some options setting

In [41]:

pd.set_option("display.width", 80)
# toggle the line below that if one doesnt want DataFrames displayed as HTML tables
#pd.set_option("notebook_repr_html", False) 
pd.set_option("notebook_repr_html", True) 

Pandas's data structures and functionalities will be familiar to R users, there's a section on Pandas's website where Wes McKinney gives some translation of common idioms / operations between R and Pandas

In [42]:

HTML('<iframe src=http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html#compare-with-r width=900 height=350></iframe>')

Out[42]:

Pandas data structures¶

Series¶

A Series is a single vector of data values (think a NumPy array with shape N or (N,1)) with an index that labels each element in the vector.

Series constructions¶

In [43]:

a = pd.Series(np.random.normal(0,1,(10,)))

In [44]:

Out[44]:

0    0.037009
1    0.593607
2   -1.623254
3    0.294439
4    2.162388
5   -0.570866
6    1.671018
7    1.399286
8   -0.573680
9   -0.160413
dtype: float64

In [45]:

a.index

Out[45]:

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [46]:

a.values

Out[46]:

array([ 0.03700874,  0.59360691, -1.6232544 ,  0.29443856,  2.16238755,
       -0.57086554,  1.67101806,  1.39928642, -0.57367972, -0.16041259])

You can define your own index¶

In [47]:

a = pd.Series(np.random.normal(0,1,(10,)), index=np.arange(1,11))

In [48]:

Out[48]:

1     0.950278
2    -0.206156
3    -0.136064
4     0.247603
5    -0.228526
6    -0.199513
7    -0.071164
8     1.546748
9    -0.712044
10    0.214061
dtype: float64

In [49]:

a = pd.Series(np.random.normal(0,1,5), \
              index=['a','b','c','d','e'], name='my series')

In [50]:

Out[50]:

a   -0.448799
b   -0.410785
c   -1.607073
d   -1.217413
e    0.024734
Name: my series, dtype: float64

Pandas objects expose some powerful, high level plotting functions (built on top of Matplotlib)

In [51]:

plot = a.plot(kind='bar', rot=0, color='w', hatch='///', title=a.name, width=0.9, grid='off')

In [52]:

f, ax = plt.subplots()
bars = ax.bar(np.arange(len(a)), a.values, color='w', edgecolor='k', align='center', hatch='///')
ax.set_xticks(np.arange(len(a)))
ax.set_xlim(-0.5, len(a)-0.5)
ax.set_xticklabels(a.index)
ax.set_title(a.name)

Out[52]:

<matplotlib.text.Text at 0x10b9342d0>

Series indexing¶

Selecting from a Series is easy, using the corresponding index key (like a dict)

In [53]:

Out[53]:

a   -0.448799
b   -0.410785
c   -1.607073
d   -1.217413
e    0.024734
Name: my series, dtype: float64

In [54]:

a['c']

Out[54]:

-1.607072752509346

slices are permitted

In [55]:

a['a':'c'] ### Note the difference with standard Python / Numpy positional, integer indexing

Out[55]:

a   -0.448799
b   -0.410785
c   -1.607073
Name: my series, dtype: float64

In [56]:

a['c':]

Out[56]:

c   -1.607073
d   -1.217413
e    0.024734
Name: my series, dtype: float64

deleting an element

In [57]:

a.drop('d')

Out[57]:

a   -0.448799
b   -0.410785
c   -1.607073
e    0.024734
Name: my series, dtype: float64

Adding / modifying inplace an element to / of a Series: use the loc method

In [58]:

Out[58]:

a   -0.448799
b   -0.410785
c   -1.607073
d   -1.217413
e    0.024734
Name: my series, dtype: float64

In [59]:

a.loc['e'] = 1

In [60]:

Out[60]:

a   -0.448799
b   -0.410785
c   -1.607073
d   -1.217413
e    1.000000
Name: my series, dtype: float64

In [61]:

a.loc['f'] = 5

In [62]:

Out[62]:

a   -0.448799
b   -0.410785
c   -1.607073
d   -1.217413
e    1.000000
f    5.000000
Name: my series, dtype: float64

Mathematical operations involving two series will perform operations by aligning indices.

The union of all indices is created
The mathematical operation is performed on matching indices.

Indices that do not match are given the value NaN (not a number), and values are computed for all unique pairs of repeated indices.

In [24]:

s1 = pd.Series(np.arange(1.0,4.0),index=['a','b','c'])
s2 = pd.Series(np.arange(1.0,4.0),index=['b','c','d'])

In [25]:

s3 = s1 + s2

In [26]:

s3

Out[26]:

a   NaN
b     3
c     5
d   NaN
dtype: float64

NaNs are ignored in all operations

In [27]:

s3.mean()

Out[27]:

4.0

You can drop them from the Series

In [28]:

s4 = s3.dropna()

In [29]:

s4

Out[29]:

b    3
c    5
dtype: float64

Or use the fillna method to replace them by a value

In [30]:

s3.fillna(-999)

Out[30]:

a   -999
b      3
c      5
d   -999
dtype: float64

In [31]:

s3.fillna(s3.mean())

Out[31]:

a    4
b    3
c    5
d    4
dtype: float64

Series with a date / datetime index (timeseries)¶

Series can have indexes representing dates / times

In [ ]:

In [32]:

a.index = pd.date_range(start='2014-1-1', periods=len(a)) # default 'period' is daily

In [33]:

a.head()

Out[33]:

2014-01-01    0.142453
2014-01-02   -0.367365
2014-01-03    1.160442
2014-01-04    0.846886
Freq: D, Name: my series, dtype: float64

In [ ]:

a.index

In [ ]:

### but you can convert it to an numpy array of python datetime objects if you want
py_datetimes = a.index.to_pydatetime()

And a number of useful methods for manipulation of time series is exposed

In [34]:

### resample daily time-series to 5 minutes 'period', using forward filling method
a.resample('5min',fill_method='ffill')

Out[34]:

2014-01-01 00:00:00    0.142453
2014-01-01 00:05:00    0.142453
2014-01-01 00:10:00    0.142453
2014-01-01 00:15:00    0.142453
2014-01-01 00:20:00    0.142453
2014-01-01 00:25:00    0.142453
2014-01-01 00:30:00    0.142453
2014-01-01 00:35:00    0.142453
2014-01-01 00:40:00    0.142453
2014-01-01 00:45:00    0.142453
2014-01-01 00:50:00    0.142453
2014-01-01 00:55:00    0.142453
2014-01-01 01:00:00    0.142453
2014-01-01 01:05:00    0.142453
2014-01-01 01:10:00    0.142453
...
2014-01-03 22:50:00    1.160442
2014-01-03 22:55:00    1.160442
2014-01-03 23:00:00    1.160442
2014-01-03 23:05:00    1.160442
2014-01-03 23:10:00    1.160442
2014-01-03 23:15:00    1.160442
2014-01-03 23:20:00    1.160442
2014-01-03 23:25:00    1.160442
2014-01-03 23:30:00    1.160442
2014-01-03 23:35:00    1.160442
2014-01-03 23:40:00    1.160442
2014-01-03 23:45:00    1.160442
2014-01-03 23:50:00    1.160442
2014-01-03 23:55:00    1.160442
2014-01-04 00:00:00    0.846886
Freq: 5T, Name: my series, Length: 865

In [ ]:

In [35]:

### the ```shift``` method makes it easy e.g. to compare series with lead / lags 
a.shift(periods=-1)

Out[35]:

2014-01-01   -0.367365
2014-01-02    1.160442
2014-01-03    0.846886
2014-01-04         NaN
Freq: D, Name: my series, dtype: float64

In [ ]:

In [36]:

### and the ```truncate`` method allows easy selection of time-slices
a.truncate(after='2014-1-2')

Out[36]:

2014-01-01    0.142453
2014-01-02   -0.367365
Freq: D, Name: my series, dtype: float64

DataFrames¶

DataFrames are IMHO one of the most powerful data structures in the Python / data analysis world.

They can be viewed as a collection of named Series. They feature two indexes, respectively for the rows and the columns, and can contain heteregoneous data types (although it must be consistent within each column). Note that a DataFrame index, either along the rows or the columns (or both !) can contain more than one level, they are called hierarchical indexes and allows the representation of complex data organisation.

If the index along the rows of a DataFrame is of datetime type, all the methods exposed for the Series (re-sampling, shifting, truncating, etc) are available for the DataFrame.

DataFrame constructions¶

In [ ]:

import string # part of the standard library
idx = list(string.lowercase[:10])
print(idx)

In [ ]:

df = pd.DataFrame(np.arange(100).reshape(10,10),\
                  columns=idx,index=np.arange(1,11))

In [ ]:

df

In [ ]:

### here I am creating a DataFrame from a dictionnary

df = pd.DataFrame({'A' : np.random.random(5),\
                   'B' : np.random.random(5),\
                   'C': np.random.random(5)}, index=np.arange(1,6))
print df

In [ ]:

df

Indexing a DataFrame¶

different ways, return either a Series or a DataFrame

In [ ]:

type(df['A'])

In [ ]:

df[['A']]

In [ ]:

type(df['A']) 

In [ ]:

type(df[['A']])

To access a particular row instead of a column, you use the ix method

In [ ]:

df.ix[3]

In [ ]:

df

access values per positional indices only

In [ ]:

df.iloc[2,0:2]

In [ ]:

type(_)

And you can combine of course row (with ix) and column indexing, using the same convention for slices as we saw for the Series

In [ ]:

df.ix[3]['A':'B']

In [ ]:

df.ix[3][['A','C']]

you can also use the loc method, giving it both row AND columns indexes (the indexing is based on the label, NOT on the position)

In [ ]:

df

In [ ]:

df.loc[[3,5],['A','B']]

conditional indexing or subsetting of a DataFrame

In [ ]:

df

In [ ]:

subset = df[df['A'] <= 0.7]

In [ ]:

subset

queries¶

as from Pandas version 0.14, you can build complex database-like queries on DataFrames

In [ ]:

df

In [ ]:

df.query('A > B')

In [ ]:

df.query('(A > B) & (B > C)')

Extending a DataFrame¶

Adding a column is easy

In [ ]:

df

In [ ]:

df['D'] = np.random.random(5)

In [ ]:

df

The following works because Pandas understands that a single value must be repeated over the row length

In [ ]:

df['E'] = 5

In [ ]:

df

The following doesn't work because there's no way to tell where to insert the missing value (align to 1st or last index ?)

In [ ]:

df['F'] = np.random.random(4)

Unless we make a series out of it, with a index matching at least partly the DataFrame (row) index

In [ ]:

df['F'] = pd.Series(np.random.random(4), index=np.arange(1,5)) #

In [ ]:

df

Useful DataFrames methods¶

applying a numpy function¶

In [ ]:

df.apply(np.sqrt) # or np.sqrt(df)

summarizing data¶

In [ ]:

df.describe().T

replacing values¶

In [ ]:

df.head()

In [ ]:

df['E'].replace(5, 0.0, inplace=True) 
# you can use dictionnary to multiple replacements

In [ ]:

df

DataFrame high-level plotting functions¶

In [ ]:

df.plot();

In [ ]:

df.drop('E', axis=1).plot(figsize=(8,12), \
                          subplots=True, \
                          sharex=True, \
                          kind='bar', rot=0); 

In [ ]:

import pandas as pd
from sklearn.datasets import load_iris
 
iris = load_iris()
 
iris = pd.DataFrame(iris.data, columns=iris.feature_names)

In [ ]:

iris.head()

In [ ]:

iris.hist(); 

In [ ]:

HTML('<iframe src=http://pandas.pydata.org/pandas-docs/stable/visualization.html width=1000 height=350></iframe>')

Input and Output in pandas¶

Pandas has very powerful IO methods, allowing to load csv, excel, tab-delimited files very easily. Pandas DataFrames can also be saved also in csv, excel files.

Other supported file types are:

JSON (JavaScript Object Notation)
HDF5
HTML (to e.g. read tables contained in HTML documents)
SQL (Pandas can query directly from SQL databases thanks to SQLAlchemy)
...

In [ ]:

pd.read

In [ ]:

SOI = pd.read_csv('../data/NIWA_SOI.csv')

In [ ]:

SOI.head()

In [ ]:

SOI = pd.read_csv('../data/NIWA_SOI.csv', index_col=0)

In [ ]:

SOI.head()

In [ ]:

SOI.ix[1950:2000].head()

Stacking¶

In [ ]:

SOIs = SOI.stack()

In [ ]:

SOIs.head()

In [ ]:

SOIs.index

In [ ]:

from dateutil import parser

In [ ]:

dateindex = [parser.parse("-".join(map(str, [x[0], x[1], 1]))) for x in SOIs.index]

In [ ]:

SOIs.index=dateindex

In [ ]:

SOIs.head()

In [ ]:

SOIs.plot(figsize=(12,5))

In [ ]:

pd.rolling_mean(SOIs,12).plot()

Saving in csv, excel etc¶

In [ ]:

SOIs = pd.DataFrame(SOIs)

In [ ]:

SOIs.to_excel('../data/SOI_time_series.xlsx', header=False)

In [ ]:

#!open ./data/SOI_time_series.xlsx

Reading Excel files¶

first example: an Excel file, containing one sheet, with an Excel dates column

In [38]:

!open ../data/ISO_datetime.xls

In [39]:

data = pd.read_excel('../data/ISO_datetime.xls', sheetname='Sheet1')

In [40]:

data.head()

Out[40]:

	date	value
0	2014-01-02	0.786299
1	2014-01-03	0.354398
2	2014-01-04	0.747329
3	2014-01-05	0.461564
4	2014-01-06	0.626105

In [ ]:

data.date

In [41]:

data = pd.read_excel('../data/ISO_datetime.xls', \
                     sheetname='Sheet1', index_col=0)

In [43]:

data.index

Out[43]:

<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-02, ..., 2014-01-11]
Length: 10, Freq: None, Timezone: None

correctly parsing date(times) when date and time information are contained in different columns¶

In [44]:

!open ../data/year_month_day.xlsx

In [ ]:

ymd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet1')
ymd.head()

In [45]:

ymd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet1', \
                    parse_dates = {'date': [0, 1, 2]}, \
                    index_col='date')
ymd.head()

Out[45]:

	value
date
2014-01-02	0.786299
2014-01-03	0.354398
2014-01-04	0.747329
2014-01-05	0.461564
2014-01-06	0.626105

In [ ]:

myd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet2'); myd.head()

In [46]:

myd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet2',\
                    parse_dates = {'date': [1, 0, 2]}, \
                    index_col='date'); 
myd.head()

Out[46]:

	value
date
2014-01-02	0.478467
2014-01-03	0.833801
2014-01-04	0.837646
2014-01-05	0.453646
2014-01-06	0.260355

groupby operations in Pandas¶

The groupby method is a very powerful method of pandas DataFrames, in a nutschell it allows you to

split your data according to unique values of a variable (or unique combinations of N variables)
apply some operation to the groups thus defined, either an aggregation or transformation method
combine the results into a DataFrame

This process is illustrated below, where the operation is here calculating the mean of the groups's values

A very nice explanation of the groupby method, with examples, is available from Pandas's documentation at:

http://pandas.pydata.org/pandas-docs/stable/groupby.html

and a short tutorial on Wes McKinney's blog here

In [47]:

Image(filename='images/split-apply-combine.png', width=800)

Out[47]:

In [ ]:

url = "ftp://ftp.cpc.ncep.noaa.gov/wd52dg/data/indices/ersst3b.nino.mth.81-10.ascii"

In [ ]:

!wget -P ../data ftp://ftp.cpc.ncep.noaa.gov/wd52dg/data/indices/ersst3b.nino.mth.81-10.ascii

In [48]:

data = pd.read_table('../data/ersst3b.nino.mth.81-10.ascii', sep='\s+') #the '\s+' is a regular expression meaning any number of spaces

In [ ]:

# if the network confirguration allows it, you can read directly off the URL (ftp):

#data = pd.read_table(url, sep='\s+')

In [49]:

data.tail()

Out[49]:

	YR	MON	NINO1+2	ANOM	NINO3	ANOM.1	NINO4	ANOM.2	NINO3.4	ANOM.3
776	2014	9	21.54	0.66	25.33	0.31	29.14	0.42	27.00	0.17
777	2014	10	21.69	0.49	25.72	0.65	29.22	0.50	27.24	0.46
778	2014	11	22.44	0.62	26.05	0.91	29.38	0.72	27.56	0.83
779	2014	12	23.41	0.40	26.09	0.79	29.19	0.65	27.35	0.66
780	2015	1	24.83	0.15	26.28	0.51	28.97	0.65	27.21	0.54

I only keep the raw - monthly - values of NINO 3.4

In [50]:

nino = data[['YR','MON','NINO3.4']]

In [51]:

nino.tail()

Out[51]:

	YR	MON	NINO3.4
776	2014	9	27.00
777	2014	10	27.24
778	2014	11	27.56
779	2014	12	27.35
780	2015	1	27.21

Now I want to calculate a climatology (over the whole period available)

I first group by UNIQUE values of the variable months, I should get 12 groups

In [52]:

groups = nino.groupby('MON')

In [53]:

for month, group in groups:
    print month
    print group.head()

1
      YR  MON  NINO3.4
0   1950    1    24.83
12  1951    1    25.46
24  1952    1    26.85
36  1953    1    26.85
48  1954    1    27.03
2
      YR  MON  NINO3.4
1   1950    2    25.20
13  1951    2    25.78
25  1952    2    26.79
37  1953    2    27.19
49  1954    2    27.22
3
      YR  MON  NINO3.4
2   1950    3    26.03
14  1951    3    26.72
26  1952    3    27.32
38  1953    3    27.68
50  1954    3    27.21
4
      YR  MON  NINO3.4
3   1950    4    26.36
15  1951    4    27.24
27  1952    4    27.88
39  1953    4    28.19
51  1954    4    26.87
5
      YR  MON  NINO3.4
4   1950    5    26.19
16  1951    5    27.68
28  1952    5    27.99
40  1953    5    28.29
52  1954    5    27.07
6
      YR  MON  NINO3.4
5   1950    6    26.52
17  1951    6    27.46
29  1952    6    27.33
41  1953    6    28.02
53  1954    6    26.93
7
      YR  MON  NINO3.4
6   1950    7    26.42
18  1951    7    27.72
30  1952    7    26.72
42  1953    7    27.52
54  1954    7    26.37
8
      YR  MON  NINO3.4
7   1950    8    25.98
19  1951    8    27.36
31  1952    8    26.46
43  1953    8    27.16
55  1954    8    25.73
9
      YR  MON  NINO3.4
8   1950    9    25.78
20  1951    9    27.51
32  1952    9    26.54
44  1953    9    27.13
56  1954    9    25.38
10
      YR  MON  NINO3.4
9   1950   10    25.96
21  1951   10    27.43
33  1952   10    26.54
45  1953   10    27.02
57  1954   10    25.51
11
      YR  MON  NINO3.4
10  1950   11    25.64
22  1951   11    27.48
34  1952   11    26.36
46  1953   11    26.96
58  1954   11    25.67
12
      YR  MON  NINO3.4
11  1950   12    25.50
23  1951   12    27.12
35  1952   12    26.53
47  1953   12    26.99
59  1954   12    25.37

In [54]:

climatology = groups.mean()

Same as

climatology = groups.aggregate(np.mean)

In [55]:

climatology['NINO3.4'].head(12)

Out[55]:

MON
1      26.495606
2      26.677077
3      27.184308
4      27.623692
5      27.739692
6      27.544462
7      27.152000
8      26.765385
9      26.672462
10     26.642000
11     26.608615
12     26.541692
Name: NINO3.4, dtype: float64

In [56]:

f, ax = plt.subplots()

climatology['NINO3.4'].plot(ax=ax, kind='bar',ylim=[26,28], rot=0, width=0.9)

ax.set_xticklabels(list('JFMAMJJASOND'));

f.savefig('../outputs/climatology.pdf')

Now suppose we want to apply a function that doesnt aggregate the values in the groups (such as sum, or mean) but rather want to apply a function to those values ...

An example would be calculating the standardized anomalies per month (to each value subtract the mean of the corresponding month, then divide by the standard-deviation)

In [57]:

def zscore(x): 
    z = (x - x.mean()) / x.std()
    return z

In [58]:

transformed = nino.groupby(nino.MON).apply(zscore)

In [59]:

transformed['NINO3.4'].plot()

Out[59]:

<matplotlib.axes._subplots.AxesSubplot at 0x10877be50>

Now we want calculate (just) the anomalies WRT to the climatology, but with a ** normal established over 1980 - 2010**

In [60]:

nino.head()

Out[60]:

	YR	MON	NINO3.4
0	1950	1	24.83
1	1950	2	25.20
2	1950	3	26.03
3	1950	4	26.36
4	1950	5	26.19

we can make use of the query method of Pandas DataFrames to select the climatological period

In [61]:

nino.query('YR >= 1981 & YR <= 2000').head()

Out[61]:

	YR	MON	NINO3.4
372	1981	1	26.17
373	1981	2	26.13
374	1981	3	26.75
375	1981	4	27.32
376	1981	5	27.41

In [62]:

def demean(x): 
    z = x - x.query('YR >= 1981 & YR <= 2000').mean()
    return z

In [63]:

anoms = nino.groupby([nino.MON]).apply(demean)

In [64]:

f, ax = plt.subplots(figsize=(10,6))
anoms['NINO3.4'].plot()
data['ANOM.3'].plot() # Should be the same

Out[64]:

<matplotlib.axes._subplots.AxesSubplot at 0x1088c2ad0>

Exercise¶

The file Daily_clim_data_Auckland.csv in the data directory contains daily values for rainfall, minimum and maximum temperatures for Auckland from 1972 to now.

Read the data in a Pandas DataFrame, correctly setting up the index as a datetime object
calculate monthly means for the temperature data, monthly totals for rainfall
calculate and plot a climatology over the period 1981 - 2000 for all parameters
calculate the anomalies WRT this climatology for all parameters
plots the time-series of minimum and maximum temperature
dig into the Scipy documentation to find how to calculate the linear trend over the period for the temperature data
calculate and plot the trend
Contact the New Zealand Climate Science Coalition

In [ ]: