%load_ext load_style
%load_style talk.css
from IPython.display import Image, HTML
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
The load_style extension is already loaded. To reload it, use: %reload_ext load_style
Pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data. It is a fundamental high-level building block for doing practical, real world data analysis in Python.
pandas is well suited for:
Key features:
HTML('<iframe src=http://pandas.pydata.org/index.html width=900 height=350></iframe>')
The conventional way to import pandas is import pandas as pd
import pandas as pd
some options setting
pd.set_option("display.width", 80)
# toggle the line below that if one doesnt want DataFrames displayed as HTML tables
#pd.set_option("notebook_repr_html", False)
pd.set_option("notebook_repr_html", True)
Pandas's data structures and functionalities will be familiar to R users, there's a section on Pandas's website where Wes McKinney gives some translation of common idioms / operations between R and Pandas
HTML('<iframe src=http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html#compare-with-r width=900 height=350></iframe>')
A Series is a single vector of data values (think a NumPy array with shape N or (N,1)) with an index that labels each element in the vector.
a = pd.Series(np.random.normal(0,1,(10,)))
a
0 0.037009 1 0.593607 2 -1.623254 3 0.294439 4 2.162388 5 -0.570866 6 1.671018 7 1.399286 8 -0.573680 9 -0.160413 dtype: float64
a.index
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
a.values
array([ 0.03700874, 0.59360691, -1.6232544 , 0.29443856, 2.16238755, -0.57086554, 1.67101806, 1.39928642, -0.57367972, -0.16041259])
a = pd.Series(np.random.normal(0,1,(10,)), index=np.arange(1,11))
a
1 0.950278 2 -0.206156 3 -0.136064 4 0.247603 5 -0.228526 6 -0.199513 7 -0.071164 8 1.546748 9 -0.712044 10 0.214061 dtype: float64
a = pd.Series(np.random.normal(0,1,5), \
index=['a','b','c','d','e'], name='my series')
a
a -0.448799 b -0.410785 c -1.607073 d -1.217413 e 0.024734 Name: my series, dtype: float64
Pandas objects expose some powerful, high level plotting functions (built on top of Matplotlib)
plot = a.plot(kind='bar', rot=0, color='w', hatch='///', title=a.name, width=0.9, grid='off')
f, ax = plt.subplots()
bars = ax.bar(np.arange(len(a)), a.values, color='w', edgecolor='k', align='center', hatch='///')
ax.set_xticks(np.arange(len(a)))
ax.set_xlim(-0.5, len(a)-0.5)
ax.set_xticklabels(a.index)
ax.set_title(a.name)
<matplotlib.text.Text at 0x10b9342d0>
Selecting from a Series is easy, using the corresponding index key (like a dict)
a
a -0.448799 b -0.410785 c -1.607073 d -1.217413 e 0.024734 Name: my series, dtype: float64
a['c']
-1.607072752509346
slices are permitted
a['a':'c'] ### Note the difference with standard Python / Numpy positional, integer indexing
a -0.448799 b -0.410785 c -1.607073 Name: my series, dtype: float64
a['c':]
c -1.607073 d -1.217413 e 0.024734 Name: my series, dtype: float64
deleting an element
a.drop('d')
a -0.448799 b -0.410785 c -1.607073 e 0.024734 Name: my series, dtype: float64
Adding / modifying inplace an element to / of a Series: use the loc
method
a
a -0.448799 b -0.410785 c -1.607073 d -1.217413 e 0.024734 Name: my series, dtype: float64
a.loc['e'] = 1
a
a -0.448799 b -0.410785 c -1.607073 d -1.217413 e 1.000000 Name: my series, dtype: float64
a.loc['f'] = 5
a
a -0.448799 b -0.410785 c -1.607073 d -1.217413 e 1.000000 f 5.000000 Name: my series, dtype: float64
Mathematical operations involving two series will perform operations by aligning indices.
Indices that do not match are given the value NaN (not a number), and values are computed for all unique pairs of repeated indices.
s1 = pd.Series(np.arange(1.0,4.0),index=['a','b','c'])
s2 = pd.Series(np.arange(1.0,4.0),index=['b','c','d'])
s3 = s1 + s2
s3
a NaN b 3 c 5 d NaN dtype: float64
NaNs are ignored in all operations
s3.mean()
4.0
You can drop them from the Series
s4 = s3.dropna()
s4
b 3 c 5 dtype: float64
Or use the fillna
method to replace them by a value
s3.fillna(-999)
a -999 b 3 c 5 d -999 dtype: float64
s3.fillna(s3.mean())
a 4 b 3 c 5 d 4 dtype: float64
Series can have indexes representing dates / times
a
a.index = pd.date_range(start='2014-1-1', periods=len(a)) # default 'period' is daily
a.head()
2014-01-01 0.142453 2014-01-02 -0.367365 2014-01-03 1.160442 2014-01-04 0.846886 Freq: D, Name: my series, dtype: float64
a.index
### but you can convert it to an numpy array of python datetime objects if you want
py_datetimes = a.index.to_pydatetime()
And a number of useful methods for manipulation of time series is exposed
### resample daily time-series to 5 minutes 'period', using forward filling method
a.resample('5min',fill_method='ffill')
2014-01-01 00:00:00 0.142453 2014-01-01 00:05:00 0.142453 2014-01-01 00:10:00 0.142453 2014-01-01 00:15:00 0.142453 2014-01-01 00:20:00 0.142453 2014-01-01 00:25:00 0.142453 2014-01-01 00:30:00 0.142453 2014-01-01 00:35:00 0.142453 2014-01-01 00:40:00 0.142453 2014-01-01 00:45:00 0.142453 2014-01-01 00:50:00 0.142453 2014-01-01 00:55:00 0.142453 2014-01-01 01:00:00 0.142453 2014-01-01 01:05:00 0.142453 2014-01-01 01:10:00 0.142453 ... 2014-01-03 22:50:00 1.160442 2014-01-03 22:55:00 1.160442 2014-01-03 23:00:00 1.160442 2014-01-03 23:05:00 1.160442 2014-01-03 23:10:00 1.160442 2014-01-03 23:15:00 1.160442 2014-01-03 23:20:00 1.160442 2014-01-03 23:25:00 1.160442 2014-01-03 23:30:00 1.160442 2014-01-03 23:35:00 1.160442 2014-01-03 23:40:00 1.160442 2014-01-03 23:45:00 1.160442 2014-01-03 23:50:00 1.160442 2014-01-03 23:55:00 1.160442 2014-01-04 00:00:00 0.846886 Freq: 5T, Name: my series, Length: 865
a
### the ```shift``` method makes it easy e.g. to compare series with lead / lags
a.shift(periods=-1)
2014-01-01 -0.367365 2014-01-02 1.160442 2014-01-03 0.846886 2014-01-04 NaN Freq: D, Name: my series, dtype: float64
a
### and the ```truncate`` method allows easy selection of time-slices
a.truncate(after='2014-1-2')
2014-01-01 0.142453 2014-01-02 -0.367365 Freq: D, Name: my series, dtype: float64
DataFrames are IMHO one of the most powerful data structures in the Python / data analysis world.
They can be viewed as a collection of named Series. They feature two indexes, respectively for the rows and the columns, and can contain heteregoneous data types (although it must be consistent within each column). Note that a DataFrame index, either along the rows or the columns (or both !) can contain more than one level, they are called hierarchical indexes and allows the representation of complex data organisation.
If the index along the rows of a DataFrame is of datetime type, all the methods exposed for the Series (re-sampling, shifting, truncating, etc) are available for the DataFrame.
import string # part of the standard library
idx = list(string.lowercase[:10])
print(idx)
df = pd.DataFrame(np.arange(100).reshape(10,10),\
columns=idx,index=np.arange(1,11))
df
### here I am creating a DataFrame from a dictionnary
df = pd.DataFrame({'A' : np.random.random(5),\
'B' : np.random.random(5),\
'C': np.random.random(5)}, index=np.arange(1,6))
print df
df
different ways, return either a Series or a DataFrame
type(df['A'])
df[['A']]
type(df['A'])
type(df[['A']])
To access a particular row instead of a column, you use the ix method
df.ix[3]
df
access values per positional indices only
df.iloc[2,0:2]
type(_)
And you can combine of course row (with ix) and column indexing, using the same convention for slices as we saw for the Series
df.ix[3]['A':'B']
df.ix[3][['A','C']]
you can also use the loc
method, giving it both row AND columns indexes
(the indexing is based on the label, NOT on the position)
df
df.loc[[3,5],['A','B']]
conditional indexing or subsetting of a DataFrame
df
subset = df[df['A'] <= 0.7]
subset
as from Pandas version 0.14, you can build complex database-like queries on DataFrames
df
df.query('A > B')
df.query('(A > B) & (B > C)')
Adding a column is easy
df
df['D'] = np.random.random(5)
df
The following works because Pandas understands that a single value must be repeated over the row length
df['E'] = 5
df
The following doesn't work because there's no way to tell where to insert the missing value (align to 1st or last index ?)
df['F'] = np.random.random(4)
Unless we make a series out of it, with a index matching at least partly the DataFrame (row) index
df['F'] = pd.Series(np.random.random(4), index=np.arange(1,5)) #
df
df.apply(np.sqrt) # or np.sqrt(df)
df.describe().T
df.head()
df['E'].replace(5, 0.0, inplace=True)
# you can use dictionnary to multiple replacements
df
df.plot();
df.drop('E', axis=1).plot(figsize=(8,12), \
subplots=True, \
sharex=True, \
kind='bar', rot=0);
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
iris = pd.DataFrame(iris.data, columns=iris.feature_names)
iris.head()
iris.hist();
HTML('<iframe src=http://pandas.pydata.org/pandas-docs/stable/visualization.html width=1000 height=350></iframe>')
Pandas has very powerful IO methods, allowing to load csv, excel, tab-delimited files very easily. Pandas DataFrames can also be saved also in csv, excel files.
Other supported file types are:
pd.read
SOI = pd.read_csv('../data/NIWA_SOI.csv')
SOI.head()
SOI = pd.read_csv('../data/NIWA_SOI.csv', index_col=0)
SOI.head()
SOI.ix[1950:2000].head()
SOIs = SOI.stack()
SOIs.head()
SOIs.index
from dateutil import parser
dateindex = [parser.parse("-".join(map(str, [x[0], x[1], 1]))) for x in SOIs.index]
SOIs.index=dateindex
SOIs.head()
SOIs.plot(figsize=(12,5))
pd.rolling_mean(SOIs,12).plot()
SOIs = pd.DataFrame(SOIs)
SOIs.to_excel('../data/SOI_time_series.xlsx', header=False)
#!open ./data/SOI_time_series.xlsx
first example: an Excel file, containing one sheet, with an Excel dates
column
!open ../data/ISO_datetime.xls
data = pd.read_excel('../data/ISO_datetime.xls', sheetname='Sheet1')
data.head()
date | value | |
---|---|---|
0 | 2014-01-02 | 0.786299 |
1 | 2014-01-03 | 0.354398 |
2 | 2014-01-04 | 0.747329 |
3 | 2014-01-05 | 0.461564 |
4 | 2014-01-06 | 0.626105 |
data.date
data = pd.read_excel('../data/ISO_datetime.xls', \
sheetname='Sheet1', index_col=0)
data.index
<class 'pandas.tseries.index.DatetimeIndex'> [2014-01-02, ..., 2014-01-11] Length: 10, Freq: None, Timezone: None
!open ../data/year_month_day.xlsx
ymd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet1')
ymd.head()
ymd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet1', \
parse_dates = {'date': [0, 1, 2]}, \
index_col='date')
ymd.head()
value | |
---|---|
date | |
2014-01-02 | 0.786299 |
2014-01-03 | 0.354398 |
2014-01-04 | 0.747329 |
2014-01-05 | 0.461564 |
2014-01-06 | 0.626105 |
myd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet2'); myd.head()
myd = pd.read_excel('../data/year_month_day.xlsx', sheetname='Sheet2',\
parse_dates = {'date': [1, 0, 2]}, \
index_col='date');
myd.head()
value | |
---|---|
date | |
2014-01-02 | 0.478467 |
2014-01-03 | 0.833801 |
2014-01-04 | 0.837646 |
2014-01-05 | 0.453646 |
2014-01-06 | 0.260355 |
The groupby method is a very powerful method of pandas DataFrames, in a nutschell it allows you to
split your data according to unique values of a variable (or unique combinations of N variables)
apply some operation to the groups thus defined, either an aggregation or transformation method
combine the results into a DataFrame
This process is illustrated below, where the operation is here calculating the mean of the groups's values
A very nice explanation of the groupby method, with examples, is available from Pandas's documentation at:
http://pandas.pydata.org/pandas-docs/stable/groupby.html
and a short tutorial on Wes McKinney's blog here
Image(filename='images/split-apply-combine.png', width=800)
url = "ftp://ftp.cpc.ncep.noaa.gov/wd52dg/data/indices/ersst3b.nino.mth.81-10.ascii"
!wget -P ../data ftp://ftp.cpc.ncep.noaa.gov/wd52dg/data/indices/ersst3b.nino.mth.81-10.ascii
data = pd.read_table('../data/ersst3b.nino.mth.81-10.ascii', sep='\s+') #the '\s+' is a regular expression meaning any number of spaces
# if the network confirguration allows it, you can read directly off the URL (ftp):
#data = pd.read_table(url, sep='\s+')
data.tail()
YR | MON | NINO1+2 | ANOM | NINO3 | ANOM.1 | NINO4 | ANOM.2 | NINO3.4 | ANOM.3 | |
---|---|---|---|---|---|---|---|---|---|---|
776 | 2014 | 9 | 21.54 | 0.66 | 25.33 | 0.31 | 29.14 | 0.42 | 27.00 | 0.17 |
777 | 2014 | 10 | 21.69 | 0.49 | 25.72 | 0.65 | 29.22 | 0.50 | 27.24 | 0.46 |
778 | 2014 | 11 | 22.44 | 0.62 | 26.05 | 0.91 | 29.38 | 0.72 | 27.56 | 0.83 |
779 | 2014 | 12 | 23.41 | 0.40 | 26.09 | 0.79 | 29.19 | 0.65 | 27.35 | 0.66 |
780 | 2015 | 1 | 24.83 | 0.15 | 26.28 | 0.51 | 28.97 | 0.65 | 27.21 | 0.54 |
I only keep the raw - monthly - values of NINO 3.4
nino = data[['YR','MON','NINO3.4']]
nino.tail()
YR | MON | NINO3.4 | |
---|---|---|---|
776 | 2014 | 9 | 27.00 |
777 | 2014 | 10 | 27.24 |
778 | 2014 | 11 | 27.56 |
779 | 2014 | 12 | 27.35 |
780 | 2015 | 1 | 27.21 |
Now I want to calculate a climatology (over the whole period available)
I first group by UNIQUE values of the variable months, I should get 12 groups
groups = nino.groupby('MON')
for month, group in groups:
print month
print group.head()
1 YR MON NINO3.4 0 1950 1 24.83 12 1951 1 25.46 24 1952 1 26.85 36 1953 1 26.85 48 1954 1 27.03 2 YR MON NINO3.4 1 1950 2 25.20 13 1951 2 25.78 25 1952 2 26.79 37 1953 2 27.19 49 1954 2 27.22 3 YR MON NINO3.4 2 1950 3 26.03 14 1951 3 26.72 26 1952 3 27.32 38 1953 3 27.68 50 1954 3 27.21 4 YR MON NINO3.4 3 1950 4 26.36 15 1951 4 27.24 27 1952 4 27.88 39 1953 4 28.19 51 1954 4 26.87 5 YR MON NINO3.4 4 1950 5 26.19 16 1951 5 27.68 28 1952 5 27.99 40 1953 5 28.29 52 1954 5 27.07 6 YR MON NINO3.4 5 1950 6 26.52 17 1951 6 27.46 29 1952 6 27.33 41 1953 6 28.02 53 1954 6 26.93 7 YR MON NINO3.4 6 1950 7 26.42 18 1951 7 27.72 30 1952 7 26.72 42 1953 7 27.52 54 1954 7 26.37 8 YR MON NINO3.4 7 1950 8 25.98 19 1951 8 27.36 31 1952 8 26.46 43 1953 8 27.16 55 1954 8 25.73 9 YR MON NINO3.4 8 1950 9 25.78 20 1951 9 27.51 32 1952 9 26.54 44 1953 9 27.13 56 1954 9 25.38 10 YR MON NINO3.4 9 1950 10 25.96 21 1951 10 27.43 33 1952 10 26.54 45 1953 10 27.02 57 1954 10 25.51 11 YR MON NINO3.4 10 1950 11 25.64 22 1951 11 27.48 34 1952 11 26.36 46 1953 11 26.96 58 1954 11 25.67 12 YR MON NINO3.4 11 1950 12 25.50 23 1951 12 27.12 35 1952 12 26.53 47 1953 12 26.99 59 1954 12 25.37
climatology = groups.mean()
Same as
climatology = groups.aggregate(np.mean)
climatology['NINO3.4'].head(12)
MON 1 26.495606 2 26.677077 3 27.184308 4 27.623692 5 27.739692 6 27.544462 7 27.152000 8 26.765385 9 26.672462 10 26.642000 11 26.608615 12 26.541692 Name: NINO3.4, dtype: float64
f, ax = plt.subplots()
climatology['NINO3.4'].plot(ax=ax, kind='bar',ylim=[26,28], rot=0, width=0.9)
ax.set_xticklabels(list('JFMAMJJASOND'));
f.savefig('../outputs/climatology.pdf')
Now suppose we want to apply a function that doesnt aggregate the values in the groups (such as sum, or mean) but rather want to apply a function to those values ...
An example would be calculating the standardized anomalies per month (to each value subtract the mean of the corresponding month, then divide by the standard-deviation)
def zscore(x):
z = (x - x.mean()) / x.std()
return z
transformed = nino.groupby(nino.MON).apply(zscore)
transformed['NINO3.4'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x10877be50>
Now we want calculate (just) the anomalies WRT to the climatology, but with a ** normal established over 1980 - 2010**
nino.head()
YR | MON | NINO3.4 | |
---|---|---|---|
0 | 1950 | 1 | 24.83 |
1 | 1950 | 2 | 25.20 |
2 | 1950 | 3 | 26.03 |
3 | 1950 | 4 | 26.36 |
4 | 1950 | 5 | 26.19 |
we can make use of the query method of Pandas DataFrames to select the climatological period
nino.query('YR >= 1981 & YR <= 2000').head()
YR | MON | NINO3.4 | |
---|---|---|---|
372 | 1981 | 1 | 26.17 |
373 | 1981 | 2 | 26.13 |
374 | 1981 | 3 | 26.75 |
375 | 1981 | 4 | 27.32 |
376 | 1981 | 5 | 27.41 |
def demean(x):
z = x - x.query('YR >= 1981 & YR <= 2000').mean()
return z
anoms = nino.groupby([nino.MON]).apply(demean)
f, ax = plt.subplots(figsize=(10,6))
anoms['NINO3.4'].plot()
data['ANOM.3'].plot() # Should be the same
<matplotlib.axes._subplots.AxesSubplot at 0x1088c2ad0>
The file Daily_clim_data_Auckland.csv in the data
directory contains
daily values for rainfall, minimum and maximum temperatures for Auckland from 1972 to now.