by Mahé Perrette
ModStrat Seminar, 21st May, 2014
Potsdam Institute for Climate Impact Research
Installation instructions:
- Install python:
- Language: python 2.7
- Package manager: `pip` or newer `conda`
- Interactive terminal or notebook: ipython
- For beginner, use Anaconda which comes with all the above and many pre-installed packages:
http://continuum.io/downloads
- Install netCDF4 (e.g. `conda install netCDF4` or `pip install netCDF4`)
- Install dimarray via `git` (need to install `git` program first for commands below to work):
git clone https://github.com/perrette/dimarray.git
cd dimarray
python setup.py install
- Start notebook: `ipython notebook` and select the current notebook (ending .ipynb)
- Download data used in this notebook: https://www.pik-potsdam.de/members/perrette/dimarray/data.zip
numpy
package (2006), matplotlib
etc...2+3
5
mylist = ['a', 'b', 'c', 'd']
mylist
['a', 'b', 'c', 'd']
mylist[0] # 0-based index
'a'
mydict = dict(a=1, b=2)
mydict
{'a': 1, 'b': 2}
mydict['b'] # access element by key
2
python is a general-purpose programming language, with basic functionalities reduced to a minimum.
Some variables and function names are defined by default without the need of importing anything...
sum([1, 2, 3, 4])
10
... but others are not:
exp(0)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-7-0a236a9466c0> in <module>() ----> 1 exp(0) NameError: name 'exp' is not defined
In that case, need to import specific packages: after importing a package, it is possible to access its functions with "."
import numpy
numpy.exp(0)
1.0
But numpy
is a bit long to type for each command. There are quite a few ways of going around that, such as defining an alias:
import numpy as np
np.exp(0)
or importing the function directly:
from numpy import exp
exp(0)
or even importing everything (not recommanded):
from numpy import *
exp(0)
for i in range(5):
print i
0 1 2 3 4
if 4 > 2:
print 'no parenthesis'
else:
print 'nope'
no parenthesis
Functions take named arguments, and optional arguments as well
def myfunc(a, b=2):
""" my documentation
"""
return a+b
# functions are a particular type of variables
myfunc
<function __main__.myfunc>
myfunc(3)
5
myfunc(3, b=5)
8
Like in C, classes can be defined with attributes and methods. Class attributes and methods can be accessed with "." like for variables or functions in a module:
class Ocean:
T = 25 # ocean temperature
cp = 4000 # sea water heat capacity
V = 1e6 # volume
indian = Ocean ()
indian
<__main__.Ocean instance at 0x7f42700de908>
indian.T
25
Add a method energy
:
class Ocean2:
T = 25 # ocean temperature
cp = 4000 # sea water heat capacity
V = 1e6 # volume
def energy(self): # class method, like a function defined within a class
return self.cp*self.T*self.V
indian = Ocean2()
indian.energy()
100000000000.0
Overload operator, add an __add__
method.
class Ocean3:
T = 25 # ocean temperature
cp = 4000 # sea water heat capacity
V = 1e6 # volume
def energy(self): # class method, like a function defined within a class
return self.cp*self.T*self.V
def __add__(self, other):
res = Ocean3()
res.T = 0.5*(self.T + other.T)
res.cp = 0.5*(self.cp+other.cp)
res.V = self.V+other.V
return res
indian = Ocean3()
big = indian + indian
big.energy()
200000000000.0
For scientific applications, need to use specific packages. numpy
is one of them, to be installed via the package manager pip
:
pip install numpy
or shipped by default if python was installed with anaconda (recommended !)
http://continuum.io/downloads
Once installed on your system, it can be imported any time in the python console, and there are various ways of using the functions and variables defined in it.
import *
form not recommended in general%pylab
or start ipython with ipython --pylab
%pylab
Using matplotlib backend: Qt4Agg Populating the interactive namespace from numpy and matplotlib
Now the numpy function array is defined in the workspace:
array
<function numpy.core.multiarray.array>
a = array([2, 5, 7, 8, 3]) # define an array (same as np.array)
a
array([2, 5, 7, 8, 3])
Indexing works similarly to python lists, with slicing as start:stop:step
a[:3] # first three elements
array([2, 5, 7])
a[-3:] # last three (negative indexing)
array([7, 8, 3])
a[::2] # every second element
array([2, 7, 3])
operations work as you would expect:
array([1,2,3]) * 2
array([2, 4, 6])
In contrast, addition and scalar multiplication mean concatenation for standard python list:
[1, 2, 3] * 2
[1, 2, 3, 1, 2, 3]
arange(5) # 0...4 : five elements numbered from 0, same as r_[:5]
array([0, 1, 2, 3, 4])
arange(1900, 2000, 5) # alternative syntax: r_[1900:2000:5]
array([1900, 1905, 1910, 1915, 1920, 1925, 1930, 1935, 1940, 1945, 1950, 1955, 1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995])
linspace(-1, 1, 10) # 10 elements between -1 and 1
array([-1. , -0.77777778, -0.55555556, -0.33333333, -0.11111111, 0.11111111, 0.33333333, 0.55555556, 0.77777778, 1. ])
zeros(5)
array([ 0., 0., 0., 0., 0.])
a = ones((2,3))
a
array([[ 1., 1., 1.], [ 1., 1., 1.]])
Arrays are objects (class instances) ! They have attributes and methods which can be accessed with the "." syntax:
a.ndim # number of dimensions
2
a.shape
(2, 3)
a.size
6
a.mean # a method, itself an object
<function mean>
a.mean()
1.0
classical function approach is also defined:
mean(a)
1.0
a = arange(2*3).reshape(2,3)
a
array([[0, 1, 2], [3, 4, 5]])
multi-dimensional indexing works similarly to matlab or R
a[1,0] # multi-dimensional indexing: first index for line, second for columns
3
a[1] # grab the whole line
array([3, 4, 5])
a[:, -1] # last column (negative integer start from the end)
array([2, 5])
axis
parameter accepted by most transformation
a.mean(axis=0) # mean over lines
array([ 1.5, 2.5, 3.5])
a.mean(axis=1) # mean over columns
array([ 1., 4.])
In the python notebook, you may use the following "magicc" command to display figures inline:
#%pylab
%matplotlib inline
x = arange(10)
y = x**2
plot(x, y, label='my line')
plot(x, y*0.5, label='my other line')
legend(loc='upper left')
xlabel('x axis')
ylabel('y axis')
title('My Figure')
<matplotlib.text.Text at 0x7f42572b9a90>
A DimArray
can be defined just like a numpy array, with
additional information about its axes, which can be given
via axes
and dims
parameters.
import dimarray as da
data = [[1.,2,3], [4,5,6]]
a = da.DimArray(data, axes=[['grl', 'ant'], [1950, 1960, 1970]], dims=['variable', 'time'])
a
dimarray: 6 non-null elements (0 null) dimensions: 'variable', 'time' 0 / variable (2): grl to ant 1 / time (3): 1950 to 1970 array([[ 1., 2., 3.], [ 4., 5., 6.]])
Array data are stored in a values
attribute:
a.values
array([[ 1., 2., 3.], [ 4., 5., 6.]])
while dimensions are located in an axes
attribute
a.axes
dimensions: 'variable', 'time' 0 / variable (2): grl to ant 1 / time (3): 1950 to 1970
Individual axes can be accessed as well:
ax = a.axes[1]
ax.name , ax.values
('time', array([1950, 1960, 1970]))
A few handy aliases are defined
a.dims # grab axis names (the dimensions)
('variable', 'time')
a.labels # grab axis values
(array(['grl', 'ant'], dtype=object), array([1950, 1960, 1970]))
Indexing works on axis values instead of integer position
a['grl', 1970]
3.0
but integer-index is always possible via ix
toogle between labels
- and position
-based indexing:
a.ix[0, -1]
3.0
Numpy transformations are defined, and now accept axis name:
a.mean(axis='time')
dimarray: 2 non-null elements (0 null) dimensions: 'variable' 0 / variable (2): grl to ant array([ 2., 5.])
They can handle nans if asked to:
a['ant', 1950] = nan
a
dimarray: 5 non-null elements (1 null) dimensions: 'variable', 'time' 0 / variable (2): grl to ant 1 / time (3): 1950 to 1970 array([[ 1., 2., 3.], [ nan, 5., 6.]])
a.mean(axis='time')
dimarray: 1 non-null elements (1 null) dimensions: 'variable' 0 / variable (2): grl to ant array([ 2., nan])
and many more in the doc.
We have a few netCDF files in a data folder
ls data
cmip5.CSIRO-Mk3-6-0.nc cmip5.IPSL-CM5A-LR.nc cmip5.MPI-ESM-MR.nc cmip5.GFDL-ESM2G.nc cmip5.IPSL-CM5A-MR.nc gravit.perrette2013.nc cmip5.HadGEM2-ES.nc cmip5.MIROC5.nc
Read netCDF file to a Dataset object
ds = da.read_nc('data/cmip5.MPI-ESM-MR.nc')
ds
read from data/cmip5.MPI-ESM-MR.nc
Dataset of 3 variables dimensions: 'time', 'scenario' 0 / time (251): 1850 to 2100 1 / scenario (4): historical to rcp85 ohu: ('time', 'scenario') tsl: ('time', 'scenario') temp: ('time', 'scenario')
temp = ds['temp']
temp
dimarray: 909 non-null elements (95 null) dimensions: 'time', 'scenario' 0 / time (251): 1850 to 2100 1 / scenario (4): historical to rcp85 array(...)
%matplotlib inline
temp.plot()
<matplotlib.axes.AxesSubplot at 0x7f42512a7190>
Load new model. Note the time axis extends until 2300 and there is one more scenario
temp2 = da.read_nc('data/cmip5.IPSL-CM5A-LR.nc', 'temp')
temp2
dimarray: 1760 non-null elements (495 null) dimensions: 'time', 'scenario' 0 / time (451): 1850 to 2300 1 / scenario (5): historical to rcp85 array(...)
Compute reference period: reduce the time axis
ref1 = temp[1986:2005].mean(axis='time')
ref1
dimarray: 4 non-null elements (0 null) dimensions: 'scenario' 0 / scenario (4): historical to rcp85 array([ 14.48965673, 14.48965673, 14.48965673, 14.48965673])
compute anomalies w.r.t reference period
ano1 = temp - ref1
ano1
dimarray: 909 non-null elements (95 null) dimensions: 'time', 'scenario' 0 / time (251): 1850 to 2100 1 / scenario (4): historical to rcp85 array(...)
compute difference:
=> first time series is re-indexed to 2300 and filled with NaNs to match the second
ano2 = temp2 - temp2[1986:2005].mean(axis='time') # same as for ano1
ano1.shape, ano2.shape
((251, 4), (451, 5))
difference = ano2 - ano1
difference
dimarray: 909 non-null elements (1346 null) dimensions: 'time', 'scenario' 0 / time (451): 1850 to 2300 1 / scenario (5): historical to rcp85 array(...)
remove nans: every scenario which does not have at least one valid value is removed => RCP6.0 is eliminated
diffvalid = difference.dropna(axis='scenario', minvalid=1)
diffvalid
dimarray: 909 non-null elements (895 null) dimensions: 'time', 'scenario' 0 / time (451): 1850 to 2300 1 / scenario (4): historical to rcp85 array(...)
#difference.plot()
diffvalid.plot()
<matplotlib.axes.AxesSubplot at 0x7f4250e20b50>
Instead of a difference, one could also have stacked the two models into a single array:
newa = da.stack((temp, temp2), axis='model', align=True, keys=['MPI-ESM-MR', 'IPSL-CM5A-LR'])
newa
dimarray: 2669 non-null elements (1841 null) dimensions: 'model', 'time', 'scenario' 0 / model (2): MPI-ESM-MR to IPSL-CM5A-LR 1 / time (451): 1850 to 2300 2 / scenario (5): historical to rcp85 array(...)
newa.name = 'temperature'
newa.date = '22nd May 2014'
newa.description = 'my dimarray test'
newa._metadata
{'date': '22nd May 2014', 'description': 'my dimarray test', 'name': 'temperature'}
newa.write_nc('mydata.nc')
ds = da.read_nc('mydata.nc')
ds
ds['temperature']._metadata
read from mydata.nc
{'date': u'22nd May 2014', 'description': u'my dimarray test', 'name': u'temperature'}
Use regular expressions to read several netCDF files
ds = da.read_nc('data/cmip5.*.nc', align=True, axis='model')
ds
read from data/cmip5.IPSL-CM5A-LR.nc read from data/cmip5.GFDL-ESM2G.nc read from data/cmip5.HadGEM2-ES.nc read from data/cmip5.MPI-ESM-MR.nc read from data/cmip5.MIROC5.nc read from data/cmip5.IPSL-CM5A-MR.nc read from data/cmip5.CSIRO-Mk3-6-0.nc
Dataset of 3 variables dimensions: 'model', 'time', 'scenario' 0 / model (7): data/cmip5.IPSL-CM5A-LR to data/cmip5.CSIRO-Mk3-6-0 1 / time (451): 1850 to 2300 2 / scenario (5): historical to rcp85 ohu: ('model', 'time', 'scenario') tsl: ('model', 'time', 'scenario') temp: ('model', 'time', 'scenario')
Note that all models have been automatically aligned along the longest time period (1850-2100) and with all scenarios, filling the rest with NaNs.
Export to one array and rename model axis to be more human-readable
a = ds.to_array(axis='variable')
a
#a['ohu',:,2100, 'rcp45']
dimarray: 25784 non-null elements (21571 null) dimensions: 'variable', 'model', 'time', 'scenario' 0 / variable (3): ohu to temp 1 / model (7): data/cmip5.IPSL-CM5A-LR to data/cmip5.CSIRO-Mk3-6-0 2 / time (451): 1850 to 2300 3 / scenario (5): historical to rcp85 array(...)
# rename model axis
def rename_axis(model_ax):
return [ax.replace('data/cmip5.','') for ax in model_ax]
a.model[:] = rename_axis(a.model)
a
dimarray: 25784 non-null elements (21571 null) dimensions: 'variable', 'model', 'time', 'scenario' 0 / variable (3): ohu to temp 1 / model (7): IPSL-CM5A-LR to CSIRO-Mk3-6-0 2 / time (451): 1850 to 2300 3 / scenario (5): historical to rcp85 array(...)
Compute 21st change (warming, sea level rise...)
change = a[:,:, 2081:2099].mean('time') - a[:,:,1986:2005].mean('time')
change
dimarray: 71 non-null elements (34 null) dimensions: 'variable', 'model', 'scenario' 0 / variable (3): ohu to temp 1 / model (7): IPSL-CM5A-LR to CSIRO-Mk3-6-0 2 / scenario (5): historical to rcp85 array(...)
Use pandas to have a nice display
#rounded = change['temp'].apply(lambda x: np.round(x, 2)) # roudn to 2 decimals
#rounded.to_pandas()
change['temp'].to_pandas()
scenario | historical | rcp26 | rcp45 | rcp60 | rcp85 |
---|---|---|---|---|---|
model | |||||
IPSL-CM5A-LR | NaN | 1.244389 | 2.130471 | 2.528180 | 4.451431 |
GFDL-ESM2G | NaN | 0.263015 | 0.979251 | 1.503795 | 2.768312 |
HadGEM2-ES | NaN | 1.385744 | 2.518408 | NaN | 4.645714 |
MPI-ESM-MR | NaN | 0.778038 | 1.703055 | NaN | 3.495094 |
MIROC5 | NaN | 0.999615 | 1.669583 | 1.945708 | 3.293620 |
IPSL-CM5A-MR | NaN | 1.080994 | 2.245526 | NaN | 4.367397 |
CSIRO-Mk3-6-0 | NaN | 1.436035 | 2.201792 | 2.409064 | 3.987457 |
can also group along all information
#rounded = change['temp'].apply(lambda x: np.round(x, 2)) # roudn to 2 decimals
#rounded.to_pandas()
grouped = change[:,:,['rcp26','rcp85']].group('model','scenario')
grouped
dimarray: 39 non-null elements (3 null) dimensions: 'variable', 'model,scenario' 0 / variable (3): ohu to temp 1 / model,scenario (14): ('IPSL-CM5A-LR', 'rcp26') to ('CSIRO-Mk3-6-0', 'rcp85') array([[-0.24231561, 2.04321769, -0.15443784, nan, 0.38053016, 1.99874107, -0.19882841, 1.66944207, 0.39981952, 2.14975528, -0.16926416, 2.05212802, 0.54011965, 2.34895955], [ 0.11217237, 0.26592819, nan, nan, 0.12786021, 0.25204866, 0.12568446, 0.26728292, 0.16828752, 0.27523558, 0.11636126, 0.27740127, 0.15880893, 0.27442475], [ 1.2443886 , 4.4514311 , 0.26301486, 2.76831207, 1.38574405, 4.64571374, 0.77803809, 3.49509356, 0.999615 , 3.29361974, 1.08099444, 4.36739709, 1.43603482, 3.98745694]])
grouped.to_pandas()
model | IPSL-CM5A-LR | GFDL-ESM2G | HadGEM2-ES | MPI-ESM-MR | MIROC5 | IPSL-CM5A-MR | CSIRO-Mk3-6-0 | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scenario | rcp26 | rcp85 | rcp26 | rcp85 | rcp26 | rcp85 | rcp26 | rcp85 | rcp26 | rcp85 | rcp26 | rcp85 | rcp26 | rcp85 |
variable | ||||||||||||||
ohu | -0.242316 | 2.043218 | -0.154438 | NaN | 0.380530 | 1.998741 | -0.198828 | 1.669442 | 0.399820 | 2.149755 | -0.169264 | 2.052128 | 0.540120 | 2.348960 |
tsl | 0.112172 | 0.265928 | NaN | NaN | 0.127860 | 0.252049 | 0.125684 | 0.267283 | 0.168288 | 0.275236 | 0.116361 | 0.277401 | 0.158809 | 0.274425 |
temp | 1.244389 | 4.451431 | 0.263015 | 2.768312 | 1.385744 | 4.645714 | 0.778038 | 3.495094 | 0.999615 | 3.293620 | 1.080994 | 4.367397 | 1.436035 | 3.987457 |
grouped.ungroup()
dimarray: 39 non-null elements (3 null) dimensions: 'variable', 'model', 'scenario' 0 / variable (3): ohu to temp 1 / model (7): IPSL-CM5A-LR to CSIRO-Mk3-6-0 2 / scenario (2): rcp26 to rcp85 array([[[-0.24231561, 2.04321769], [-0.15443784, nan], [ 0.38053016, 1.99874107], [-0.19882841, 1.66944207], [ 0.39981952, 2.14975528], [-0.16926416, 2.05212802], [ 0.54011965, 2.34895955]], [[ 0.11217237, 0.26592819], [ nan, nan], [ 0.12786021, 0.25204866], [ 0.12568446, 0.26728292], [ 0.16828752, 0.27523558], [ 0.11636126, 0.27740127], [ 0.15880893, 0.27442475]], [[ 1.2443886 , 4.4514311 ], [ 0.26301486, 2.76831207], [ 1.38574405, 4.64571374], [ 0.77803809, 3.49509356], [ 0.999615 , 3.29361974], [ 1.08099444, 4.36739709], [ 1.43603482, 3.98745694]]])
change.reshape('variable,scenario','model')
dimarray: 71 non-null elements (34 null) dimensions: 'variable,scenario', 'model' 0 / variable,scenario (15): (u'ohu', 'historical') to (u'temp', 'rcp85') 1 / model (7): IPSL-CM5A-LR to CSIRO-Mk3-6-0 array(...)
Example of application using dimarray: http://54.72.92.185/choices
ds = da.read_nc('data/gravit.perrette2013.nc')
ds
read from data/gravit.perrette2013.nc
Dataset of 1 variable dimensions: 'lat', 'lon' 0 / lat (180): -89.5 to 89.5 1 / lon (360): -179.5 to 179.5 glacier: ('lat', 'lon')
ds['glacier'].contourf(levels=linspace(-0.2, 1.5, 10), extend='both'); colorbar()
<matplotlib.colorbar.Colorbar instance at 0x7f4250c12e18>
import dimarray.geo
gl = ds['glacier']
gl.mean(skipna=True)
0.81279758206097585
gl2 = da.geo.GeoArray(gl)
gl2.mean(skipna=True)
0.9999999999999476
gl2.axes['lat'].weights
<function dimarray.geo.geoarray.<lambda>>
gl2.axes['lon'].modulo
360
gl2[:,[280.5]]
geoarray: 124 non-null elements (56 null) dimensions: 'lat', 'lon' 0 / lat (180): -89.5 to 89.5 1 / lon (1): -79.5 to -79.5 array(...)
An Informal Introduction to Python
scipy.org: list major packages for scientific use
nbviewer.org: visualize notebooks
dimarray: manipulate arrays with dimensions
(with more complete, but a bit long, doc: http://nbviewer.ipython.org/github/perrette/dimarray/blob/master/dimarray.ipynb)