by Mahé Perrette
ModStrat Seminar, 22nd May, 2014
Potsdam Institute for Climate Impact Research
numpy
package (2006), matplotlib
etc...2+3
5
mylist = ['a', 'b', 'c', 'd']
mylist
['a', 'b', 'c', 'd']
mylist[0] # 0-based index
'a'
mydict = dict(a=1, b=2)
mydict
{'a': 1, 'b': 2}
mydict['b'] # access element by key
2
python is a general-purpose programming language, with basic functionalities reduced to a minimum.
Some variables and function names are defined by default without the need of importing anything...
sum([1, 2, 3, 4])
10
... but others are not:
exp(0)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-7-0a236a9466c0> in <module>() ----> 1 exp(0) NameError: name 'exp' is not defined
In that case, need to import specific packages: after importing a package, it is possible to access its functions with "."
import numpy
numpy.exp(0)
1.0
But numpy
is a bit long to type for each command. There are quite a few ways of going around that, such as defining an alias:
import numpy as np
np.exp(0)
or importing the function directly:
from numpy import exp
exp(0)
or even importing everything (not recommanded):
from numpy import *
exp(0)
for i in range(5):
print i
0 1 2 3 4
if 4 > 2:
print 'no parenthesis'
else:
print 'nope'
no parenthesis
Functions take named arguments, and optional arguments as well
def myfunc(a, b=2):
""" my documentation
"""
return a+b
# functions are a particular type of variables
myfunc
<function __main__.myfunc>
myfunc(3)
5
myfunc(3, b=5)
8
Like in C, classes can be defined with attributes and methods. Class attributes and methods can be accessed with "." like for variables or functions in a module:
class Ocean:
T = 25 # ocean temperature
cp = 4000 # sea water heat capacity
V = 1e6 # volume
indian = Ocean ()
indian
<__main__.Ocean instance at 0x7f3debfa8a28>
indian.T
25
Add a method energy
:
class Ocean2:
T = 25 # ocean temperature
cp = 4000 # sea water heat capacity
V = 1e6 # volume
def energy(self): # class method, like a function defined within a class
return self.cp*self.T*self.V
indian = Ocean2()
indian.energy()
100000000000.0
Overload operator, add an __add__
method.
class Ocean3:
T = 25 # ocean temperature
cp = 4000 # sea water heat capacity
V = 1e6 # volume
def energy(self): # class method, like a function defined within a class
return self.cp*self.T*self.V
def __add__(self, other):
res = Ocean3()
res.T = 0.5*(self.T + other.T)
res.cp = 0.5*(self.cp+other.cp)
res.V = self.V+other.V
return res
indian = Ocean3()
big = indian + indian
big.energy()
200000000000.0
For scientific applications, need to use specific packages. numpy
is one of them, to be installed via the package manager pip
:
pip install numpy
or shipped by default if python was installed with anaconda (recommanded!)
http://continuum.io/downloads
Once installed on your system, it can be imported any time in the python console, and there are various ways of using the functions and variables defined in it.
import *
form not recommended in general%pylab
or start ipython with ipython --pylab
%pylab
Using matplotlib backend: Qt4Agg Populating the interactive namespace from numpy and matplotlib
array
<function numpy.core.multiarray.array>
a = array([2, 5, 7, 8, 3]) # define an array (same as np.array)
a
array([2, 5, 7, 8, 3])
Indexing works similarly to python lists, with slicing as start:stop:step
a[:3] # first three elements
array([2, 5, 7])
a[-3:] # last three (negative indexing)
array([7, 8, 3])
a[::2] # every second element
array([2, 7, 3])
operations work as you would expect:
array([1,2,3]) * 2
array([2, 4, 6])
In contrast, addition and scalar multiplication mean concatenation for standard python list:
[1, 2, 3] * 2
[1, 2, 3, 1, 2, 3]
arange(5) # 0...4 : five elements numbered from 0, same as r_[:5]
array([0, 1, 2, 3, 4])
arange(1900, 2000, 5) # alternative syntax: r_[1900:2000:5]
array([1900, 1905, 1910, 1915, 1920, 1925, 1930, 1935, 1940, 1945, 1950, 1955, 1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995])
linspace(-1, 1, 10) # 10 elements between -1 and 1
array([-1. , -0.77777778, -0.55555556, -0.33333333, -0.11111111, 0.11111111, 0.33333333, 0.55555556, 0.77777778, 1. ])
zeros(5)
array([ 0., 0., 0., 0., 0.])
a = ones((2,3))
a
array([[ 1., 1., 1.], [ 1., 1., 1.]])
Arrays are objects (class instances) ! They have attributes and methods which can be accessed with the "." syntax:
a.ndim # number of dimensions
2
a.shape
(2, 3)
a.size
6
a.mean # a method, itself an object
<function mean>
a.mean()
1.0
classical function approach is also defined:
mean(a)
1.0
a = arange(2*3).reshape(2,3)
a
array([[0, 1, 2], [3, 4, 5]])
multi-dimensional indexing works similarly to matlab or R
a[1,0] # multi-dimensional indexing: first index for line, second for columns
3
a[1] # grab the whole line
array([3, 4, 5])
a[:, -1] # last column (negative integer start from the end)
array([2, 5])
axis
parameter accepted by most transformation
a.mean(axis=0) # mean over lines
array([ 1.5, 2.5, 3.5])
a.mean(axis=1) # mean over columns
array([ 1., 4.])
matplotlib is a handy plotting package. It's functions are imported via the "magicc" ipython command %pylab
. Another handy command is %matplotlib inline
to display figures inline.
#%pylab
%matplotlib inline
x = arange(10)
y = x**2
plot(x, y, label='my line')
plot(x, y*0.5, label='my other line')
legend(loc='upper left')
xlabel('x axis')
ylabel('y axis')
title('My Figure')
<matplotlib.text.Text at 0x7f3ddf9127d0>
A DimArray
can be defined just like a numpy array, with
additional information about its axes, which can be given
via axes
and dims
parameters.
import dimarray as da
data = [[1.,2,3], [4,5,6]]
a = da.DimArray(data, axes=[['grl', 'ant'], [1950, 1960, 1970]], dims=['item', 'time'])
a
dimarray: 6 non-null elements (0 null) 0 / item (2): grl to ant 1 / time (3): 1950 to 1970 array([[ 1., 2., 3.], [ 4., 5., 6.]])
Array data are stored in a values
attribute:
a.values
array([[ 1., 2., 3.], [ 4., 5., 6.]])
while dimensions are located in an axes
attribute
a.axes
0 / item (2): grl to ant 1 / time (3): 1950 to 1970
Individual axes can be accessed as well:
ax = a.axes[1]
ax.name , ax.values
('time', array([1950, 1960, 1970]))
A few handy aliases are defined
a.dims # grab axis names (the dimensions)
('item', 'time')
a.labels # grab axis values
(array(['grl', 'ant'], dtype=object), array([1950, 1960, 1970]))
Indexing works on axis values instead of integer position
a['grl', 1970]
3.0
but integer-index is always possible via ix
toogle between labels
- and position
-based indexing:
a.ix[0, -1]
3.0
Numpy transformations are defined, and now accept axis name:
a.mean(axis='time')
dimarray: 2 non-null elements (0 null) 0 / item (2): grl to ant array([ 2., 5.])
They can handle nans if asked to:
a['ant', 1950] = nan
a
dimarray: 5 non-null elements (1 null) 0 / item (2): grl to ant 1 / time (3): 1950 to 1970 array([[ 1., 2., 3.], [ nan, 5., 6.]])
a.mean(axis='time', skipna=True)
dimarray: 2 non-null elements (0 null) 0 / item (2): grl to ant array([ 2. , 5.5])
and many more in the doc.
We have a few netCDF files in a data folder, let's go there
import os
datadir = da.get_datadir()
os.chdir(datadir)
List the content of the data folder (will work on unix systems only)
ls
cmip5.CSIRO-Mk3-6-0.nc cmip5.GFDL-ESM2G.nc cmip5.HadGEM2-ES.nc cmip5.IPSL-CM5A-LR.nc cmip5.IPSL-CM5A-MR.nc cmip5.MIROC5.nc cmip5.MPI-ESM-MR.nc
Read netCDF file to a Dataset object
datadir = da.get_datadir()
ds = da.read_nc('cmip5.MPI-ESM-MR.nc')
ds
Dataset of 2 variables 0 / time (251): 1850 to 2100 1 / scenario (4): historical to rcp85 tsl: ('time', 'scenario') temp: ('time', 'scenario')
temp = ds['temp']
temp
dimarray: 909 non-null elements (95 null) 0 / time (251): 1850 to 2100 1 / scenario (4): historical to rcp85 array(...)
%matplotlib inline
temp.plot()
<matplotlib.axes.AxesSubplot at 0x7f3dd8ca85d0>
Load new model. Note the time axis extends until 2300 and there is one more scenario
temp2 = da.read_nc('cmip5.IPSL-CM5A-LR.nc', 'temp')
temp2
dimarray: 1760 non-null elements (495 null) 0 / time (451): 1850 to 2300 1 / scenario (5): historical to rcp85 array(...)
Compute reference period: reduce the time axis
ref1 = temp[1986:2005].mean(axis='time')
ref1
dimarray: 4 non-null elements (0 null) 0 / scenario (4): historical to rcp85 array([ 14.48965673, 14.48965673, 14.48965673, 14.48965673])
compute anomalies w.r.t reference period
ano1 = temp - ref1
ano1
dimarray: 909 non-null elements (95 null) 0 / time (251): 1850 to 2100 1 / scenario (4): historical to rcp85 array(...)
compute difference:
=> first time series is re-indexed to 2300 and filled with NaNs to match the second
ano2 = temp2 - temp2[1986:2005].mean(axis='time') # same as for ano1
ano1.shape, ano2.shape
((251, 4), (451, 5))
difference = ano2 - ano1
difference
dimarray: 909 non-null elements (1346 null) 0 / time (451): 1850 to 2300 1 / scenario (5): historical to rcp85 array(...)
remove nans: every scenario which does not have at least one valid value is removed => RCP6.0 is eliminated
diffvalid = difference.dropna(axis='scenario', minvalid=1)
diffvalid
dimarray: 909 non-null elements (895 null) 0 / time (451): 1850 to 2300 1 / scenario (4): historical to rcp85 array(...)
#difference.plot()
diffvalid.plot()
<matplotlib.axes.AxesSubplot at 0x7f3dd8a65350>
Instead of a difference, one could also have stacked the two models into a single array:
newa = da.stack((temp, temp2), axis='model', align=True, keys=['MPI-ESM-MR', 'IPSL-CM5A-LR'])
newa
dimarray: 2669 non-null elements (1841 null) 0 / model (2): MPI-ESM-MR to IPSL-CM5A-LR 1 / time (451): 1850 to 2300 2 / scenario (5): historical to rcp85 array(...)
newa.name = 'temperature'
newa.date = '22nd May 2014'
newa.description = 'my dimarray test'
newa._metadata
{'date': '22nd May 2014', 'description': 'my dimarray test', 'name': 'temperature'}
newa.write_nc('/tmp/mydata.nc')
ds = da.read_nc('/tmp/mydata.nc')
ds
ds['temperature']._metadata
{'date': u'22nd May 2014', 'description': u'my dimarray test', 'name': u'temperature'}
Use regular expressions to read several netCDF files
ds = da.read_nc('cmip5.*.nc', align=True, axis='model')
ds
Dataset of 2 variables 0 / model (7): cmip5.IPSL-CM5A-LR to cmip5.CSIRO-Mk3-6-0 1 / time (451): 1850 to 2300 2 / scenario (5): historical to rcp85 tsl: ('model', 'time', 'scenario') temp: ('model', 'time', 'scenario')
Note that all models have been automatically aligned along the longest time period (1850-2100) and with all scenarios, filling the rest with NaNs.
Export to one array and rename model axis to be more human-readable
a = ds.to_array(axis='variable')
a
#a['ohu',:,2100, 'rcp45']
dimarray: 16725 non-null elements (14845 null) 0 / variable (2): tsl to temp 1 / model (7): cmip5.IPSL-CM5A-LR to cmip5.CSIRO-Mk3-6-0 2 / time (451): 1850 to 2300 3 / scenario (5): historical to rcp85 array(...)
# rename model axis
def rename_axis(model_ax):
return [ax.replace('data/cmip5.','') for ax in model_ax]
a.model[:] = rename_axis(a.model)
a
dimarray: 16725 non-null elements (14845 null) 0 / variable (2): tsl to temp 1 / model (7): cmip5.IPSL-CM5A-LR to cmip5.CSIRO-Mk3-6-0 2 / time (451): 1850 to 2300 3 / scenario (5): historical to rcp85 array(...)
Compute 21st change (warming, sea level rise...)
change = a[:,:, 2081:2099].mean('time') - a[:,:,1986:2005].mean('time')
change
dimarray: 47 non-null elements (23 null) 0 / variable (2): tsl to temp 1 / model (7): cmip5.IPSL-CM5A-LR to cmip5.CSIRO-Mk3-6-0 2 / scenario (5): historical to rcp85 array([[[ nan, 0.11217237, 0.1677092 , 0.17220445, 0.26592819], [ nan, nan, nan, nan, nan], [ nan, 0.12786021, 0.16702883, 0.18248119, 0.25204866], [ nan, 0.12568446, 0.17624791, nan, 0.26728292], [ nan, 0.16828752, 0.19214832, 0.20351329, 0.27523558], [ nan, 0.11636126, 0.17071112, nan, 0.27740127], [ nan, 0.15880893, 0.19516833, 0.19162792, 0.27442475]], [[ nan, 1.2443886 , 2.1304713 , 2.52817959, 4.4514311 ], [ nan, 0.26301486, 0.97925117, 1.50379534, 2.76831207], [ nan, 1.38574405, 2.5184083 , nan, 4.64571374], [ nan, 0.77803809, 1.70305479, nan, 3.49509356], [ nan, 0.999615 , 1.66958345, 1.94570811, 3.29361974], [ nan, 1.08099444, 2.24552594, nan, 4.36739709], [ nan, 1.43603482, 2.20179153, 2.4090637 , 3.98745694]]])
Use pandas to have a nice display
#rounded = change['temp'].apply(lambda x: np.round(x, 2)) # roudn to 2 decimals
#rounded.to_pandas()
change['temp'].to_pandas()
scenario | historical | rcp26 | rcp45 | rcp60 | rcp85 |
---|---|---|---|---|---|
model | |||||
cmip5.IPSL-CM5A-LR | NaN | 1.244389 | 2.130471 | 2.528180 | 4.451431 |
cmip5.GFDL-ESM2G | NaN | 0.263015 | 0.979251 | 1.503795 | 2.768312 |
cmip5.HadGEM2-ES | NaN | 1.385744 | 2.518408 | NaN | 4.645714 |
cmip5.MPI-ESM-MR | NaN | 0.778038 | 1.703055 | NaN | 3.495094 |
cmip5.MIROC5 | NaN | 0.999615 | 1.669583 | 1.945708 | 3.293620 |
cmip5.IPSL-CM5A-MR | NaN | 1.080994 | 2.245526 | NaN | 4.367397 |
cmip5.CSIRO-Mk3-6-0 | NaN | 1.436035 | 2.201792 | 2.409064 | 3.987457 |
can also group along all information
#rounded = change['temp'].apply(lambda x: np.round(x, 2)) # roudn to 2 decimals
#rounded.to_pandas()
grouped = change[:,:,['rcp26','rcp85']].group('model','scenario')
grouped
dimarray: 26 non-null elements (2 null) 0 / variable (2): tsl to temp 1 / model,scenario (14): ('cmip5.IPSL-CM5A-LR', 'rcp26') to ('cmip5.CSIRO-Mk3-6-0', 'rcp85') array([[ 0.11217237, 0.26592819, nan, nan, 0.12786021, 0.25204866, 0.12568446, 0.26728292, 0.16828752, 0.27523558, 0.11636126, 0.27740127, 0.15880893, 0.27442475], [ 1.2443886 , 4.4514311 , 0.26301486, 2.76831207, 1.38574405, 4.64571374, 0.77803809, 3.49509356, 0.999615 , 3.29361974, 1.08099444, 4.36739709, 1.43603482, 3.98745694]])
grouped.to_pandas()
model | cmip5.IPSL-CM5A-LR | cmip5.GFDL-ESM2G | cmip5.HadGEM2-ES | cmip5.MPI-ESM-MR | cmip5.MIROC5 | cmip5.IPSL-CM5A-MR | cmip5.CSIRO-Mk3-6-0 | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scenario | rcp26 | rcp85 | rcp26 | rcp85 | rcp26 | rcp85 | rcp26 | rcp85 | rcp26 | rcp85 | rcp26 | rcp85 | rcp26 | rcp85 |
variable | ||||||||||||||
tsl | 0.112172 | 0.265928 | NaN | NaN | 0.127860 | 0.252049 | 0.125684 | 0.267283 | 0.168288 | 0.275236 | 0.116361 | 0.277401 | 0.158809 | 0.274425 |
temp | 1.244389 | 4.451431 | 0.263015 | 2.768312 | 1.385744 | 4.645714 | 0.778038 | 3.495094 | 0.999615 | 3.293620 | 1.080994 | 4.367397 | 1.436035 | 3.987457 |
grouped.ungroup()
dimarray: 26 non-null elements (2 null) 0 / variable (2): tsl to temp 1 / model (7): cmip5.IPSL-CM5A-LR to cmip5.CSIRO-Mk3-6-0 2 / scenario (2): rcp26 to rcp85 array([[[ 0.11217237, 0.26592819], [ nan, nan], [ 0.12786021, 0.25204866], [ 0.12568446, 0.26728292], [ 0.16828752, 0.27523558], [ 0.11636126, 0.27740127], [ 0.15880893, 0.27442475]], [[ 1.2443886 , 4.4514311 ], [ 0.26301486, 2.76831207], [ 1.38574405, 4.64571374], [ 0.77803809, 3.49509356], [ 0.999615 , 3.29361974], [ 1.08099444, 4.36739709], [ 1.43603482, 3.98745694]]])
change.reshape('variable,scenario','model')
dimarray: 47 non-null elements (23 null) 0 / variable,scenario (10): (u'tsl', 'historical') to (u'temp', 'rcp85') 1 / model (7): cmip5.IPSL-CM5A-LR to cmip5.CSIRO-Mk3-6-0 array([[ nan, nan, nan, nan, nan, nan, nan], [ 0.11217237, nan, 0.12786021, 0.12568446, 0.16828752, 0.11636126, 0.15880893], [ 0.1677092 , nan, 0.16702883, 0.17624791, 0.19214832, 0.17071112, 0.19516833], [ 0.17220445, nan, 0.18248119, nan, 0.20351329, nan, 0.19162792], [ 0.26592819, nan, 0.25204866, 0.26728292, 0.27523558, 0.27740127, 0.27442475], [ nan, nan, nan, nan, nan, nan, nan], [ 1.2443886 , 0.26301486, 1.38574405, 0.77803809, 0.999615 , 1.08099444, 1.43603482], [ 2.1304713 , 0.97925117, 2.5184083 , 1.70305479, 1.66958345, 2.24552594, 2.20179153], [ 2.52817959, 1.50379534, nan, nan, 1.94570811, nan, 2.4090637 ], [ 4.4514311 , 2.76831207, 4.64571374, 3.49509356, 3.29361974, 4.36739709, 3.98745694]])
Let's first make some 2-D data to play with
lon = linspace(-179.5, 179.5, 360)
lat = linspace(-89.5, 89.5, 180)
LON, LAT = meshgrid(lon, lat)
values = cos(radians(LON)) + cos(radians(LAT))
basic = da.DimArray(values, axes=[lat, lon], dims=['lat', 'lon'])
basic.contourf(); colorbar()
<matplotlib.colorbar.Colorbar instance at 0x7f3dd88b05f0>
Now, that's let's compute the global mean with the standard dimarray method:
basic.mean()
0.63662785266740252
But in reality, high-latitude values are down-weighted because of the smaller surface they occupy. GeoArray is a subclass of DimArray that handles that recognizes typical dimensions and apply the desired weighting.
from dimarray.geo import GeoArray
geo = GeoArray(basic)
geo.mean()
0.78538819485364419
It does little more than defining the axis attributes weights
, here a lambda function...
geo.axes['lat'].weights
<function dimarray.geo.geoarray.<lambda>>
that is called on every axis value when needed
geo.axes['lat'].weights(geo.lat)
array([ 0.00872654, 0.02617695, 0.04361939, 0.06104854, 0.0784591 , 0.09584575, 0.11320321, 0.13052619, 0.14780941, 0.16504761, 0.18223553, 0.19936793, 0.21643961, 0.23344536, 0.25038 , 0.26723838, 0.28401534, 0.3007058 , 0.31730466, 0.33380686, 0.35020738, 0.36650123, 0.38268343, 0.39874907, 0.41469324, 0.4305111 , 0.44619781, 0.46174861, 0.47715876, 0.49242356, 0.50753836, 0.52249856, 0.53729961, 0.55193699, 0.56640624, 0.58070296, 0.59482279, 0.60876143, 0.62251464, 0.63607822, 0.64944805, 0.66262005, 0.67559021, 0.68835458, 0.70090926, 0.71325045, 0.72537437, 0.73727734, 0.74895572, 0.76040597, 0.77162458, 0.78260816, 0.79335334, 0.80385686, 0.81411552, 0.82412619, 0.83388582, 0.84339145, 0.85264016, 0.86162916, 0.8703557 , 0.87881711, 0.88701083, 0.89493436, 0.90258528, 0.90996127, 0.91706007, 0.92387953, 0.93041757, 0.93667219, 0.94264149, 0.94832366, 0.95371695, 0.95881973, 0.96363045, 0.96814764, 0.97236992, 0.97629601, 0.9799247 , 0.98325491, 0.9862856 , 0.98901586, 0.99144486, 0.99357186, 0.9953962 , 0.99691733, 0.9981348 , 0.99904822, 0.99965732, 0.99996192, 0.99996192, 0.99965732, 0.99904822, 0.9981348 , 0.99691733, 0.9953962 , 0.99357186, 0.99144486, 0.98901586, 0.9862856 , 0.98325491, 0.9799247 , 0.97629601, 0.97236992, 0.96814764, 0.96363045, 0.95881973, 0.95371695, 0.94832366, 0.94264149, 0.93667219, 0.93041757, 0.92387953, 0.91706007, 0.90996127, 0.90258528, 0.89493436, 0.88701083, 0.87881711, 0.8703557 , 0.86162916, 0.85264016, 0.84339145, 0.83388582, 0.82412619, 0.81411552, 0.80385686, 0.79335334, 0.78260816, 0.77162458, 0.76040597, 0.74895572, 0.73727734, 0.72537437, 0.71325045, 0.70090926, 0.68835458, 0.67559021, 0.66262005, 0.64944805, 0.63607822, 0.62251464, 0.60876143, 0.59482279, 0.58070296, 0.56640624, 0.55193699, 0.53729961, 0.52249856, 0.50753836, 0.49242356, 0.47715876, 0.46174861, 0.44619781, 0.4305111 , 0.41469324, 0.39874907, 0.38268343, 0.36650123, 0.35020738, 0.33380686, 0.31730466, 0.3007058 , 0.28401534, 0.26723838, 0.25038 , 0.23344536, 0.21643961, 0.19936793, 0.18223553, 0.16504761, 0.14780941, 0.13052619, 0.11320321, 0.09584575, 0.0784591 , 0.06104854, 0.04361939, 0.02617695, 0.00872654])
Additionally, there is a modulo
attribute that ease indexing:
geo.axes['lon'].modulo
360
geo[:,[280.5]]
geoarray: 180 non-null elements (0 null) 0 / lat (180): -89.5 to 89.5 1 / lon (1): -79.5 to -79.5 array(...)
Example of application using dimarray: http://54.72.92.185/choices
An Informal Introduction to Python
scipy.org: list major packages for scientific use