name = '2017-03-24-climate-model-output'
title = 'Two ways of preparing climate model output for analysis'
tags = 'numpy, iris'
author = 'Denis Sergeev'
from nb_tools import connect_notebook_to_post
from IPython.core.display import HTML, Image
html = connect_notebook_to_post(name, title, tags, author)
Today one of the group members asked for help with reading climate model output and preparing it for data analysis.
This notebook shows a couple of ways of doing that with the help of numpy
and iris
Python packages.
Luckily, the model output is quite small and stored in a simple ASCII file. However, it has some properties that can be a hurdle for a programming novice.
We start with downloading data from a given link.
URL = 'https://raw.githubusercontent.com/ueapy/ueapy.github.io/src/content/data/run1_U_60N_10hPa.dat'
Instead of copy-pasting the contents manually, we are going to use Python's standard library and download the file, making this part of scientific analysis more reproducible.
from urllib.request import urlretrieve
To organise data and code folders, we also import os
module.
import os
datadir = os.path.join(os.path.pardir, 'data') # directory is one level up
if not os.path.isdir(datadir):
# if the directory does not exist, create it
os.makedirs(datadir)
# File with data
fname = os.path.join(datadir, 'data.dat')
Now that we have a directory to store data, we can save the model output there.
urlretrieve(URL, fname)
('../data/data.dat', <http.client.HTTPMessage at 0x7fe7c80b7208>)
numpy
¶Since the data are purely numeric, we use numpy
module.
import numpy as np
data = np.genfromtxt(fname)
data.shape
(1500, 6)
data
array([[ 6.74683, 8.44815, 4.10201, 4.62099, 8.84487, 15.718 ], [ 20.4013 , 25.0052 , 26.7049 , 24.2583 , 21.5956 , 28.7007 ], [ 37.33 , 35.545 , 33.39 , 24.4802 , 24.7544 , 25.5569 ], ..., [ 13.3082 , 12.9732 , 18.6628 , 22.5797 , 20.8556 , 25.2516 ], [ 21.6696 , 13.2805 , 13.5226 , 23.432 , 26.1123 , 25.6979 ], [ 26.1932 , 21.6942 , 15.5987 , 13.2761 , 14.5777 , 17.3154 ]])
For some reason the data are stored in 6 columns by 1500 rows, which in total is 9000 values.
We know a priori that the file contains 75 years of data written every third day, and the climate model's calendar is 360-day calendar. Hence, we have 120 values per year:
data.shape[0] * data.shape[1] / 75
120.0
Keeping data in $1500\times6$ array does not seem to be useful, so we make it 1-D:
data = data.flatten()
data.shape
(9000,)
To make the code above reusable, we create the following function to get data.
def get_model_data(url=URL, fname='climate_model_data.dat', force_download=False):
"""
Function to download climate model output from UEA server
Parameters
---------
url : string (optional)
web location of the data
fname : string (optional)
full path to save the data
force_download : bool (optional)
if True, force redownload of data
Returns
-------
data : numpy.ndarray
1-D array of data
"""
if not os.path.exists(fname) or force_download:
urlretrieve(URL, fname)
# print('Downloading...')
data = np.genfromtxt(fname)
return data.flatten()
data = get_model_data()
Now we transform the array into a more useful shape.
NDAYS = 120 # the number of 3-day periods in a 360-day year
NYEARS = 75 # the total number of years
data_yd = data.reshape((NYEARS, NDAYS))
print(data_yd.shape)
(75, 120)
For example, this is a value of $u$-wind on 30 January of the last year:
data_yd[-1, 10]
23.914400000000001
What if we want to extract only winter data? We can't use the first winter, because it's incomplete: it only has January and February. So the first winter period will comprise December data from the year 1:
data_yd[0, -10:]
array([ 15.5446, 20.6539, 16.4162, 22.0274, 30.3875, 27.8614, 28.5274, 32.0706, 35.9934, 34.0339])
plus January and February data from the year 2:
data_yd[1, :20]
array([ 3.06054000e+01, 2.61758000e+01, 2.98059000e+01, 3.20111000e+01, 2.72294000e+01, 1.97748000e+01, 1.90082000e+01, 1.51616000e+01, 1.22748000e+01, 1.09608000e+01, 1.36364000e+01, 2.22356000e+01, 2.76375000e+01, 2.39670000e+01, 1.24344000e+01, 2.54243000e+00, 2.32738000e-01, -3.17650000e-02, 6.82037000e-01, 4.43382000e-02])
To join them, we can use numpy.concatentate()
function:
np.concatenate([data_yd[0, -10:], data_yd[1, :20]])
array([ 1.55446000e+01, 2.06539000e+01, 1.64162000e+01, 2.20274000e+01, 3.03875000e+01, 2.78614000e+01, 2.85274000e+01, 3.20706000e+01, 3.59934000e+01, 3.40339000e+01, 3.06054000e+01, 2.61758000e+01, 2.98059000e+01, 3.20111000e+01, 2.72294000e+01, 1.97748000e+01, 1.90082000e+01, 1.51616000e+01, 1.22748000e+01, 1.09608000e+01, 1.36364000e+01, 2.22356000e+01, 2.76375000e+01, 2.39670000e+01, 1.24344000e+01, 2.54243000e+00, 2.32738000e-01, -3.17650000e-02, 6.82037000e-01, 4.43382000e-02])
And of course we can apply the same logic to the whole dataset:
data_djf = np.concatenate([data_yd[:-1, -10:], data_yd[1:, :20]], axis=1)
print(data_djf.shape)
(74, 30)
How to find winters when at least 20 days of constant wind direction followed by its change?
Here we are just applying this answer on Stack Overflow to our problem.
for i, yr in enumerate(data_djf):
condition = yr > 0
lens_true = np.diff(np.where(np.concatenate(([condition[0]], condition[:-1] != condition[1:], [True])))[0])[::2]
if 20 <= lens_true.max() < 30:
print(i, lens_true.max())
0 27 1 23 5 27 7 27 8 20 12 24 14 25 17 21 20 23 22 26 24 24 25 27 26 21 28 20 29 25 30 24 33 29 57 25 61 21 65 20 66 29
In the above example numpy
's capabilities were probably enough. But when you have more dimensions and data are more complex, it is mostly always better to use labelled arrays and all the great functionality offered by such libraries as xarray
or iris
.
We show how iris
library can be used with the same dataset. We chose iris
, mostly because it can handle non-standard calendars, like 360-day
one.
To create an appropriate time coordinate, we will use iris
companion package - cf_units
.
import cf_units
import iris
DAYS_PER_YEAR = 360
t_unit = cf_units.Unit('days since 0001-01-01 00:00:00',
calendar='360_day')
t_coord = iris.coords.DimCoord(np.arange(0, DAYS_PER_YEAR * NYEARS, 3),
units=t_unit,
standard_name='time')
Now we can attach the newly created time coordinate to the data themselves by creating an iris cube:
cube = iris.cube.Cube(data=data,
units='m/s',
dim_coords_and_dims=[(t_coord, 0)])
cube.rename('eastward_wind')
print(cube)
eastward_wind / (m/s) (time: 9000) Dimension coordinates: time x
Since we now have a labelled aray with appropriate metadata, we can use iris
to make statistical analysis easier and make the code more readable.
import iris.coord_categorisation
iris.coord_categorisation.add_season(cube, 'time', name='clim_season')
iris.coord_categorisation.add_season_year(cube, 'time', name='season_year')
print(cube)
eastward_wind / (m/s) (time: 9000) Dimension coordinates: time x Auxiliary coordinates: clim_season x season_year x
cube.coord('clim_season')
AuxCoord(array(['djf', 'djf', 'djf', ..., 'djf', 'djf', 'djf'], dtype='<U64'), standard_name=None, units=Unit('no_unit'), long_name='clim_season')
for season, year in zip(cube.coord('clim_season')[:100:10].points,
cube.coord('season_year')[:100:10].points):
print('{} {}'.format(season, year))
djf 1 djf 1 mam 1 mam 1 mam 1 jja 1 jja 1 jja 1 son 1 son 1
annual_seasonal_mean = cube.aggregated_by(['clim_season', 'season_year'],
iris.analysis.MEAN)
print(annual_seasonal_mean)
eastward_wind / (m/s) (time: 301) Dimension coordinates: time x Auxiliary coordinates: clim_season x season_year x Cell methods: mean: clim_season, season_year
HTML(html)
This post was written as an IPython (Jupyter) notebook. You can view or download it using nbviewer.