In [1]:

name = '2017-03-24-climate-model-output'
title = 'Two ways of preparing climate model output for analysis'
tags = 'numpy, iris'
author = 'Denis Sergeev'

In [2]:

from nb_tools import connect_notebook_to_post
from IPython.core.display import HTML, Image

html = connect_notebook_to_post(name, title, tags, author)

Today one of the group members asked for help with reading climate model output and preparing it for data analysis.

This notebook shows a couple of ways of doing that with the help of numpy and iris Python packages.

Luckily, the model output is quite small and stored in a simple ASCII file. However, it has some properties that can be a hurdle for a programming novice.

Download the data from UEA archive¶

We start with downloading data from a given link.

In [3]:

URL = 'https://raw.githubusercontent.com/ueapy/ueapy.github.io/src/content/data/run1_U_60N_10hPa.dat'

Instead of copy-pasting the contents manually, we are going to use Python's standard library and download the file, making this part of scientific analysis more reproducible.

In [4]:

from urllib.request import urlretrieve

To organise data and code folders, we also import os module.

In [5]:

import os

In [6]:

datadir = os.path.join(os.path.pardir, 'data')  # directory is one level up
if not os.path.isdir(datadir):
    # if the directory does not exist, create it
    os.makedirs(datadir)

# File with data
fname = os.path.join(datadir, 'data.dat')

Now that we have a directory to store data, we can save the model output there.

In [7]:

urlretrieve(URL, fname)

Out[7]:

('../data/data.dat', <http.client.HTTPMessage at 0x7fe7c80b7208>)

Read the data using `numpy`¶

Since the data are purely numeric, we use numpy module.

In [8]:

import numpy as np

In [9]:

data = np.genfromtxt(fname)

In [10]:

data.shape

Out[10]:

(1500, 6)

In [11]:

data

Out[11]:

array([[  6.74683,   8.44815,   4.10201,   4.62099,   8.84487,  15.718  ],
       [ 20.4013 ,  25.0052 ,  26.7049 ,  24.2583 ,  21.5956 ,  28.7007 ],
       [ 37.33   ,  35.545  ,  33.39   ,  24.4802 ,  24.7544 ,  25.5569 ],
       ..., 
       [ 13.3082 ,  12.9732 ,  18.6628 ,  22.5797 ,  20.8556 ,  25.2516 ],
       [ 21.6696 ,  13.2805 ,  13.5226 ,  23.432  ,  26.1123 ,  25.6979 ],
       [ 26.1932 ,  21.6942 ,  15.5987 ,  13.2761 ,  14.5777 ,  17.3154 ]])

For some reason the data are stored in 6 columns by 1500 rows, which in total is 9000 values.

We know a priori that the file contains 75 years of data written every third day, and the climate model's calendar is 360-day calendar. Hence, we have 120 values per year:

In [12]:

data.shape[0] * data.shape[1] / 75

Out[12]:

120.0

Keeping data in $1500\times6$ array does not seem to be useful, so we make it 1-D:

In [13]:

data = data.flatten()
data.shape

Out[13]:

(9000,)

Wrap it up in a function¶

To make the code above reusable, we create the following function to get data.

In [14]:

def get_model_data(url=URL, fname='climate_model_data.dat', force_download=False):
    """
    Function to download climate model output from UEA server
    
    Parameters
    ---------
    url : string (optional)
        web location of the data
    fname : string (optional)
        full path to save the data
    force_download : bool (optional)
        if True, force redownload of data
    Returns
    -------
    data : numpy.ndarray
        1-D array of data
    """
    if not os.path.exists(fname) or force_download:
        urlretrieve(URL, fname)
        # print('Downloading...')
    data = np.genfromtxt(fname)
    
    return data.flatten()

In [15]:

data = get_model_data()

1. Plain NumPy¶

Reshape the array to YEARS $\times$ DAYS¶

Now we transform the array into a more useful shape.

In [16]:

NDAYS = 120  # the number of 3-day periods in a 360-day year
NYEARS = 75  # the total number of years

In [17]:

data_yd = data.reshape((NYEARS, NDAYS))
print(data_yd.shape)

(75, 120)

For example, this is a value of $u$-wind on 30 January of the last year:

In [18]:

data_yd[-1, 10]

Out[18]:

23.914400000000001

Select only winter months¶

What if we want to extract only winter data? We can't use the first winter, because it's incomplete: it only has January and February. So the first winter period will comprise December data from the year 1:

In [19]:

data_yd[0, -10:]

Out[19]:

array([ 15.5446,  20.6539,  16.4162,  22.0274,  30.3875,  27.8614,
        28.5274,  32.0706,  35.9934,  34.0339])

plus January and February data from the year 2:

In [20]:

data_yd[1, :20]

Out[20]:

array([  3.06054000e+01,   2.61758000e+01,   2.98059000e+01,
         3.20111000e+01,   2.72294000e+01,   1.97748000e+01,
         1.90082000e+01,   1.51616000e+01,   1.22748000e+01,
         1.09608000e+01,   1.36364000e+01,   2.22356000e+01,
         2.76375000e+01,   2.39670000e+01,   1.24344000e+01,
         2.54243000e+00,   2.32738000e-01,  -3.17650000e-02,
         6.82037000e-01,   4.43382000e-02])

To join them, we can use numpy.concatentate() function:

In [21]:

np.concatenate([data_yd[0, -10:], data_yd[1, :20]])

Out[21]:

array([  1.55446000e+01,   2.06539000e+01,   1.64162000e+01,
         2.20274000e+01,   3.03875000e+01,   2.78614000e+01,
         2.85274000e+01,   3.20706000e+01,   3.59934000e+01,
         3.40339000e+01,   3.06054000e+01,   2.61758000e+01,
         2.98059000e+01,   3.20111000e+01,   2.72294000e+01,
         1.97748000e+01,   1.90082000e+01,   1.51616000e+01,
         1.22748000e+01,   1.09608000e+01,   1.36364000e+01,
         2.22356000e+01,   2.76375000e+01,   2.39670000e+01,
         1.24344000e+01,   2.54243000e+00,   2.32738000e-01,
        -3.17650000e-02,   6.82037000e-01,   4.43382000e-02])

And of course we can apply the same logic to the whole dataset:

In [22]:

data_djf = np.concatenate([data_yd[:-1, -10:], data_yd[1:, :20]], axis=1)
print(data_djf.shape)

(74, 30)

Selecting years by a certain criterion¶

How to find winters when at least 20 days of constant wind direction followed by its change?

Here we are just applying this answer on Stack Overflow to our problem.

In [23]:

for i, yr in enumerate(data_djf):
    condition = yr > 0
    lens_true = np.diff(np.where(np.concatenate(([condition[0]], condition[:-1] != condition[1:], [True])))[0])[::2]
    if 20 <= lens_true.max() < 30:
        print(i, lens_true.max())

2. What if you want to use labelled arrays?¶

In the above example numpy's capabilities were probably enough. But when you have more dimensions and data are more complex, it is mostly always better to use labelled arrays and all the great functionality offered by such libraries as xarray or iris.

We show how iris library can be used with the same dataset. We chose iris, mostly because it can handle non-standard calendars, like 360-day one.

To create an appropriate time coordinate, we will use iris companion package - cf_units.

In [24]:

import cf_units
import iris

In [25]:

DAYS_PER_YEAR = 360

In [26]:

t_unit = cf_units.Unit('days since 0001-01-01 00:00:00',
                       calendar='360_day')

In [27]:

t_coord = iris.coords.DimCoord(np.arange(0, DAYS_PER_YEAR * NYEARS, 3),
                               units=t_unit,
                               standard_name='time')

Now we can attach the newly created time coordinate to the data themselves by creating an iris cube:

In [28]:

cube = iris.cube.Cube(data=data,
                      units='m/s',
                      dim_coords_and_dims=[(t_coord, 0)])
cube.rename('eastward_wind')

In [29]:

print(cube)

eastward_wind / (m/s)               (time: 9000)
     Dimension coordinates:
          time                           x

Calculate seasonal means¶

Source: http://scitools.org.uk/iris/docs/latest/userguide/cube_statistics.html

Since we now have a labelled aray with appropriate metadata, we can use iris to make statistical analysis easier and make the code more readable.

In [30]:

import iris.coord_categorisation

In [31]:

iris.coord_categorisation.add_season(cube, 'time', name='clim_season')
iris.coord_categorisation.add_season_year(cube, 'time',  name='season_year')

In [32]:

print(cube)

eastward_wind / (m/s)               (time: 9000)
     Dimension coordinates:
          time                           x
     Auxiliary coordinates:
          clim_season                    x
          season_year                    x

In [33]:

cube.coord('clim_season')

Out[33]:

AuxCoord(array(['djf', 'djf', 'djf', ..., 'djf', 'djf', 'djf'], 
      dtype='<U64'), standard_name=None, units=Unit('no_unit'), long_name='clim_season')

In [34]:

for season, year in zip(cube.coord('clim_season')[:100:10].points,
                        cube.coord('season_year')[:100:10].points):
    print('{} {}'.format(season, year))

djf 1
djf 1
mam 1
mam 1
mam 1
jja 1
jja 1
jja 1
son 1
son 1

In [35]:

annual_seasonal_mean = cube.aggregated_by(['clim_season', 'season_year'],
                                          iris.analysis.MEAN)

In [36]:

print(annual_seasonal_mean)

eastward_wind / (m/s)               (time: 301)
     Dimension coordinates:
          time                           x
     Auxiliary coordinates:
          clim_season                    x
          season_year                    x
     Cell methods:
          mean: clim_season, season_year

In [37]:

HTML(html)

Out[37]:

This post was written as an IPython (Jupyter) notebook. You can view or download it using nbviewer.