Notebook

Data management¶

Our previous lessons have shown us how to write programs that ingest a list of data files, perform some calculations on those data, and then print a final result to the screen. While this was a useful exercise in learning the principles of scripting and parsing the command line, in most cases the output of our programs will not be so simple. Instead, programs typically take data as input, manipulate that data, and then output yet more data. Over the course of a multi-year research project, most reseachers will write many different programs that produce many different output datasets.

We want to:

Manage our data in such as way as to avoid confusion/calamity

Along the way, we will learn:

how to create a Data Reference Syntax
how to view the contents of binary files
about data provenance and metadata

What's in a name?¶

In this lesson we are going to process some of the climate model data that was submitted to the CMIP5 project. This project informed many of the results presented in the Intergovernmental Panel on Climate Change (IPCC) Fifth Assessment Report, making it one of the most widely used datasets in the world.

First off, let's see what files we've got:

In [21]:

!ls *.nc

uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc  vas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc

The first thing to notice is the distinctive Data Reference Syntax (DRS) associated with CMIP5. Modelling groups contributing to the project must name their files according to the following structure:

<variable name>_<MIP table>_<model>_<experiment>_<ensemble member>_<temporal subset>_<geographical info>.nc

From this we can deduce, without even inspecting the contents of the file, that we have surface zonal (i.e. east-west; abbreviated uas) and meridional (i.e. north-south; abbreviated vas) wind speed data. It belongs to the atmospheric, monthly timescale data group (Amon) and was derived from an Australian model known as ACCESS1-3. The external forcing applied to the model was that corresponding to the rcp85 scenario (high future greenhouse gas emissions), it was the r1i1p1'th realisation of the model, and we have the data for June, July and August of the year 2050 for the Australian (aus) region.

The DRS for CMIP5 actually goes further than just the file name. If you download a whole heap of CMIP5 data, it comes with the following directory structure:

/<activity>/<product>/<institute>/<model>/<experiment>/<frequency>/<modeling realm>/<variable name>/<ensemble member>/

In the first instance this level of detail seems like a bit of overkill, but consider the scope of the CMIP5 data archive. It contains data from over 50 climate models for upwards of 100 different variables and 50 or so different experiments, for which each modelling group typically provides bewteen 3 and 10 different realisations. Since the data are so well labelled, calculating the average surface temperature (tas) across the r1i1p1 realisation of all models that provided monthly timescale data for the rcp85 scenario can be achieved with a single cdo bash shell command like the following, which is truly amazing:

cdo ensmean /*/*/*/*/rcp85/mon/*/tas/r1i1p1/tas_Amon_*_rcp85_r1i1p1_*.nc

Unless your research involves analysing CMIP5 data, you may never deal with such a large dataset. Nevertheless, it is a very good idea to develop your own personal DRS for the data that you do have. This often involves investing some time at the beginning of a project to think carefully about the design of your directory and file name structures (as these can be very hard to change later on). The combination of bash shell wildcards and a well planned DRS is one of the easiest ways to make your research more efficient and reliable.

Challenge¶

We haven't even looked inside our CMIP5 data files and already we have the beginnings of a detailed data management plan. The first step in any research project should be to develop such a plan, so for this challenge we are going to turn back time. If you could start your current research project all over again, what would your data management plan look like? Things to consider include:

Data Reference Syntax
How long it will take to obtain the data
Storage and backup (here's a post with some backup ideas)

Write down and discuss your plan with your partner.

Binary file formats¶

Now that we've identified our CMIP5 files, let's go ahead and look what's inside. Our initial impulse might be to enter

!cat uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc

but in this case such a command will produce an incomprehensible mix symbols and letters. The reason is that up until now, we have been dealing with text files. These consist of a simple sequence of character data (represented using ASCII, Unicode, or some other standard) separated into lines, meaning that text files are human-readable when opened with a text editor or displayed using cat.

All other file types are known collectively as binary files. They tend to be smaller and faster for the computer to interpret than text files, but the payoff is that they aren't human-readable unless you have the right intpreter (e.g. .doc files aren't readable with your text editor and must instead be opened with Microsoft Word). In this case we have a Network Common Data Form (netCDF) file, so we need to use a special command line utility called ncdump to view the contents.

In [22]:

!ncdump -h uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc

netcdf uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus {
dimensions:
	lon = 26 ;
	nb2 = 2 ;
	lat = 28 ;
	time = UNLIMITED ; // (3 currently)
variables:
	double lon(lon) ;
		lon:standard_name = "longitude" ;
		lon:long_name = "longitude" ;
		lon:units = "degrees_east" ;
		lon:axis = "X" ;
		lon:bounds = "lon_bnds" ;
	double lon_bnds(lon, nb2) ;
	double lat(lat) ;
		lat:standard_name = "latitude" ;
		lat:long_name = "latitude" ;
		lat:units = "degrees_north" ;
		lat:axis = "Y" ;
		lat:bounds = "lat_bnds" ;
	double lat_bnds(lat, nb2) ;
	double time(time) ;
		time:standard_name = "time" ;
		time:bounds = "time_bnds" ;
		time:units = "days since 1-01-01 00:00:00" ;
		time:calendar = "proleptic_gregorian" ;
	double time_bnds(time, nb2) ;
		time_bnds:units = "days since 1-01-01 00:00:00" ;
		time_bnds:calendar = "proleptic_gregorian" ;
	float uas(time, lat, lon) ;
		uas:standard_name = "eastward_wind" ;
		uas:long_name = "Eastward Near-Surface Wind" ;
		uas:units = "m s-1" ;
		uas:_FillValue = 1.e+20f ;
		uas:cell_methods = "time: mean" ;
		uas:history = "2012-03-14T04:40:42Z altered by CMOR: Treated scalar dimension: \'height\'. 2012-03-14T04:40:42Z altered by CMOR: replaced missing value flag (-1.07374e+09) with standard missing value (1e+20)." ;
		uas:associated_files = "baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation gridspecFile: gridspec_atmos_fx_ACCESS1-3_rcp85_r0i0p0.nc" ;

// global attributes:
		:CDI = "Climate Data Interface version 1.5.6 (http://code.zmaw.de/projects/cdi)" ;
		:Conventions = "CF-1.4" ;
		:history = "Thu Nov 07 14:19:44 2013: cdo sellonlatbox,110,160,-45,-10 uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008.nc uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc\n",
			"Thu Nov 07 14:13:51 2013: cdo seldate,2050-06-01,2050-08-31 uas_Amon_ACCESS1-3_rcp85_r1i1p1_200601-210012.nc uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008.nc\n",
			"CMIP5 compliant file produced from raw ACCESS model output using the ACCESS Post-Processor and CMOR2. 2012-03-14T04:40:43Z CMOR rewrote data to comply with CF standards and CMIP5 requirements. Fri Apr 13 12:32:01 2012: corrected model_id from ACCESS1-3 to ACCESS1.3 Fri Apr 13 14:07:50 2012: forcing attribute modified to correct value Wed May  2 13:39:09 2012: updated version number to v20120413." ;
		:institution = "CSIRO (Commonwealth Scientific and Industrial Research Organisation, Australia), and BOM (Bureau of Meteorology, Australia)" ;
		:institute_id = "CSIRO-BOM" ;
		:experiment_id = "rcp85" ;
		:model_id = "ACCESS1.3" ;
		:forcing = "GHG, Oz, SA, Sl, Vl, BC, OC, (GHG = CO2, N2O, CH4, CFC11, CFC12, CFC113, HCFC22, HFC125, HFC134a)" ;
		:parent_experiment_id = "historical" ;
		:parent_experiment_rip = "r1i1p1" ;
		:branch_time = 732311. ;
		:contact = "The ACCESS wiki: http://wiki.csiro.au/confluence/display/ACCESS/Home. Contact Tony.Hirst@csiro.au regarding the ACCESS coupled climate model. Contact Peter.Uhe@csiro.au regarding ACCESS coupled climate model CMIP5 datasets." ;
		:references = "See http://wiki.csiro.au/confluence/display/ACCESS/ACCESS+Publications" ;
		:initialization_method = 1 ;
		:physics_version = 1 ;
		:tracking_id = "724f536a-c5fa-4a68-85f1-ff277af34c75" ;
		:version_number = "v20120413" ;
		:product = "output" ;
		:experiment = "RCP8.5" ;
		:frequency = "mon" ;
		:creation_date = "2012-03-14T04:40:43Z" ;
		:project_id = "CMIP5" ;
		:table_id = "Table Amon (01 February 2012) 01388cb4507c2f05326b711b09604e7e" ;
		:title = "ACCESS1-3 model output prepared for CMIP5 RCP8.5" ;
		:parent_experiment = "historical" ;
		:modeling_realm = "atmos" ;
		:realization = 1 ;
		:cmor_version = "2.8.0" ;
		:CDO = "Climate Data Operators version 1.5.6.1 (http://code.zmaw.de/projects/cdo)" ;
}

By using the -h flag, only the header of the file has been shown. The great thing about netCDF files is that the header contains metadata - that is, data about the data. Each variable has it's own 'variable attributes' (e.g. the lat axis has a long_name and units attribute) and then there are also a whole suite of global attributes that describe the history of the file. When we write out own netCDF output later on, we will discuss the conventions around netCDF metadata in more detail.

Calculating the wind speed¶

To read in our data, we are going to use a library known as the Climate Data Management System (cdms2). This library is part of a larger open-source software package called Ultrascale Visualisation - Climate Data Analysis Tools (UV-CDAT).

In [23]:

import cdms2

u_name = 'uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc'
u_file = cdms2.open(u_name)
u_data = u_file('uas')

v_name = 'vas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc'
v_file = cdms2.open(v_name)
v_data = v_file('vas')

Our two variables, udata and vdata, are cdms2 transient variables.

In [24]:

print 'udata is of type:', type(u_data)

udata is of type: <class 'cdms2.tvariable.TransientVariable'>

The nice thing about these transient variables is that they carry the netCDF metadata with them.

In [25]:

print 'Metadata about the time axis:' 
print u_data.getTime()

print 'Raw time axis values:'
print u_data.getTime()[:]

print 'Time axis values in a friendlier format:'
print u_data.getTime().asComponentTime()

Metadata about the time axis:
   id: time
   Designated a time axis.
   units:  days since 1-01-01 00:00:00
   Length: 3
   First:  748548.0
   Last:   748609.5
   Other axis attributes:
      standard_name: time
      calendar: proleptic_gregorian
      axis: T
   Python id:  0x3599190

Raw time axis values:
[ 748548.   748578.5  748609.5]
Time axis values in a friendlier format:
[2050-6-16 0:0:0.0, 2050-7-16 12:0:0.0, 2050-8-16 12:0:0.0]

We can now go ahead and calculate the wind speed,

In [26]:

wsp_data = (u_data**2 + v_data**2)**0.5

and our transient variables are smart enough to pass along the relevant metadata to our new variable:

In [27]:

print 'Metadata about the time axis:' 
print wsp_data.getTime()

print 'Time axis values in a friendlier format:'
print wsp_data.getTime().asComponentTime()

Metadata about the time axis:
   id: time
   Designated a time axis.
   units:  days since 1-01-01 00:00:00
   Length: 3
   First:  748548.0
   Last:   748609.5
   Other axis attributes:
      standard_name: time
      calendar: proleptic_gregorian
      axis: T
   Python id:  0x321db50

Time axis values in a friendlier format:
[2050-6-16 0:0:0.0, 2050-7-16 12:0:0.0, 2050-8-16 12:0:0.0]

Climate and Forecast (CF) metadata convention¶
It's incredibly useful that libraries like cdms2 can make use of the metadata stored in netCDF files to create methods like asComponentTime(). However, let's put ourselves in the shoes of the developers of cdms2 for a minute. In order to convert the time axis to a meaningful list of dates, the library needs to first identify the units of the time axis. This isn't as easy as you'd think, since the creator of the netCDF file could easily have called the units attribute "measure," or "scale," or something else completely unpredictable instead. They could also have defined the units as "weeks since 1-01-01 00:00:00," or "milliseconds after 1979-12-31." Obviously what is needed is a standard method for defining netCDF attributes, and that’s where the Climate and Forecast (CF) metadata convention comes in.

The CF metadata standard was first defined back in the early 2000s and has now been adopted by all the major institutions and projects in the weather/climate sciences. There is a nice blog post on the topic if you'd like more information, but for the most part you just need to be aware that if a tool you're using isn't working, if might be because your netCDF file isn't CF compliant.

Data provenance¶

Before we go ahead and create a new script (calc_wind_speed.py) for calculating the wind speed, there's just one more thing to consider. Looking closely at the global attributes of uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc you can see that the entire history of the file, all the way back to its initial download, has been recorded in the history attribute.

In [28]:

global_atts = u_file.attributes
old_history = global_atts['history']
print old_history

Thu Nov 07 14:19:44 2013: cdo sellonlatbox,110,160,-45,-10 uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008.nc uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc
Thu Nov 07 14:13:51 2013: cdo seldate,2050-06-01,2050-08-31 uas_Amon_ACCESS1-3_rcp85_r1i1p1_200601-210012.nc uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008.nc
CMIP5 compliant file produced from raw ACCESS model output using the ACCESS Post-Processor and CMOR2. 2012-03-14T04:40:43Z CMOR rewrote data to comply with CF standards and CMIP5 requirements. Fri Apr 13 12:32:01 2012: corrected model_id from ACCESS1-3 to ACCESS1.3 Fri Apr 13 14:07:50 2012: forcing attribute modified to correct value Wed May  2 13:39:09 2012: updated version number to v20120413.

The last two entries, for instance, were generated by the cdo package when it was used to select a temporal (seldate) and spatial (sellonlatbox) subset of the original data file. This practice of recording the history of the file ensures the provenance of the data. In other words, a complete record of everything that has been done to the data is stored with the data, which avoids any confusion in the event that the data is ever moved, passed around to different users, or viewed by its creator many months later.

If we want to create our own entry for the history attribute, we'll need to be able to create a:

Time stamp
Record of what was entered at the command line in order to execute calc_wind_speed.py
Method of indicating which verion of the script was run (i.e. because the script is in our git repository)

Time stamp¶

A library called datetime can be used to find out the time and date right now:

In [29]:

import datetime

time_stamp = datetime.datetime.now().strftime("%a %b %d %H:%M:%S %Y")
print time_stamp

Tue Jun 10 16:59:58 2014

The strftime function can be used to customise the appearance of a datetime object; in this case we've made it look just like the other time stamps in our data files.

Command line record¶

In the Software Carpentry lesson on command line programs we met sys.argv, which contains all the arguments entered by the user at the command line:

In [30]:

import sys
print sys.argv

['-c', '-f', '/home/dbirving/.ipython/profile_default/security/kernel-0f9c6be1-a1d7-4bd9-8f12-c7772ab46070.json', "--IPKernelApp.parent_appname='ipython-notebook'", '--profile-dir', '/home/dbirving/.ipython/profile_default', '--parent=1']

In launching this IPython notebook, you can see that a number of command line arguments were used. To join all these list elements up, we can use the join function that belongs to Python strings

In [31]:

args = " ".join(sys.argv)
print args

-c -f /home/dbirving/.ipython/profile_default/security/kernel-0f9c6be1-a1d7-4bd9-8f12-c7772ab46070.json --IPKernelApp.parent_appname='ipython-notebook' --profile-dir /home/dbirving/.ipython/profile_default --parent=1

While this list of arguments is very useful, it doesn't tell us which Python installation was used to execute those arguments. The sys library can help us out here too:

In [32]:

exe = sys.executable
print exe 

/usr/local/uvcdat/1.5.1/bin/python

Git hash¶

In the Software Carpentry lessons on git we learned that each commit is associated with a unique 40-character identifier known as a hash. We can use the git Python library to get the hash associated with the script:

In [33]:

from git import Repo
import os

git_hash = Repo(os.getcwd()).head.commit.hexsha
print git_hash

2db1f92b517fe4262f4c8d98d2ec8b9b4c89b154

We can now put all this information together for our history entry:

In [34]:

entry = """%s: %s %s (Git hash: %s)""" %(time_stamp, exe, args, git_hash[0:7])
print entry

Tue Jun 10 16:59:58 2014: /usr/local/uvcdat/1.5.1/bin/python -c -f /home/dbirving/.ipython/profile_default/security/kernel-0f9c6be1-a1d7-4bd9-8f12-c7772ab46070.json --IPKernelApp.parent_appname='ipython-notebook' --profile-dir /home/dbirving/.ipython/profile_default --parent=1 (Git hash: 2db1f92)

Putting it all together¶

So far we've been experimenting in the IPython notebook to familiarise ourselves with UV-CDAT and the other Python modules that might be useful for calculating the wind speed. We should now go ahead and write a script, so we can repeat the process with a single entry at the command line:

In [35]:

!cat calc_wind_speed.py

import os, sys
import datetime
from git import Repo

import cdms2
cdms2.setNetcdfShuffleFlag(0)
cdms2.setNetcdfDeflateFlag(0)
cdms2.setNetcdfDeflateLevelFlag(0)


def main():
    script = sys.argv[0]
    u_file = sys.argv[1]
    u_var = sys.argv[2]
    v_file = sys.argv[3]
    v_var = sys.argv[4]
    outfile_name = sys.argv[5]
    
    u_data, ufile_atts = read_data(u_file, u_var)
    v_data, vfile_atts = read_data(v_file, v_var)
    
    wsp_data = calc_wsp(u_data, v_data) 
    
    write_output(wsp_data, ufile_atts, outfile_name)    


def read_data(ifile, var):
    """Read data from ifile corresponding to the var variable"""
    
    fin = cdms2.open(ifile)
    data = fin(var)
    file_atts = fin.attributes 
    fin.close()
    
    return data, file_atts


def calc_wsp(uwnd, vwnd):
    """Calculate the wind speed and create relevant attributes"""
    
    wsp = (uwnd**2 + vwnd**2)**0.5
    
    wsp.id = 'wsp'
    wsp.long_name = 'Wind speed'
    wsp.units = 'm s-1'
    
    return wsp


def write_output(wsp_data, ufile_atts, outfile_name):
    """Write the output file"""
    
    outfile = cdms2.open(outfile_name, 'w')
        
    new_history = create_history()
    old_history = ufile_atts['history']

    setattr(outfile, 'history', """%s\n%s""" %(new_history, old_history))
    for att_name in ufile_atts.keys():
        if att_name != "history":  # history excluded because we've already done it
            setattr(outfile, att_name, ufile_atts[att_name])

    outfile.write(wsp_data)
    outfile.close()


def create_history():
    """Create the new entry for the global history file attribute"""
    
    time_stamp = datetime.datetime.now().strftime("%a %b %d %H:%M:%S %Y")
    exe = sys.executable
    args = " ".join(sys.argv)
    git_hash = Repo(os.getcwd()).head.commit.hexsha

    return """%s: %s %s (Git hash: %s)""" %(time_stamp, exe, args, git_hash[0:7])


main()

(The cdms2.setNetcdf... commands simply specify that we want the classic netCDF format - see this post for more details on netCDF formats.)

We can now run this script at the command line:

In [36]:

!/usr/local/uvcdat/1.5.1/bin/python calc_wind_speed.py uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc uas vas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc vas wsp_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc

The finished product¶

We can now inspect the attributes in our new file:

In [37]:

!ncdump -h wsp_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc

netcdf wsp_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus {
dimensions:
	time = UNLIMITED ; // (3 currently)
	bound = 2 ;
	lat = 28 ;
	lon = 26 ;
variables:
	double time(time) ;
		time:bounds = "time_bnds" ;
		time:units = "days since 1-01-01 00:00:00" ;
		time:standard_name = "time" ;
		time:calendar = "proleptic_gregorian" ;
		time:axis = "T" ;
	double time_bnds(time, bound) ;
	double lat(lat) ;
		lat:bounds = "lat_bnds" ;
		lat:units = "degrees_north" ;
		lat:long_name = "latitude" ;
		lat:standard_name = "latitude" ;
		lat:axis = "Y" ;
	double lat_bnds(lat, bound) ;
	double lon(lon) ;
		lon:bounds = "lon_bnds" ;
		lon:modulo = 360. ;
		lon:long_name = "longitude" ;
		lon:standard_name = "longitude" ;
		lon:units = "degrees_east" ;
		lon:axis = "X" ;
		lon:topology = "circular" ;
	double lon_bnds(lon, bound) ;
	float wsp(time, lat, lon) ;
		wsp:associated_files = "baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation gridspecFile: gridspec_atmos_fx_ACCESS1-3_rcp85_r0i0p0.nc" ;
		wsp:long_name = "Wind speed" ;
		wsp:standard_name = "eastward_wind" ;
		wsp:cell_methods = "time: mean" ;
		wsp:units = "m s-1" ;
		wsp:missing_value = 1.e+20f ;
		wsp:history = "2012-03-14T04:40:42Z altered by CMOR: Treated scalar dimension: \'height\'. 2012-03-14T04:40:42Z altered by CMOR: replaced missing value flag (-1.07374e+09) with standard missing value (1e+20)." ;

// global attributes:
		:Conventions = "CF-1.4" ;
		:history = "Tue Jun 10 17:00:00 2014: /usr/local/uvcdat/1.5.1/bin/python calc_wind_speed.py uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc uas vas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc vas wsp_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc (Git hash: 2db1f92)\n",
			"Thu Nov 07 14:19:44 2013: cdo sellonlatbox,110,160,-45,-10 uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008.nc uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008_aus.nc\n",
			"Thu Nov 07 14:13:51 2013: cdo seldate,2050-06-01,2050-08-31 uas_Amon_ACCESS1-3_rcp85_r1i1p1_200601-210012.nc uas_Amon_ACCESS1-3_rcp85_r1i1p1_205006-205008.nc\n",
			"CMIP5 compliant file produced from raw ACCESS model output using the ACCESS Post-Processor and CMOR2. 2012-03-14T04:40:43Z CMOR rewrote data to comply with CF standards and CMIP5 requirements. Fri Apr 13 12:32:01 2012: corrected model_id from ACCESS1-3 to ACCESS1.3 Fri Apr 13 14:07:50 2012: forcing attribute modified to correct value Wed May  2 13:39:09 2012: updated version number to v20120413." ;
		:initialization_method = 1 ;
		:CDI = "Climate Data Interface version 1.5.6 (http://code.zmaw.de/projects/cdi)" ;
		:product = "output" ;
		:creation_date = "2012-03-14T04:40:43Z" ;
		:frequency = "mon" ;
		:references = "See http://wiki.csiro.au/confluence/display/ACCESS/ACCESS+Publications" ;
		:title = "ACCESS1-3 model output prepared for CMIP5 RCP8.5" ;
		:experiment = "RCP8.5" ;
		:realization = 1 ;
		:project_id = "CMIP5" ;
		:institute_id = "CSIRO-BOM" ;
		:model_id = "ACCESS1.3" ;
		:parent_experiment_id = "historical" ;
		:experiment_id = "rcp85" ;
		:cmor_version = "2.8.0" ;
		:parent_experiment = "historical" ;
		:modeling_realm = "atmos" ;
		:branch_time = 732311. ;
		:institution = "CSIRO (Commonwealth Scientific and Industrial Research Organisation, Australia), and BOM (Bureau of Meteorology, Australia)" ;
		:version_number = "v20120413" ;
		:forcing = "GHG, Oz, SA, Sl, Vl, BC, OC, (GHG = CO2, N2O, CH4, CFC11, CFC12, CFC113, HCFC22, HFC125, HFC134a)" ;
		:CDO = "Climate Data Operators version 1.5.6.1 (http://code.zmaw.de/projects/cdo)" ;
		:physics_version = 1 ;
		:contact = "The ACCESS wiki: http://wiki.csiro.au/confluence/display/ACCESS/Home. Contact Tony.Hirst@csiro.au regarding the ACCESS coupled climate model. Contact Peter.Uhe@csiro.au regarding ACCESS coupled climate model CMIP5 datasets." ;
		:table_id = "Table Amon (01 February 2012) 01388cb4507c2f05326b711b09604e7e" ;
		:tracking_id = "724f536a-c5fa-4a68-85f1-ff277af34c75" ;
		:parent_experiment_rip = "r1i1p1" ;
}

Since most of the file attributes were inherited by default from the input data file (i.e. the u-wind file), it's worth checking to see if there are any that don't make sense. Sure enough, the standard name is misleading:

wsp:standard_name = "eastward_wind"

We should revise our script so that it removes or renames the standard name, but beyond that we should resist the urge to start cleaning up. The associated_files wind speed attribute, for instance, makes little sense to anyone who isn't involved in the CMIP5 project. While this might seem like a reasonable argument for deleting that attribute, once an attribute is deleted it's gone forever. The associated_files attribute is taking up a negligible amount of memory, so why not just leave it there just in case? When in doubt, keep metadata.

Challenge¶

Does your data management plan from the first challenge adequately address this issue of data provenance? If not, go ahead and add to your plan now. Things to consider include:

How to capture and store metadata relating to output files that aren't self describing (i.e. unlike .nc files, formats like .csv or .png don't store things like global and variable attributes within them)

Discuss the additions you've made to your plan with your partner.