This notebook will show how we read, process, and plot data from an argo float netcdf file.
Feedback on whether we're doing this the right way would be great!
We have downloaded all argo data from: ftp://usgodae.org/pub/outgoing/argo
Below is an ipython command for downloading just one netcdf file (the '!' makes ipython act like it's the terminal).
!wget ftp://usgodae.org/pub/outgoing/argo/dac/aoml/1900203/1900203_prof.nc
--2014-07-01 10:51:16-- ftp://usgodae.org/pub/outgoing/argo/dac/aoml/1900203/1900203_prof.nc => `1900203_prof.nc.1' Resolving usgodae.org (usgodae.org)... 199.9.2.160 Connecting to usgodae.org (usgodae.org)|199.9.2.160|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /pub/outgoing/argo/dac/aoml/1900203 ... done. ==> SIZE 1900203_prof.nc ... 1160304 ==> PASV ... done. ==> RETR 1900203_prof.nc ... done. Length: 1160304 (1.1M) (unauthoritative) 100%[======================================>] 1,160,304 4.72M/s in 0.2s 2014-07-01 10:51:18 (4.72 MB/s) - `1900203_prof.nc.1' saved [1160304]
But actually we downloaded all of it (it took a while). Here's the what the directory looks like:
!ls argo/
ar_greylist.txt ar_index_global_tech.txt etc ar_index_global_meta.txt ar_index_global_tech.txt.gz File: ar_index_global_meta.txt.gz ar_index_global_traj.txt geo ar_index_global_prof.txt ar_index_global_traj.txt.gz latest_data ar_index_global_prof.txt.gz dac readme_before_using_the_data.txt
We plan to look at the files by data center (dac), but there is also data organized by geography (ex. ocean) and a directory of just the latest data (latest_data).
!ls argo/geo
atlantic_ocean indian_ocean pacific_ocean
!ls argo/dac/
aoml bodc coriolis csio csiro incois jma kma kordi meds nmdis
!ls argo/dac/aoml/ | head
13857 13858 13859 15819 15820 15851 15852 15853 15854 15855 ls: write error: Broken pipe
We're not really sure what we should take away from the readme.
!cat argo/readme_before_using_the_data.txt
Argo near real-time data is subject to only coarse fully-automated quality control checks. No quality control is done on Oxygen measurements at present. Argo delayed-mode procedures for checking sensor drifts and offsets in salinity rely on a statistical comparison of the float data with reference data. An adjustment is made when the float PI judges that it will improve the quality of the dataset. Users should include the supplied error estimates in their usage of Argo delayed-mode salinity data. For both near real-time and delayed mode data, proper and appropriate use is the responsability of the user
We're not sure if/how we should use the greylist.
Some of the entries in the greylist have start and end dates, some don't have end dates. We aren't sure what the platform code is, and we don't know if this indicates that the data is there and it's bad or if it suggests that the data is just not going to be there at all.
!head -5 argo/ar_greylist.txt
PLATFORM_CODE,PARAMETER_NAME,START_DATE,END_DATE,QUALITY_CODE,COMMENT,DAC 4900313,PRES,20070512,,3,sensor problem,AO 4900313,TEMP,20070512,,3,sensor problem,AO 4900313,PSAL,20070512,,3,sensor problem,AO 5900708,PSAL,20090221,,3,sensor problem,AO
!tail -20 argo/ar_greylist.txt
2900890,PRES,20090116,20110615,4,sensor problem,KM 2900890,TEMP,20090116,20110615,4,sensor problem,KM 2900890,PSAL,20090116,20110615,4,sensor problem,KM2900603,PSAL,20060717,,3,hit bottom and bad data after returned ,KO 2900603,TEMP,20060717,,3,hit bottom and bad data after returned ,KO 2900603,PRES,20060717,,3,hit bottom and bad data after returned ,KO 2901209,PSAL,20121001,,3,hit bottom and bad data after returned ,KO 2901209,TEMP,20121001,,3,hit bottom and bad data after returned ,KO 2901209,PRES,20121001,,3,hit bottom and bad data after returned ,KO 2901204,PSAL,20101022,,3,hit bottom and bad data after returned ,KO 2901204,TEMP,20101022,,3,hit bottom and bad data after returned ,KO 2901204,PRES,20101022,,3,hit bottom and bad data after returned ,KO 4900518,PRES,20110101,,4,TNPD and sensor problem,ME 4900518,TEMP,20110101,,4,pressure sensor failure,ME 4900518,PSAL,20110101,,4,pressure sensor failure,ME 4900503,PRES,20090309,,4,TNPD and sensor problem,ME 4900503,TEMP,20090309,,4,pressure sensor failure,ME 4900503,PSAL,20090309,,4,pressure sensor failure,ME 4901101,PRES,20080603,,4,pressure sensor failure and potential other failures,ME 4901101,PSAL,20080603,,4,pressure sensor failure and potential conductivity sensor failure,ME 4901101,TEMP,20080603,,4,pressure sensor failure and potential temperature sensor failure,ME
scipy
to read a netcdf file.¶We decided to skip over some of the things we don't totally understand.
from scipy.io import netcdf
import datetime
f = netcdf.netcdf_file('1900203_prof.nc', 'r')
For now, we are only considering the variables in the list below (not the QC versions or any other metadata - but maybe we are missing something important?)
DATE = 'JULD'
LON = 'LONGITUDE'
LAT = 'LATITUDE'
TEMP = 'TEMP'
PRES = 'PRES'
PSAL = 'PSAL'
dates = f.variables[DATE]
lats = f.variables[LAT]
lons = f.variables[LON]
temps = f.variables[TEMP]
pressures = f.variables[PRES]
salinities = f.variables[PSAL]
Perhaps we should double check the 'REFERENCE_DATE_TIME' variable to confirm for each file, but for now we just assume that they are measured since Jan 1, 1950.
list(f.variables['REFERENCE_DATE_TIME']) # this is the date mentioned above (but in a strange format)
['1', '9', '5', '0', '0', '1', '0', '1', '0', '0', '0', '0', '0', '0']
JULD_DAYS_SINCE = datetime.datetime(1950, 1, 1)
def ConvertDate(juld, since=JULD_DAYS_SINCE):
return since + datetime.timedelta(days=juld)
print dates[10]
print ConvertDate(dates[10]) # example of a date conversion
19605.5572339 2003-09-05 13:22:25.006371
So far, I've just done the absolute easiest search for converting pressure to depth, which is shown below (though not included in the CSV later). It seems pretty obvious that I've at least got the units wrong. I am currently looking for argo-specific resources to do this conversion. Advice on how to do this calculation would be great.
ATMOSPHERIC_PRESSURE = 101000
G = 9.8
R = 1030
# theoretically, will return approximate depth in meters (but pressure is not in the expected units)
def GetDepth(pressure):
depth = (pressure)/(R*G)
return depth
print pressures[1][40]
print GetDepth(pressures[1][40]) # example of a pressure to depth conversion -- pressures are a list of lists, one list of measurements per profile
925.0 0.0916385971864
pressures.units # actual units of pressure in this dataset
'decibar'
(columns for floatid, date, lat, lon, temp, pressure, salinity, and, eventually, derived depth based on pressure)
def WriteCSV(fid, netcdf_filename, output_file):
# TODO(cobbc): maybe filter out and replace nodata values with N/A or None?
f = netcdf.netcdf_file(netcdf_filename, 'r')
out = output_file
header = 'FLOATID, TIMESTAMP, LAT, LON, DERIVED DEPTH, TEMPERATURE, PRESSURE, SALINITY\n'
out.write(header)
dates = f.variables[DATE]
lats = f.variables[LAT]
lons = f.variables[LON]
temps = f.variables[TEMP]
pressures = f.variables[PRES]
salinities = f.variables[PSAL]
for date, lat, lon, temp, pressure, psal in zip(dates,lats,lons,temps,pressures,salinities):
timestamp = ConvertDate(date)
derived_depth = 0 # TODO(cobbc): update with algorithm for using pressure and/or psal to derive depth
output = fid + ', ' + str(timestamp) + ', ' + str(lat) + ', ' + str(lon) + ', ' + str(derived_depth) + ', '
for t, p, s in zip(temp, pressure, psal):
out.write(output + str(t) + ', ' + str(p) + ', ' + str(s) + '\n')
outfile = open('example_csv.csv', 'w')
WriteCSV('1900203', 'argo/dac/aoml/1900203/1900203_prof.nc', outfile)
outfile.close()
!head -20 example_csv.csv
FLOATID, TIMESTAMP, LAT, LON, DERIVED DEPTH, TEMPERATURE, PRESSURE, SALINITY 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 17.657, 5.0, 35.418 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 17.637, 15.0, 35.423 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 17.622, 25.0, 35.42 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 17.622, 35.0, 35.425 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 17.623, 45.0, 35.425 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 17.621, 55.0, 35.425 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 17.613, 65.0, 35.427 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 17.587, 75.0, 35.424 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 17.558, 85.0, 35.419 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 17.494, 95.0, 35.41 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 17.266, 105.0, 35.395 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 16.344, 115.0, 35.392 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 15.641, 125.0, 35.411 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 15.379, 135.0, 35.401 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 15.033, 145.0, 35.385 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 14.785, 155.0, 35.367 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 14.635, 165.0, 35.354 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 14.436, 175.0, 35.334 1900203, 2003-05-28 13:40:01.001930, -34.0870018005, -12.1160001755, 0, 14.315, 185.0, 35.331
Obviously, this takes significantly more time, space, etc., but the processing will all be the same.
Here's a link to another notebook with some of my previous plotting and mapping efforts:
http://nbviewer.ipython.org/gist/cobbc12/ca505f0b7e9965ad166b
I tried to deal with outlier/nodata values (which may be worse than usual, since I was using the latest dataset), but may not have done that appropriately (feedback on how to improve that would be very appreciated).
In general, feedback on whether we're approaching this the right way would be great. Here are a few more questions we have ...
Ex. the following is missing PSAL. We plan to simply mark the missing data as "N/A" and include this file although it is not complete. Is that the right thing to do?
f2 = netcdf.netcdf_file('argo/dac/aoml/13857/13857_prof.nc', 'r')
print sorted(f2.variables.keys())
['CALIBRATION_DATE', 'CYCLE_NUMBER', 'DATA_CENTRE', 'DATA_MODE', 'DATA_STATE_INDICATOR', 'DATA_TYPE', 'DATE_CREATION', 'DATE_UPDATE', 'DC_REFERENCE', 'DIRECTION', 'FORMAT_VERSION', 'HANDBOOK_VERSION', 'HISTORY_ACTION', 'HISTORY_DATE', 'HISTORY_INSTITUTION', 'HISTORY_PARAMETER', 'HISTORY_PREVIOUS_VALUE', 'HISTORY_QCTEST', 'HISTORY_REFERENCE', 'HISTORY_SOFTWARE', 'HISTORY_SOFTWARE_RELEASE', 'HISTORY_START_PRES', 'HISTORY_STEP', 'HISTORY_STOP_PRES', 'INST_REFERENCE', 'JULD', 'JULD_LOCATION', 'JULD_QC', 'LATITUDE', 'LONGITUDE', 'PARAMETER', 'PI_NAME', 'PLATFORM_NUMBER', 'POSITIONING_SYSTEM', 'POSITION_QC', 'PRES', 'PRES_ADJUSTED', 'PRES_ADJUSTED_ERROR', 'PRES_ADJUSTED_QC', 'PRES_QC', 'PROFILE_PRES_QC', 'PROFILE_TEMP_QC', 'PROJECT_NAME', 'REFERENCE_DATE_TIME', 'SCIENTIFIC_CALIB_COEFFICIENT', 'SCIENTIFIC_CALIB_COMMENT', 'SCIENTIFIC_CALIB_EQUATION', 'STATION_PARAMETERS', 'TEMP', 'TEMP_ADJUSTED', 'TEMP_ADJUSTED_ERROR', 'TEMP_ADJUSTED_QC', 'TEMP_QC', 'WMO_INST_TYPE']
f2.variables['PSAL']
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-23-b3b5ae08696b> in <module>() ----> 1 f2.variables['PSAL'] KeyError: 'PSAL'
Not sure why or if there's other data we should include to replace it.