First steps in Python3! This example has been run with the python.org Python v. 3.3.3 (installed from the .dmg), with numpy, matplotlib, GDAL, and IPython (and its dependencies for this notebook) installed into a virtual environment with pip.
First, let's check that we're really using Python 3:
import sys
print(sys.version)
3.3.3 (v3.3.3:c3896275c0f6, Nov 16 2013, 23:39:35) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
Very well. So let's load the pylab libraries, set them up for plotting inline, import everything else we need, and off we go. The goal is to open a MODIS Level 1B file, that is, a file that contains satellite data that has been corrected for sensor calibration, but not re-mapped or gridded in any way.
This data comes from NASA in a format called HDF-EOS (http://hdfeos.org/), a hierachchical data format based on HDF4, which is maintained by the HDF Group. Theoretically, the pyhdf library would be suitable for interfacing with such a file, but it has the tendency to be fickle. The web page (http://pysclint.sourceforge.net/pyhdf/) does not give me confidence it has been ported to Python 3.
No matter, the version of GDAL I have installed has been compiled with HDF4 and HDF-EOS support. GDAL supports a huge number of data formats and therefore provides somewhat of a convoluted kitchen-sink of an interface. But it'll do.
%pylab inline
import os, os.path
from osgeo import gdal
Populating the interactive namespace from numpy and matplotlib
Here is the file we want to look at, and it comes with a separate geolocation file. We open both and receive GDAL Dataset objects back. The file name tells us it contains data at (nominally) 1 km resolution, and dates from Julian day 199 in 2004.
fn = "MOD021KM.A2004199.2140.005.2010140125955.hdf"
geofn = "MOD03.A2004199.2140.005.2010140193808.hdf"
dat = gdal.Open(fn)
geodat = gdal.Open(geofn)
print(type(dat))
<class 'osgeo.gdal.Dataset'>
Metadata can be accessed from the gdal.Dataset object, for example to confirm the date/time stamp of the data.
metadata = dat.GetMetadata_Dict()
print(metadata['RANGEBEGINNINGDATE'], metadata['RANGEBEGINNINGTIME'])
2004-07-17 21:40:00.000000
Indeed, the data file knows the name of its own georeference data file, though the metadata key naming isn't immediately intuitive. ANCILLARYINPUTPOINTER? I wouldn't have guessed. (There are other ways of getting the same data, in fact.)
dat.GetMetadata_Dict()['ANCILLARYINPUTTYPE'], dat.GetMetadata_Dict()['ANCILLARYINPUTPOINTER']
('Geolocation', 'MOD03.A2004199.2140.005.2010140193808.hdf')
HDF files are, as the name suggests, hierarchically organized. In the GDAL dataset API this means that a gdal.Dataset object contains subdatasets. For HDF-EOS files we stop at one level of subdatasetting: each subdatasets contains raster bands of the same dimension and data type. What subdatasets there are in a dataset, and what rasterbands in a subdataset, is laid down in NASA's data specification and user manuals for the data in question.
In the case of MODIS, we need to know this: MODIS acquires data in 36 spectral bands, 2 at a nominal resolution of 250 m, 5 at 500 m, and the remaining 19 at 1 km. The data file we have is the 1 km one. It also contains all the higher-resolution datasets, aggregated to 1 km. For reasons that will become clear, I'd like to pull out the subdatasets that were generated from the 250 and 500 m data. Luckily (either because I've looked through all the subdatasets or because I've read NASA's documentation) I know which ones these are. Similarly, I know where to find latitude and longitude arrays in the georeference data.
subdatasets = dat.GetSubDatasets()
geosubdatasets = geodat.GetSubDatasets()
subdatasets[4][1], subdatasets[7][1], geosubdatasets[8][1], geosubdatasets[9][1]
('[2x2030x1354] EV_250_Aggr1km_RefSB MODIS_SWATH_Type_L1B (16-bit unsigned integer)', '[5x2030x1354] EV_500_Aggr1km_RefSB MODIS_SWATH_Type_L1B (16-bit unsigned integer)', '[2030x1354] Latitude (32-bit floating-point)', '[2030x1354] Longitude (32-bit floating-point)')
There they are. And the subdataset descriptions contain additional information:
We open the subdatasets and see that each indeed contains the expected number of rasterbands.
highres = gdal.Open(subdatasets[4][0], gdal.GA_ReadOnly)
midres = gdal.Open(subdatasets[7][0], gdal.GA_ReadOnly)
latsds = gdal.Open(geosubdatasets[8][0], gdal.GA_ReadOnly)
lonsds = gdal.Open(geosubdatasets[9][0], gdal.GA_ReadOnly)
print(highres.RasterCount, midres.RasterCount, latsds.RasterCount, lonsds.RasterCount)
2 5 1 1
I would like to produce something close to a true color RGB image, so I'd like to access the bands closest to red, green and blue. These would be band 2 (the second from the 250 m subdataset), and bands 4 and 3 (the second and first from the 500 m data). Attention, trap: rasterband indexing in GDAL starts at 1, not 0.
red = highres.GetRasterBand(1)
green = midres.GetRasterBand(2)
blue = midres.GetRasterBand(1)
Finally, let's check that our three color RasterBand objects are of the same size. They are, because each has been aggregated to a resolution of 1 km (at nadir, which means looking straight down - it's actually more than 1 km at the left and right edges).
for color in [red, green, blue]:
print(color.XSize, color.YSize)
1354 2030 1354 2030 1354 2030
To plot the data as an image, we need to do two things:
For the first, in this demonstation, we're going to be unconcerned about memory and simply allocate a whole $N \times M \times 3$ array containing 8-bit integer data. Note that the XSize is the number of columns and the YSize the number of rows, so in the array declaration they have to be reversed (rows come first).
imgdata = numpy.zeros([blue.YSize, blue.XSize, 3], np.uint8)
imgdata.shape
(2030, 1354, 3)
In the second step, the maximum and minimum data value are calculated, any invalid data masked out, and the remaining data re-scaled to the interval between 0 and 255. This is done for each RGB band. Note that we get the data as a numpy array from the rasterband object via color.ReadAsArray()
.
for idx, color in enumerate([red, green, blue]):
min, max = color.ComputeRasterMinMax()
data = np.ma.masked_greater_equal(color.ReadAsArray(), max)
imgdata[:, :, idx] = np.multiply(255 / (max - min), data - min).astype(np.uint8)
The result isn't very "true color", simply because of the high dynamic range of the image in particular the cloud and ice areas, which are much brighter than the land.
fig = plt.figure(figsize=(18, 27))
plt.imshow(imgadata)
<matplotlib.image.AxesImage at 0x1015e80d0>
But there is some sea ice visible at the top edge of the image! We can zoom in on that.
fig = plt.figure(figsize=(18, 27))
imgplt = plt.imshow(imgdata[0:300, 200:1200, :])
Looking at the latitude and longitude values associated with the image, it seems that the ice is somewhere in the western Beauford sea. For a single band, it is easy to plot the scene on a rough lat/long coordinate system.
lat = latsds.GetRasterBand(1).ReadAsArray()
lon = lonsds.GetRasterBand(1).ReadAsArray()
print(np.min(lat), np.max(lat))
51.8005 74.916
fig = plt.figure(figsize=(18, 12))
plt.pcolormesh(lon, lat, imgdata[..., 2], cmap='bone')
<matplotlib.collections.QuadMesh at 0x10255a9d0>
To do the same with an RGB image, or on a proper map, we would need to do more work. Other things that should be done are to improve the brightness and contrast, plot coast line overlays and such.
As this tutorial shows, the basic scientific toolset is available and functional in Python 3. The data file that was intentionally chosen for this demonstration is something slightly more complicated than, say, a well-behaved 8-bit RGB GeoTIFF image, which could literally have been displayed in three lines of code using matplotlib alone.
Not quite all is fully up and running, though:
ValueError: level must be >= 0
. From a few similar posts related to other libraries with Python 3.3.3 (https://code.google.com/p/pysam/issues/detail?id=127) it looks like this is a know issue. My basemap version isn't quite the latest (it's a 120 MB download, so I keep a tarball around), so maybe upgrading will resolve it.But things look good!
Most importantly, coding in Python 3 doesn't feel significantly different from Python 2. Yeah, I forgot the parentheses after "print" a few times. But with from __future__ import print_function
, I believe all the above code would run unchanged under Python 2.7. It's still the same language to me.