This section of the notes is intended for follow-up after the class and future reference. We will not go through this in detail in the teaching sessions.
This material and the exercises are more advanced than in the main notes. You are not required to do these, but may like to do so if you want to stretch yourself.
Quite often (data masks in data products, being the most typical example), you will need to deal with binary numbers and binary operations, so we will introduce the concepts you need here.
We came across logical or Boolean operations in the main session (with the data type bool
). In logic, something can be True
or False
, and operations such as not
, and
and or
have quite obvious meanings. More generally in logic (and electronics) we may come across other logical operators such as nand
(not and
) and nor
(not or
) and xor
(exclusive or
: either but not both), but these are not defined in Python (they can of course be derived though).
You should first make sure that you understand the results of the and
, or
and not
bool
operators:
print 'False and False =',False and False
print 'False and True =',False and True
print 'True and False =',True and False
print 'True and True =',True and True
False and False = False False and True = False True and False = False True and True = True
print 'False or False =',False or False
print 'False or True =',False or True
print 'True or False =',True or False
print 'True or True =',True or True
False or False = False False or True = True True or False = True True or True = True
print 'not False =',not False
print 'not True =',not True
not False = True not True = False
As an exercise for this, you could see if you can simulate the logical combinations xor
, nor
and nand
, e.g.:
# nand test:
# see http://en.wikipedia.org/wiki/Logical_NAND
# (A nand B) is not (A and B)
ABList = [(False,False),(False,True),(True,False),(True,True)]
for A,B in ABList:
print '%s nand %s ='%(str(A),str(B)),not (A and B)
False nand False = True False nand True = True True nand False = True True nand True = False
It is of value to have some understanding of binary operations and representation(s):
There are two main number representations used in computing, which depend on the interpretation of the MSB, Most Significant Bit or 'high bit order' (or Byte) and LSB, Least Significant Bit or 'low bit order' (or Byte).
Which system is used is sometimes refered to as 'endianness' so we may refer to a 'big-endian' or 'little-endian' representation. In a big-endian representation, the left-most byte represents the highest number. In a little-endian system, this is the lowest number.
It is probably easiest to understand this with decimal numbers:
So, in a big-endian decimal representation:
152
represents one hundred and fifty two ((1 x 10^2) + (5 x 10^1) + (2 x 10^0)
)
In a little-endian system, this is interpreted the other way around, so 152
is (2 x 10^2) + (5 x 10^1) + (1 x 10^0)
, so is actually two hundred and fifty one.
The term comes from Jonathan Swift's Gulliver's travels, if you are interested, refering to the which end of an egg the people of Lilliput and Blefuscu believe you should open:
"(The people of Lilliput and Blefuscu have) been engaged in a most obstinate war for six-and-thirty moons past. It began upon the following occasion. It is allowed on all hands, that the primitive way of breaking eggs, before we eat them, was upon the larger end; but his present majesty’s grandfather, while he was a boy, going to eat an egg, and breaking it according to the ancient practice, happened to cut one of his fingers. Whereupon the emperor his father published an edict, commanding all his subjects, upon great penalties, to break the smaller end of their eggs. The people so highly resented this law, that our histories tell us, there have been six rebellions raised on that account; wherein one emperor lost his life, and another his crown. These civil commotions were constantly fomented by the monarchs of Blefuscu; and when they were quelled, the exiles always fled for refuge to that empire. It is computed that eleven thousand persons have at several times suffered death, rather than submit to break their eggs at the smaller end. Many hundred large volumes have been published upon this controversy: but the books of the Big-endians have been long forbidden, and the whole party rendered incapable by law of holding employments."
Endianness then (like which end of an egg you should break), is ultimately arbitrary, but we must still have conventions so that information on one computer system can be interpreted by another.
At present, probably most computers you use will be little-endian (the 'Intel convention' as it has come to be known), but you may also come across data stored in big-endian (so-called 'Motorola convention') format.
Endianness is mainly implemented at the byte level, so if you are considering a signle byte, which 'end you open the egg' has no consequence.
For longer number representations though (e.g. a 2-byte or 4-byte integer) it can have significant consequences, and you need to be aware of the endianness of data that you might try to be reading. Many file formats will explicitly store the endinaness that data were written in, so a correct interpretation will be handled by low level reading routines. But, if you get binary data that are of more than one byte that look rather odd when you display them, one possible reason is that you have assumed a different endinaness to what the data were written in.
When we are working with binary numbers for a single bit, we represent False
by 0
and True
by 1
.
In Python, these representations can essentially be used interchangably with logical operators, (but we use the function bool()
to convert to a boolean representation)
print True or False
True
print 1 or 0
1
print bool(1 or 0)
True
whatTheySay = True
if whatTheySay:
print "it's true what they say ..."
else:
print "it's not true what they say ..."
it's true what they say ...
whatTheySay = 1
if whatTheySay:
print "it's true what they say ..."
else:
print "it's not true what they say ..."
it's true what they say ...
A binary representation of a number is a representation in base 2, where we may use multiple bits to represent numbers.
Some number representations (such as floating point) are stored in a more complicated manner, but ASCII codes for string representation and integers are in a more straightforward binary format.
So, e.g. the (integer) decimal number 3 is 11
in binary (base 2):
3 == (1 * 2**1) + \
(1 * 2**0)
True
We can use the Python function bin()
to get this more directly:
print bin(3)
0b11
print 0b11
3
Similarly, the decimal number 101
is 1100101
in binary:
101 == (1 * 2**6) + \
(1 * 2**5) + \
(0 * 2**4) + \
(0 * 2**3) + \
(1 * 2**2) + \
(0 * 2**1) + \
(1 * 2**0)
True
print bin(101)
0b1100101
a. Work out what the following decimal numbers are in binary:
7
493
127
255
1024
and check your result using the approach we took to confirming 101
:
101 == (1 * 2**6) + \
(1 * 2**5) + \
(0 * 2**4) + \
(0 * 2**3) + \
(1 * 2**2) + \
(0 * 2**1) + \
(1 * 2**0)
b. How many bits are needed to represent each of these numbers?
c. What is the largest number you could represent in: (i) a 32 bit representation; (b) a 64 bit representation?
d. Recalling that there are 8 bits in a byte, what is the largest number you could represent in: (a) a single byte; (b) two bytes?
'''
Explorations in binary!
We can represent:
binary numbers, e.g. 0b010101
octal numbers, e.g. 0o755
hexadecimal, e.g. 0x2A2FF
but if we print these,
by default they are printed as the decimal equivalents.
We can convert to binary, octal or hex string with
bin(), oct(), hex()
'''
x = 0b111101101
print bin(x),'is',x,'in decimal'
print bin(x),'is',oct(x),'in octal'
print bin(x),'is',hex(x),'in hexadecimal'
0b111101101 is 493 in decimal 0b111101101 is 0755 in octal 0b111101101 is 0x1ed in hexadecimal
In Python (and most other computer languages) you have access to bitwise operators. As you might expect, these are operators that are executed on individual bits in a binary representation of a number.
The bitwise operators available in Python are:
&
bitwise and|
bitwise or^
bitwise xor
(exclusive or)~
bitwise ones complement<<
bitwise left shift>>
bitwise right shiftThe &
operator simply performs a logical and
operation on two sets of binary representations, so:
1 & 0 == 0
is the same as a logical True and False
operation that we saw above .
The |
operator simply performs a logical or
operation on two sets of binary representations, so:
1 | 0 == 1
is the same as a logical True or False
operation that we saw above.
Similarly,
1 ^ 0 == 1
1 ^ 1 == 0
and
~1 == 0
~0 == 1
These same rules apply then to all bit fields:
~1010 == 0101
etc.
The shift operators are interesting:
left shift by 1, for example is equivalent to multiplying by 2, and right shift by 1 a division by 2.
They are also very useful in sorting out 'bit masks' for data products.
'''
Bitwise operators:
'''
A = 521
B = 523
# print as binary:
print A,'\tA:\t',bin(A)
print B,'\tB:\t',bin(B)
# some operations
print '\tA | B:\t',bin(A|B),'\t',A|B
print '\tA ^ B:\t',bin(A^B),'\t\t',A^B
print '\tA & B:\t',bin(A&B),'\t',A&B
521 A: 0b1000001001 523 B: 0b1000001011 A | B: 0b1000001011 523 A ^ B: 0b10 2 A & B: 0b1000001001 521
'''
Bitwise shift operators:
'''
A = 531
print '\tA:\t',bin(A),'\t',A
print '\tA>>1:\t',bin(A>>1),'\t',A>>1
print '\tA>>1:\t',bin(A<<1),'\t',A<<1
print '\tA>>3:\t',bin(A>>3),'\t',A>>3
print '\tA>>3:\t',bin(A<<3),'\t',A<<3
A: 0b1000010011 531 A>>1: 0b100001001 265 A>>1: 0b10000100110 1062 A>>3: 0b1000010 66 A>>3: 0b1000010011000 4248
As an example of data masking, consider the QA mask in the MODIS Leaf Area Index (LAI) product:
Bit number | Parameter Name | Bit combination | Interpretation |
0 | MODLAND_QC bits | 0 | Good quality (main algorithm with or without saturation) |
1 | Other Quality (back-up algorithm or fill values) | ||
1 | Sensor | 0 | Terra |
1 | Aqua | ||
2 | DeadDetector | 0 | Detectors apparently fine for up to 50% of channels |
1 | Dead detectors caused >50% adjacent detector retrieval | ||
3-4 | CloudState | 00 | Significant clouds NOT present (clear) |
01 | Significant clouds WERE present | ||
10 | Mixed cloud present on pixel | ||
11 | Cloud state not defined (assumed clear) | ||
5-7 | CF_QC | 000 | Main (RT) method used (best result possible (no saturation)) |
001 | Main (RT) method used with saturation. (usable) | ||
010 | Main (RT) method failed due to bad geometry (empirical algorithm used) | ||
010 | Main (RT) method failed due to problems other than geometry (empirical algorithm used) | ||
010 | Pixel not produced at all. |
So, the MODIS LAI QA information is contained in one byte (8 bits), which encodes data about five different categories of QA information.
When using such data, we need to make choices about what quality of data we are willing to accept.
How can we use our understanding of bitwise operations to pull out these datasets?
First, we can try to pull out the different categories.
One way to do this is using a set of 'bit masks' for each of the bit fields we want to cover.
If then perform a bitwise and operation with these, only relevant bits will 'show' through.
# an example QA value: fill all fields
qa = 0b11111111
print 'qa',qa
print 'binary qa',bin(qa)
# set up masks for the different parameters
mask1 = 0b00000001 # bit 0
mask2 = 0b00000010 # bit 1
mask3 = 0b00000100 # bit 2
mask4 = 0b00011000 # bit 3-4
mask5 = 0b11100000 # bit 5-7
qa1 = qa & mask1
qa2 = qa & mask2
qa3 = qa & mask3
qa4 = qa & mask4
qa5 = qa & mask5
print 'qa1',bin(qa1)
print 'qa2',bin(qa2)
print 'qa3',bin(qa3)
print 'qa4',bin(qa4)
print 'qa5',bin(qa5)
qa 255 binary qa 0b11111111 qa1 0b1 qa2 0b10 qa3 0b100 qa4 0b11000 qa5 0b11100000
We gave an example qa value with 1
in all bit fields, and can see that this information has been passed through above, but we still need to remove the trailing zeros to the right of the bit field.
We can easily do this with a right shift operator:
qa1 = (qa & 0b00000001) >> 0 # bit 0
qa2 = (qa & 0b00000010) >> 1 # bit 1
qa3 = (qa & 0b00000100) >> 2 # bit 2
qa4 = (qa & 0b00011000) >> 3 # bit 3-4
qa5 = (qa & 0b11100000) >> 5 # bit 5-7
print 'qa1',bin(qa1)
print 'qa2',bin(qa2)
print 'qa3',bin(qa3)
print 'qa4',bin(qa4)
print 'qa5',bin(qa5)
qa1 0b1 qa2 0b1 qa3 0b1 qa4 0b11 qa5 0b111
And now we see that we have just the information we required.
There are many ways to achieve the same result in filtering such QA masks, but it is quite easy to make a mistake in doing so, so the clearest way to do this is generally as above:
1
in the required fieldsmask4 = 0b00011000 # bit 3-4
mask1 = 0b00000001 # bit 0
&
) between the data and the bit mask(qa & 0b00011000) >> 3 # bit 3-4
It is always a good idea to test the application of such masks by setting 1
in all fields, as you can easily see if you get the result you should be expecting.
Develop some code to repeat the QA bit masking done above, but make the code generate the bit masks itself from knowledge of the first and last of the bit fields you require (assuming they are sequential).
If possible, do this in a function.
Demonstrate its operation with several example bit masks.
In Python, the mechanism for trapping errors (i.e. when something goes wrong in running a block of code) is:
try:
...
except:
...
e.g.:
a = 1
b = 0
print a/b
--------------------------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) <ipython-input-20-e0a9b2305aea> in <module>() 1 a = 1 2 b = 0 ----> 3 print a/b ZeroDivisionError: integer division or modulo by zero
a = 1
b = 0
'''
What do we want to happen if we try to divide by zero?
In this case, we decide we want to set the result to zero
'''
try:
print a/b
except ZeroDivisionError:
print 0
0
Note the indentation in the code. This is what defines the heirarchy of conditional statements in Python (rather than some explicit 'start' and 'end' statements, as in some other languages).
We can trap explicit types of error, e.g.:
x = 100.
del x
try:
print x
except NameError:
print "name 'x' is not defined"
name 'x' is not defined
or we can be non-specific and trap all errors in a code block:
try:
print '2' + 2 # adding a string to integer doesn't make sense
except:
print 'you made an error'
you made an error
You can trap multiple forms of error:
import sys
filename = 'files/data/progress.dat'
try:
f = open(filename)
s = f.readlines()
print s[0]
print s[0]/10
except IOError as e:
# trap a specific IOError
print "I/O error reading from %s"%(filename)
except:
# trap any further errors
print "Error in processing data from file %s"%filename
AS I walk’d through the wilderness of this world, Error in processing data from file files/data/progress.dat
If you want to know more on this subject, you can follow this up with a more detailed tutorial.
# We can use separators other than whitespace with split:
data = "1964,1220\n1974,2470\n1984,2706\n1994,4812\n2004,2707"
print "data.split('\\n'):\n\t",data.split('\n')
# note '\\n' in here as '\n' would have printed a newline
# \ is an escape character (as in unix)
data.split('\n'): ['1964,1220', '1974,2470', '1984,2706', '1994,4812', '2004,2707']
# using a generator expressions
fdata1 = (i.split() for i in data.split('\n'))
print list(fdata1)
[['1964,1220'], ['1974,2470'], ['1984,2706'], ['1994,4812'], ['2004,2707']]
# two listcomps to split first
# on \n and then each of these on comma
fdata1 = [i.split() for i in data.split('\n')]
fdata2 = [j[0].split(',') for j in fdata1]
print fdata2
[['1964', '1220'], ['1974', '2470'], ['1984', '2706'], ['1994', '4812'], ['2004', '2707']]
fdata1 = [i.split() for i in data.split('\n')]
fdata2 = [j[0].split(',') for j in fdata1]
fdata = [[int(l) for l in k] for k in fdata2]
print fdata
[[1964, 1220], [1974, 2470], [1984, 2706], [1994, 4812], [2004, 2707]]
# or all at once, which works
# but now the meaning of code is too obscure
# so this is poor style
fdata = [[int(l) for l in k] for k in [j[0].split(',') \
for j in [i.split() for i in data.split('\n')]]]
print fdata
[[1964, 1220], [1974, 2470], [1984, 2706], [1994, 4812], [2004, 2707]]
# It might be comprehensible if laid out better
fdata = [[int(l) \
for l in k] \
for k in [j[0].split(',') \
for j in [i.split() \
for i in data.split('\n')]]]
# even so you might find this confusing
# and prefer to do it is smaller chunks
print fdata
[[1964, 1220], [1974, 2470], [1984, 2706], [1994, 4812], [2004, 2707]]
You can control spaces, and precision e.g.:
print "%20s %s"%('hello','world')
print "%-20s %s"%('hello','world')
hello world hello world
pi = 3.1415926536
for format in ["%3.2f","%3.20f","%9.5f","%020.3f"]:
print format,'\n\t:',format%pi
%3.2f : 3.14 %3.20f : 3.14159265360000006240 %9.5f : 3.14159 %020.3f : 0000000000000003.142
pi = 3.1415926536
for format in ["%3.2e","%3.20e","%9.5e","%020.3e"]:
print format,'\n\t:',format%pi
%3.2e : 3.14e+00 %3.20e : 3.14159265360000006240e+00 %9.5e : 3.14159e+00 %020.3e : 000000000003.142e+00
i = 100
for format in ["%i","%3i","%5i","%08i"]:
print format,'\n\t:',format%i
%i : 100 %3i : 100 %5i : 100 %08i : 00000100
There are other (and in some ways more elegant) ways to format strings in Python. e.g:
print "Hey {1}, Hello {0}".format('world','you')
print '{0}{1}{0}'.format('abra', 'cad')
# date formatting is a particular example
from datetime import datetime
print """
At the third stroke, it will be
{:%Y-%m-%d %H:%M:%S}
Beep
Beep
Beep
""".format(datetime.now(),'beep')
Hey you, Hello world abracadabra At the third stroke, it will be 2014-09-30 11:11:37 Beep Beep Beep
There are also additional formatting codes, including:
# character
print "char: %c %c %c"%(100,101,102)
# octal
print "octal: %o"%(0o755)
# hex
print "hex: %X"%(0xFFF)
print "hex: %x"%(0xFFF)
char: d e f octal: 755 hex: FFF hex: fff
We noted earlier that in parsing a filename, you might want to split it by the file separator:
# example, with directory names
# glob unix style pattern matching for files and directories
import glob
# returns a list (or [] if empty)
# to match the pattern given
file_list = glob.glob("files/data/*.txt")
print "file_list:\n\t",file_list
# e.g. the first string in the list
this_file = file_list[0]
# split the filename on the field '/'
print "\nthis_file.split('/'):\n\t",this_file.split('/')
# so the filename is just the last element in this list
print "\nthis_file.split('/')[-1]:\n\t",this_file.split('/')[-1]
file_list: ['files/data/HadSEEP_monthly_qc.txt', 'files/data/heathrowdata.txt', 'files/data/modis_files.txt', 'files/data/modis_files2a.txt', 'files/data/modis_files2b.txt', 'files/data/some_modis_files.txt'] this_file.split('/'): ['files', 'data', 'HadSEEP_monthly_qc.txt'] this_file.split('/')[-1]: HadSEEP_monthly_qc.txt
That's fine if you are on a unix system, but on a windows system, the file separator is the other way around, so this code is not portable.
To make aspects of your code like this portable, we can use features from Python that will tend to be in the module os
:
from os import sep # import the string sep from the module os
print "On this system, the file separator is",sep
On this system, the file separator is /
So, better code than the above is:
# example, with directory names
# glob unix style pattern matching for files and directories
import glob
from os import sep
# returns a list (or [] if empty)
# to match the pattern given
file_list = glob.glob("files{0}data{0}*.txt".format(sep))
print "file_list:\n\t",file_list
# e.g. the first string in the list
this_file = file_list[0]
# split the filename on the field '/' or '\' (sep)
print "\nthis_file.split({0}):\n\t".format(sep),\
this_file.split(sep)
# so the filename is just the last element in this list
print "\nthis_file.split({0})[-1]:\n\t".format(sep),\
this_file.split(sep)[-1]
file_list: ['files/data/HadSEEP_monthly_qc.txt', 'files/data/heathrowdata.txt', 'files/data/modis_files.txt', 'files/data/modis_files2a.txt', 'files/data/modis_files2b.txt', 'files/data/some_modis_files.txt'] this_file.split(/): ['files', 'data', 'HadSEEP_monthly_qc.txt'] this_file.split(/)[-1]: HadSEEP_monthly_qc.txt
There will always be a compromise between portability and readability of the code, so try to make sure that any portability modifications you make don't obscure the meaniong of the code.
Most of what you will want (especially to begin with) will be contained in your current Python distribution.
Sometimes though, you might need to install some new module.
Generally, the easiest way to do this is to use easy_install
.
We will use this to install a package pyephem
into your user area:
at a unix prompt, type:
easy_install --user pyephem
If all goes well, the text that comes up at the terminal should tell you that this has installed (e.g. in /home/plewis/.local/lib/python2.7/site-packages/pyephem-3.7.5.1-py2.7-linux-x86_64.egg
).
We can test to see if we can load this package:
import ephem
and then do some interesting things with it.
The following piece of code might seem a little complicated right now, but it just uses
We will need to import
some other things from a few other packages as well:
from ephem import Sun, Observer
from math import pi
from datetime import date,datetime,time
# radians to degrees
rtod = 180./pi
# observer information in pyephem
obs = Observer()
today = date.today()
# or datetime.date(2013, 3, 12)
# put in lat / lon for UCL
# https://www.google.co.uk/search?q=ucl+longitude+latitude
obs.lat = '51.5248'
obs.long = '-0.1336'
for hour in xrange(0,24):
for minute in xrange(0,60,30):
t = time(hour, minute, 0)
obs.date = datetime.combine(today, t)
sun = Sun(obs)
zenith_elevation = float(sun.alt)*rtod # degrees
print obs.date,zenith_elevation
2014/9/30 00:00:00 -41.1227180114 2014/9/30 00:30:00 -40.4796625312 2014/9/30 00:59:59 -39.0649015428 2014/9/30 01:30:00 -36.9463783528 2014/9/30 02:00:00 -34.2136162508 2014/9/30 02:30:00 -30.9643006022 2014/9/30 03:00:00 -27.2941667536 2014/9/30 03:30:00 -23.2912319377 2014/9/30 03:59:59 -19.0337171884 2014/9/30 04:30:00 -14.5903017653 2014/9/30 05:00:00 -10.0216590927 2014/9/30 05:30:00 -5.01932715546 2014/9/30 06:00:00 -0.134849371939 2014/9/30 06:30:00 4.09286322097 2014/9/30 06:59:59 8.55453445901 2014/9/30 07:30:00 12.938958392 2014/9/30 08:00:00 17.1514399965 2014/9/30 08:30:00 21.1181133381 2014/9/30 09:00:00 24.7672136192 2014/9/30 09:30:00 28.020436136 2014/9/30 09:59:59 30.7949255712 2014/9/30 10:30:00 33.0070257679 2014/9/30 11:00:00 34.579161145 2014/9/30 11:30:00 35.4491052641 2014/9/30 12:00:00 35.5794764999 2014/9/30 12:30:00 34.9643838142 2014/9/30 12:59:59 33.6305127561 2014/9/30 13:30:00 31.6322282163 2014/9/30 14:00:00 29.0427156717 2014/9/30 14:30:00 25.9443096378 2014/9/30 15:00:00 22.4206252902 2014/9/30 15:30:00 18.5516782953 2014/9/30 15:59:59 14.4111578562 2014/9/30 16:30:00 10.0714597095 2014/9/30 17:00:00 5.61372180666 2014/9/30 17:30:00 1.2203105187 2014/9/30 18:00:00 -2.38322183516 2014/9/30 18:30:00 -8.47981048194 2014/9/30 18:59:59 -13.0951972617 2014/9/30 19:30:00 -17.6077341991 2014/9/30 20:00:00 -21.9594218892 2014/9/30 20:30:00 -26.0849005441 2014/9/30 21:00:00 -29.9096032812 2014/9/30 21:30:00 -33.3488186792 2014/9/30 21:59:59 -36.3086640806 2014/9/30 22:30:00 -38.6902793272 2014/9/30 23:00:00 -40.3983491291 2014/9/30 23:30:00 -41.3537969714
Let's write this out to a file now:
from ephem import Sun, Observer
from math import pi
from datetime import date,datetime,time
filename = 'files/data/elevation.dat'
try:
fp = open(filename,"w")
except:
print "Failed to open file %s for writing"%filename
# radians to degrees
rtod = 180./pi
# observer information in pyephem
obs = Observer()
today = date.today()
# or datetime.date(2013, 3, 12)
# put in lat / lon for UCL
# https://www.google.co.uk/search?q=ucl+longitude+latitude
obs.lat = '51.5248'
obs.long = '-0.1336'
for hour in xrange(0,24):
for minute in xrange(0,60,30):
t = time(hour, minute, 0)
obs.date = datetime.combine(today, t)
sun = Sun(obs)
elevation = float(sun.alt)*rtod # degrees
fp.write("%s %s\n"%(obs.date,elevation))
fp.close()
# just for interest, let's plot this:
import pylab as plt
%pylab inline
filename = 'files/data/elevation.dat'
fp = open(filename,"r")
hours = []
zeniths = []
for i in fp.readlines():
data = i.split()
time = [float(i) for i in data[1].split(':')]
hour = time[0] + time[1]/60. + time[2]/(60.*60)
zenith = 90. - float(data[2])
if zenith <= 90.:
hours.append(hour)
zeniths.append(zenith)
fp.close()
# plotting
fig = plt.figure(figsize=(14, 6))
plt.plot(hours,zeniths)
plt.xlabel('hour')
plt.ylabel('solar zenith / degrees')
<matplotlib.text.Text at 0x106574710>
In the main exercises, you have been advised to download a file and read it using the normal file open
commands.
If a file is available as a URL, you can conveniently use the urllib2
module in Python to access the file directly.
So, in the case of the precipitation data in an earlier exercise, we could hhave directly read the data with:
import urllib2
#as files
url = "http://www.metoffice.gov.uk/hadobs/hadukp/data/monthly/HadSEEP_monthly_qc.txt"
req = urllib2.Request ( url )
raw_data = urllib2.urlopen(req).readlines()
print raw_data[:6]
['Monthly Southeast England precipitation (mm). Daily automated values used after 1996.\n', 'Wigley & Jones (J.Climatol.,1987), Gregory et al. (Int.J.Clim.,1991)\n', 'Jones & Conway (Int.J.Climatol.,1997), Alexander & Jones (ASL,2001). Values may change after QC.\n', 'YEAR JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC ANN\n', ' 1873 87.1 50.4 52.9 19.9 41.1 63.6 53.2 56.4 62.0 86.0 59.4 15.7 647.7\n', ' 1874 46.8 44.9 15.8 48.4 24.1 49.9 28.3 43.6 79.4 96.1 63.9 52.3 593.5\n']
They are particularly useful when you want to build clear data structures, e.g. in loading a dataset such as files/data/HadSEEP_monthly_qc.txt
.
First, let's read the data from this file into the lists label
and rdata
, where rdata[0]
will be the first column in the file, rdata[1]
the second column etc.
fp = open('files/data/HadSEEP_monthly_qc.txt','r')
sdata = fp.readlines()
fp.close()
labels = sdata[3].split()
data = [i.split() for i in sdata[4:]]
j = 0
rdata = [[float(data[i][j]) \
for i in xrange(len(data))]\
for j in xrange(len(data[0]))]
print rdata[0]
[1873.0, 1874.0, 1875.0, 1876.0, 1877.0, 1878.0, 1879.0, 1880.0, 1881.0, 1882.0, 1883.0, 1884.0, 1885.0, 1886.0, 1887.0, 1888.0, 1889.0, 1890.0, 1891.0, 1892.0, 1893.0, 1894.0, 1895.0, 1896.0, 1897.0, 1898.0, 1899.0, 1900.0, 1901.0, 1902.0, 1903.0, 1904.0, 1905.0, 1906.0, 1907.0, 1908.0, 1909.0, 1910.0, 1911.0, 1912.0, 1913.0, 1914.0, 1915.0, 1916.0, 1917.0, 1918.0, 1919.0, 1920.0, 1921.0, 1922.0, 1923.0, 1924.0, 1925.0, 1926.0, 1927.0, 1928.0, 1929.0, 1930.0, 1931.0, 1932.0, 1933.0, 1934.0, 1935.0, 1936.0, 1937.0, 1938.0, 1939.0, 1940.0, 1941.0, 1942.0, 1943.0, 1944.0, 1945.0, 1946.0, 1947.0, 1948.0, 1949.0, 1950.0, 1951.0, 1952.0, 1953.0, 1954.0, 1955.0, 1956.0, 1957.0, 1958.0, 1959.0, 1960.0, 1961.0, 1962.0, 1963.0, 1964.0, 1965.0, 1966.0, 1967.0, 1968.0, 1969.0, 1970.0, 1971.0, 1972.0, 1973.0, 1974.0, 1975.0, 1976.0, 1977.0, 1978.0, 1979.0, 1980.0, 1981.0, 1982.0, 1983.0, 1984.0, 1985.0, 1986.0, 1987.0, 1988.0, 1989.0, 1990.0, 1991.0, 1992.0, 1993.0, 1994.0, 1995.0, 1996.0, 1997.0, 1998.0, 1999.0, 2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0, 2007.0, 2008.0, 2009.0, 2010.0, 2011.0, 2012.0, 2013.0]
We can of course just remember that the first column is 'year', the second column data for 'JAN' etc., but we would have a much better organised dataset if we put it in a dictionary.
One way to do this would be:
# set up a
dataset = {}
for i,k in enumerate(labels):
dataset[k] = rdata[i]
and we could now refer to the 'year' data as:
print dataset['YEAR']
[1873.0, 1874.0, 1875.0, 1876.0, 1877.0, 1878.0, 1879.0, 1880.0, 1881.0, 1882.0, 1883.0, 1884.0, 1885.0, 1886.0, 1887.0, 1888.0, 1889.0, 1890.0, 1891.0, 1892.0, 1893.0, 1894.0, 1895.0, 1896.0, 1897.0, 1898.0, 1899.0, 1900.0, 1901.0, 1902.0, 1903.0, 1904.0, 1905.0, 1906.0, 1907.0, 1908.0, 1909.0, 1910.0, 1911.0, 1912.0, 1913.0, 1914.0, 1915.0, 1916.0, 1917.0, 1918.0, 1919.0, 1920.0, 1921.0, 1922.0, 1923.0, 1924.0, 1925.0, 1926.0, 1927.0, 1928.0, 1929.0, 1930.0, 1931.0, 1932.0, 1933.0, 1934.0, 1935.0, 1936.0, 1937.0, 1938.0, 1939.0, 1940.0, 1941.0, 1942.0, 1943.0, 1944.0, 1945.0, 1946.0, 1947.0, 1948.0, 1949.0, 1950.0, 1951.0, 1952.0, 1953.0, 1954.0, 1955.0, 1956.0, 1957.0, 1958.0, 1959.0, 1960.0, 1961.0, 1962.0, 1963.0, 1964.0, 1965.0, 1966.0, 1967.0, 1968.0, 1969.0, 1970.0, 1971.0, 1972.0, 1973.0, 1974.0, 1975.0, 1976.0, 1977.0, 1978.0, 1979.0, 1980.0, 1981.0, 1982.0, 1983.0, 1984.0, 1985.0, 1986.0, 1987.0, 1988.0, 1989.0, 1990.0, 1991.0, 1992.0, 1993.0, 1994.0, 1995.0, 1996.0, 1997.0, 1998.0, 1999.0, 2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0, 2007.0, 2008.0, 2009.0, 2010.0, 2011.0, 2012.0, 2013.0]
A simper and more Pythonic way is to use zip
:
dataset = dict(zip(labels,rdata))
print dataset['YEAR']
[1873.0, 1874.0, 1875.0, 1876.0, 1877.0, 1878.0, 1879.0, 1880.0, 1881.0, 1882.0, 1883.0, 1884.0, 1885.0, 1886.0, 1887.0, 1888.0, 1889.0, 1890.0, 1891.0, 1892.0, 1893.0, 1894.0, 1895.0, 1896.0, 1897.0, 1898.0, 1899.0, 1900.0, 1901.0, 1902.0, 1903.0, 1904.0, 1905.0, 1906.0, 1907.0, 1908.0, 1909.0, 1910.0, 1911.0, 1912.0, 1913.0, 1914.0, 1915.0, 1916.0, 1917.0, 1918.0, 1919.0, 1920.0, 1921.0, 1922.0, 1923.0, 1924.0, 1925.0, 1926.0, 1927.0, 1928.0, 1929.0, 1930.0, 1931.0, 1932.0, 1933.0, 1934.0, 1935.0, 1936.0, 1937.0, 1938.0, 1939.0, 1940.0, 1941.0, 1942.0, 1943.0, 1944.0, 1945.0, 1946.0, 1947.0, 1948.0, 1949.0, 1950.0, 1951.0, 1952.0, 1953.0, 1954.0, 1955.0, 1956.0, 1957.0, 1958.0, 1959.0, 1960.0, 1961.0, 1962.0, 1963.0, 1964.0, 1965.0, 1966.0, 1967.0, 1968.0, 1969.0, 1970.0, 1971.0, 1972.0, 1973.0, 1974.0, 1975.0, 1976.0, 1977.0, 1978.0, 1979.0, 1980.0, 1981.0, 1982.0, 1983.0, 1984.0, 1985.0, 1986.0, 1987.0, 1988.0, 1989.0, 1990.0, 1991.0, 1992.0, 1993.0, 1994.0, 1995.0, 1996.0, 1997.0, 1998.0, 1999.0, 2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0, 2007.0, 2008.0, 2009.0, 2010.0, 2011.0, 2012.0, 2013.0]
So, e.g. plotting data sets would be clear using the dictionary:
import pylab as plt
%pylab inline
for i,JFM in enumerate([['JAN','FEB','MAR'],['APR','MAY','JUN'],\
['JUL','AUG','SEP'],['OCT','NOV','DEC']]):
plt.figure(i,figsize=(14, 6))
plt.xlabel('Year')
plt.ylabel('Precip. (mm)')
for month in JFM:
plt.ylim(0,250)
plt.xlim(dataset['YEAR'][0],dataset['YEAR'][-1])
plt.title(str(JFM))
plt.plot(dataset['YEAR'],dataset[month])
plt.figure(i+1,figsize=(14, 6))
plt.xlabel('Year')
plt.ylabel('Precip. (mm)')
plt.title('ANN')
plt.ylim(0,1100)
plt.xlim(dataset['YEAR'][0],dataset['YEAR'][-1])
plt.plot(dataset['YEAR'],dataset['ANN'])
[<matplotlib.lines.Line2D at 0x107593850>]
In this example, we want to read all data from the Heathrow Met data file files/data/heathrowdata.txt
and plot each data field.
In doing this, we want to store the data in a dictionary, as this is a convenient way of access the different data fields.
At the heart of this example is the statement
dict(zip(labels,rdata))
This is a very useful combination in Python as it allows us to generate a dictionary with the keys specified by labels
and the values rdata
.
A few things that make this a bit more complicated are:
rdata = [[data[j][i] for j in xrange(len(data))] for i in xrange(len(data[0]))]
#
symbols here and there and the fact that '---'
is used to indicate no data. Here, we replace these awkward fields with more convenient numerical values (that won't break the conversion to float
), or get rid of them as appropriate (using replace()
on the strings).-999
to indicate no data, and have to filter these values out before plotting. This is easy enough to do in a listcomp for the y
values by using a conditional statement:y = [i for i in items[label] if i != -999]
But for the x values, we have to filter these based on a test with the y
values ... and so we use a more complex statement:
x = [time[j] for j,i in enumerate(items[label]) if i != -999]
+
is a concatenation operator for lists. Instead, we have to loop over all of the items in the lists:time = [items['year'][i] + items['month'][i]/12. for i in xrange(len(items['year']))]
We will see next time that there are more convenient modules for doing this sort of thing, but it's of value to know how you can do it if you only have lists.
One final comment is that we have made extensive use of listcomps here, which make the code more compact but may be more difficult for people just starting coding to follow.
# Read files/data/heathrowdata.txt as above into data as float
# doing some clever things to deal with file markers
# i.e. where it says # Provisional or # we will not worry about
# that for the moment
# where it has fillers, ---, replace by -999
fp = open('files/data/heathrowdata.txt','r')
ldata = [i.replace('---','-999').replace('# Provisional','').replace('#','').split() \
for i in fp.readlines()[7:]]
data = [[float(k) for k in j] for j in ldata]
fp.close()
# The reading by line is a bit inconvenient, so rotate the
# dataset
rdata = [[data[j][i] for j in xrange(len(data))] for i in xrange(len(data[0]))]
# set some labels for each column
labels = ['year','month','tmax (degC)','tmin (degC)',\
'air frost (days)','rain (mm)','sun (hours)']
# generate a dictionary with the labels as keys
items = dict(zip(labels,rdata))
import pylab as plt
# its quite difficult to add arrays
# so we have to loop over each item
time = [items['year'][i] + items['month'][i]/12. for i in xrange(len(items['year']))]
for i,label in enumerate(labels[2:]):
plt.figure(i,figsize=(14, 6))
plt.xlabel('year')
plt.ylabel(label)
# filter out the -999 values
x = [time[j] for j,i in enumerate(items[label]) if i != -999]
y = [i for i in items[label] if i != -999]
plt.plot(x,y)
In the main class notes, we used expressions and loops of the form:
data = "1964 1220 1974 2470 1984 2706 1994 4812 2004 2707"
sdata = data.split()
for i in xrange(0,len(sdata),2):
print ' '.join(sdata[i:i+2])
1964 1220 1974 2470 1984 2706 1994 4812 2004 2707
to do repeated actions by looping the variable i
over some array of integers.
That is a perfectly fine way to pull these data out of a string, but it's not very 'Pythonic' (it does'nt make best use of some of the elegant features of this language).
Better in this sense is what are known as listcomps (list comprehensions).
With a listcomp, you define a list (enclosed in [
, ]
), with two or three terms. The first term is some function fn(item)
. The second is a for statement. The third, if present, is a conditional (if
) statement.
For example:
fdata = [int(s) for s in data.split()]
print fdata
[1964, 1220, 1974, 2470, 1984, 2706, 1994, 4812, 2004, 2707]
This generates a list. Within this list, we iterate over the loop for s in data.split()
, and enter the result of the function int(s)
.
So this example is directly equivalent to:
data = "1964 1220 1974 2470 1984 2706 1994 4812 2004 2707"
fdata = []
for s in data.split():
fdata.append(int(s))
print fdata
[1964, 1220, 1974, 2470, 1984, 2706, 1994, 4812, 2004, 2707]
You will very commonly use listcomps of this nature when performing some function over elements in a list where you have to perform the function on each element at a time.
data = "1964 1220 1974 2470 1984 2706 1994 4812 2004 2707"
fdata = [int(s) for s in data.split()]
# use slicing to separate the odd
# and even data
years = fdata[0::2]
emissions = fdata[1::2]
print years
print emissions
[1964, 1974, 1984, 1994, 2004] [1220, 2470, 2706, 4812, 2707]
Listcomps can be very convenient, as they are a compact way of specifying a loop (with a conditonal statement oif required).
Don't use listcomps if they obscure the meaning of what you are doing though.
A listcomp generates everything in the list, then returns the list.
Sometimes, you only need one element at a time (e.g. within a loop). In such cases, it is better to use generator expressions.
These look much like listcomps but use ()
rather than []
fdata = (int(s) for s in data.split())
print 'fdata is',fdata
for i in fdata:
print i
fdata is <generator object <genexpr> at 0x7fc81da021e0> 1964 1220 1974 2470 1984 2706 1994 4812 2004 2707
but only return one item at a time, on demand in the loop.