...as a simple introduction to Python and automating analyses.
Suppose we have a spreadsheet of bird sightings by day and location, and we want to analyze and summarize the data.
Let's start by making up some data of the form we want to analyze. In a real situation, this data would generally be taken from the real data -- but, since this is an entirely fictional example, here we're going to just construct some test data.
The most important thing about this test data is that it should be small but realistic. We'll talk about "realistic" later -- for now, small is the key thing, because it lets you think about what your code should be doing.
%%file birds.txt
robin,kentucky,may 23
robin,west virginia,may 24
seagull,maine,may 24
chickadee,kentucky,june 1
seagull,maine,june 1
Overwriting birds.txt
Let's start by parsing the file 'birds.txt' that we created above.
"Parsing" means "reading in data that's in some format". Here, we're looking at a file that is in a spreadsheet format, with each record separated by commas -- a "comma separated value" format file. Python has a built-in library, 'csv', that can read a wide variety of columnar formats.
@@docs for csv
# Let's start by reading in each line of the file, one by one.
fp = file('birds.txt', 'rb')
for line in fp:
print line
robin,kentucky,may 23 robin,west virginia,may 24 seagull,maine,may 24 chickadee,kentucky,june 1 seagull,maine,june 1
Note that each line has an extra carriage return or newline after it, because 'print' adds it. You can disable that by putting a ',' after 'line' in the print statement, which tells 'print' not to do that. (Try it out!)
There are a variety of ways to load this file into Python so that we can work with it. One way would be to use Python string functions to break each line up by commas, but this will not always work on complicated csv files. So instead let's use the csv module.
import csv
fp = file('birds.txt', 'rb')
reader = csv.reader(fp)
for bird, state, day in reader:
print bird, '/', state, '/', day
robin / kentucky / may 23 robin / west virginia / may 24 seagull / maine / may 24 chickadee / kentucky / june 1 seagull / maine / june 1
The next problem here is that we are not actually doing anything with this data other than printing it out!
Let's suppose we want to ask some questions -- like, "how many chickadees did we see"? That's pretty easy to do with the above code --
import csv
fp = file('birds.txt', 'rb')
reader = csv.reader(fp)
n = 0
for bird, state, day in reader:
if bird == 'chickadee':
n += 1
print 'we saw', n, 'chickadees'
we saw 1 chickadees
There are two different problems with this approach, though.
First, every time you want to count the number of times you see a particular bird, you have to do all the file parsing and for loop stuff enough. While inefficient, the inefficiency isn't that bad -- the real problem is that you're virtually guaranteed to screw up if you copy and paste the code multiple times. (The computing principle underlying all of this is "Don't Repeat Yourself", or "DRY".) So what we really want to do is store the results of reading in the file once, and then use those results again and again.
The second problem is that every time we want to count a new bird, we need to replace 'chickadee' in multiple places. We'll fix that in a moment.
# define an empty list
birdlist = []
fp = file('birds.txt', 'rb')
reader = csv.reader(fp)
n = 0
for bird, state, day in reader:
birdlist.append(bird)
print 'we saw', birdlist.count('chickadee'), 'chickadees'
we saw 1 chickadees
print 'we saw', birdlist.count('robin'), 'robins'
we saw 2 robins
OK, so here we're using a list to keep track of all of the birds and then we're just using a feature of lists in Python to ask for the counts of whatever bird we're interested in. Good stuff.
What next? Let's solve the next problem, which is that we kind of want to wrap all of this up nicely so that we can reuse the code generically -- without specifying either the specific filename in advance, or which bird we want to count.
# to do this, define functions
def read_birdlist(filename):
birdlist = []
fp = file('birds.txt', 'rb')
reader = csv.reader(fp)
n = 0
for bird, state, day in reader:
birdlist.append(bird)
return birdlist
def print_birdcount(birdlist, bird):
print 'we saw', birdlist.count(bird), 'of bird type', bird
birdlist = read_birdlist('birds.txt')
print_birdcount(birdlist, 'robin')
print_birdcount(birdlist, 'chickadee')
we saw 2 of bird type robin we saw 1 of bird type chickadee
A few things to note -- the functions are basically copy-paste from working code, just substituting variables in where we want to be generic. Also, we're naming the functions in nice understandable ways. And once you have them written, you can just leave 'em alone and use them again and again.
Notice that 'birdlist', above, doesn't contain all of the information in the 'birds.txt' file -- it just has the first column. Let's fix that.
def read_birdlist2(filename):
birdlist = []
fp = file(filename, 'rb')
reader = csv.reader(fp)
n = 0
for bird, state, day in reader:
birdlist.append((bird, state, day))
return birdlist
birdlist2 = read_birdlist2('birds.txt')
print birdlist2
[('robin', 'kentucky', 'may 23'), ('robin', 'west virginia', 'may 24'), ('seagull', 'maine', 'may 24'), ('chickadee', 'kentucky', 'june 1'), ('seagull', 'maine', 'june 1')]
Here what we are doing is storing multiple values in something called a 'tuple', which is basically just a comma-seperated set of values, in a list. This list is essentially identical to what we're reading out of the file; you can iterate over it like so:
for bird, state, day in birdlist2:
print bird, state, day
robin kentucky may 23 robin west virginia may 24 seagull maine may 24 chickadee kentucky june 1 seagull maine june 1
Our new problem, however, is that we cannot just use 'birdlist2.count(bird)' any more, because birdlist2 contains far more than birds -- it also contains the other information. So what do we do?
We rewrite print_birdcount appropriately, is what!
def print_birdcount2(birdlist2, bird_to_count):
n = 0
for bird, state, day in birdlist2:
if bird == bird_to_count:
n += 1
print 'we saw', n, 'of bird type', bird_to_count
print_birdcount2(birdlist2, 'robin')
we saw 2 of bird type robin
We can also write generic functions to sort things out by state, etc, by filtering the list on various rules. Let's grab all the birds seen in kentucky:
def get_entries_by_state(birdlist2, query_state):
newlist = []
for bird, state, day in birdlist2:
if state == query_state:
newlist.append((bird, state, day))
return newlist
get_entries_by_state(birdlist2, 'kentucky')
[('robin', 'kentucky', 'may 23'), ('chickadee', 'kentucky', 'june 1')]
...and, since we're getting the birds out in the same form as we're loading them in -- as lists of 3-tuples (bird, state, day), we can now apply 'print_birdcount2' as we wish:
kentucky_birds = get_entries_by_state(birdlist2, 'kentucky')
print_birdcount2(kentucky_birds, 'chickadee')
we saw 1 of bird type chickadee
You might well ask yourself, why on earth are we doing all of this to count a list of six bird entries!? Because now we can count much bigger lists, entirely automatically!
bigbirdlist = read_birdlist2('long-birds.csv')
print len(bigbirdlist)
print bigbirdlist[:10]
10000 [('albatross', 'maine', 'april 27'), ('stork', 'michigan', 'september 21'), ('heron', 'kentucky', 'september 22'), ('albatross', 'kansas', 'june 28'), ('common teal', 'kentucky', 'june 03'), ('long-tailed duck', 'maine', 'august 28'), ('vulture', 'kansas', 'september 25'), ('flamingo', 'kansas', 'may 16'), ('vulture', 'missouri', 'july 16'), ('albatross', 'maine', 'july 20')]
...and all of the functions etc above will work on this.
So far we've been counting specific birds. What if we just want to get the count for all birds rather than just one at a time??
We can do this easily like so:
birdlist2 = read_birdlist2('birds.txt')
for bird, state, day in birdlist2:
print_birdcount2(birdlist2, bird)
we saw 2 of bird type robin we saw 2 of bird type robin we saw 2 of bird type seagull we saw 1 of bird type chickadee we saw 2 of bird type seagull
But you probably see the hitch -- each bird count is printed as many times as it's in the list! So, now what?
We can get a unique list of birds by using a set, a type of Python data structure that collapses lists down into sets that contain at most one of anything:
bird_types = set()
for bird, state, day in birdlist2:
bird_types.add(bird)
print bird_types
set(['chickadee', 'seagull', 'robin'])
# now we can just look at the unique set of birds
for bird_type in bird_types:
print_birdcount2(birdlist2, bird_type)
we saw 1 of bird type chickadee we saw 2 of bird type seagull we saw 2 of bird type robin
This all does get a little unwieldy, though, when we're talking about thousands, hundreds of thousands, or millions of records. In that case we might just want to count all of the birds once, and then save the results. Is there a way to do that?
The standard way in Python to store this kind of association -- really, any kind of association -- is to use a dictonary, which is just a way of looking up values by looking at their keys. To see this in action, here's a simple example:
d = {}
d['robin'] = 1
d['seagull'] = 2
This now gives us the ability to retrieve the number of birds from 'd' by asking for the value associated with that birdname:
print d['robin']
1
But how do we construct such a dictionary? The simplest way (not necessarily the most efficient, but that's ok) is to just use our bits and bobs of code above, and copy paste.
def make_birddict(birdlist2):
# first, get the unique set of birds
bird_types = set()
for bird, state, day in birdlist2:
bird_types.add(bird)
# then, copy the code from print_birdcount2:
d = {}
for bird_type in bird_types:
n = 0
for bird, state, day in birdlist2:
if bird == bird_type:
n += 1
d[bird_type] = n
return d
# tada!
print make_birddict(birdlist2)
{'chickadee': 1, 'seagull': 2, 'robin': 2}
The above code is really inefficient, though. You may notice that it reads through all of the entries in birdlist2 once for each bird_type. Is there a better way to do it?
Yes -- you can update the values in the dictionary as you go.
def make_birddict2(birdlist2):
# first, get the unique set of birds
bird_types = set()
for bird, state, day in birdlist2:
bird_types.add(bird)
# then, set the count for each bird to 0
d = {}
for bird in bird_types:
d[bird] = 0
# now, go through all of the entries in birdlist2 and, for each one, increment the count in the dictionary
for bird, state, day in birdlist2:
d[bird] = d[bird] + 1
return d
make_birddict2(birdlist2)
{'chickadee': 1, 'robin': 2, 'seagull': 2}
A good rule of thumb in Python is that any time you have nested loops, you might be able to use a dictionary to do the same thing, but more efficiently.
It's all well and good to look at strings like we are, but what if we want to graph things by day collected? We can't just use the dates as they are, because they're not numbers that you can graph by. What we have to do is convert them into a format that Python can compare. How?
We can use the function datetime.strptime, docs here: http://docs.python.org/3.3/library/datetime.html?highlight=strptime#strftime-and-strptime-behavior
from datetime import datetime
# strptime will convert strings into dates given the appropriate formatting string (see docs above)
print datetime.strptime('may 21', '%B %d')
1900-05-21 00:00:00
# we should fix that year...
print datetime.strptime('may 21' + ' 2013', '%B %d %Y')
2013-05-21 00:00:00
# ok -- and now we need to convert back to a straight up number.
# If all the dates are 2013, we can just ask for day of year...
date = datetime.strptime('may 21' + ' 2013', '%B %d %Y')
day_of_year = date.strftime('%j')
print day_of_year
141
# hmm, does that actually work!?
for bird, state, day in birdlist2:
date = datetime.strptime(day + ' 2013', '%B %d %Y')
day_of_year = date.strftime('%j')
print day, date, day_of_year
may 23 2013-05-23 00:00:00 143 may 24 2013-05-24 00:00:00 144 may 24 2013-05-24 00:00:00 144 june 1 2013-06-01 00:00:00 152 june 1 2013-06-01 00:00:00 152
Hey, that looks OK -- let's transform the list into something we want to plot, birds by day:
birdcount_by_day = {}
for bird, state, day in birdlist2:
birdcount_by_day[day] = 0
for bird, state, day in birdlist2:
birdcount_by_day[day] = birdcount_by_day[day] + 1
print birdcount_by_day
{'may 24': 2, 'may 23': 1, 'june 1': 2}
plotme_x = []
plotme_y = []
for day in birdcount_by_day: # note, iterating over dictionaries gives you keys
date = datetime.strptime(day + ' 2013', '%B %d %Y')
day_of_year = date.strftime('%j')
# trick: we need to convert day_of_year into an integer
day_of_year = int(day_of_year)
# retrieve birdcount
count = birdcount_by_day[day]
# now add day_of_year and birdcount
plotme_x.append(day_of_year)
plotme_y.append(count)
print plotme_x
print plotme_y
[144, 143, 152] [2, 1, 2]
plot(plotme_x, plotme_y, 'ro')
[<matplotlib.lines.Line2D at 0x109ef8110>]
plot(plotme_x, plotme_y, 'ro')
axis(ymin=0, xmin=140, xmax=160)
(140, 160, 0, 2.2000000000000002)
# if we plot with lines, we see that because the points aren't sorted, we don't get what we expect.
plot(plotme_x, plotme_y, 'r-')
axis(ymin=0, xmin=140, xmax=160)
(140, 160, 0, 2.2000000000000002)
# how do we fix this? one way is to zip the two lists together, sort them, and pull them back out...
plotme_all = zip(plotme_x, plotme_y)
plotme_all.sort()
plotme_x = []
plotme_y = []
for (x, y) in plotme_all:
plotme_x.append(x)
plotme_y.append(y)
print plotme_x
[143, 144, 152]
# tada...
plot(plotme_x, plotme_y, 'r-')
axis(ymin=0, xmin=140, xmax=160)
(140, 160, 0, 2.2000000000000002)
Let's do this for a BIG bird list next.
bigbirdcount_by_day = {}
for bird, state, day in bigbirdlist:
bigbirdcount_by_day[day] = 0
for bird, state, day in bigbirdlist:
bigbirdcount_by_day[day] = bigbirdcount_by_day[day] + 1
plotme = []
for day in bigbirdcount_by_day: # note, iterating over dictionaries gives you keys
date = datetime.strptime(day + ' 2013', '%B %d %Y')
day_of_year = date.strftime('%j')
# trick: we need to convert day_of_year into an integer
day_of_year = int(day_of_year)
# retrieve birdcount
count = bigbirdcount_by_day[day]
# now add day_of_year and birdcount
plotme.append((day_of_year, count))
plotme.sort()
print plotme[:5]
[(18, 1), (19, 1), (22, 1), (28, 1), (31, 1)]
plotme_x = []
plotme_y = []
for (x, y) in plotme:
plotme_x.append(x)
plotme_y.append(y)
plot(plotme_x, plotme_y, 'r-')
[<matplotlib.lines.Line2D at 0x1082a3f90>]
This is all fine and well, you say, but your advisor wants a spreadsheet. Or you want one. Let's say, you want to take all your big bird list and just print out the ones that are from kentucky into a spreadsheet. How would you do that?
Basically, you want to create a 'csv' file, and, conveniently, the 'csv' module contains routines to help do that! Instead of 'csv.reader', you use 'csv.writer'.
# start by picking out the birds we want to save:
birdlist2 = read_birdlist2('birds.txt')
print birdlist2
[('robin', 'kentucky', 'may 23'), ('robin', 'west virginia', 'may 24'), ('seagull', 'maine', 'may 24'), ('chickadee', 'kentucky', 'june 1'), ('seagull', 'maine', 'june 1')]
kentucky_birds = []
for bird, state, date in birdlist2:
if state == 'kentucky':
kentucky_birds.append((bird, date))
print kentucky_birds
[('robin', 'may 23'), ('chickadee', 'june 1')]
import csv
# now, open a file for writing, and use 'csv.writer' to write rows to it.
# just for grins, use tab for a delimiter, instead of a comma.
fp = file('kentucky-birds.tsv', 'wb')
w = csv.writer(fp, delimiter='\t') # <- use tab instead of the default, which is a comma
for bird, date in kentucky_birds:
row = (bird, date)
w.writerow(row)
fp.close()
# check it out -- did it work?
print open('kentucky-birds.tsv').read()
robin may 23 chickadee june 1
yep!
You might be thinking to yourself, sure, I know how to do all of that in Excel. And heck, maybe you do.
But there are three reasons you might not want to do it in Excel.
First, by doing everything manually, your work increases in direct proportion to the number of things you want to do to the data.
Second, by doing everything manually, you cause errors that give you the wrong results.
And third, the trickier the analysis, the more likely it is that you introduce errors.
Well, and fourth, Excel has an upper limit of 2 million rows of data.
One thing that I would never want to do with Excel is integrate two different files. (I have done it, which is why I never want to do it again!)
%%file bird-weights.txt
robin,5
seagull,12
chickadee,4
Overwriting bird-weights.txt
Now suppose we want to create a combined list of birds, states, weights, and dates?
This kind of merge can be done by using a dictionary to associate the birds with their weights, and then output everything to another file.
# first, load in the bird weights
fp = file('bird-weights.txt', 'rb')
r = csv.reader(fp)
weight_dict = {}
for bird, weight in r:
weight_dict[bird] = weight
# now, go through all of the birdlist2 entries
birdlist2 = read_birdlist2('birds.txt')
new_birdlist = []
for bird, state, day in birdlist2:
weight = weight_dict[bird]
new_birdlist.append((bird,state,day,weight))
print new_birdlist
[('robin', 'kentucky', 'may 23', '5'), ('robin', 'west virginia', 'may 24', '5'), ('seagull', 'maine', 'may 24', '12'), ('chickadee', 'kentucky', 'june 1', '4'), ('seagull', 'maine', 'june 1', '12')]
# what if one of the birds doesn't have a weight in the file, though?
del weight_dict['robin']
new_birdlist = []
for bird, state, day in birdlist2:
weight = weight_dict[bird]
new_birdlist.append((bird,state,day,weight))
print new_birdlist
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-53-10d2070924bd> in <module>() 4 new_birdlist = [] 5 for bird, state, day in birdlist2: ----> 6 weight = weight_dict[bird] 7 new_birdlist.append((bird,state,day,weight)) 8 KeyError: 'robin'
# change the code to be aware of potentially missing data:
new_birdlist = []
for bird, state, day in birdlist2:
weight = weight_dict.get(bird, -1) # use '-1' or something patently absurd if no weight in table
new_birdlist.append((bird,state,day,weight))
print new_birdlist
# and now you can output to a file, or whatever.
Try to make your code easy to read. Rather than commenting everything to death, make your function and variable names descriptive.
Test out your functions on small data sets where you know the answer.
The third time you find yourself performing the same query or merge, write a function to do it instead. Corrollary: don't overplan in advance! Only do this the third time!