Software Carpentry Bootcamp
Scripps Research Institute, La Jolla, CA
November 2012
Prepared by: Justin Kitzes
The core Python libraries, along with external packages such as numpy
, matplotlib
, pandas
, and others,
provide both low-level and high-level methods for extracting data from files, processing this data, and
saving results back to disk.
In this lesson, we will read text-based data files in two ways. First, we will use Python's low level
file reader and writer to examine the contents of a relatively unstructured text file, perform some analysis,
and write our results back out to another text file. Second, we will use high-level helper functions
from matplotlib
to easily read, modify, and write back out a csv file (csv stands for comma-separated
values, and is the kind of data that you might have in a spreadsheet).
In general, you will want to use easy, high-level approaches whenever possible (they will save you time
and probably errors) and switch over to the low-level methods when your data have an unusual structure
that the high-level readers cannot deal with easily. There are other high-level file handlers in numpy
,
such as loadtxt
and savetxt
, and pandas
, such as from_csv
and to_csv
, that are also useful,
and we encourage you to look into them.
The most basic way to get data into Python is to open a text file and read it, line by line, and examine each line in turn to determine what to do with it (ie, save it, skip it, extract some information, etc.).
This method will work on any type of file, including tables in csv format. However, this method really becomes useful when you find yourself with a "unique" type of data record, such as the one in the sample file shown below.
# Load numpy once again, and the plotting library for good measure
import numpy as np
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline]. For more information, type 'help(pylab)'.
# We have a data file with records of animal sightings in it
!cat sightings_recs_sm.txt
0 { Date: 2011-04-22 Time: 15:16 Animal: Wolverine Count: 5 } 1 { Date: 2011-04-22 Time: 17:28 Animal: Owl Count: 4 } 2 { Date: 2011-04-22 Time: 01:06 Animal: Owl Count: 27 }
# Let's read the file line by line, and print each line back out (watch for newline characters)
f = open('sightings_recs_sm.txt', 'r')
for line in f:
print repr(line)
f.close()
'0 {\n' 'Date: 2011-04-22\n' 'Time: 15:16\n' 'Animal: Wolverine\n' 'Count: 5\n' '}\n' '\n' '1 {\n' 'Date: 2011-04-22\n' 'Time: 17:28\n' 'Animal: Owl\n' 'Count: 4\n' '}\n' '\n' '2 {\n' 'Date: 2011-04-22\n' 'Time: 01:06\n' 'Animal: Owl\n' 'Count: 27\n' '}\n' '\n'
# Let's only print out the lines that have counts
f = open('sightings_recs_sm.txt', 'r')
for line in f:
if line[0:5] == 'Count':
print repr(line)
f.close()
'Count: 5\n' 'Count: 4\n' 'Count: 27\n'
# Instead of printing the count, let's get the count and add it to a total
#a = 'Count: 5'
#print a.split(' ')[1]
f = open('sightings_recs_sm.txt', 'r')
total = 0
for line in f:
if line[0:5] == 'Count':
thiscount = int(line.split(' ')[1])
total = total + thiscount
print total
f.close()
36
# Finally, let's write out our answer to a new file
g = open('sightings_recs_result.txt', 'w')
g.write('Total animals seen: ' + str(total))
g.close()
!cat sightings_recs_result.txt
Total animals seen: 36
count of owls, just the number of times the word 'Owl' appears after the word 'Animal:' on a line).
'Counted by [your name]' and, on the second line, the string 'Total owl records:' followed by the number of records mentioing Owl. HINT: To write a new line in a file, use the special character '\n'. 2. Run your code on the much larger file 'sightings_recs_lg.txt'. How many records are there for owls?
Bonus
all records that have Owl as the animal). HINT: Since the animal name comes before the count, one way to
do this is to create a new variable called animal
, and update its value each time the for loop encounters
a line that starts with 'Animal:'. Then, when the 'Count:' line is encountered, you should only add the count to
your running total if the current value of the variable animal
is Owl. Watch out for the new line character
\n
at the end of strings!
2. Run your code on the much larger file 'sightings_recs_lg.txt'. How many total owls were seen?
f = open('sightings_recs_sm.txt', 'r')
#f = open('sightings_recs_lg.txt', 'r')
recs = 0
for line in f:
if line[0:11] == 'Animal: Owl':
recs = recs + 1
print recs
f.close()
g = open('sightings_recs_owls.txt', 'w')
g.write('Counted by Justin\n')
g.write('Total owl records: ' + str(recs))
g.close()
!cat sightings_recs_owls.txt
2 Counted by Justin Total owl records: 2
f = open('sightings_recs_sm.txt', 'r')
#f = open('sightings_recs_lg.txt', 'r')
total = 0
animal = 'None'
for line in f:
if line[0:6] == 'Animal':
animal = line.split(' ')[1]
if (line[0:5] == 'Count') and (animal[0:3] == 'Owl'):
thiscount = int(line.split(' ')[1])
total = total + thiscount
print total
f.close()
31
The process of reading and writing files becomes MUCH easier if your data is nicely structured, such as if
it is formatted as a csv file (like a spreadsheet). In this case, there are a number of high-level helper
functions that will help you to read, manipulate, and write this data. We'll focus on a set of helper
functions found in matplotlib.mlab
.
There are lots of functions in mlab
, but the ones that we want are listed under "record array helper
functions" on this page: http://matplotlib.org/api/mlab_api.html. The most useful ones are
These functions will help us work with a special type of numpy
array known as a recarray
, short for
record array. As you may remember, every element in a numpy array must be of the same data type (such as
an int, float, string, etc.). In a recarray, the elements are a complex data type called a record that
contains multiple fields, sort of like a simple database.
# Import matplotlib.mlab
import matplotlib.mlab as ml
# Our field tech was much nicer this time, and gave us a csv with the sightings in it
!cat sightings_tab_sm.csv
Date,Time,Animal,Count 2011-04-22,16:20,Owl,14 2011-04-22,10:08,Fox,4 2011-04-23,06:29,Muskox,31 2011-04-23,21:11,Owl,20 2011-04-23,07:41,Muskox,20
# We can load this csv into a numpy recarray in just one step
tab = ml.csv2rec('sightings_tab_sm.csv')
print repr(tab)
rec.array([ (datetime.date(2011, 4, 22), datetime.datetime(2012, 11, 13, 16, 20), 'Owl', 14), (datetime.date(2011, 4, 22), datetime.datetime(2012, 11, 13, 10, 8), 'Fox', 4), (datetime.date(2011, 4, 23), datetime.datetime(2012, 11, 13, 6, 29), 'Muskox', 31), (datetime.date(2011, 4, 23), datetime.datetime(2012, 11, 13, 21, 11), 'Owl', 20), (datetime.date(2011, 4, 23), datetime.datetime(2012, 11, 13, 7, 41), 'Muskox', 20)], dtype=[('date', '|O8'), ('time', '|O8'), ('animal', '|S6'), ('count', '<i8')])
# A recarray is a special kind of numpy array
# It is one dimensional, but each element is a record that has many pieces of data in it
print type(tab)
print tab.dtype # dtype isn't just an int, float, or str - it's complex
<class 'numpy.core.records.recarray'> [('date', '|O8'), ('time', '|O8'), ('animal', '|S6'), ('count', '<i8')]
# We can access "rows" and "columns" in a recarray easily
# "Rows" are actually the indexes of the records
# "Columns" can be accessed by the name of the field
print tab[0]
print tab['animal']
print tab[0]['animal']
(datetime.date(2011, 4, 22), datetime.datetime(2012, 11, 13, 16, 20), 'Owl', 14) ['Owl' 'Fox' 'Muskox' 'Owl' 'Muskox'] Owl
# We can access values in the records by looping over each record individually
for record in tab:
#print record
if record['animal'] == 'Owl':
print 'Count: ', record['count']
Count: 14 Count: 20
# Or by using boolean indexing
isowl = (tab['animal'] == 'Owl')
print isowl
print tab['count']
print tab['count'][isowl]
[ True False False True False] [14 4 31 20 20] [14 20]
# Let's make a new table (also a recarray) that drops the date and time cols
# Also, let's add a new col called 'group' that is True if there were more than 4 animals counted
tabnew = tab.copy()
tabnew = ml.rec_drop_fields(tabnew, ['date','time'])
isgroup = (tabnew['count'] > 4)
tabnew = ml.rec_append_fields(tabnew, 'group', isgroup)
print repr(tabnew)
rec.array([('Owl', 14, True), ('Fox', 4, False), ('Muskox', 31, True), ('Owl', 20, True), ('Muskox', 20, True)], dtype=[('animal', '|S6'), ('count', '<i8'), ('group', '|b1')])
# Finally, let's save our new recarray as a csv file
ml.rec2csv(tabnew, 'sightings_tab_results.csv')
# It worked!
!cat sightings_tab_results.csv
animal,count,group Owl,14,True Fox,4,False Muskox,31,True Owl,20,True Muskox,20,True
focusanimal
that is equal to Owl. Write code that reads a csv file ofsightings, loops through the records, and counts the mean number of the focus animal seen per record.
HINT: You'll want to use two counter variables, totalrecs
and totalcount
, to track the number of records
mentioning Owl and the total number of owls seen. Print out the answer for the file sightings_tab_sm.csv
and make sure that it's correct.
Bonus
sightings_tab_sm.csv
, write code that adds a column called 'taxa' that containsthe word 'bird' if the animal is an Owl or a Ptarmigan (the two birds in this data set), and the word
'mammal' otherwise. Save this csv file. HINT: You'll probably need to use a for loop, with an if statement
inside of it, to view each record and to get and store the appropriate value in a taxalist
. You can then
append this taxalist
to the table.
focusanimal = 'Owl'
tab = ml.csv2rec('sightings_tab_sm.csv')
totalrecs = 0
totalcount = 0
for rec in tab:
if rec['animal'] == 'Owl':
totalrecs += 1
totalcount += rec['count']
meancount = totalcount/totalrecs
print totalrecs, meancount
2 17
focusanimal = 'Owl'
tab = ml.csv2rec('sightings_tab_sm.csv')
isfocus = (tab['animal'] == focusanimal)
totalrecs = np.sum(isfocus)
meancount = np.mean(tab['count'][isfocus])
print totalrecs, meancount
2 17.0
tabnew = ml.csv2rec('sightings_tab_sm.csv')
taxalist = []
for rec in tab:
if (rec['animal'] == 'Owl') or (rec['animal'] == 'Ptarmigan'):
taxalist.append('bird')
else:
taxalist.append('mammal')
tabnew = ml.rec_append_fields(tabnew, 'taxa', taxalist)
ml.rec2csv(tabnew, 'sightings_tab_new.csv')
!cat sightings_tab_new.csv
date,time,animal,count,taxa 2011-04-22,2012-11-13 16:20:00,Owl,14,bird 2011-04-22,2012-11-13 10:08:00,Fox,4,mammal 2011-04-23,2012-11-13 06:29:00,Muskox,31,mammal 2011-04-23,2012-11-13 21:11:00,Owl,20,bird 2011-04-23,2012-11-13 07:41:00,Muskox,20,mammal