We are studying inflammation in patients who have been given a new treatment for arthritis, and need to analyze the first dozen data sets. The data sets are stored in comma-separated values (CSV) format: each row holds information for a single patient, and the columns represent successive days. The first few rows of our data file look like this:
0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1 0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1 0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1 0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1
While a lot of powerful tools are built into Python, even more tools exist in the libraries and packages that are written in Python.
In order to load our inflammation data, we need to import a library called NumPy. We could load the data using the csv module, but you should use NumPy when you want to do things with numbers, especially if you have matrices or tables.
We can load NumPy using import. We also load urllib to read the data from the web:
import numpy as np
import urllib
First thing, we need to copy the file from the web to our local disk. This is done using the urllib.request.urlretrieve
function:
url = r"https://raw.githubusercontent.com/swcarpentry/python-novice-inflammation/gh-pages/data/inflammation-01.csv"
fname = "inflammation-01.csv"
urllib.request.urlretrieve(url, fname)
!cat $fname
0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0 0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1 0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1 0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1 0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1 0,0,1,2,2,4,2,1,6,4,7,6,6,9,9,15,4,16,18,12,12,5,18,9,5,3,10,3,12,7,8,4,7,3,5,4,4,3,2,1 0,0,2,2,4,2,2,5,5,8,6,5,11,9,4,13,5,12,10,6,9,17,15,8,9,3,13,7,8,2,8,8,4,2,3,5,4,1,1,1 0,0,1,2,3,1,2,3,5,3,7,8,8,5,10,9,15,11,18,19,20,8,5,13,15,10,6,10,6,7,4,9,3,5,2,5,3,2,2,1 0,0,0,3,1,5,6,5,5,8,2,4,11,12,10,11,9,10,17,11,6,16,12,6,8,14,6,13,10,11,4,6,4,7,6,3,2,1,0,0 0,1,1,2,1,3,5,3,5,8,6,8,12,5,13,6,13,8,16,8,18,15,16,14,12,7,3,8,9,11,2,5,4,5,1,4,1,2,0,0 0,1,0,0,4,3,3,5,5,4,5,8,7,10,13,3,7,13,15,18,8,15,15,16,11,14,12,4,10,10,4,3,4,5,5,3,3,2,2,1 0,1,0,0,3,4,2,7,8,5,2,8,11,5,5,8,14,11,6,11,9,16,18,6,12,5,4,3,5,7,8,3,5,4,5,5,4,0,1,1 0,0,2,1,4,3,6,4,6,7,9,9,3,11,6,12,4,17,13,15,13,12,8,7,4,7,12,9,5,6,5,4,7,3,5,4,2,3,0,1 0,0,0,0,1,3,1,6,6,5,5,6,3,6,13,3,10,13,9,16,15,9,11,4,6,4,11,11,12,3,5,8,7,4,6,4,1,3,0,0 0,1,2,1,1,1,4,1,5,2,3,3,10,7,13,5,7,17,6,9,12,13,10,4,12,4,6,7,6,10,8,2,5,1,3,4,2,0,2,0 0,1,1,0,1,2,4,3,6,4,7,5,5,7,5,10,7,8,18,17,9,8,12,11,11,11,14,6,11,2,10,9,5,6,5,3,4,2,2,0 0,0,0,0,2,3,6,5,7,4,3,2,10,7,9,11,12,5,12,9,13,19,14,17,5,13,8,11,5,10,9,8,7,5,3,1,4,0,2,1 0,0,0,1,2,1,4,3,6,7,4,2,12,6,12,4,14,7,8,14,13,19,6,9,12,6,4,13,6,7,2,3,6,5,4,2,3,0,1,0 0,0,2,1,2,5,4,2,7,8,4,7,11,9,8,11,15,17,11,12,7,12,7,6,7,4,13,5,7,6,6,9,2,1,1,2,2,0,1,0 0,1,2,0,1,4,3,2,2,7,3,3,12,13,11,13,6,5,9,16,9,19,16,11,8,9,14,12,11,9,6,6,6,1,1,2,4,3,1,1 0,1,1,3,1,4,4,1,8,2,2,3,12,12,10,15,13,6,5,5,18,19,9,6,11,12,7,6,3,6,3,2,4,3,1,5,4,2,2,0 0,0,2,3,2,3,2,6,3,8,7,4,6,6,9,5,12,12,8,5,12,10,16,7,14,12,5,4,6,9,8,5,6,6,1,4,3,0,2,0 0,0,0,3,4,5,1,7,7,8,2,5,12,4,10,14,5,5,17,13,16,15,13,6,12,9,10,3,3,7,4,4,8,2,6,5,1,0,1,0 0,1,1,1,1,3,3,2,6,3,9,7,8,8,4,13,7,14,11,15,14,13,5,13,7,14,9,10,5,11,5,3,5,1,1,4,4,1,2,0 0,1,1,1,2,3,5,3,6,3,7,10,3,8,12,4,12,9,15,5,17,16,5,10,10,15,7,5,3,11,5,5,6,1,1,1,1,0,2,1 0,0,2,1,3,3,2,7,4,4,3,8,12,9,12,9,5,16,8,17,7,11,14,7,13,11,7,12,12,7,8,5,7,2,2,4,1,1,1,0 0,0,1,2,4,2,2,3,5,7,10,5,5,12,3,13,4,13,7,15,9,12,18,14,16,12,3,11,3,2,7,4,8,2,2,1,3,0,1,1 0,0,1,1,1,5,1,5,2,2,4,10,4,8,14,6,15,6,12,15,15,13,7,17,4,5,11,4,8,7,9,4,5,3,2,5,4,3,2,1 0,0,2,2,3,4,6,3,7,6,4,5,8,4,7,7,6,11,12,19,20,18,9,5,4,7,14,8,4,3,7,7,8,3,5,4,1,3,1,0 0,0,0,1,4,4,6,3,8,6,4,10,12,3,3,6,8,7,17,16,14,15,17,4,14,13,4,4,12,11,6,9,5,5,2,5,2,1,0,1 0,1,1,0,3,2,4,6,8,6,2,3,11,3,14,14,12,8,8,16,13,7,6,9,15,7,6,4,10,8,10,4,2,6,5,5,2,3,2,1 0,0,2,3,3,4,5,3,6,7,10,5,10,13,14,3,8,10,9,9,19,15,15,6,8,8,11,5,5,7,3,6,6,4,5,2,2,3,0,0 0,1,2,2,2,3,6,6,6,7,6,3,11,12,13,15,15,10,14,11,11,8,6,12,10,5,12,7,7,11,5,8,5,2,5,5,2,0,2,1 0,0,2,1,3,5,6,7,5,8,9,3,12,10,12,4,12,9,13,10,10,6,10,11,4,15,13,7,3,4,2,9,7,2,4,2,1,2,1,1 0,0,1,2,4,1,5,5,2,3,4,8,8,12,5,15,9,17,7,19,14,18,12,17,14,4,13,13,8,11,5,6,6,2,3,5,2,1,1,1 0,0,0,3,1,3,6,4,3,4,8,3,4,8,3,11,5,7,10,5,15,9,16,17,16,3,8,9,8,3,3,9,5,1,6,5,4,2,2,0 0,1,2,2,2,5,5,1,4,6,3,6,5,9,6,7,4,7,16,7,16,13,9,16,12,6,7,9,10,3,6,4,5,4,6,3,4,3,2,1 0,1,1,2,3,1,5,1,2,2,5,7,6,6,5,10,6,7,17,13,15,16,17,14,4,4,10,10,10,11,9,9,5,4,4,2,1,0,1,0 0,1,0,3,2,4,1,1,5,9,10,7,12,10,9,15,12,13,13,6,19,9,10,6,13,5,13,6,7,2,5,5,2,1,1,1,1,3,0,1 0,1,1,3,1,1,5,5,3,7,2,2,3,12,4,6,8,15,16,16,15,4,14,5,13,10,7,10,6,3,2,3,6,3,3,5,4,3,2,1 0,0,0,2,2,1,3,4,5,5,6,5,5,12,13,5,7,5,11,15,18,7,9,10,14,12,11,9,10,3,2,9,6,2,2,5,3,0,0,1 0,0,1,3,3,1,2,1,8,9,2,8,10,3,8,6,10,13,11,17,19,6,4,11,6,12,7,5,5,4,4,8,2,6,6,4,2,2,0,0 0,1,1,3,4,5,2,1,3,7,9,6,10,5,8,15,11,12,15,6,12,16,6,4,14,3,12,9,6,11,5,8,5,5,6,1,2,1,2,0 0,0,1,3,1,4,3,6,7,8,5,7,11,3,6,11,6,10,6,19,18,14,6,10,7,9,8,5,8,3,10,2,5,1,5,4,2,1,0,1 0,1,1,3,3,4,4,6,3,4,9,9,7,6,8,15,12,15,6,11,6,18,5,14,15,12,9,8,3,6,10,6,8,7,2,5,4,3,1,1 0,1,2,2,4,3,1,4,8,9,5,10,10,3,4,6,7,11,16,6,14,9,11,10,10,7,10,8,8,4,5,8,4,4,5,2,4,1,1,0 0,0,2,3,4,5,4,6,2,9,7,4,9,10,8,11,16,12,15,17,19,10,18,13,15,11,8,4,7,11,6,7,6,5,1,3,1,0,0,0 0,1,1,3,1,4,6,2,8,2,10,3,11,9,13,15,5,15,6,10,10,5,14,15,12,7,4,5,11,4,6,9,5,6,1,1,2,1,2,1 0,0,1,3,2,5,1,2,7,6,6,3,12,9,4,14,4,6,12,9,12,7,11,7,16,8,13,6,7,6,10,7,6,3,1,5,4,3,0,0 0,0,1,2,3,4,5,7,5,4,10,5,12,12,5,4,7,9,18,16,16,10,15,15,10,4,3,7,5,9,4,6,2,4,1,4,2,2,2,1 0,1,2,1,1,3,5,3,6,3,10,10,11,10,13,10,13,6,6,14,5,4,5,5,9,4,12,7,7,4,7,9,3,3,6,3,4,1,2,0 0,1,2,2,3,5,2,4,5,6,8,3,5,4,3,15,15,12,16,7,20,15,12,8,9,6,12,5,8,3,8,5,4,1,3,2,1,3,1,0 0,0,0,2,4,4,5,3,3,3,10,4,4,4,14,11,15,13,10,14,11,17,9,11,11,7,10,12,10,10,10,8,7,5,2,2,4,1,2,1 0,0,2,1,1,4,4,7,2,9,4,10,12,7,6,6,11,12,9,15,15,6,6,13,5,12,9,6,4,7,7,6,5,4,1,4,2,2,2,1 0,1,2,1,1,4,5,4,4,5,9,7,10,3,13,13,8,9,17,16,16,15,12,13,5,12,10,9,11,9,4,5,5,2,2,5,1,0,0,1 0,0,1,3,2,3,6,4,5,7,2,4,11,11,3,8,8,16,5,13,16,5,8,8,6,9,10,10,9,3,3,5,3,5,4,5,3,3,0,1 0,1,1,2,2,5,1,7,4,2,5,5,4,6,6,4,16,11,14,16,14,14,8,17,4,14,13,7,6,3,7,7,5,6,3,4,2,2,1,1 0,1,1,1,4,1,6,4,6,3,6,5,6,4,14,13,13,9,12,19,9,10,15,10,9,10,10,7,5,6,8,6,6,4,3,5,2,1,1,1 0,0,0,1,4,5,6,3,8,7,9,10,8,6,5,12,15,5,10,5,8,13,18,17,14,9,13,4,10,11,10,8,8,6,5,5,2,0,2,0 0,0,1,0,3,2,5,4,8,2,9,3,3,10,12,9,14,11,13,8,6,18,11,9,13,11,8,5,5,2,8,5,3,5,4,1,3,1,1,0
We saved the file to the local filesystem.
We can have a look at the file contents from the IPython notebook by calling a system command cat
or type
.
To do this, we need to put an !
at the beginning of the line and prepend any variable name with $
.
Once it's done, we can ask NumPy to read the data file:
data = np.loadtxt(fname, delimiter=',')
print(data)
[[ 0. 0. 1. ..., 3. 0. 0.] [ 0. 1. 2. ..., 1. 0. 1.] [ 0. 1. 1. ..., 2. 1. 1.] ..., [ 0. 1. 1. ..., 1. 1. 1.] [ 0. 0. 0. ..., 0. 2. 0.] [ 0. 0. 1. ..., 1. 1. 0.]]
The expression np.loadtxt(...)
is a function call that asks Python to run the function loadtxt
that belongs to the numpy
library. This dotted notation is used everywhere in Python to refer to the parts of things as thing.component
(see: namespaces).
numpy.loadtxt
has two parameters: the name of the file we want to read, and the delimiter that separates values on a line. These both need to be character strings (or strings for short), so we put them in quotes.
We saved the output of loadtxt
in the variable data
. When we print(data)
, only a few rows and columns are shown (with ...
to omit elements when displaying big arrays). To save space, Python displays numbers as 1.
instead of 1.0
when there's nothing interesting after the decimal point.
Now that our data is in memory, we can start doing things with it. First, let's ask what type of thing data refers to:
print(type(data))
<class 'numpy.ndarray'>
The output tells us that data currently refers to an N-dimensional array created by the NumPy library. We can see what its shape is like this:
print(data.shape)
n_patients,n_days = data.shape
(60, 40)
This tells us that data
has 60 rows and 40 columns, which are 60 patients and 40 days. data.shape
is a member of data
, i.e., a value that is stored as part of a larger value. We use the same dotted notation for the members of values that we use for the functions in libraries because they have the same part-and-whole relationship.
If we want to get a single value from the matrix, we must provide an index in square brackets, just as we do with a list
, but with as many indices as the number of dimensions in shape
(two in this case):
print("first value in data",data[0,0])
print("middle value in data:", data[30, 20])
first value in data 0.0 middle value in data: 13.0
The expression data[30, 20]
may not surprise you, but data[0, 0]
might. Programming languages like Fortran and MATLAB start counting at 1, because that's what human beings have done for thousands of years. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that's simpler for computers to do. Just like with list
and str
, if we have an M×N array in Python, its indices go from 0 to M-1 on the first axis and 0 to N-1 on the second. It takes a bit of getting used to, but one way to remember the rule is that the index is how many steps we have to take from the start to get the item we want.
In the Corner¶
What may also surprise you is that when Python displays an array, it shows the element with index [0, 0] in the upper left corner rather than the lower left. This is consistent with the way mathematicians draw matrices, but different from the Cartesian coordinates. The indices are (row, column) instead of (column, row) for the same reason, which can be confusing when plotting data.
An index like [30, 20]
selects a single element of an array, but we can select whole sections as well. For example, we can select the first ten days (columns) of values for the first four (rows) patients like this:
print(data[0:4, 0:10])
[[ 0. 0. 1. 3. 1. 2. 4. 7. 8. 3.] [ 0. 1. 2. 1. 2. 1. 3. 2. 2. 6.] [ 0. 1. 1. 3. 3. 2. 6. 2. 5. 9.] [ 0. 0. 2. 0. 4. 2. 2. 1. 6. 7.]]
The slice 0:4
means, "Start at index 0 and go up to, but not including, index 4." Again, the up-to-but-not-including takes a bit of getting used to, but the rule is that the difference between the upper and lower bounds is the number of values in the slice.
We don't have to start slices at 0:
print(data[5:10, 0:10])
[[ 0. 0. 1. 2. 2. 4. 2. 1. 6. 4.] [ 0. 0. 2. 2. 4. 2. 2. 5. 5. 8.] [ 0. 0. 1. 2. 3. 1. 2. 3. 5. 3.] [ 0. 0. 0. 3. 1. 5. 6. 5. 5. 8.] [ 0. 1. 1. 2. 1. 3. 5. 3. 5. 8.]]
We also don't have to include the upper and lower bound on the slice. If we don't include the lower bound, Python uses 0 by default; if we don't include the upper, the slice runs to the end of the axis, and if we don't include either (i.e., if we just use ':' on its own), the slice includes everything:
small = data[:3, 36:]
print('small is:')
print(small)
small is: [[ 2. 3. 0. 0.] [ 1. 1. 0. 1.] [ 2. 2. 1. 1.]]
Arrays also know how to perform common mathematical operations on their values. The simplest operations with data are arithmetic: add, subtract, multiply, and divide. When you do such operations on arrays, the operation is done on each individual element of the array. Thus:
doubledata = data * 2.0
will create a new array doubledata
whose elements have the value of two times the value of the corresponding elements in data
.
print('original:')
print(data[:3, 36:])
print('doubledata:')
print(doubledata[:3, 36:])
original: [[ 2. 3. 0. 0.] [ 1. 1. 0. 1.] [ 2. 2. 1. 1.]] doubledata: [[ 4. 6. 0. 0.] [ 2. 2. 0. 2.] [ 4. 4. 2. 2.]]
If, instead of taking an array and doing arithmetic with a single value (as above) you did the arithmetic operation with another array of the same size and shape, the operation will be done on corresponding elements of the two arrays. Thus:
tripledata = doubledata + data
will give you an array where tripledata[0,0]
will equal doubledata[0,0]
plus data[0,0]
, and so on for all other elements of the arrays.
print('tripledata:')
print(tripledata[:3, 36:])
tripledata: [[ 6. 9. 0. 0.] [ 3. 3. 0. 3.] [ 6. 6. 3. 3.]]
Calculate the square root of the data using numpy
.
Print the result for the first 5 columns of the first 3 rows.
Often, we want to do more than add, subtract, multiply, and divide values of data. Arrays also know how to do more complex operations on their values. If we want to find the average inflammation for all patients on all days, for example, we can just ask the array for its mean value
print(data.mean())
6.14875
mean
is a method of the array, i.e., a function that belongs to it in the same way that the member shape does. If variables are nouns, methods are verbs: they are what the thing in question knows how to do. This is why data.shape
doesn't need to be called (it's just a thing) but data.mean()
does (it's an action). It is also why we need empty parentheses for ata.mean()
: even when we're not passing in any parameters, parentheses are how we tell Python to go and do something for us.
NumPy arrays have lots of useful methods:
print('maximum inflammation:', data.max())
print('minimum inflammation:', data.min())
print('standard deviation:', data.std())
maximum inflammation: 20.0 minimum inflammation: 0.0 standard deviation: 4.61383319712
When analyzing data, though, we often want to look at partial statistics, such as the maximum value per patient or the average value per day. One way to do this is to select the data we want to create a new temporary array, then ask it to do the calculation:
patient_0 = data[0, :] # 0 on the first axis, everything on the second
print('maximum inflammation for patient 0:', patient_0.max())
maximum inflammation for patient 0: 18.0
What if we need the maximum inflammation for all patients, or the average for each day? As the diagram below shows, we want to perform the operation across an axis: To support this, most array methods allow us to specify the axis we want to work on. If we ask for the average across axis 0, we get:
print(data.mean(axis=0))
[ 0. 0.45 1.11666667 1.75 2.43333333 3.15 3.8 3.88333333 5.23333333 5.51666667 5.95 5.9 8.35 7.73333333 8.36666667 9.5 9.58333333 10.63333333 11.56666667 12.35 13.25 11.96666667 11.03333333 10.16666667 10. 8.66666667 9.15 7.25 7.33333333 6.58333333 6.06666667 5.95 5.11666667 3.6 3.3 3.56666667 2.48333333 1.5 1.13333333 0.56666667]
As a quick check, we can ask this array what its shape is:
print(data.mean(axis=0).shape)
(40,)
The expression (40,)
tells us we have an N×1 vector, so this is the average inflammation per day for all patients. If we average across axis 1, we get:
print(data.mean(axis=1))
[ 5.45 5.425 6.1 5.9 5.55 6.225 5.975 6.65 6.625 6.525 6.775 5.8 6.225 5.75 5.225 6.3 6.55 5.7 5.85 6.55 5.775 5.825 6.175 6.1 5.8 6.425 6.05 6.025 6.175 6.55 6.175 6.35 6.725 6.125 7.075 5.725 5.925 6.15 6.075 5.75 5.975 5.725 6.3 5.9 6.75 5.925 7.225 6.15 5.95 6.275 5.7 6.1 6.825 5.975 6.725 5.7 6.25 6.4 7.05 5.9 ]
which is the average inflammation per patient across all days.
On which day did each patient had the most inflammation?
Use .data.argmax
to find out.
The mathematician Richard Hamming once said, "The purpose of computing is insight, not numbers", and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few features of Python's matplotlib
here. While there is no "official" plotting library, this package is the de facto standard. First, let's tell the IPython Notebook that we want our plots displayed inline, rather than in a separate viewing window. This command will also import many other useful scientific python functions and modules, including NumPy:
%matplotlib?
%matplotlib inline
from matplotlib.pyplot import *
The %
at the start of the line signals that this is a command for the notebook, rather than a statement in Python.
Next, use two of matplotlib
's functions to create and display a heat map of our data:
imshow(data, aspect=0.66); # aspect control the width of one x unit vs. one y unit
<matplotlib.image.AxesImage at 0x47c1e10>
Blue regions in this heat map are low values, while red shows high values. As we can see, inflammation rises and falls over a 40-day period (The ;
at the end of the line supresses the automatic printing of function's return value, which is an AxesImage
in this case).
Let's take a look at the average inflammation over time:
avg_inflammation = data.mean(axis=0)
plot(avg_inflammation);
Here, we have put the average per day across all patients in the variable avg_inflammation
, then used plot
to create and display a line graph of those values. The result is roughly a linear rise and fall, which is suspicious: based on other studies, we expect a sharper rise and slower fall. Let's have a look at two other statistics:
plot(avg_inflammation, label='avg')
plot(data.max(axis=0), label='max')
plot(data.min(axis=0), label='min')
title('inflammation per day')
xlabel('day')
ylabel('inflammation')
legend();
The maximum value rises and falls perfectly smoothly, while the minimum seems to be a step function. Neither result seems particularly likely, so either there's a mistake in our calculations or something is wrong with our data.
matplotlib
has many more plotting commands. For example, we can calculate the stadard deviation of the mean (sem
) which is defined as the standard deviation of the sample divided by the square root of the number of the samples:
$$
\sqrt{\frac{\sum_i{(x_{i=1}^{n} - \mu)^2}}{N}}
$$
This can be calculated using the method std
and the function sqrt
.
Note that when calculating the standard deviation of a sample, the degrees of freedom is $n-1$ - this is represented by the ddof=1
argument to std
.
sem_inflammation = data.std(axis=0, ddof=1) / np.sqrt(n_patients)
errorbar(x=range(n_days), y=avg_inflammation, yerr=sem_inflammation, ecolor='k')
<Container object of 3 artists>
In exercise B we calculated the day in which each patient had the most inflammation.
Let's plot it now - make a scatter plot using the scatter
command, with patient number on x-axis and day of max inflammation on y-axis.
Don't forget to add axis labels with xlabel
and ylabel
!
Many times we want to display data with some smoothing or fitting of a regression line over it. To do this, we will use the pandas and seaborn library. pandas is a very strong library for manipulation large datasets. seaborn adds on top of pandas a set of sophisticated statistical visualizations.
Unfortunately, seaborn doesn't ship with pyzo so we need to install it manually. The best way is to use conda or pip from the pyzo folder (make sure you write the proper path, on my PC it's D:\pyzo2014a
):
D:\pyzo2014a\Scripts\conda install seaborn
or
D:\pyzo2014a\Scripts\pip install seaborn
On linux pip
is in the bin
rather than the Scripts
folder and must be called using python3
from the same folder.
import zipfile
import pandas as pd
import seaborn as sns
sns.set_style("ticks") # control the plotting style
sns.set_context("talk") # set to talk because this is a lecture! hit shift-tab after the "(" to see other options.
We will analyze animal life-history data from AnAge.
We will get the data from the download page, but it's compressed with zip so we need to unzip it and then we can read the data using pandas read_table
function:
urllib.request.urlretrieve('http://genomics.senescence.info/species/dataset.zip', 'anage_dataset.zip')
('anage_dataset.zip', <http.client.HTTPMessage at 0x5eab5f0>)
with zipfile.ZipFile('anage_dataset.zip') as z:
f = z.open('anage_data.txt')
data = pd.read_table(f)
print(type(data))
print(data.shape)
<class 'pandas.core.frame.DataFrame'> (4212, 31)
pandas holds data in DataFrame
(similar to R).
DataFrame
have a single row per observation (in contrast to the previous exercise in which each table cell was one observation), and each column has a single variable. Variables can be numbers or strings.
data.head()
HAGRID | Kingdom | Phylum | Class | Order | Family | Genus | Species | Common name | Female maturity (days) | ... | Source | Specimen origin | Sample size | Data quality | IMR (per yr) | MRDT (yrs) | Metabolic rate (W) | Body mass (g) | Temperature (K) | References | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | Animalia | Arthropoda | Insecta | Diptera | Drosophilidae | Drosophila | melanogaster | Fruit fly | 7 | ... | NaN | captivity | large | acceptable | 0.05 | 0.04 | NaN | NaN | NaN | 2,20,32,47,53,68,69,240,241,242,243,274,602,98... |
1 | 5 | Animalia | Arthropoda | Insecta | Hymenoptera | Apidae | Apis | mellifera | Honey bee | NaN | ... | 812 | unknown | medium | acceptable | NaN | NaN | NaN | NaN | NaN | 63,407,408,741,805,806,808,812,815,828,830,831... |
2 | 7 | Animalia | Arthropoda | Insecta | Hymenoptera | Formicidae | Lasius | niger | Black garden ant | NaN | ... | 411 | unknown | medium | acceptable | NaN | NaN | NaN | NaN | NaN | 411,813,814 |
3 | 8 | Animalia | Arthropoda | Insecta | Lepidoptera | Nymphalidae | Bicyclus | anynana | Squinting bush brown | 15 | ... | 811 | wild | medium | acceptable | NaN | NaN | NaN | NaN | NaN | 418,809,811 |
4 | 9 | Animalia | Arthropoda | Malacostraca | Decapoda | Nephropidae | Homarus | americanus | American lobster | NaN | ... | 2 | wild | medium | acceptable | NaN | NaN | NaN | NaN | NaN | 2,13,594 |
5 rows × 31 columns
DataFrame
has many of the features of numpy.ndarray
- it also has a shape
and various statistical methods (max
, mean
etc.).
However, DataFrame
allows richer indexing.
For example, let's browse our data for species that have body mass greater than 300 kg.
First we will a new column that tells us if a row is a large animal row or not:
large_index = data['Body mass (g)'] > 300*1000 # 300 kg
print(large_index)
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False ... 4197 False 4198 False 4199 False 4200 False 4201 False 4202 False 4203 False 4204 False 4205 False 4206 False 4207 False 4208 False 4209 False 4210 False 4211 False Name: Body mass (g), Length: 4212, dtype: bool
Now, we slice or data with this boolean index.
The iterrows
method let's us iterate over the rows of the data.
For each row we get both the row as a Series
object (similar to dict
for our use)
and the row number as an int
.
large_data = data[large_index]
for i,row in large_data.iterrows():
print(row['Common name'], row['Body mass (g)']/1000, 'kg')
Domestic cattle 347.0 kg Dromedary 407.0 kg Moose 325.0 kg Asian elephant 3672.0 kg West Indian manatee 450.0 kg
So... a Dromedary is a Camel.
Let's continue with small and medium animals. For starters, let's plot a scatter of Body mass vs. Metabolic rate.
Because we work with pandas, we can do that with the plot
method of DataFrame
, specifying the columns for x
and y
and a plotting style (without the style we would get a line plot which makes no sense here).
data = data[data['Body mass (g)'] < 3e5]
data.plot(x='Body mass (g)', y='Metabolic rate (W)', style='o')
ylabel('Metabolic rate (W)')
sns.despine(offset=10)
ylim(0, 20);
sns.despine
removes the upper and right axes and the additional offset
keyword creates a space between the plot and the axis.
From this plot it seems that 1) there is a correlation between body mass and metabolic rate, and 2) there are many small animals (less than 30 kg) and not many medium animals (between 50 and 300 kg).
Before we continue, I prefer to have mass in kg, let's add a new column:
data['Body mass (kg)'] = data['Body mass (g)']/1000
Next, let's check how many records do we have for each Class (as in the taxonomic unit):
class_counts = data.Class.value_counts()
print(class_counts)
Mammalia 417 Aves 171 Amphibia 18 Reptilia 16 dtype: int64
class_counts.plot(kind='bar')
ylabel('# species')
sns.despine()
So we have lots of mammals and birds, and a few reptiles and amphibians. This is important as amphibian and reptiles could have a different replationship between mass and metabolism.
Let's do a simple linear regression plot; but let's do it in separate for each Class. This is done using seaborn's lmplot
function:
sns.lmplot(x='Body mass (kg)', y='Metabolic rate (W)', data=data, hue='Class', ci=False, size=6, legend_out=False);
Note that hue
means color, but it also causes seaborn to fit a different linear model to each of the Classes.
As for the last 3 paramteres:
ci
controls the confidence intervals. I chose False
, but setting it to True
will show them.size
controls the size of the plotlegend_out
decides if the legend is inside the plot or outside. We have enough space for it in the left corner.We can see that mammals and birds have a clear correlation between size and metabolism and that it extends over a nice range of mass, so let's stick to mammals; next up we will see which Orders of mammals we have.
mammalia = data[data.Class=='Mammalia']
order_counts = mammalia.Order.value_counts()
ax=order_counts.plot(kind='bar')
ylabel('# species')
sns.despine();
You see we have alot of rodents and carnivores, but also a good number of bats (Chiroptera) and primates.
Let's continue only with orders that have at least 20 species - this also includes some cool marsupials like Kangaroo, Koala and Taz (Diprotodontia and Dasyuromorphia)
orders = order_counts[order_counts >= 20]
print(orders)
abund_mammalia = mammalia[mammalia.Order.isin(orders.index)]
Rodentia 155 Carnivora 52 Chiroptera 33 Primates 26 Diprotodontia 22 Dasyuromorphia 20 dtype: int64
sns.lmplot(x='Body mass (kg)', y='Metabolic rate (W)', data=abund_mammalia, hue="Order",
ci=False, size=8, legend_out=False, line_kws={'lw':2}, scatter_kws={'s':50, 'alpha':0.5});
Because there is alot of data here I made the lines thinner - this can be done by giving matplotlib keywords as a dictionary to the argument line_kws
- and I made the markers bigger but with alpha (transperancy) 0.5 using the scatter_kws
argument.
Still ,there's too much data, and part of the problem is that some Orders are large - primates - and some are small - rodents.
Let's plot a separate plot for each Order. We do this using the col
and row
arguments of lmplot
, but in general this can be done for any plot using seaborn's FacetGrid
function.
sns.lmplot(x='Body mass (kg)', y='Metabolic rate (W)', data=abund_mammalia, hue="Order",
col="Order", col_wrap=3, ci=None, scatter_kws={'s':40}, sharex=False, sharey=False);
We used the sharex=False
and sharey=False
arguments so that each Order will have a different axis range and so the data is will spread nicely.
Last but not least, let's have a closer look at the corelation between mass and metabolism in primates.
We will do a joint plot which will give us the pearson correlation and the distribution of each parameter.
primates = mammalia[mammalia.Order == 'Primates']
print(sorted(primates["Common name"]))
sns.jointplot(x='Body mass (kg)', y='Metabolic rate (W)', data=primates, kind='reg', size=8);
['Blue monkey', 'Brown lemur', 'Calabar angwantibo', "Demidoff's galago", 'Fat-tailed dwarf lemur', "Geoffroy's tamarin", 'Greater galago', 'Guereza', 'Hamadryas baboon', 'Human', 'Mantled howler monkey', 'Northern night monkey', 'Patas monkey', 'Philippine tarsier', 'Potto', 'Pygmy marmoset', 'Senegal galago', 'Slender loris', 'Slow loris', 'Small-eared galago', 'South African galago', 'South American squirrel monkey', 'Spectral tarsier', "Verreaux's sifaka", 'Western needle-clawed galago', 'White-tufted-ear marmoset']
This notebook is part of the Python Programming for Life Sciences Graduate Students course given in Tel-Aviv University, Spring 2015.
The notebook was written using Python 3.4.1 and IPython 2.1.0 (download from PyZo).
The code is available at https://github.com//Py4Life/TAU2015/blob/master/lecture7.ipynb.
The notebook can be viewed online at http://nbviewer.ipython.org/github//Py4Life/TAU2015/blob/master/lecture7.ipynb.
The notebook is also available as a PDF at https://github.com/Py4Life/TAU2015/blob/master/lecture7.pdf?raw=true.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.