Lecture 7 - 6.5.2015¶

Last update: 5.5.2015¶

Previously¶

BioPython
Sequence analysis
Sequence data in Python

Today¶

Topics¶

Data analysis
Summary statistics
Data visualization

Packages¶

NumPy is the fundamental package for scientific computing with Python. It contains arrays, math functions, linear algebra, random number capabilities and much more.
Matplotlib is a plotting library which produces publication quality figures
Pandas provides high-performance, easy-to-use data structures and data analysis tools.
Seaborn is a visualization library based on matplotlib that provides a high-level interface for plotting attractive statistical figures.

Analyzing Patient Data¶

We are studying inflammation in patients who have been given a new treatment for arthritis, and need to analyze the first dozen data sets. The data sets are stored in comma-separated values (CSV) format: each row holds information for a single patient, and the columns represent successive days. The first few rows of our data file look like this:

0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0

0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1 0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1 0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1 0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1

Objectives¶

Load NumPy, the basic scientific Python library
Read tabular data from a file into Python.
Select individual values and subsections from data.
Perform operations on arrays of data.
Display simple plots

Loading Data¶

While a lot of powerful tools are built into Python, even more tools exist in the libraries and packages that are written in Python.

In order to load our inflammation data, we need to import a library called NumPy. We could load the data using the csv module, but you should use NumPy when you want to do things with numbers, especially if you have matrices or tables.

We can load NumPy using import. We also load urllib to read the data from the web:

In [1]:

import numpy as np
import urllib

First thing, we need to copy the file from the web to our local disk. This is done using the urllib.request.urlretrieve function:

In [2]:

url = r"https://raw.githubusercontent.com/swcarpentry/python-novice-inflammation/gh-pages/data/inflammation-01.csv"
fname = "inflammation-01.csv"
urllib.request.urlretrieve(url, fname)
!cat $fname

0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1
0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1
0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1
0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1
0,0,1,2,2,4,2,1,6,4,7,6,6,9,9,15,4,16,18,12,12,5,18,9,5,3,10,3,12,7,8,4,7,3,5,4,4,3,2,1
0,0,2,2,4,2,2,5,5,8,6,5,11,9,4,13,5,12,10,6,9,17,15,8,9,3,13,7,8,2,8,8,4,2,3,5,4,1,1,1
0,0,1,2,3,1,2,3,5,3,7,8,8,5,10,9,15,11,18,19,20,8,5,13,15,10,6,10,6,7,4,9,3,5,2,5,3,2,2,1
0,0,0,3,1,5,6,5,5,8,2,4,11,12,10,11,9,10,17,11,6,16,12,6,8,14,6,13,10,11,4,6,4,7,6,3,2,1,0,0
0,1,1,2,1,3,5,3,5,8,6,8,12,5,13,6,13,8,16,8,18,15,16,14,12,7,3,8,9,11,2,5,4,5,1,4,1,2,0,0
0,1,0,0,4,3,3,5,5,4,5,8,7,10,13,3,7,13,15,18,8,15,15,16,11,14,12,4,10,10,4,3,4,5,5,3,3,2,2,1
0,1,0,0,3,4,2,7,8,5,2,8,11,5,5,8,14,11,6,11,9,16,18,6,12,5,4,3,5,7,8,3,5,4,5,5,4,0,1,1
0,0,2,1,4,3,6,4,6,7,9,9,3,11,6,12,4,17,13,15,13,12,8,7,4,7,12,9,5,6,5,4,7,3,5,4,2,3,0,1
0,0,0,0,1,3,1,6,6,5,5,6,3,6,13,3,10,13,9,16,15,9,11,4,6,4,11,11,12,3,5,8,7,4,6,4,1,3,0,0
0,1,2,1,1,1,4,1,5,2,3,3,10,7,13,5,7,17,6,9,12,13,10,4,12,4,6,7,6,10,8,2,5,1,3,4,2,0,2,0
0,1,1,0,1,2,4,3,6,4,7,5,5,7,5,10,7,8,18,17,9,8,12,11,11,11,14,6,11,2,10,9,5,6,5,3,4,2,2,0
0,0,0,0,2,3,6,5,7,4,3,2,10,7,9,11,12,5,12,9,13,19,14,17,5,13,8,11,5,10,9,8,7,5,3,1,4,0,2,1
0,0,0,1,2,1,4,3,6,7,4,2,12,6,12,4,14,7,8,14,13,19,6,9,12,6,4,13,6,7,2,3,6,5,4,2,3,0,1,0
0,0,2,1,2,5,4,2,7,8,4,7,11,9,8,11,15,17,11,12,7,12,7,6,7,4,13,5,7,6,6,9,2,1,1,2,2,0,1,0
0,1,2,0,1,4,3,2,2,7,3,3,12,13,11,13,6,5,9,16,9,19,16,11,8,9,14,12,11,9,6,6,6,1,1,2,4,3,1,1
0,1,1,3,1,4,4,1,8,2,2,3,12,12,10,15,13,6,5,5,18,19,9,6,11,12,7,6,3,6,3,2,4,3,1,5,4,2,2,0
0,0,2,3,2,3,2,6,3,8,7,4,6,6,9,5,12,12,8,5,12,10,16,7,14,12,5,4,6,9,8,5,6,6,1,4,3,0,2,0
0,0,0,3,4,5,1,7,7,8,2,5,12,4,10,14,5,5,17,13,16,15,13,6,12,9,10,3,3,7,4,4,8,2,6,5,1,0,1,0
0,1,1,1,1,3,3,2,6,3,9,7,8,8,4,13,7,14,11,15,14,13,5,13,7,14,9,10,5,11,5,3,5,1,1,4,4,1,2,0
0,1,1,1,2,3,5,3,6,3,7,10,3,8,12,4,12,9,15,5,17,16,5,10,10,15,7,5,3,11,5,5,6,1,1,1,1,0,2,1
0,0,2,1,3,3,2,7,4,4,3,8,12,9,12,9,5,16,8,17,7,11,14,7,13,11,7,12,12,7,8,5,7,2,2,4,1,1,1,0
0,0,1,2,4,2,2,3,5,7,10,5,5,12,3,13,4,13,7,15,9,12,18,14,16,12,3,11,3,2,7,4,8,2,2,1,3,0,1,1
0,0,1,1,1,5,1,5,2,2,4,10,4,8,14,6,15,6,12,15,15,13,7,17,4,5,11,4,8,7,9,4,5,3,2,5,4,3,2,1
0,0,2,2,3,4,6,3,7,6,4,5,8,4,7,7,6,11,12,19,20,18,9,5,4,7,14,8,4,3,7,7,8,3,5,4,1,3,1,0
0,0,0,1,4,4,6,3,8,6,4,10,12,3,3,6,8,7,17,16,14,15,17,4,14,13,4,4,12,11,6,9,5,5,2,5,2,1,0,1
0,1,1,0,3,2,4,6,8,6,2,3,11,3,14,14,12,8,8,16,13,7,6,9,15,7,6,4,10,8,10,4,2,6,5,5,2,3,2,1
0,0,2,3,3,4,5,3,6,7,10,5,10,13,14,3,8,10,9,9,19,15,15,6,8,8,11,5,5,7,3,6,6,4,5,2,2,3,0,0
0,1,2,2,2,3,6,6,6,7,6,3,11,12,13,15,15,10,14,11,11,8,6,12,10,5,12,7,7,11,5,8,5,2,5,5,2,0,2,1
0,0,2,1,3,5,6,7,5,8,9,3,12,10,12,4,12,9,13,10,10,6,10,11,4,15,13,7,3,4,2,9,7,2,4,2,1,2,1,1
0,0,1,2,4,1,5,5,2,3,4,8,8,12,5,15,9,17,7,19,14,18,12,17,14,4,13,13,8,11,5,6,6,2,3,5,2,1,1,1
0,0,0,3,1,3,6,4,3,4,8,3,4,8,3,11,5,7,10,5,15,9,16,17,16,3,8,9,8,3,3,9,5,1,6,5,4,2,2,0
0,1,2,2,2,5,5,1,4,6,3,6,5,9,6,7,4,7,16,7,16,13,9,16,12,6,7,9,10,3,6,4,5,4,6,3,4,3,2,1
0,1,1,2,3,1,5,1,2,2,5,7,6,6,5,10,6,7,17,13,15,16,17,14,4,4,10,10,10,11,9,9,5,4,4,2,1,0,1,0
0,1,0,3,2,4,1,1,5,9,10,7,12,10,9,15,12,13,13,6,19,9,10,6,13,5,13,6,7,2,5,5,2,1,1,1,1,3,0,1
0,1,1,3,1,1,5,5,3,7,2,2,3,12,4,6,8,15,16,16,15,4,14,5,13,10,7,10,6,3,2,3,6,3,3,5,4,3,2,1
0,0,0,2,2,1,3,4,5,5,6,5,5,12,13,5,7,5,11,15,18,7,9,10,14,12,11,9,10,3,2,9,6,2,2,5,3,0,0,1
0,0,1,3,3,1,2,1,8,9,2,8,10,3,8,6,10,13,11,17,19,6,4,11,6,12,7,5,5,4,4,8,2,6,6,4,2,2,0,0
0,1,1,3,4,5,2,1,3,7,9,6,10,5,8,15,11,12,15,6,12,16,6,4,14,3,12,9,6,11,5,8,5,5,6,1,2,1,2,0
0,0,1,3,1,4,3,6,7,8,5,7,11,3,6,11,6,10,6,19,18,14,6,10,7,9,8,5,8,3,10,2,5,1,5,4,2,1,0,1
0,1,1,3,3,4,4,6,3,4,9,9,7,6,8,15,12,15,6,11,6,18,5,14,15,12,9,8,3,6,10,6,8,7,2,5,4,3,1,1
0,1,2,2,4,3,1,4,8,9,5,10,10,3,4,6,7,11,16,6,14,9,11,10,10,7,10,8,8,4,5,8,4,4,5,2,4,1,1,0
0,0,2,3,4,5,4,6,2,9,7,4,9,10,8,11,16,12,15,17,19,10,18,13,15,11,8,4,7,11,6,7,6,5,1,3,1,0,0,0
0,1,1,3,1,4,6,2,8,2,10,3,11,9,13,15,5,15,6,10,10,5,14,15,12,7,4,5,11,4,6,9,5,6,1,1,2,1,2,1
0,0,1,3,2,5,1,2,7,6,6,3,12,9,4,14,4,6,12,9,12,7,11,7,16,8,13,6,7,6,10,7,6,3,1,5,4,3,0,0
0,0,1,2,3,4,5,7,5,4,10,5,12,12,5,4,7,9,18,16,16,10,15,15,10,4,3,7,5,9,4,6,2,4,1,4,2,2,2,1
0,1,2,1,1,3,5,3,6,3,10,10,11,10,13,10,13,6,6,14,5,4,5,5,9,4,12,7,7,4,7,9,3,3,6,3,4,1,2,0
0,1,2,2,3,5,2,4,5,6,8,3,5,4,3,15,15,12,16,7,20,15,12,8,9,6,12,5,8,3,8,5,4,1,3,2,1,3,1,0
0,0,0,2,4,4,5,3,3,3,10,4,4,4,14,11,15,13,10,14,11,17,9,11,11,7,10,12,10,10,10,8,7,5,2,2,4,1,2,1
0,0,2,1,1,4,4,7,2,9,4,10,12,7,6,6,11,12,9,15,15,6,6,13,5,12,9,6,4,7,7,6,5,4,1,4,2,2,2,1
0,1,2,1,1,4,5,4,4,5,9,7,10,3,13,13,8,9,17,16,16,15,12,13,5,12,10,9,11,9,4,5,5,2,2,5,1,0,0,1
0,0,1,3,2,3,6,4,5,7,2,4,11,11,3,8,8,16,5,13,16,5,8,8,6,9,10,10,9,3,3,5,3,5,4,5,3,3,0,1
0,1,1,2,2,5,1,7,4,2,5,5,4,6,6,4,16,11,14,16,14,14,8,17,4,14,13,7,6,3,7,7,5,6,3,4,2,2,1,1
0,1,1,1,4,1,6,4,6,3,6,5,6,4,14,13,13,9,12,19,9,10,15,10,9,10,10,7,5,6,8,6,6,4,3,5,2,1,1,1
0,0,0,1,4,5,6,3,8,7,9,10,8,6,5,12,15,5,10,5,8,13,18,17,14,9,13,4,10,11,10,8,8,6,5,5,2,0,2,0
0,0,1,0,3,2,5,4,8,2,9,3,3,10,12,9,14,11,13,8,6,18,11,9,13,11,8,5,5,2,8,5,3,5,4,1,3,1,1,0

We saved the file to the local filesystem. We can have a look at the file contents from the IPython notebook by calling a system command cat or type. To do this, we need to put an ! at the beginning of the line and prepend any variable name with $.

Once it's done, we can ask NumPy to read the data file:

In [3]:

data = np.loadtxt(fname, delimiter=',')
print(data)

[[ 0.  0.  1. ...,  3.  0.  0.]
 [ 0.  1.  2. ...,  1.  0.  1.]
 [ 0.  1.  1. ...,  2.  1.  1.]
 ..., 
 [ 0.  1.  1. ...,  1.  1.  1.]
 [ 0.  0.  0. ...,  0.  2.  0.]
 [ 0.  0.  1. ...,  1.  1.  0.]]

The expression np.loadtxt(...) is a function call that asks Python to run the function loadtxt that belongs to the numpy library. This dotted notation is used everywhere in Python to refer to the parts of things as thing.component (see: namespaces).

numpy.loadtxt has two parameters: the name of the file we want to read, and the delimiter that separates values on a line. These both need to be character strings (or strings for short), so we put them in quotes.

We saved the output of loadtxt in the variable data. When we print(data), only a few rows and columns are shown (with ... to omit elements when displaying big arrays). To save space, Python displays numbers as 1. instead of 1.0 when there's nothing interesting after the decimal point.

Manipulating Data¶

Now that our data is in memory, we can start doing things with it. First, let's ask what type of thing data refers to:

In [4]:

print(type(data))

<class 'numpy.ndarray'>

The output tells us that data currently refers to an N-dimensional array created by the NumPy library. We can see what its shape is like this:

In [5]:

print(data.shape)
n_patients,n_days = data.shape

(60, 40)

This tells us that data has 60 rows and 40 columns, which are 60 patients and 40 days. data.shape is a member of data, i.e., a value that is stored as part of a larger value. We use the same dotted notation for the members of values that we use for the functions in libraries because they have the same part-and-whole relationship.

If we want to get a single value from the matrix, we must provide an index in square brackets, just as we do with a list, but with as many indices as the number of dimensions in shape (two in this case):

In [6]:

print("first value in data",data[0,0])
print("middle value in data:", data[30, 20])

first value in data 0.0
middle value in data: 13.0

The expression data[30, 20] may not surprise you, but data[0, 0] might. Programming languages like Fortran and MATLAB start counting at 1, because that's what human beings have done for thousands of years. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that's simpler for computers to do. Just like with list and str, if we have an M×N array in Python, its indices go from 0 to M-1 on the first axis and 0 to N-1 on the second. It takes a bit of getting used to, but one way to remember the rule is that the index is how many steps we have to take from the start to get the item we want.

In the Corner¶
What may also surprise you is that when Python displays an array, it shows the element with index [0, 0] in the upper left corner rather than the lower left. This is consistent with the way mathematicians draw matrices, but different from the Cartesian coordinates. The indices are (row, column) instead of (column, row) for the same reason, which can be confusing when plotting data.

An index like [30, 20] selects a single element of an array, but we can select whole sections as well. For example, we can select the first ten days (columns) of values for the first four (rows) patients like this:

In [7]:

print(data[0:4, 0:10])

[[ 0.  0.  1.  3.  1.  2.  4.  7.  8.  3.]
 [ 0.  1.  2.  1.  2.  1.  3.  2.  2.  6.]
 [ 0.  1.  1.  3.  3.  2.  6.  2.  5.  9.]
 [ 0.  0.  2.  0.  4.  2.  2.  1.  6.  7.]]

The slice 0:4 means, "Start at index 0 and go up to, but not including, index 4." Again, the up-to-but-not-including takes a bit of getting used to, but the rule is that the difference between the upper and lower bounds is the number of values in the slice.

We don't have to start slices at 0:

In [8]:

print(data[5:10, 0:10])

[[ 0.  0.  1.  2.  2.  4.  2.  1.  6.  4.]
 [ 0.  0.  2.  2.  4.  2.  2.  5.  5.  8.]
 [ 0.  0.  1.  2.  3.  1.  2.  3.  5.  3.]
 [ 0.  0.  0.  3.  1.  5.  6.  5.  5.  8.]
 [ 0.  1.  1.  2.  1.  3.  5.  3.  5.  8.]]

We also don't have to include the upper and lower bound on the slice. If we don't include the lower bound, Python uses 0 by default; if we don't include the upper, the slice runs to the end of the axis, and if we don't include either (i.e., if we just use ':' on its own), the slice includes everything:

In [9]:

small = data[:3, 36:]
print('small is:')
print(small)

small is:
[[ 2.  3.  0.  0.]
 [ 1.  1.  0.  1.]
 [ 2.  2.  1.  1.]]

Arrays also know how to perform common mathematical operations on their values. The simplest operations with data are arithmetic: add, subtract, multiply, and divide. When you do such operations on arrays, the operation is done on each individual element of the array. Thus:

In [10]:

doubledata = data * 2.0

will create a new array doubledata whose elements have the value of two times the value of the corresponding elements in data.

In [11]:

print('original:')
print(data[:3, 36:])
print('doubledata:')
print(doubledata[:3, 36:])

original:
[[ 2.  3.  0.  0.]
 [ 1.  1.  0.  1.]
 [ 2.  2.  1.  1.]]
doubledata:
[[ 4.  6.  0.  0.]
 [ 2.  2.  0.  2.]
 [ 4.  4.  2.  2.]]

If, instead of taking an array and doing arithmetic with a single value (as above) you did the arithmetic operation with another array of the same size and shape, the operation will be done on corresponding elements of the two arrays. Thus:

In [12]:

tripledata = doubledata + data

will give you an array where tripledata[0,0] will equal doubledata[0,0] plus data[0,0], and so on for all other elements of the arrays.

In [13]:

print('tripledata:')
print(tripledata[:3, 36:])

tripledata:
[[ 6.  9.  0.  0.]
 [ 3.  3.  0.  3.]
 [ 6.  6.  3.  3.]]

Exercise A - square root¶

Calculate the square root of the data using numpy. Print the result for the first 5 columns of the first 3 rows.

In [ ]:

Descriptive statistics¶

Often, we want to do more than add, subtract, multiply, and divide values of data. Arrays also know how to do more complex operations on their values. If we want to find the average inflammation for all patients on all days, for example, we can just ask the array for its mean value

In [14]:

print(data.mean())

6.14875

mean is a method of the array, i.e., a function that belongs to it in the same way that the member shape does. If variables are nouns, methods are verbs: they are what the thing in question knows how to do. This is why data.shape doesn't need to be called (it's just a thing) but data.mean() does (it's an action). It is also why we need empty parentheses for ata.mean(): even when we're not passing in any parameters, parentheses are how we tell Python to go and do something for us.

NumPy arrays have lots of useful methods:

In [15]:

print('maximum inflammation:', data.max())
print('minimum inflammation:', data.min())
print('standard deviation:', data.std())

maximum inflammation: 20.0
minimum inflammation: 0.0
standard deviation: 4.61383319712

When analyzing data, though, we often want to look at partial statistics, such as the maximum value per patient or the average value per day. One way to do this is to select the data we want to create a new temporary array, then ask it to do the calculation:

In [16]:

patient_0 = data[0, :] # 0 on the first axis, everything on the second
print('maximum inflammation for patient 0:', patient_0.max())

maximum inflammation for patient 0: 18.0

What if we need the maximum inflammation for all patients, or the average for each day? As the diagram below shows, we want to perform the operation across an axis: axis example To support this, most array methods allow us to specify the axis we want to work on. If we ask for the average across axis 0, we get:

In [17]:

print(data.mean(axis=0))

[  0.           0.45         1.11666667   1.75         2.43333333   3.15
   3.8          3.88333333   5.23333333   5.51666667   5.95         5.9
   8.35         7.73333333   8.36666667   9.5          9.58333333
  10.63333333  11.56666667  12.35        13.25        11.96666667
  11.03333333  10.16666667  10.           8.66666667   9.15         7.25
   7.33333333   6.58333333   6.06666667   5.95         5.11666667   3.6
   3.3          3.56666667   2.48333333   1.5          1.13333333
   0.56666667]

As a quick check, we can ask this array what its shape is:

In [18]:

print(data.mean(axis=0).shape)

(40,)

The expression (40,) tells us we have an N×1 vector, so this is the average inflammation per day for all patients. If we average across axis 1, we get:

In [19]:

print(data.mean(axis=1))

[ 5.45   5.425  6.1    5.9    5.55   6.225  5.975  6.65   6.625  6.525
  6.775  5.8    6.225  5.75   5.225  6.3    6.55   5.7    5.85   6.55
  5.775  5.825  6.175  6.1    5.8    6.425  6.05   6.025  6.175  6.55
  6.175  6.35   6.725  6.125  7.075  5.725  5.925  6.15   6.075  5.75
  5.975  5.725  6.3    5.9    6.75   5.925  7.225  6.15   5.95   6.275  5.7
  6.1    6.825  5.975  6.725  5.7    6.25   6.4    7.05   5.9  ]

which is the average inflammation per patient across all days.

Exercise B - sum¶

On which day did each patient had the most inflammation? Use .data.argmax to find out.

In [ ]:

Plotting¶

The mathematician Richard Hamming once said, "The purpose of computing is insight, not numbers", and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few features of Python's matplotlib here. While there is no "official" plotting library, this package is the de facto standard. First, let's tell the IPython Notebook that we want our plots displayed inline, rather than in a separate viewing window. This command will also import many other useful scientific python functions and modules, including NumPy:

In [31]:

%matplotlib?

In [34]:

%matplotlib inline
from matplotlib.pyplot import *

The % at the start of the line signals that this is a command for the notebook, rather than a statement in Python.

Next, use two of matplotlib's functions to create and display a heat map of our data:

In [35]:

imshow(data, aspect=0.66); # aspect control the width of one x unit vs. one y unit

Out[35]:

<matplotlib.image.AxesImage at 0x47c1e10>

Blue regions in this heat map are low values, while red shows high values. As we can see, inflammation rises and falls over a 40-day period (The ; at the end of the line supresses the automatic printing of function's return value, which is an AxesImage in this case).

Let's take a look at the average inflammation over time:

In [36]:

avg_inflammation = data.mean(axis=0)
plot(avg_inflammation);

Here, we have put the average per day across all patients in the variable avg_inflammation, then used plot to create and display a line graph of those values. The result is roughly a linear rise and fall, which is suspicious: based on other studies, we expect a sharper rise and slower fall. Let's have a look at two other statistics:

In [37]:

plot(avg_inflammation, label='avg')
plot(data.max(axis=0), label='max')
plot(data.min(axis=0), label='min')
title('inflammation per day')
xlabel('day')
ylabel('inflammation')
legend();

The maximum value rises and falls perfectly smoothly, while the minimum seems to be a step function. Neither result seems particularly likely, so either there's a mistake in our calculations or something is wrong with our data.

matplotlib has many more plotting commands. For example, we can calculate the stadard deviation of the mean (sem) which is defined as the standard deviation of the sample divided by the square root of the number of the samples: $$ \sqrt{\frac{\sum_i{(x_{i=1}^{n} - \mu)^2}}{N}} $$ This can be calculated using the method std and the function sqrt.

Note that when calculating the standard deviation of a sample, the degrees of freedom is $n-1$ - this is represented by the ddof=1 argument to std.

In [39]:

sem_inflammation = data.std(axis=0, ddof=1) / np.sqrt(n_patients)
errorbar(x=range(n_days), y=avg_inflammation, yerr=sem_inflammation, ecolor='k')

Out[39]:

<Container object of 3 artists>

Challenges¶

Why do all of our plots stop just short of the upper end of our graph?
Why are the vertical lines in our plot of the minimum inflammation per day not perfectly vertical?

Exercise C¶

In exercise B we calculated the day in which each patient had the most inflammation. Let's plot it now - make a scatter plot using the scatter command, with patient number on x-axis and day of max inflammation on y-axis. Don't forget to add axis labels with xlabel and ylabel!

In [ ]:

Statistical Analysis of Life History Traits¶

Many times we want to display data with some smoothing or fitting of a regression line over it. To do this, we will use the pandas and seaborn library. pandas is a very strong library for manipulation large datasets. seaborn adds on top of pandas a set of sophisticated statistical visualizations.

Unfortunately, seaborn doesn't ship with pyzo so we need to install it manually. The best way is to use conda or pip from the pyzo folder (make sure you write the proper path, on my PC it's D:\pyzo2014a):

D:\pyzo2014a\Scripts\conda install seaborn

or

D:\pyzo2014a\Scripts\pip install seaborn

On linux pip is in the bin rather than the Scripts folder and must be called using python3 from the same folder.

In [40]:

import zipfile
import pandas as pd
import seaborn as sns
sns.set_style("ticks") # control the plotting style
sns.set_context("talk") # set to talk because this is a lecture! hit shift-tab after the "(" to see other options.

We will analyze animal life-history data from AnAge. We will get the data from the download page, but it's compressed with zip so we need to unzip it and then we can read the data using pandas read_table function:

In [31]:

urllib.request.urlretrieve('http://genomics.senescence.info/species/dataset.zip', 'anage_dataset.zip')

Out[31]:

('anage_dataset.zip', <http.client.HTTPMessage at 0x5eab5f0>)

In [51]:

with zipfile.ZipFile('anage_dataset.zip') as z:
    f = z.open('anage_data.txt')
    data = pd.read_table(f)
print(type(data))
print(data.shape)

<class 'pandas.core.frame.DataFrame'>
(4212, 31)

pandas holds data in DataFrame (similar to R). DataFrame have a single row per observation (in contrast to the previous exercise in which each table cell was one observation), and each column has a single variable. Variables can be numbers or strings.

In [52]:

data.head()

Out[52]:

	HAGRID	Kingdom	Phylum	Class	Order	Family	Genus	Species	Common name	Female maturity (days)	...	Source	Specimen origin	Sample size	Data quality	IMR (per yr)	MRDT (yrs)	Metabolic rate (W)	Body mass (g)	Temperature (K)	References
0	4	Animalia	Arthropoda	Insecta	Diptera	Drosophilidae	Drosophila	melanogaster	Fruit fly	7	...	NaN	captivity	large	acceptable	0.05	0.04	NaN	NaN	NaN	2,20,32,47,53,68,69,240,241,242,243,274,602,98...
1	5	Animalia	Arthropoda	Insecta	Hymenoptera	Apidae	Apis	mellifera	Honey bee	NaN	...	812	unknown	medium	acceptable	NaN	NaN	NaN	NaN	NaN	63,407,408,741,805,806,808,812,815,828,830,831...
2	7	Animalia	Arthropoda	Insecta	Hymenoptera	Formicidae	Lasius	niger	Black garden ant	NaN	...	411	unknown	medium	acceptable	NaN	NaN	NaN	NaN	NaN	411,813,814
3	8	Animalia	Arthropoda	Insecta	Lepidoptera	Nymphalidae	Bicyclus	anynana	Squinting bush brown	15	...	811	wild	medium	acceptable	NaN	NaN	NaN	NaN	NaN	418,809,811
4	9	Animalia	Arthropoda	Malacostraca	Decapoda	Nephropidae	Homarus	americanus	American lobster	NaN	...	2	wild	medium	acceptable	NaN	NaN	NaN	NaN	NaN	2,13,594

5 rows × 31 columns

DataFrame has many of the features of numpy.ndarray - it also has a shape and various statistical methods (max, mean etc.). However, DataFrame allows richer indexing. For example, let's browse our data for species that have body mass greater than 300 kg. First we will a new column that tells us if a row is a large animal row or not:

In [53]:

large_index = data['Body mass (g)'] > 300*1000 # 300 kg
print(large_index)

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
...
4197    False
4198    False
4199    False
4200    False
4201    False
4202    False
4203    False
4204    False
4205    False
4206    False
4207    False
4208    False
4209    False
4210    False
4211    False
Name: Body mass (g), Length: 4212, dtype: bool

Now, we slice or data with this boolean index. The iterrows method let's us iterate over the rows of the data. For each row we get both the row as a Series object (similar to dict for our use) and the row number as an int.

In [54]:

large_data = data[large_index]
for i,row in large_data.iterrows(): 
    print(row['Common name'], row['Body mass (g)']/1000, 'kg')

Domestic cattle 347.0 kg
Dromedary 407.0 kg
Moose 325.0 kg
Asian elephant 3672.0 kg
West Indian manatee 450.0 kg

So... a Dromedary is a Camel.

Let's continue with small and medium animals. For starters, let's plot a scatter of Body mass vs. Metabolic rate. Because we work with pandas, we can do that with the plot method of DataFrame, specifying the columns for x and y and a plotting style (without the style we would get a line plot which makes no sense here).

In [55]:

data = data[data['Body mass (g)'] < 3e5] 
data.plot(x='Body mass (g)', y='Metabolic rate (W)', style='o')
ylabel('Metabolic rate (W)')
sns.despine(offset=10)
ylim(0, 20);

sns.despine removes the upper and right axes and the additional offset keyword creates a space between the plot and the axis.

From this plot it seems that 1) there is a correlation between body mass and metabolic rate, and 2) there are many small animals (less than 30 kg) and not many medium animals (between 50 and 300 kg).

Before we continue, I prefer to have mass in kg, let's add a new column:

In [56]:

data['Body mass (kg)'] = data['Body mass (g)']/1000

Next, let's check how many records do we have for each Class (as in the taxonomic unit):

In [57]:

class_counts = data.Class.value_counts()
print(class_counts)

Mammalia    417
Aves        171
Amphibia     18
Reptilia     16
dtype: int64

In [58]:

class_counts.plot(kind='bar')
ylabel('# species')
sns.despine()

So we have lots of mammals and birds, and a few reptiles and amphibians. This is important as amphibian and reptiles could have a different replationship between mass and metabolism.

Let's do a simple linear regression plot; but let's do it in separate for each Class. This is done using seaborn's lmplot function:

In [59]:

sns.lmplot(x='Body mass (kg)', y='Metabolic rate (W)', data=data, hue='Class', ci=False, size=6, legend_out=False);

Note that hue means color, but it also causes seaborn to fit a different linear model to each of the Classes. As for the last 3 paramteres:

ci controls the confidence intervals. I chose False, but setting it to True will show them.
size controls the size of the plot
legend_out decides if the legend is inside the plot or outside. We have enough space for it in the left corner.

We can see that mammals and birds have a clear correlation between size and metabolism and that it extends over a nice range of mass, so let's stick to mammals; next up we will see which Orders of mammals we have.

In [60]:

mammalia = data[data.Class=='Mammalia']
order_counts = mammalia.Order.value_counts()
ax=order_counts.plot(kind='bar')
ylabel('# species')
sns.despine();

You see we have alot of rodents and carnivores, but also a good number of bats (Chiroptera) and primates.

Let's continue only with orders that have at least 20 species - this also includes some cool marsupials like Kangaroo, Koala and Taz (Diprotodontia and Dasyuromorphia)

In [61]:

orders = order_counts[order_counts >= 20]
print(orders)
abund_mammalia = mammalia[mammalia.Order.isin(orders.index)]

Rodentia          155
Carnivora          52
Chiroptera         33
Primates           26
Diprotodontia      22
Dasyuromorphia     20
dtype: int64

In [62]:

sns.lmplot(x='Body mass (kg)', y='Metabolic rate (W)', data=abund_mammalia, hue="Order", 
           ci=False, size=8, legend_out=False, line_kws={'lw':2}, scatter_kws={'s':50, 'alpha':0.5});

Because there is alot of data here I made the lines thinner - this can be done by giving matplotlib keywords as a dictionary to the argument line_kws - and I made the markers bigger but with alpha (transperancy) 0.5 using the scatter_kws argument.

Still ,there's too much data, and part of the problem is that some Orders are large - primates - and some are small - rodents. Let's plot a separate plot for each Order. We do this using the col and row arguments of lmplot, but in general this can be done for any plot using seaborn's FacetGrid function.

In [63]:

sns.lmplot(x='Body mass (kg)', y='Metabolic rate (W)', data=abund_mammalia, hue="Order", 
           col="Order", col_wrap=3, ci=None, scatter_kws={'s':40}, sharex=False, sharey=False);

We used the sharex=False and sharey=False arguments so that each Order will have a different axis range and so the data is will spread nicely. Last but not least, let's have a closer look at the corelation between mass and metabolism in primates. We will do a joint plot which will give us the pearson correlation and the distribution of each parameter.

In [65]:

primates = mammalia[mammalia.Order == 'Primates']
print(sorted(primates["Common name"]))
sns.jointplot(x='Body mass (kg)', y='Metabolic rate (W)', data=primates, kind='reg', size=8);

['Blue monkey', 'Brown lemur', 'Calabar angwantibo', "Demidoff's galago", 'Fat-tailed dwarf lemur', "Geoffroy's tamarin", 'Greater galago', 'Guereza', 'Hamadryas baboon', 'Human', 'Mantled howler monkey', 'Northern night monkey', 'Patas monkey', 'Philippine tarsier', 'Potto', 'Pygmy marmoset', 'Senegal galago', 'Slender loris', 'Slow loris', 'Small-eared galago', 'South African galago', 'South American squirrel monkey', 'Spectral tarsier', "Verreaux's sifaka", 'Western needle-clawed galago', 'White-tufted-ear marmoset']

More¶

Slides: Statistical inference with Python by Allen Downey
Book: Think Stats by Allen Downey - statistics with Python. Free Ebook.
Blog post: A modern guide to getting started with Data Science and Python
Slides: Losing your loops: Fast numerical computing with NumPy
Notebooks: Long Matplotlib tutorial, IPython-Matplotlib gallery

Fin¶

This notebook is part of the Python Programming for Life Sciences Graduate Students course given in Tel-Aviv University, Spring 2015.

The notebook was written using Python 3.4.1 and IPython 2.1.0 (download from PyZo).

The code is available at https://github.com//Py4Life/TAU2015/blob/master/lecture7.ipynb.

The notebook can be viewed online at http://nbviewer.ipython.org/github//Py4Life/TAU2015/blob/master/lecture7.ipynb.

The notebook is also available as a PDF at https://github.com/Py4Life/TAU2015/blob/master/lecture7.pdf?raw=true.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.