A gentle introduction to some elements of scientific programming in Python.
Programming Bootcamp for biology graduate students, August 2015.
Mickey Atwal, Cold Spring Harbor Laboratory. (Lab website)
INTERNET RESOURCES
In Python, dividing an integer by an integer gives you just an integer, removing the remainder. This is a legacy from older programming languages like C. In the future, this feature will change in Python, giving you a real result. If you want it now, you can import the new feature from the future to indicate that '/' means true division.
from __future__ import division
The following are the standard ways of loading some useful Python packages for scientific computing
# load NumPy package and abbreviate the name to 'np'
import numpy as np
# load SciPy package and abbreviate the name to 'sp'
import scipy as sp
# load Matplotlib plotting package and abbreviate the name to 'plt'
import matplotlib.pyplot as plt
We will be working with the text file containing the nucleotide counts of the E. Coli DNA binding sites of the transcription factor CRP (cAMP receptor protein) also known as CAP (catabolite gene activator protein). You can find a copy at http://atwallab.cshl.edu/links/crp_counts_matrix.txt. I will go through the steps of how to download a text file from the web, save it, and open it again in a numerical array format.
# load the module that deals with URL stuff
import urllib
# web URL address of the file
url="http://atwallab.cshl.edu/links/crp_counts_matrix.txt"
# name of the file to be saved into
filename="data.txt"
# downloads the file and saves it into the local directory
urllib.urlretrieve(url,filename)
This is a small tab-delimited text file where the counts data at each of the 42 nucleotide positions is stored as a series of strings. Let's take a look using the Unix command "cat"
!cat data.txt
We need to convert this to a numerical array of numbers where we can perform computations. The genfromtxt function in NumPy automatically generates a NumPy array from a text file
# loads data from text file and store in an integer NumPy array called 'counts'
counts=np.genfromtxt(filename,dtype=int)
counts.ndim # what is the dimensionality of the array?
counts.shape # what is the size of the array? (rows, columns)
counts.dtype # what is the data type of the array?
Let's practise some array indexing to remember how they work in Python
# let's have a look at the first five rows
counts[0:5]
# the first row
counts[0]
# rows 2 to 3, i.e. the second to third rows
counts[1:3]
# the second column
counts[:,1]
# the last two rows
counts[-2:]
# every third row beginning with the first
counts[::3]
# rows 3 to 4, and columns 2 to 4
counts[2:4,1:4]
# assess whether the previous elements are greater than 70
counts[2:4,1:4]>70
# assess whether the previous elements are not equal to 70
counts[2:4,1:4] !=70
# the minimum and maximum element of the array
np.min(counts), np.max(counts)
# select the elements greater than 200
counts[counts>200]
# what are the indices of the elements greater than 200?
# The Numpy function "where" returns the indices in separate arrays of rows and columns.
np.where(counts>200)
# select elements that are greater than 200 and also divisible by 3, i.e. counts mod 3 = 0
counts[(counts>200) & (counts%3==0)]
Dot Product. Frequently when performing operations on arrays we have to take the dot product of two list of numbers, or two vectors, e.g. $\vec{x}=\{x_1,x_2,x_3\}$ and $\vec{y}=\{y_1,y_2,y_3\}$. The dot product $\vec{x} \cdot \vec{y}$ is defined as $$ x \cdot y = \sum_{i=1}^{3} x_i y_i $$ NumPy provides an efficient way of doing this without explicitly writing a 'for loop'
# dot product between rows 3 and 8
np.dot(counts[2],counts[7])
# sum each column of the array, i.e. sum along the rows, the dimension indexed as 0
sum(counts,0)
# sum each row of the array, i.e. sum along the columns, the dimension indexed as 1
sum(counts,1)
# mean, median and standard deviation of each column
np.mean(counts,0), np.median(counts,0), np.std(counts,0)
We can add pseudocounts to each element. This is usually a good idea if your count data is undersampled.
# add 1 to EVERY element of the counts matrix to form a new matrix 'new_counts'
new_counts=counts+1
Let's calculate the probabilities of each nucleotide at each position, e.g. the probability of seeing an A at position i is $$ \begin{array} pp_i(A)&=&\frac{\rm{counts}_i(A)}{\rm{counts}_i(A)+\rm{counts}_i(T)+\rm{counts}_i(G)+\rm{counts}_i(C)} \\ &=&\frac{\rm{counts}_i(A)}{\rm{total\_counts}_i} \end{array} $$ The total counts is the same for all positions, so we might just as well use only the first position to evaluate it.
total_counts=sum(new_counts[0])
prob=new_counts/total_counts
print prob
It's often a good idea to represent the data graphically to glean what's going on
# set the size of the figure
plt.figure(figsize=[15,2])
# show the array flipped (transposed) and with no colour interpolation smoothing
plt.imshow(prob.T,interpolation='nearest')
# set the ticks
plt.xticks(range(0,42),range(1,43))
plt.yticks(range(4),['A','C','G','T'])
# set the colorbar
plt.clim([0,1])
plt.colorbar(ticks=arange(0,1,0.2))
# title
plt.title('base frequency matrix of CRP binding sites',fontsize=15)
# show the figure
plt.show()
We want to know how conserved the individual positions are. We start by calculating the entropy. The entropy H of nucleotide variability at each site i is defined by, $$ H_i=-\sum_{n=\{A,C,T,G\}} p_i(n) \log_2 p_i(n) $$ From the formula, we see that this calculation involves a dot product between the rows of the probability array and the rows of a log probability array. We have to be remember that $0 \log_2 0 = 0$, if any of the probabilities are zero. This is important since $\log_20=-\infty$ and computers don't like dealing with infinities. A simple way to deal with this is to remove all the zero probabilities in the above entropy summation formula.
Let's use this opportunity to go over a simple example of how to define a new function in Python.
# defining a new function 'entropy_calc' which takes as input a 1D array p
def entropy_calc(p):
p=p[p!=0]
result=np.dot(np.log2(1/p),p)
return result
Let's plug in a few probability distributions into our entropy function to make sure it spits out the right answer
# synthetic probability distributions
dist1=np.array([1/4,1/4,1/4,1/4])
dist2=np.array([1/8,1/8,1/4,1/2])
dist3=np.array([0,0,1/2,1/2])
dist4=np.array([0,0,0,1])
# evaluate the entropies
print entropy_calc(dist1)
print entropy_calc(dist2)
print entropy_calc(dist3)
print entropy_calc(dist4)
Returning to our DNA sequence probabilities, we can evaluate the entropy at each nucleotide position by the following simple command
# loop through every row of the 'prob' array and evaluate the entropy
DNA_entropy=np.array([entropy_calc(row) for row in prob])
This creates a new 1-dimensional NumPy array called 'DNA_entropy'.
print DNA_entropy
We see that most entropy values are fairly close to the maximum of two bits, representing a great deal of base variability. Let's plot the results.
# number of nucleotide positions
num_pos=len(DNA_entropy)
# set the size of the figure
plt.figure(figsize=(12,4))
# plot a bar chart
plt.bar(np.arange(num_pos),DNA_entropy,color='green')
# axes labels
plt.xlabel('position',fontsize=15)
plt.ylabel('entropy (bits)',fontsize=15)
# limit the x axis range to just to the total number of nucleotide positions
plt.xlim(0,num_pos)
# place the x axis ticks and labels to be centered at each bar
plt.xticks(arange(num_pos)+0.5,arange(1,num_pos+1),fontsize=10)
# show the figure
plt.show()
What we really want to calculate is the sequence conservation, R. This is defined to be the total possible entropy at each site minus the observed entropy: $$ \begin{array} RR &=& H_{max}-H_{obs} \\ &=& \log_2(\rm{no. of states})-\left( - \sum_{n=\{A,C,T,G\}} p(n) \log_2 p(n) \right) \\ &=& 2 + \sum_{n=\{A,C,T,G\}} p(n) \log_2 p(n) \end{array} $$
# conservation score at each position
R=2-DNA_entropy
# set the size of the figure
plt.figure(figsize=(12,4))
# plot a bar chart
plt.bar(np.arange(num_pos),R,color='green')
# axes labels
plt.xlabel('position',fontsize=20)
plt.ylabel('bits',fontsize=20)
# limit the x axis range to just to the total number of nucleotide positions
plt.xlim(0,num_pos)
plt.ylim(0,2)
# place the x axis ticks and labels to be centered at each bar
plt.xticks(arange(num_pos)+0.5,arange(1,num_pos+1),fontsize=10,rotation=0)
# set the title
plt.title('Sequence Conservation of Binding Site', fontsize=20)
# show the figure
plt.show()
A frequent task in scientific programming is to analyze random samples from a known distribution. A commonly used distribution is the gaussian (normal distribution)
$$ p(x)=\frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{(x-\mu)^2}{2 \sigma^2} \right) $$where $\sigma$ is the standard deviation and $\mu$ is the mean
# number of samples
N_gauss = 10000
# standard deviations
s = 2
# mean
u = 5
# draw random samples from gaussian distribution
samples=np.random.normal(u,s,N_gauss)
samples
We can plot the data in two different ways:
plt.figure(figsize=(12,5))
# plot histogram
plt.subplot(1,2,1)
plt.hist(samples,bins=20)
plt.title('histogram')
# plot boxplot
plt.subplot(1,2,2)
# the default is to produce a vertical boxplot. Setting vert=False gives a horizontal plot
plt.boxplot(samples,vert=False)
plt.title('boxplot')
plt.show()
We can draw random samples from a two-dimensional Gaussian (normal) distribution centered around the origin, where the $x$ and $y$ variables are independent and therefore uncorrelated,
$$ \begin{array} pp(x,y)&=&p(x)p(y) \\ &=&\displaystyle \frac{1}{2 \pi \sigma_x \sigma_y} \exp{\left( -\frac{x^2}{2 \sigma_x^2} - \frac{y^2}{2 \sigma_y^2} \right)} \end{array} $$plt.figure(figsize=(10,5)) # set the figure size
cm = plt.cm.RdBu # coloring scheme for contour plots
N_2dgauss=100 # number of datapoints
sx=7 # standard deviation in the x direction
sy=3 # standard deviation in the y direction
x=np.random.normal(0,sx,N_2dgauss) # random samples from gaussian in x-direction
y=np.random.normal(0,sy,N_2dgauss) # random samples from gaussian in y-direction
axs=max(abs(x))*1.2 # axes plotting range
t=linspace(-axs,axs,1000) # range of mesh points for contour plot
cx,cy=np.meshgrid(t,t) # mesh array
z=(1/(2*pi*sx*sy))*exp(-((cx*cx)/(2*sx*sx))-((cy*cy)/(2*sy*sy))) # actual 2d gaussian
plt.xlim([-axs,axs])
plt.ylim([-axs,axs])
# plots contour of 2d gaussian
plt.subplot(1,2,1,aspect='equal') # setting axes scales to be equal
contour(t,t,z,linspace(0.0000001,1/(2*pi*sx*sy),20),cmap=cm)
plt.grid()
plt.title('contour map of 2d gaussian distribution\n $\sigma_x=$ %d, $\sigma_y=$ %d' % (sx,sy))
plt.xlabel('x')
plt.ylabel('y')
# plots the random samples on top of contour plot
plt.subplot(1,2,2,aspect='equal')
contour(t,t,z,linspace(0.0000001,1/(2*pi*sx*sy),20),cmap=cm)
plt.plot(x,y,'bo',markersize=5)
plt.grid()
plt.title('%d random samples from the 2d gaussian\n distribution' % N_2dgauss)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
We can calculate the covariance and correlation of the above data. As a reminder, the covariance matrix of two random variables $x$ and $y$ with associated distributions $p(x)$ and $p(y)$ is
$$ {\rm{covariance}}= \left( \begin{array}{cc} \langle (x - \langle x \rangle )^2 \rangle & \langle (x - \langle x \rangle)(y - \langle y \rangle) \rangle \\ \langle (y - \langle y \rangle)(x - \langle x \rangle) \rangle & \langle (y - \langle y \rangle )^2 \rangle \end{array} \right) $$where the $\left< \right>$ brackets denotes the expectation or averaging with respect to the probability distributions (conventional physics notation). The covariance matrix is symmetric and the diagonal elements are the variances in the $x$ and $y$ directions. In practise, with real data, you don't have access to the underlying distributions $p(x)$ and $p(y)$ and so the means and variances have to be estimated from the finite number of samples.
# covariance matrix of the sample
np.cov(x,y)
The diagonal elements of the sample covariance matrix should be close to the variances (standard deviation squared) of the original gaussian distributions from which the data was sampled from. The more samples the better the agreement.
The correlation matrix is a normalized covariance matrix. The off-diagonal element is the correlation coefficient between x and y, and ranges from -1 to 1, $$ {\rm{correlation}}= \left( \begin{array}{cc} 1 & \frac{\langle (x- \langle x \rangle)(y-\langle y \rangle) \rangle } {\sqrt{\langle ( x - \langle x \rangle)^2 \rangle } \sqrt{\langle ( y - \langle y \rangle)^2 \rangle } } \\ \frac{\langle (y- \langle y \rangle)(x-\langle x \rangle) \rangle } {\sqrt{\langle ( y - \langle y \rangle)^2 \rangle } \sqrt{\langle ( x - \langle x \rangle)^2 \rangle } } & 1 \end{array} \right) $$
# correlation matrix of the sample
np.corrcoef(x,y)
Note that the correlation between $x$ and $y$ variables is not exactly equal to zero even though the variables are sampled from independent distributions. This illustrates an important point: when the number of samples is low, artificial correlations become likely due to statistical fluctutations.
Another key thing to note is that the correlation between $x$ and $y$ variables, also known as Pearson correlation coefficient, is only sensitive to linear relationships.
Let's imagine we have observed firing rates of a neuron when presented with two different stimuli, A and B. We will assume the observations are random samples from two gaussian distributions with differing means.
# number of samples
N=50
# standard deviation, the same for both stimuli
s=2
# means corresponding to differing stimuli
u_A=8
u_B=12
# observed firing rates
samples_A=np.random.normal(u_A,s,N)
samples_B=np.random.normal(u_B,s,N)
bins=arange(0,20,1) # bin boundaries for both histograms
plt.hist(samples_A,bins,alpha=0.3) # the alpha parameter adjusts the transparency of the colours
plt.hist(samples_B,bins,alpha=0.3,)
plt.legend(('A','B'))
plt.xlabel('firing rates',fontsize=15)
plt.title('histogram of neural firing rates',fontsize=15)
plt.show()
If we are given a single firing rate, how do we decide which stimulus was presented? A sensible way would be to impose a threshold half-way between the two Gaussian centers. However, classification errors are unavoidable, when the neural responses do not separate out cleanly.
Using this strategy, what fraction of the firing rates stemming from stimulus A would the neuron misclassify as deriving from stimulus B?
# threshold of our binary classifier
threshold=(u_A+u_B)/2
# misclassified fraction
len(samples_A[samples_A>threshold])/N
If we label stimulus B as positive and stimulus A as negative then the above calculation represents the false positive rate in our binary classifier. The true positive rate is then
# true positive rate
len(samples_B[samples_B>threshold])/N
Clearly, we can carry out a better job at classifying if the distributions were more widely separated by having smaller variability or bigger differences in their means.
Sometimes in data analysis we want to know if two sets of samples have the same mean or are significantly different. There is an example of a statistical test and there are many such tests implemented in SciPy. As an example, we can carry out a traditional t-test for the null hypothesis that the two independent samples from above have identical means
# load the stats module from the SciPy package
import scipy.stats as st
# perform the two-sided t-test, asumming populations have identical variances
t, pval = st.stats.ttest_ind(samples_A,samples_B)
# t-statistic
print t
Typically a statistician will report a p-value, i.e. $P({\text{observed or more extreme data}}|\text{null hypothesis})$.
# p-value
print pval
Statisticians love p-values and will erroneously make conclusions about hypotheses based on them. This is fraught with problems! In general, p-values tell you nothing about the probability of the null hypothesis being true.
$$ P({\text{observed or more extreme data}}|\text{null hypothesis}) \neq P(\text{null hypothesis}|{\text{observed or more extreme data}}) $$A frequent task in science is to fit a curve to data, and guess the underyling generative model. Let's make up some fake noisy data of the form $y=x^3 + \eta$, where $\eta$ is noise drawn from a gaussian (normal) distribution.
# number of data points
N=20
# N equalled spaced x values, from -20 to 20
x=np.linspace(-20,20,N)
# noise term: N samples from a gaussian distribution with mean 0 and standard deviation 1000
noise=np.random.normal(0,1000,N)
# y values
y=x**3+noise
Let's plot our fake noisy data.
plt.plot(x,y,'ro') # 'r' indicates red, 'o' indicates a small circle
plt.title('noisy data')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
Now we will try to fit polynomials of various orders using the Numpy "polyfit" function.
# straight line fit. The "fit1" array consists of the coefficients of the best linear fit
fit1=np.polyfit(x,y,1)
# 3rd order polynomial fit
fit3=np.polyfit(x,y,3)
# 19th order polynomial fit
fit19=np.polyfit(x,y,19)
# create functions from the fits
y_1=np.poly1d(fit1)
y_3=np.poly1d(fit3)
y_19=np.poly1d(fit19)
print('linear fit: y_1=(%.2f)x+(%.2f)' % (fit1[0],fit1[1]))
print('3rd order fit: y_3=(%.2f)x^3 + (%.2f)x^2 + (%.2f)x + (%.2f)' % (fit3[0],fit3[1],fit3[2],fit3[3]))
plt.plot(x,y,'ro')
plt.plot(x,y_1(x))
plt.plot(x,y_3(x))
plt.plot(x,y_19(x))
# add a legend to the right of the plot
legend_text=('data','linear fit: $y=mx+c$ ','3rd order polynomial','19th order polynomial')
plt.legend(legend_text, loc='center left', bbox_to_anchor=(1,0.5))
plt.show()
The high order (19th) fit clearly tracks the data better. However, the 19th order polynomial in fact overfits the data since it will perform poorly on new data sampled from the original noisy curve.