(Inspired by an exercise in A Primer on Scientific Programming with Python)
In bioinformatics, we are faced with enormous volumes of data. For instance, one task would be to calculate the GC-content of a large strand of DNA.
http://en.wikipedia.org/wiki/GC-content
Here, we will look at E. Coli UTI89, one of the smallest sequenced genomes. First we open the file and load all the letters into a string called dna.
fin = open("ecoli_uti89.txt")
dna = ""
for line in fin.readlines():
dna += line.strip()
To make sure we've read it correctly, let's print the last ten characters, and determine the length of the string.
print dna[-10:]
AGTGATTTTC
len(dna)
5065741
When faced with such large amounts of data, we need to be fast in how we process it. Let's explore a few methods for how to count how many times a particular letter appears in this DNA strand.
As we write these functions, we will store them in a list. Unfortunately, we will not save the recursive function, since it will exceed the stack limit in Python very quickly.
funcs = []
def count_for(dna, letter):
c = 0
for x in dna:
if x == letter:
c += 1
return c
funcs.append(count_for)
def count_builtin(dna, letter):
return dna.count(letter)
funcs.append(count_builtin)
def count_rec(dna, letter):
if len(dna) == 0:
return 0
if dna[0] == letter:
return 1 + count_rec(dna[1:], letter)
else:
return count_rec(dna[1:], letter)
#funcs.append(count_rec)
# Mitchell Sharp:
def count_diff(dna, letter):
start = len(dna)
dna2 = dna.replace(letter, "")
end = len(dna2)
return start - end
funcs.append(count_diff)
# Bryan Urban:
def count_while(dna, letter):
c = 0
num = 0
while(c<len(dna)):
if(dna[c]==letter):
num+=1
c+=1
return num
funcs.append(count_while)
# Elizabeth Dye and Deonna Thornton
def count_dict(dna,letter):
d={}
for i in range(len(dna)):
try:
d[dna[i]]+=1
except:
d[dna[i]]=1
if letter in d:
return d[letter]
else:
return 0
funcs.append(count_dict)
# Hunter Lewis
def count_split(dna,letter):
count = 0
half1 = dna[0:len(dna)/2]
half2 = dna[len(dna)/2:-1]
for ele in half1:
if ele == letter:
count += 1
for ele2 in half2:
if ele2 == letter:
count += 1
return count
funcs.append(count_split)
# Mark Woodard
def count_skip(dna, letter):
num = 0
for i in range(0, len(dna), 2):
if dna[i] == letter:
num += 1
for i in range(1, len(dna), 2):
if dna[i] == letter:
num += 1
return num
funcs.append(count_skip)
#Deonna Thornton and Elizabeth Dye
def count_idx(dna, letter):
a = []
b = []
for elt in dna:
a.append(elt)
a.sort()
for i in range(len(a)):
if a[i] == letter:
b.append(a[i])
g = len(b)
if letter not in a:
g = 0
return g
funcs.append(count_idx)
# Mitch Lowry
def count_range(dna, letter):
count = 0
for x in range(len(dna)):
if dna[x] == letter:
count += 1
return count
funcs.append(count_range)
# Luke Lasley
def count_reverse(dna, letter):
count = len(dna)
for ele in dna:
if ele != letter:
count -=1
return count
funcs.append(count_reverse)
# Luke Evans, from the Filters link above
def count_filter(dna, letter):
count = 0
letters = [elem for elem in dna if elem == letter]
count += len(letters)
return count
funcs.append(count_filter)
# Chris Woodward
def count_sort(dna, letter):
num = 0
data = sorted(dna)
for a in data:
if a == letter:
num += 1
if num > 0 and a != letter:
break
return num
funcs.append(count_sort)
Let's run through each of these functions and time how long they take to count. Also, we will record the numeric count, to guarantee our functions are accurate.
First, we'll need a way to track their names, and luckily, each function has a field called "func_name" that we can use to save the names of the functions.
Which do you think will be the fastest?
dir(count_for)
['__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__doc__', '__format__', '__get__', '__getattribute__', '__globals__', '__hash__', '__init__', '__module__', '__name__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'func_closure', 'func_code', 'func_defaults', 'func_dict', 'func_doc', 'func_globals', 'func_name']
names = []
timings = []
results = []
import time
for f in funcs:
t0 = time.clock()
results.append(f(dna, "C"))
t1 = time.clock()
cpu_time = t1 - t0
timings.append(cpu_time)
names.append(f.func_name[6:])
Let's look at what we've calculated for timings, names, and results.
timings
[0.28089600000000003, 0.014823999999999948, 0.03485899999999986, 0.786794, 0.8590310000000001, 0.2685310000000003, 0.45907300000000006, 1.6763839999999997, 0.4551540000000003, 0.3258910000000004, 0.24885400000000057, 0.9272920000000004]
names
['for', 'builtin', 'diff', 'while', 'dict', 'split', 'skip', 'idx', 'range', 'reverse', 'filter', 'sort']
results
[1284322, 1284322, 1284322, 1284322, 1284322, 1284321, 1284322, 1284322, 1284322, 1284322, 1284322, 1284322]
Woah, the builtin function, and the diff function, that uses only a builtin function, are the fastest, by a large margin!
We will need to spend a bit more time fixing the split function, since its results are not accurate
Now let's plot this with a bar chart, using matplotlib.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots()
ind = np.arange(len(funcs))
width = 0.75
rects1 = ax.bar(ind, timings, width, color="r")
ax.set_xticks(ind + width / 2)
ax.set_xticklabels(names, rotation="vertical")
[<matplotlib.text.Text at 0x109af8290>, <matplotlib.text.Text at 0x109b1a310>, <matplotlib.text.Text at 0x109ba7150>, <matplotlib.text.Text at 0x109ba7890>, <matplotlib.text.Text at 0x109ba7fd0>, <matplotlib.text.Text at 0x109bb0750>, <matplotlib.text.Text at 0x109bb0e90>, <matplotlib.text.Text at 0x109bbc610>, <matplotlib.text.Text at 0x109bbcd50>, <matplotlib.text.Text at 0x109bc64d0>, <matplotlib.text.Text at 0x109af8c50>, <matplotlib.text.Text at 0x109adca10>]
So, finally, we can calculate the GC-content of our DNA strand as efficiently as possible. Even if we use the builtin function four times in a row, it is much faster than any of the other methods.
letters = ["A", "T", "G", "C"]
counts = []
for let in letters:
counts.append(count_builtin(dna, let))
counts
[1250197, 1252067, 1279155, 1284322]
There is more GC than AT in our strand, but not an enormous difference. Later when we learn about Markov Chains, we might be able to tease out the portions of the code that are higher in GC than surrounding areas.
plt.pie(counts, labels=letters)
([<matplotlib.patches.Wedge at 0x109d0fa50>, <matplotlib.patches.Wedge at 0x109d1e490>, <matplotlib.patches.Wedge at 0x109d1ee10>, <matplotlib.patches.Wedge at 0x109d2c7d0>], [<matplotlib.text.Text at 0x109d0ff50>, <matplotlib.text.Text at 0x109d1ea50>, <matplotlib.text.Text at 0x109d2c410>, <matplotlib.text.Text at 0x109d2cd90>])