By: Tamas Nagy
This is a toy implementation of a Hidden Markov Model (HMM) for recognizing 5' splice-sites. This example is based on the Sean Eddy's 2004 paper in Nature Biotechnology.
I've reproduced Figure 1 from the paper below:
from IPython.display import Image
Image(url='http://1.bp.blogspot.com/_9rDJWEd9qEA/TKovS1g05gI/AAAAAAAAAlA/R3_cCjLPxCk/s1600/EddyHMMfig.jpg', embed=True)
Adapted by permission from Macmillan Publishers Ltd: Nature Biotechnology 22, 1315 - 1316 (2004) doi:10.1038/nbt1004-1315, copyright 2004
We want to build a model in such a way that, given a sequence, we can calculate the maximum likelihood path, i.e. the location of the 5' site, and the confidence that we have in our result.
Let's first import the necessary packages:
from yahmm import *
import numpy as np, matplotlib.pyplot as plt
# %install_ext https://raw.githubusercontent.com/rasbt/python_reference/master/ipython_magic/watermark.py
%load_ext watermark
%watermark -t -d -z -p yahmm,numpy,matplotlib
20/07/2014 17:03:24 CEST yahmm 1.0.0 numpy 1.8.1 matplotlib 1.3.1
Then create the model
model = Model("5prime_ss_recog")
Next, let's set up the states using the emission states specified in the article
state_e = State(DiscreteDistribution({'A':0.25, 'C':0.25,
'G':0.25,'T':0.25}), name='E')
state_5 = State(DiscreteDistribution({'A':0.05,'C':0,
'G':0.95,'T':0}), name='5')
state_i = State(DiscreteDistribution({'A':0.4,'C':0.1,
'G':0.1,'T':0.4}), name='I')
After that we'll need to add all the transition probabilities. Don't forget to include the start and end states.
model.add_transition(model.start, state_e, 1)
model.add_transition(state_e, state_e, 0.9)
model.add_transition(state_e, state_5, 0.1)
model.add_transition(state_5, state_i, 1)
model.add_transition(state_i, state_i, 0.9)
model.add_transition(state_i, model.end, 0.1)
Let's finalize our model
model.bake(verbose=True)
Next, let's actually test our model on a specific sequence to see if we can predict potential 5' splice-sites.
The sequence (from the paper):
seq = 'CTTCATGTGAAAGCAGACGTAAGTCA'
Now, let's find the most probable path using the Viterbi algorithm
print(seq)
# ML path
viterbi_prob, viterbi_path = model.viterbi(list(seq))
print('%s log P = %f'%(''.join(state[1].name for state in viterbi_path[1:-1]), viterbi_prob))
CTTCATGTGAAAGCAGACGTAAGTCA EEEEEEEEEEEEEEEEEE5IIIIIII log P = -41.219678
As you can see, this agrees nicely with what we see in Figure 1 of the paper. The most probable 5' splice site the located at the penultimate guanosine.
But the question remains: How confident are we in our answer? Since HMMs are probabilistic models we can calculate our confidence using a process termed posterior decoding.
We'll first get the index of the 5'
state and use this to get the transition probabilities at each nucleotide position in our test sequence.
indices = { state.name: i for i, state in enumerate( model.states ) }
ss_state_index = indices['5']
probs = [np.exp(prob[ss_state_index]) for prob
in model.forward_backward(list(seq))[1]]
print(probs[:10])
[0.0, 0.0, 0.0, 0.0, 0.0010693543486039417, 0.0, 0.031746457224179507, 0.0, 0.04960383941278039, 0.0016317052438414646]
Let's graph the result so that we can better see what is going on
%matplotlib inline
rects = plt.bar(range(1, len(seq)+1), probs, align='center')
plt.xticks(range(1, len(seq)+1), list(seq))
plt.ylim([0,0.6])
def label(rects):
for rect in rects:
height = rect.get_height()
if(height > .1):
plt.text(rect.get_x()+rect.get_width()/2., 1.05*height,
'%s%%'%int(height*100),ha='center', va='bottom')
label(rects)
As you can see we get a probability of 46% that the penultimate G is a 5' splice site.
Found an error? Found five? Was something not clear? Let us know at https://github.com/jmschrei/yahmm/issues. Even better? Submit a pull request!
(words of encouragement are welcome too!)