Naive 5' Splice Site Recognition Example¶

By: Tamas Nagy

This is a toy implementation of a Hidden Markov Model (HMM) for recognizing 5' splice-sites. This example is based on the Sean Eddy's 2004 paper in Nature Biotechnology.

I've reproduced Figure 1 from the paper below:

In [1]:

from IPython.display import Image
Image(url='http://1.bp.blogspot.com/_9rDJWEd9qEA/TKovS1g05gI/AAAAAAAAAlA/R3_cCjLPxCk/s1600/EddyHMMfig.jpg', embed=True)

Out[1]:

What we have here is a simple three state model. As you can see, all nucleotides have an equal probability of being present in state `E`, but the following state `5` is highly enriched in guanosine, followed by the adenine/thymine-rich state `I`. The emission probabilities of each nucleotide is given above the states. The transition probabilities are shown next to their respective transition arrow.

We want to build a model in such a way that, given a sequence, we can calculate the maximum likelihood path, i.e. the location of the 5' site, and the confidence that we have in our result.

Implementation¶

Let's first import the necessary packages:

In [2]:

from yahmm import *
import numpy as np, matplotlib.pyplot as plt

# %install_ext https://raw.githubusercontent.com/rasbt/python_reference/master/ipython_magic/watermark.py
%load_ext watermark
%watermark -t -d -z -p yahmm,numpy,matplotlib

20/07/2014 17:03:24 CEST

yahmm 1.0.0
numpy 1.8.1
matplotlib 1.3.1

Then create the model

In [3]:

model = Model("5prime_ss_recog")

Next, let's set up the states using the emission states specified in the article

In [4]:

state_e = State(DiscreteDistribution({'A':0.25, 'C':0.25,
                                      'G':0.25,'T':0.25}), name='E')
state_5 = State(DiscreteDistribution({'A':0.05,'C':0,
                                      'G':0.95,'T':0}), name='5')
state_i = State(DiscreteDistribution({'A':0.4,'C':0.1,
                                      'G':0.1,'T':0.4}), name='I')

After that we'll need to add all the transition probabilities. Don't forget to include the start and end states.

In [5]:

model.add_transition(model.start, state_e, 1)
model.add_transition(state_e, state_e, 0.9)
model.add_transition(state_e, state_5, 0.1)
model.add_transition(state_5, state_i, 1)
model.add_transition(state_i, state_i, 0.9)
model.add_transition(state_i, model.end, 0.1)

Let's finalize our model

In [6]:

model.bake(verbose=True)

Next, let's actually test our model on a specific sequence to see if we can predict potential 5' splice-sites.

Analysis¶

The sequence (from the paper):

In [7]:

seq = 'CTTCATGTGAAAGCAGACGTAAGTCA'

Now, let's find the most probable path using the Viterbi algorithm

In [8]:

print(seq)
# ML path
viterbi_prob, viterbi_path = model.viterbi(list(seq))
print('%s log P = %f'%(''.join(state[1].name for state in viterbi_path[1:-1]), viterbi_prob))

CTTCATGTGAAAGCAGACGTAAGTCA
EEEEEEEEEEEEEEEEEE5IIIIIII log P = -41.219678

As you can see, this agrees nicely with what we see in Figure 1 of the paper. The most probable 5' splice site the located at the penultimate guanosine.

But the question remains: How confident are we in our answer? Since HMMs are probabilistic models we can calculate our confidence using a process termed posterior decoding.

We'll first get the index of the 5' state and use this to get the transition probabilities at each nucleotide position in our test sequence.

In [9]:

indices = { state.name: i for i, state in enumerate( model.states ) }
ss_state_index = indices['5']
probs = [np.exp(prob[ss_state_index]) for prob 
         in model.forward_backward(list(seq))[1]]
print(probs[:10])

[0.0, 0.0, 0.0, 0.0, 0.0010693543486039417, 0.0, 0.031746457224179507, 0.0, 0.04960383941278039, 0.0016317052438414646]

Let's graph the result so that we can better see what is going on

In [10]:

%matplotlib inline
rects = plt.bar(range(1, len(seq)+1), probs, align='center')
plt.xticks(range(1, len(seq)+1), list(seq))
plt.ylim([0,0.6])

def label(rects):
    for rect in rects:
        height = rect.get_height()
        if(height > .1):
            plt.text(rect.get_x()+rect.get_width()/2., 1.05*height, 
                     '%s%%'%int(height*100),ha='center', va='bottom')
label(rects)

As you can see we get a probability of 46% that the penultimate G is a 5' splice site.

Found an error? Found five? Was something not clear? Let us know at https://github.com/jmschrei/yahmm/issues. Even better? Submit a pull request!

(words of encouragement are welcome too!)