Sequence assembly with pure python graph algorithms¶

Humberto Ortiz-Zuazaga humberto.ortiz@upr.edu

Introduction¶

Example sequence assembly using the NetworkX python package, a pure python package for manipulating graphs.

This example builds on several other examples using Eulerian assembly for reconstructing a sequence. I will simulate a simple genome with 10 "genes".

In [1]:

import networkx as nx
DG=nx.DiGraph()

Let's just generate a random sequence of length 10000, and convert it into a string.

In [2]:

import random
random.seed(0)
myseq = "".join([random.choice(['a', 'c', 'g', 't']) for i in range(10000)])
print myseq[:60], "..."

ttccgctccgtgctgcttttcgtgcacgttctctgagctgactactagattcacgtaggg ...

Now we simulate a set of 5000 reads of the given sequence. Just pick a starting point at random, and read off a fixed size chunk of sequence. I used trial and error to find a readlen that gives unique reads. Pretend we have 10 genes of 1000 bases each, and reads can't span genes.

In [3]:

readlen = 30
genelen = 1000
starts = []
reads = []
for i in range(10):
  starts += [random.randint(i*genelen, (i+1)*genelen-readlen-1) for j in range(500)]
    
reads = [myseq[start:start+readlen] for start in  starts]
print reads[0]

caatagacggggtagagcgggataggagca

We should check if we've sampled the sequence enough to be able to reconstruct it. Let's check the coverage (how many reads overlap) for each base.

In [4]:

coverage = [0] * len(myseq)
for start in starts:
  for i in range(start, start+readlen):
    coverage[i] += 1

In [5]:

%matplotlib inline

import matplotlib, numpy
#matplotlib.use('Agg')
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(4,2))
plt.plot(coverage)
fig.tight_layout()

That doesn't look too bad, although the coverage around 200 looks a bit low. (PS, I love ipython, it embeds that plot right into the output).

Setting up the graph¶

I also had to use trial and error to find a k-mer length that produces the correct graph. I think the k-mers have to be unique, or you get contigs that span genes.

In [6]:

k = 16
 
for read in reads:
  for i in range(len(read) - k - 1):
    kmer_a = read[i:i+k]
    kmer_b = read[i+1:i+k+1]
    DG.add_edge(kmer_a, kmer_b)

Assembling the sequence¶

So here's where stuff gets interesting. I'm basically sequencing single stranded DNA with no errors, and have decent coverage over the 10 genes. Unfortunately, the k-mer graph breaks into 14 pieces. In sequencing parlance, these are 14 different contigs, as the sequence assembler can't join them. Raising the coverage (seqencing more reads) should reduce the number of contigs.

In [7]:

components = nx.connected_components(DG.to_undirected())
len(components)

Out[7]:

In [8]:

map(len, components)

Out[8]:

[983, 982, 982, 981, 981, 977, 836, 625, 598, 498, 480, 381, 347, 139]

Since each read is basically unique, I can find the location of each read in the original sequence, and mark what component it came from.

In [9]:

placement = [0] * len(myseq)

for i in range(len(components)):
    for read in components[i]:
        place = myseq.index(read)
        placement[place] = i

placefig=plt.figure(figsize=(4,2))
plt.plot(placement)
placefig.tight_layout()

We can take component 0, and choose the subgraph it induces on the directed k-mer graph.

In [10]:

best = nx.subgraph(DG, components[0])

In [15]:

nx.draw(best, with_labels=False, node_size=300, pos=nx.random_layout(best))

In [ ]:

The Eulerian path function is a hack. I just made the fewest modifications possible to eulerian_circuit. I don't think we correctly check the end of the path (should end up with indegree 0 and outdegree 0).

In [11]:

def eulerian_path(G, source=None):
    """Find an Eulerian path in a weakly connected component, stolen from eulerian_circuit"""
    if not nx.is_weakly_connected(G):
        raise nx.NetworkXError("G is not connected.")

    g = G.__class__(G) # copy graph structure (not attributes)

    # set starting node
    if source is None:
        for v in g.nodes():
            indeg = g.in_degree(v)
            outdeg = g.out_degree(v)
            if outdeg == (indeg + 1):
                break
        else:
            raise nx.NetworkXError("G doesn't have an Eulerian circuit.")
    else:
        v = source

    while g.out_degree(v) > 0:
        n = v   
        bridge = None
        # sort nbrs here to provide stable ordering of alternate cycles
        nbrs = sorted([v for u,v in g.edges(n)])
        for v in nbrs:
            g.remove_edge(n,v)
            bridge = not nx.is_connected(g.to_undirected())
            if bridge:
                g.add_edge(n,v)  # add this edge back and try another
            else:
                break  # this edge is good, break the for loop 
        if bridge:
            g.remove_edge(n,v)            
            g.remove_node(n)
        yield (n,v)

In [12]:

path = eulerian_path(best)

In [13]:

nodepath = [u for u,v in path]

In [14]:

seq = nodepath[0] + "".join([node[-1] for node in nodepath[1:]])

In [15]:

seq in myseq

Out[15]:

True

In [16]:

len(seq)

Out[16]:

In [17]:

def wrap(seq):
  start = 0
  for i in range(60, len(seq), 60):
    print seq[start: i]
    start = i
  print seq[start:]
    
wrap(seq)

tgagagcctgtaatgactggttctttcttcgagctctccctatttgctatgactacaaac
tgaatccactcccggcaggaccccgtacggttgttaaccatgtttctgagagcattgacg
aaattaagtatctagcttatctttgcaatacaacactattacacaagacgcagaaccagc
ctggtcccagctgctccttaagatcgatgaacgcgtcctacagggggaggttagccaaaa
cgcagcaactcattatcctgggcggtacgggtcggtcgattctcccgtttcaataattag
ctgcgtgttccagtcccgatggaaaggatcactcgcgaagccgcttgagacatagcgccc
cctgtggggggccgatttttgtacaccagaagtttaggaagtttctgcccaggtaacacc
cgggtaggtggcaacgcattccatggctaggttcggaggtgcggcaacacacgaaactcc
caaaactctgtggttaaaaaacgaggtggccgtcgcgaagtcgtagcaatgccatgactc
gatgcactacccgtggccggccacttccagtagaagcagtgtggcgcagtccgaatgacg
atagacatttagtgaccgagaaattcactgaggtttgtgagcaaactggatataatacgg
gctaaatagttacatcctgtgaaacttacttctacacgaagtatgtaccgtgtttctcgg
gtttctgccatcccgctgtacaactcattgaacccttcacacctcatggcagcgacggga
caaagtgcagtagccgcgggcggaacgtggtaacgcctggtgcaagagtctttcggaagg
acccttatgcaggttttgtggctttcagcctgttcactctcttgatcaatcgttgggcaa
cataagtgctagctcgatgattgttcactcctaccctcattcacgcaagggcccgcagtc
aggcttcgcgactagaacgcccgaggtgccgggcatt

Where is this "gene"?

In [42]:

myseq.index(seq)

Out[42]:

In [ ]: