Humberto Ortiz-Zuazaga humberto.ortiz@upr.edu
Example sequence assembly using the NetworkX python package, a pure python package for manipulating graphs.
This example builds on several other examples using Eulerian assembly for reconstructing a sequence. I will simulate a simple genome with 10 "genes".
import networkx as nx
DG=nx.DiGraph()
Let's just generate a random sequence of length 10000, and convert it into a string.
import random
random.seed(0)
myseq = "".join([random.choice(['a', 'c', 'g', 't']) for i in range(10000)])
print myseq[:60], "..."
ttccgctccgtgctgcttttcgtgcacgttctctgagctgactactagattcacgtaggg ...
Now we simulate a set of 5000 reads of the given sequence. Just pick a starting point at random, and read off a fixed size chunk of sequence. I used trial and error to find a readlen that gives unique reads. Pretend we have 10 genes of 1000 bases each, and reads can't span genes.
readlen = 30
genelen = 1000
starts = []
reads = []
for i in range(10):
starts += [random.randint(i*genelen, (i+1)*genelen-readlen-1) for j in range(500)]
reads = [myseq[start:start+readlen] for start in starts]
print reads[0]
caatagacggggtagagcgggataggagca
We should check if we've sampled the sequence enough to be able to reconstruct it. Let's check the coverage (how many reads overlap) for each base.
coverage = [0] * len(myseq)
for start in starts:
for i in range(start, start+readlen):
coverage[i] += 1
%matplotlib inline
import matplotlib, numpy
#matplotlib.use('Agg')
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(4,2))
plt.plot(coverage)
fig.tight_layout()
That doesn't look too bad, although the coverage around 200 looks a bit low. (PS, I love ipython, it embeds that plot right into the output).
I also had to use trial and error to find a k-mer length that produces the correct graph. I think the k-mers have to be unique, or you get contigs that span genes.
k = 16
for read in reads:
for i in range(len(read) - k - 1):
kmer_a = read[i:i+k]
kmer_b = read[i+1:i+k+1]
DG.add_edge(kmer_a, kmer_b)
So here's where stuff gets interesting. I'm basically sequencing single stranded DNA with no errors, and have decent coverage over the 10 genes. Unfortunately, the k-mer graph breaks into 14 pieces. In sequencing parlance, these are 14 different contigs, as the sequence assembler can't join them. Raising the coverage (seqencing more reads) should reduce the number of contigs.
components = nx.connected_components(DG.to_undirected())
len(components)
14
map(len, components)
[983, 982, 982, 981, 981, 977, 836, 625, 598, 498, 480, 381, 347, 139]
Since each read is basically unique, I can find the location of each read in the original sequence, and mark what component it came from.
placement = [0] * len(myseq)
for i in range(len(components)):
for read in components[i]:
place = myseq.index(read)
placement[place] = i
placefig=plt.figure(figsize=(4,2))
plt.plot(placement)
placefig.tight_layout()
We can take component 0, and choose the subgraph it induces on the directed k-mer graph.
best = nx.subgraph(DG, components[0])
nx.draw(best, with_labels=False, node_size=300, pos=nx.random_layout(best))
The Eulerian path function is a hack. I just made the fewest modifications possible to eulerian_circuit. I don't think we correctly check the end of the path (should end up with indegree 0 and outdegree 0).
def eulerian_path(G, source=None):
"""Find an Eulerian path in a weakly connected component, stolen from eulerian_circuit"""
if not nx.is_weakly_connected(G):
raise nx.NetworkXError("G is not connected.")
g = G.__class__(G) # copy graph structure (not attributes)
# set starting node
if source is None:
for v in g.nodes():
indeg = g.in_degree(v)
outdeg = g.out_degree(v)
if outdeg == (indeg + 1):
break
else:
raise nx.NetworkXError("G doesn't have an Eulerian circuit.")
else:
v = source
while g.out_degree(v) > 0:
n = v
bridge = None
# sort nbrs here to provide stable ordering of alternate cycles
nbrs = sorted([v for u,v in g.edges(n)])
for v in nbrs:
g.remove_edge(n,v)
bridge = not nx.is_connected(g.to_undirected())
if bridge:
g.add_edge(n,v) # add this edge back and try another
else:
break # this edge is good, break the for loop
if bridge:
g.remove_edge(n,v)
g.remove_node(n)
yield (n,v)
path = eulerian_path(best)
nodepath = [u for u,v in path]
seq = nodepath[0] + "".join([node[-1] for node in nodepath[1:]])
seq in myseq
True
len(seq)
997
def wrap(seq):
start = 0
for i in range(60, len(seq), 60):
print seq[start: i]
start = i
print seq[start:]
wrap(seq)
tgagagcctgtaatgactggttctttcttcgagctctccctatttgctatgactacaaac tgaatccactcccggcaggaccccgtacggttgttaaccatgtttctgagagcattgacg aaattaagtatctagcttatctttgcaatacaacactattacacaagacgcagaaccagc ctggtcccagctgctccttaagatcgatgaacgcgtcctacagggggaggttagccaaaa cgcagcaactcattatcctgggcggtacgggtcggtcgattctcccgtttcaataattag ctgcgtgttccagtcccgatggaaaggatcactcgcgaagccgcttgagacatagcgccc cctgtggggggccgatttttgtacaccagaagtttaggaagtttctgcccaggtaacacc cgggtaggtggcaacgcattccatggctaggttcggaggtgcggcaacacacgaaactcc caaaactctgtggttaaaaaacgaggtggccgtcgcgaagtcgtagcaatgccatgactc gatgcactacccgtggccggccacttccagtagaagcagtgtggcgcagtccgaatgacg atagacatttagtgaccgagaaattcactgaggtttgtgagcaaactggatataatacgg gctaaatagttacatcctgtgaaacttacttctacacgaagtatgtaccgtgtttctcgg gtttctgccatcccgctgtacaactcattgaacccttcacacctcatggcagcgacggga caaagtgcagtagccgcgggcggaacgtggtaacgcctggtgcaagagtctttcggaagg acccttatgcaggttttgtggctttcagcctgttcactctcttgatcaatcgttgggcaa cataagtgctagctcgatgattgttcactcctaccctcattcacgcaagggcccgcagtc aggcttcgcgactagaacgcccgaggtgccgggcatt
Where is this "gene"?
myseq.index(seq)
4000