In this tutorial, we'll take a reference genome (E. Coli), simulate a set of contigs and an optical map, and then use TWIN to align the contigs to the optical map. While TWIN scales to large genomes, this tutorial will focus on clarity, at times sacrificing efficiency.
After most code cells, some portion of the data produced by that cell will be displayed in the hopes of making this easier to follow.
Hopefully the simplicity of the python language and the subset of python features used here makes this accessible to everyone. There are two python feature which may not be transparent however. One is python's list comprehension feature. This is a feature for generating a list where each element is the value of an expression. The syntax is like so:
[expression FOR variable IN iterable] # where uppercase denotes keywords
So this python code:
[ x + x for x in range(10) ]
is an expression, which when evaluated, will produce:
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
The second such feature is that python allows unpacking of tuples in assignment, so (x,y,z) = (1,2,3) will assign one iteger to each of the variables on the left in order. The parenthesis are optional in many cases.
First, we'll import some useful modules
import Bio
import Bio.Restriction
import Bio.SeqIO
import Bio.SearchIO
import random
/s/chopin/l/grad/muggli/py3env/lib/python3.3/site-packages/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release. BiopythonExperimentalWarning)
ref_genome = "./ecoli_ref.fa"
First, we'll simulate the optical map.
In practice, since E. Coli has a circular chromosome, we would want to concatenate two copies of the reference genome together. This would allow successful alignment for contigs that should align across the point spanning where the circle is broken to form a linear sequence. (In this scenario, any alignments of contigs where the left end aligns at a point right of the original genome length would be duplicates of another alignment, since the reference sequence was duplicated.) To keep things simple, we'll ignore the circular component of /E. Coli/.
We can use Biopython's restriction module to digest the reference genome with the SwaI (https://www.neb.com/products/r0604-swai) enzyme:
with open("ecoli_ref.fa", "rU") as ref_handle:
for ref_seq in Bio.SeqIO.parse(ref_handle, "fasta"): # iterate through the all (in this case one) seqs in the fasta file
ref_frag_seqs = Bio.Restriction.SwaI.catalyze(ref_seq.seq) # split that seqence into a tuple of sequences
print(ref_frag_seqs[:2]) # print the first two fragments
(Seq('AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAG...TTT', SingleLetterAlphabet()), Seq('AAATTAAAATCCATCTTTCAACCTCTTGATATTTTGGGGGTTAATTAATCTTTC...TTT', SingleLetterAlphabet()))
Now we'll add and use a function to simulate gaussian noise and remove small fragments:
random.seed(8675309) # initialize this to a "random" number so this notebook is reproducible
def noise(lst, small_cutoff, stddev):
return [int(random.gauss(0, stddev)) + s for s in lst if s >= small_cutoff]
ref_frag_lengths = [len(ref_frag) for ref_frag in ref_frag_seqs]
noisy_frags = noise(ref_frag_lengths, small_cutoff=700, stddev=150)
print(noisy_frags)
[40226, 81532, 122840, 1194, 50221, 19636, 27276, 57108, 33243, 2605, 107188, 25540, 15351, 42488, 81284, 141588, 50132, 28308, 57921, 10670, 50153, 17596, 115609, 30380, 8532, 31972, 6119, 33543, 21278, 46756, 31057, 26520, 15478, 69668, 88918, 5889, 2709, 18869, 3227, 9980, 5752, 7733, 36140, 29378, 9736, 90791, 7930, 94844, 2194, 39616, 9094, 34753, 5255, 98464, 101786, 51787, 17789, 29444, 77084, 40071, 31578, 19094, 4456, 4226, 1239, 28256, 203819, 57944, 21107, 65100, 7656, 175914, 75772, 15125, 99637, 29570, 11774, 45392, 41521, 184441, 111685, 9905, 104362, 25395, 17475, 13946, 61524, 8200, 733, 12159, 36378, 3281, 28732, 73220, 8003, 2200, 6392, 8888, 14466, 87525, 41859, 13379, 15020, 63761, 34057, 71288, 55396, 761, 7451, 55237, 40286, 39667, 22059]
def write_frags(fname, frags):
with open(fname, "w") as om:
for frag in frags:
# SOMA format uses frag sizes in kb, we don't per frag stddev, so use 0.0
line = str(frag/1000.0) + "\t" + str(0.0) + "\n"
om.write(line)
write_frags("ecoli_optmap", noisy_frags)
! ls -l ecoli_optmap
-rw------- 1 muggli grad 1207 Dec 3 17:18 ecoli_optmap
We can use the included om2bytes.py script to create a binary file of 32 bit integers (TWIN uses the sdsl-lite library which directly builds the FM-index from the binary file).
The bin size parameter is used to quantize the data. This should not be necessary in most cases so we'll use a bin size of 1 bp to effectively turn this off. (Quantizing the fragments results in similar suffixes in the optical map being adjacent in a suffix array. This in turn allows TWIN to match many suffixes concurrently instead of exploring each candidate as s separate branch in the backtracking search described in the paper.)
! ~/twin/om2bytes.py ecoli_optmap ecoli_optmap.bin 1
! ls -l ecoli_optmap*
-rw------- 1 muggli grad 1207 Dec 3 17:18 ecoli_optmap -rw------- 1 muggli grad 452 Dec 3 17:18 ecoli_optmap.bin
Since this tutorial assumes just a reference genome, we'll simulate some contigs from the reference sequence, first choose 9 uniformly distributed random endpoints across the genome.
interior_end_points = [random.randint(0, len(ref_seq)) for i in range(9)]
end_points = [0] + interior_end_points + [len(ref_seq)]
end_points.sort()
print(end_points)
[0, 1186908, 2097533, 2202679, 2665150, 2841819, 2844982, 3929561, 3955024, 4393536, 4639675]
Now we'll take consecutive pairs of endpoints (by zipping together this list and a suffix of this list) and write out the corresponding substrings of the reference sequence as a new FASTA file of these simulated ("fake") contigs
contig_loci = {} # bookkeeping to see how clone the TWIN alignments with accumulated restriction fragment based locus match
with open("fake_contigs.fa", "w") as contigs_handle:
for interval_num, (start, end) in enumerate(zip(end_points, end_points[1:])):
subseq = ref_seq[start:end]
if random.randint(0,1) == 0: # reverse complement with 50% probability
subseq.reverse_complement()
print( len(subseq))
subseq.id = "fake_contig_" + str(interval_num)
contig_loci[subseq.id] = start
Bio.SeqIO.write(subseq, contigs_handle, "fasta")
! grep -n "^>" fake_contigs.fa
1186908 910625 105146 462471 176669 3163 1084579 25463 438512 246139 1:>fake_contig_0 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome. 19784:>fake_contig_1 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome. 34963:>fake_contig_2 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome. 36717:>fake_contig_3 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome. 44426:>fake_contig_4 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome. 47372:>fake_contig_5 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome. 47426:>fake_contig_6 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome. 65504:>fake_contig_7 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome. 65930:>fake_contig_8 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome. 73240:>fake_contig_9 ENA|U00096|U00096.2 Escherichia coli str. K-12 substr. MG1655, complete genome.
with open("fake_contigs.fa") as f:
for s in Bio.SeqIO.parse( f, "fasta"):
print(len(s))
1186908 910625 105146 462471 176669 3163 1084579 25463 438512 246139
Now we'll digest the contigs with the included digest.py
! ~/twin/digest.py fake_contigs.fa SwaI fake_contigs.silico
! ls -l fake_contigs*
-rw------- 1 muggli grad 4717958 Dec 3 17:18 fake_contigs.fa -rw------- 1 muggli grad 995 Dec 3 17:18 fake_contigs.silico
n.b. You can use fake_contigs.silico and ecoli_optmap (not .bin!) as input to SOMAv2's match executable.
What follows is the raw output. twin2psl.py can be used to convert this to pattern space layout (.psl) format. The alignment statistics are output in python's dictionary literal format which is suitable for use with python's eval().
In the optical map fragments in the alignment pattern are in reverse order to the orientation you would expect because they are coming off the top of a stack maintained during the backtracking backward search. Backward alignments will appear to match up with the query contig as they are reversed before the search is begun and then reversed again by peeling them off the stack. Asterisks mark in silico contig fragments for which there is no corresponding fragment in the optical map.
40372 - 40226
146
! ~/twin/twin --opt_map ecoli_optmap.bin --silico fake_contigs.silico --emit_alignment_pattern |tee twin.stdout
F value was set to 4. search radius value was set to 1000. FM-Index indexed 113 elements. Constructing suffix array... Attempting to mmap 448 elements from ecoli_optmap.bin loaded 452 bytes from optical map for sa purposes constructing iBWT LUT... Done constructing iBWT LUT. Loading contigs... Done loading contigs. Processing 8 in-silico digested contigs... processing contig 0 Matching contig fake_contig_1:(ignored 23313) 8494 32178 6147 381 33256 21211 46904 31191 26548 15499 69671 88912 5684 2995 18866 3209 9946 5696 7890 36218 29300 9849 90849 8071 94857 2354 39876 9124 35005 5183 (ignored 91868) Alignment pattern: 5255 34753 9094 39616 2194 94844 7930 90791 9736 29378 36140 7733 5752 9980 3227 18869 2709 5889 88918 69668 15478 26520 31057 46756 21278 33543 * 6119 31972 8532 Alignment stats: {'fragnum' : 24, 'locus' : 1210089, 'fval' : 1.54994, 'chi_square_sum' : 23.8616, 'deviation_sum' : 2980, 'num_matched_frags' : 29, 'chi_square_cdf' : 0.264274} backward alignment: threshold: 10000 Matching contig fake_contig_0:(ignored 40372) 81429 122993 1449 50262 19426 27340 57193 33258 2668 106869 25515 15428 42593 81181 141556 50372 28516 58088 10436 50207 17283 115590 (ignored 6822) Alignment pattern: 115609 17596 50153 10670 57921 28308 50132 141588 81284 42488 15351 25540 107188 2605 33243 57108 27276 19636 50221 1194 122840 81532 Alignment stats: {'fragnum' : 1, 'locus' : 40226, 'fval' : 0.240206, 'chi_square_sum' : 25.603, 'deviation_sum' : 2885, 'num_matched_frags' : 22, 'chi_square_cdf' : 0.730882} backward alignment: threshold: 10000 Matching contig fake_contig_6:(ignored 13113) 7446 175804 75774 15015 99569 29752 11632 45454 41590 184238 111696 9620 104361 25244 17760 13830 61572 8186 798 12243 (ignored 19882) Alignment pattern: 12159 733 8200 61524 13946 17475 25395 104362 9905 111685 184441 41521 45392 11774 29570 99637 15125 75772 175914 7656 Alignment stats: {'fragnum' : 70, 'locus' : 2857063, 'fval' : 0.897409, 'chi_square_sum' : 17.2729, 'deviation_sum' : 2218, 'num_matched_frags' : 20, 'chi_square_cdf' : 0.364811} backward alignment: threshold: 10000 Matching contig fake_contig_3:(ignored 3070) 51897 17730 29482 76754 39907 31549 19134 472 4454 4352 1175 28145 (ignored 154350) Alignment pattern: 28256 1239 4226 4456 * 19094 31578 40071 77084 29444 17789 51787 Alignment stats: {'fragnum' : 55, 'locus' : 2204070, 'fval' : 0.894484, 'chi_square_sum' : 8.33596, 'deviation_sum' : 1073, 'num_matched_frags' : 11, 'chi_square_cdf' : 0.317069} backward alignment: threshold: 10000 Matching contig fake_contig_8:(ignored 23155) 73260 7831 2129 6393 8817 14379 87592 41932 13256 15058 63691 34277 (ignored 46742) Alignment pattern: 34057 63761 15020 13379 41859 87525 14466 8888 6392 2200 8003 73220 Alignment stats: {'fragnum' : 93, 'locus' : 3977640, 'fval' : 0.298298, 'chi_square_sum' : 5.71231, 'deviation_sum' : 1033, 'num_matched_frags' : 12, 'chi_square_cdf' : 0.0701158} backward alignment: threshold: 10000 Matching contig fake_contig_9:(ignored 24292) 55651 663 891 7295 55369 40318 39581 (ignored 22079) Alignment pattern: 39667 40286 55237 7451 * 761 55396 Alignment stats: {'fragnum' : 106, 'locus' : 4417698, 'fval' : 0.215011, 'chi_square_sum' : 5.54707, 'deviation_sum' : 759, 'num_matched_frags' : 6, 'chi_square_cdf' : 0.524214} Alignment pattern: 39667 40286 55237 7451 761 * 55396 Alignment stats: {'fragnum' : 106, 'locus' : 4417698, 'fval' : 0.835548, 'chi_square_sum' : 5.87133, 'deviation_sum' : 791, 'num_matched_frags' : 6, 'chi_square_cdf' : 0.562243} backward alignment: threshold: 10000 Matching contig fake_contig_4:(ignored 49435) 57747 21022 (ignored 48465) Alignment pattern: 21107 57944 Alignment stats: {'fragnum' : 67, 'locus' : 2712913, 'fval' : 1.32936, 'chi_square_sum' : 2.04596, 'deviation_sum' : 282, 'num_matched_frags' : 2, 'chi_square_cdf' : 0.640477} backward alignment: threshold: 10000 Matching contig fake_contig_7:(ignored 16464) 3498 (ignored 5501) Alignment pattern: 3227 Alignment stats: {'fragnum' : 38, 'locus' : 1617397, 'fval' : 1.80667, 'chi_square_sum' : 3.26404, 'deviation_sum' : 271, 'num_matched_frags' : 1, 'chi_square_cdf' : 0.929186} Alignment pattern: 3281 Alignment stats: {'fragnum' : 91, 'locus' : 3945627, 'fval' : 1.44667, 'chi_square_sum' : 2.09284, 'deviation_sum' : 217, 'num_matched_frags' : 1, 'chi_square_cdf' : 0.85201} backward alignment: Alignment pattern: 3227 Alignment stats: {'fragnum' : 38, 'locus' : 1617397, 'fval' : 1.80667, 'chi_square_sum' : 3.26404, 'deviation_sum' : 271, 'num_matched_frags' : 1, 'chi_square_cdf' : 0.929186} Alignment pattern: 3281 Alignment stats: {'fragnum' : 91, 'locus' : 3945627, 'fval' : 1.44667, 'chi_square_sum' : 2.09284, 'deviation_sum' : 217, 'num_matched_frags' : 1, 'chi_square_cdf' : 0.85201} threshold: 10000 placed somewhere: 8 total alignments: 12 attempts: 8 skipped:0
Now we can use twin2psl to convert this data to BLAT's Pattern Space Layout (.psl) format. It takes three arguments:
! ~/twin/twin2psl.py twin.stdout ecoli_optmap.bin contig_alignments.psl
/s/chopin/l/grad/muggli/.local/lib/python2.7/site-packages/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release. BiopythonExperimentalWarning)
Now we'll explore these results using biopython's PSL processing facilities. It groups single alignments first by target name, then these groups that share the same target name are grouped by query name. To traverse the actual alignments, we need three nested loops:
with open("contig_alignments.psl") as psl:
qresults = Bio.SearchIO.parse(psl, 'blat-psl')
for qresult in qresults: # A qresult groups together all PSL lines with the the same Q_NAME
print( "*", qresult.id)
for hit in qresult.hits: # A hit is a group of PSL lines that all share the same T_NAME
print (" *", hit.id)
for hsp in hit.hsps: # A High Scoring Pair (HSP) represents a line in a PSL
print(" *", hsp.hit_start_all[0]) # take the position of the first (in our case only) match interval
print(" -", contig_loci[qresult.id])
* fake_contig_1 * ecoli_optmap.bin * 1186776 - 1186908 * fake_contig_0 * ecoli_optmap.bin * -146 - 0 * fake_contig_6 * ecoli_optmap.bin * 2843950 - 2844982 * fake_contig_3 * ecoli_optmap.bin * 2201000 - 2202679 * fake_contig_8 * ecoli_optmap.bin * 3954485 - 3955024 * fake_contig_9 * ecoli_optmap.bin * 4393406 - 4393536 * 4393406 - 4393536 * fake_contig_4 * ecoli_optmap.bin * 2663478 - 2665150 * fake_contig_7 * ecoli_optmap.bin * 1600933 - 3929561 * 3929163 - 3929561 * 1600933 - 3929561 * 3929163 - 3929561