In this assignment you'll use multiple sequence alignment to reconstruct the phylogeny of a group of organisms based on their 16S rRNA sequences. This assignment builds on ideas from the previous assingment, in that in the last assignment you were identifying good primers to use for amplifying 16S from diverse organisms, and in this assignment we're using those sequences to group organisms by their relatedness. Because of the very large numbers of sequences that are commonly obtained in a modern DNA-sequencing-based experiment, grouping similiar sequences and then working with representative sequences for each of those groups is common for computational efficiency. We'll be exploring these ideas in more detail through-out the next segments of the class.
From a bioinformatics standpoint, we usually start working with sequence in fasta format, very similar to the sequences in the cell below. See here for an explanation of the fasta format.
At this point, you should be feeling fairly comfortable interacting with the IPython Notebook. This assignment will give you additional practice while you explore the ideas mentioned above.
Continue to work with IPython Notebooks and interact with python code. Understand what multiple sequence alignment is used for, and the concept of grouping sequences into clusters of OTUs. Consider the possible drawbacks to these methods.
Read all of the cells containing text very carefully!
You may write code or use a text editor if you wish, however all of the tools necessary to answer the questions are present in this notebook.
Get help, that's what office hours are for!
You are allowed to discuss the assignment with other students, however your work needs to be your own. Using or looking at code or commands generated by another student is strictly prohibited. If you're in doubt over whether some type of interaction is acceptable for this assignment, ask.
Remember to learn about what a function does you can run:
help(name_of_function)
Try this with the funcitons below to see what they do.
from __future__ import division
from skbio.parse.sequences import parse_fasta
from skbio import BiologicalSequence, SequenceCollection
from iab.algorithms import progressive_msa_and_tree, iterative_msa_and_tree, kmer_distance, guide_tree_from_sequences
The cell below contains the sequences that you will be working with throughout the assignment
seqs_16s = """>881726
GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGATTCATCCTTCGGGATGGGTTAGCGGCGGACGGGTGAGTAACACGTAGGCAACCTGCCTGCAAGTCCGGGATAACTAACGGAAACGTTAGCTAATACCGGATACGCGGTTGGATCGCATGATCCGATCGGGAAAGACGGCGCAAGCTGCCACTTGTAGATGGGCCTGCGGCGCATTAGCTAGTTGGTGGGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGAGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGACGCAAGTCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTTCTCGGATCGTAAAGCTCTGTTGCCAGGGAAGAACGCTCGGGAGAGTAACTGCTCTCGAGGTGACGGTACCTGAGAAGAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTCGATTAAGTTTGGTGTTTAAGCCCGGGGCTCAACCCCGGTTCGCACTGAAAACTGATCGACTTGAGTGTAGGAGAGGAAAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGCCTATAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGCATGCTAGGTGTTAGGGGTTTCGATACCCTTGGTGCCGAAGTCAACACAGTAAGCATGCCGCCTGGGGAGTACGGTCGCAAGACTGAAACTCAAAGGAATTGACGGGGACCCGCACAAGCAGTGGAGTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCCCCTGAATCCTCTAGAGATAGAGGCGGCCCTTCGGGGACAGGGGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGATCGTAGTTGCCAGCACTTCGGGTGGGCACTCTAGGATGACTGCCGGTGACAAACCGGAGGAAGGCGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTACTACAATGGCCGGTACAACGGGCTGCGAAGCCGCGAGGTGGAGCCAATCCCAGAAAGCCGGTCTCAGTTCAGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGTCT
>793074
GAATGAACGCTGGCGGCGTGCTTAATAATGCAAGTCGAGCGCGTAGCAATACGAGCGGCGCACGGGTGCGTAACACGTAGGTCATCTGCCTCTAGGTCGGGGATAACTGCGGGAAACTGCAGCTAATACCCGATGATATCGAGAGATCAAAGCTTCGGTGCCTAGAGAGGAGCCTGCGGCTCATTAGCTAGTTGGTGGGGTAACGGCCTACCAAGGCCACGATGAGTAGCCGGCCTGAGAGGGCGATCGGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTAGGGAATATTGGGCAATGGGCGAAAGCCTGACCCAGCAACGCCGCGTGAGTGATGAAGCCTTTCGGGGTGTAAAGCTCTTTTGGCAGGGACGAATCAATGACGGTACCTGCGTAATAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGGGGGGGCAAGCGTTATTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGCTTCTTAAGTCGGGTGTTTAATGTCGGGGCTCAACTCCGGCGCTGCACTCGATACTGGGAGGCTAGAGTACTCGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTTAGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGAGAGTAACTGACGCTCAGAGCGCGAAAGCCAGGGGATCGAACGGGATTAGATACCCCGGTAGTCCTGGCTGTAAACGATGGGTACTAGATGTCGCCGGTATCAATCCCGGCGGTATCGTCGCTAACGCATTAAGTACCCCGCCTGGGGAGTACGCTCGCAAGAGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGACTTGACATACCTCGGACCGGACCTAGAGATAGGACCTTCTCCCGTAAGGGAGCCGGGGATACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCCATCCCTAGTTGCCAGCGAGTCATGTCGGGAACTCTAGGGAGACTGCCGTTGATAAAACGAGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGTCCAGGGCTACACACGTGCTACAATGGCCACCACAAAGGGTCGCAATACCGTGAGGTGGAGCTAATCCCAAAAAGGTGGCCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGTCGGAATCGCTAGTAATCGCGGATCAGAACGCCGCGGTGAATACAGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACGAGAGCTGGTTGCGTTAGAAGTCGCCAGGCCAACCGCAAGGGGGCAGGCGCCGAATGCGTGATGAGTGATTGGGGT
>669210
AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCGCGCCTAACACATGCAAGTCGAACGGACTAGCCCCTTCGGGGGCGAAGTTAGTGGCGAACGGGTGAGTAACGCGTAAGTAACCTGCCCCCGGGACTGGGATAACAGCTCGAAAGAGCCGCTAATACCGGATAATTGTTGCAACACTTAGGAGTTGTAACTAAAGAAGGCCTCTGTTTCAAGCTTTCACCTGGGGATGGGCTTGCGTCCCATTAGCTTGTTGGTGAGGTAACGGCTCACCAAGGAAACGATGGGTAGCCGGCCTGAGAGGGTGGTCGGTCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGAGGAATCTTGCGCAATGGGCTAACGCCTGACGCAGCGACGCCGCGTGGACGATGAAGCTTTTCGGAGTGTAAAGTCCTTTCAGGAGGGAAGAAATGCCGGTAGTGTGAATAACACACCGGTTTGACGGTACCTCAAGAAGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAACACGGAGGGGGCAAGCGTTGTTCGGAATCACTGGGCGTAAAGAGCGCGTAGGTGGTTGTGTAAGTCGGATGTGAAATCCCTCGGCTCAACCGAGGAACTGCGTTCGAAACTACATAGCTAGAGGGCAGGAGAGGAGAGCGGAATTCCCAGTGTAGCGGTGAAATGCGCAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCTCTCTGGACTGTTCCTGACACTGAGGCGCGAAAGCCAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTGGCCGTAAACGATGGGCACTAGGTGTGGGGGGTGTCGATCCCCCCCGTGCCGCAGCTAACGCATTAAGTGCCCCGCCTGGGAAGTACGATCGCAAGGTTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCGGAATTTGACATGTTTCTGACGGCCTGCAGAAATGCAGGCTTCCCCTCGGGGCAGATACACAGGAGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGCCCTTAGTTGCCATCGGTTCGGCCGGGAACTCTAAGGGGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCAGTATGGCCTTTATGTTCCGGGCTACACACGTGCTACAATGGCTGGTACAAAGGGTCGCGATGCCGTGAGGTGGAGCCAATCCCAAAAAGCCAGTCTTAGTTCGGATTGGAGTCTGCAACTCGACTCTATGAAGCCGGAATCGCTAGTAATCGTGGATCAGCACGCCATGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACGAAAGTTGGTTGTACCAGAAGTCATTGGGCTAACCCTTTTGGGGGGCAGATGCCGAAGGTATGGTCAGCGATTGGGGTGAAGTCGTAACAAGGTAACC
>583705
ACGGGTGAGTAACGCGTATGCAACCTACCTCGGAAAAGGGGATGACTGGTGGAAACGGGGATTAATGCCCCCTAGGGTTGTTTCTCTGCCTGGGTGAGCCGTTACTATTGGAACCGATTGAGATGGCCATGTTGGTCATTTCCTGGTTGGTGAGGTTACCTCACACCAAGGCGACGATGACTACGGGGTCTAAAAGGATGGTCCCGCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGGGCAACCCTGAACCAGCCATGCCGCGTGAAGGAAGACGGCCCTATGGGTTGTAAACTTCTTTTATATGGGAATAAAGAGAGGTACGTGTACCTCAGTGAATGTACCATATGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCAGGCGTTATCCGGATTTATTAGGTTTAAAGGGTGCGTAGGCGGGATACTAAGTCAGTGGTGAAAGTTTGCGGCTCAACCGTAAAATCGCCATTGATACTGGTATTCTTGAGTATACAGGAAGTAGGCGGAATGTGTAGTGTAGCGGTGAAATGCATAGATATTACACAGAACACCGATTGCGAAGGCAGCTTACTATAGTATAACTGACGCTGATGCACGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGATTACTGGTTGTGCGCGATACACAGTGCGCGACTGAGCGAAAGCATTAAGTAATCCACCTGGGGAGTACGGCGGCAACGCTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCTGGGCTTAAATGTAGAGTGCATGGAGTGGAAACATTCCTTTCCTTCGGGACTCTTTACAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGGTTAAGTCCCATAACGAGCGCAACCCCTATCATTAGTTGCTAACAGGTCAAGCTGAGGACTCTAGCGAAACTGCCGGTGTAAACCGTGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTATGTCCAGGGCTACACACGTGTTACGATGGCCAGTACAAAGGGTAGCTACCTGGTGACAGGATGCTAATCTCAAAAGCTGGTCTCAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGTTGGATTCGCTAGTAATCGTATATCAGCCATGATACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCAAACCATGGAAGCTGGGGGTACCTGAAGTACGTCACCGCAAGGAGCGTCCTAGGGTAAATCTAGTGACTGGGGTTAAGTGGTAACAAGGTAACC
>524860
AGAGTTTGATCCTGGCTCAGAACGAACGTTGGCGGCATGGATGAGGCATGCAAGTCGCGGGAATCCCCAGCAATGGGGGGAACCGGCGTAAGGGGCAGTAAGGCGTAGGTACCTACCCCCAGGTCCGGGATAGCCCGCCGAGAGGCGGGGTAATACCGGATGACCTCGGGAGAGCAAAGCTCCGGCGCCTGAGGCGGGGCCTACGTGATATTACCTAGTTGGCGGGGTAACGGCCCACCAAGGGGGAGATGTCTAGCGGGTGTGAGAGCACGACCCGCGCCACTCGCACTGAGACACTGGCGAGACACCTACGGGTGGCTGCAGTCGAGGATCTTCGGCAATGGGGGCAACCCTGACCGAGCGATGCCGCGTGGGCGACGAAGGCCTTCGGGTTGTAAAGCCCTGTCGAGGGGGAGAAAGCCGCAAGGCGGATCCATCCCTGGAGGAAGCTCGGGCTAAGTTCGTGCCAGCAGCCGCGGTAAGACGAACCGAGCGAACGTTGTTCGGAATCACTGGGCTTAAAGGGCGCGTAGGCGGGCTGCCGCGTCCGGGGTGAAATCCCACGGCTCAACCGTGGAACGGCCCCGGGTACGGGCGGCCTCGAGGGGGATAGGGGCGTGCGGAACTGTGGGTGGAGCGGTGAAATGCGTTGATATCCACAGGAACTCCGGTGGCGAAGGCGGCACGCTGGATCCTCTCTGACGCTGAGGCGCGGAAGCCAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTGGCCCTAAACGATGAGAACTGGGTAGTAGCCCTGGCATGGGGTTACTGCCGCAGCCAAAGTGCTAAGTTCTCCGCCTGGGGAGTATGGTCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCTCACACAAGCGGTGGAGCATGTGGCTTAATTCGAGGCTACGCGAAGAACCTTATCCTGGACTTGACATGTGCGAAAGCGCCAGCAGGTAGGACCCGGAAACGGGAACGAACGGTATCCAACCCGGAAGCTGGTACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTCGGGTTAAGTCCCATAACGAGCGAAACCCTTACCCCTTGTTGCAACCCGAAAGGGGCACTCGAGGGGGACTGCCGGTGTCAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGTCCAGGGCTGCACACGTGCTACAATGGCGTGGACAGAGGGACGCGACTGCGCGAGCAGAAGCCGACCCCCGAAAGCACGCCCCAGTTCAGATCGCAGGCTGCAACCCGCCTGCGTGAAGCCGGAATCGCTAGTAATCGCGGGTCAGCAACACCGCGGTGAATGTGTTCCTGAGCCTTGTACACACCGCCCGTCAAGCCACGAAAGGGAGGGACGTCCGAAGTCGCCTCGCGGCGCCGAAGACGGACTTCCTGATTGGGACTAAGTCGTAACAAGGTAACC
>501793
GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAACGGAGTGGTTGAAGGAGCTTGCTCTTTTGATCGCTTAGTGGCAGACGGGTGAGTAACACGTAGGCAACCTGGCTGTAAGACGGGGATAACTGGCGGAAACGTGAGCTAAAACCGGATGGTCGGCTTGAGGGCATCCTCGAGTCGGGAAAGGACGGAGCAATCTGTCGCTTACAGATGGGCCTGCGGCGCATTAGCTAGTTGGTAGGGTAACGGCCTACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGAACGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCGGCAATGGACGAAAGTCTGACCGAGCAACGCCGCGTGAGTGATGAAGGTTTTCGGATCGTAAAGCTCTGTTGCCAGGGAAGAACGCCAGGGAGAGTAACTGCTCTCTGGGTGACGGTACCTGAGAAGAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTCGTTTAAGTCTCATGTCTAAACCCCGGGGCTCAACCTCGGGGTGCATGGGAAACTGGGCGACTGGAGTGCATGTGAGGAAAGTGGAATTCCACGTGTAGCGGTGGAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGCCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAATGCTAGGTGTTAGGGGTTTCGATACCCTTGGTGCCGAAGTTAACACATTAAGCATTCCGCCTGGGGAGTACGGTCGCAAGACTGAAACTCAAAGGAATTGACGGGGACCCGCACAAGCAGTGGGGTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCCGATGAAACATGCAGAGATGTGTGCCCTCTTCGGAGCATTGGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTAAGGTTAGTTGCCAGCAGGTGAAGCTGGGCACTCTAACATGACTGCCGGTGACAAACCGGAGGAAGGCGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTACTACAATGGCCAGTACAACGGGAAGCGAAGTGGCGACACGGAGCCAATCTTAGAAAGCTGGTCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGTCT
>296752
AGAGTTTGATCTCTGGCTCAGAACAAACGCTGGCGGTGCGTCTTAAGCATGCAAGTCGAGCGATGGTAGGGGGCTTGCTCCCTATTCATAGCGGCGGACTGGTGAGTAACGCGTAGATGACATACCTTTTGCTGGGGGATAGCTTGTGGAAACACAGGGTAATACCGCATACGATTGAGGCGGTTAGAGCGCTTCAATCAAAGCCTTGTATGGGGCGGCAGTTGAGTGGTCTGCGTACTATTAGCTTGTTGGTGGGGTAACGGCTCACCAAGGCGATGATAGTTATCCGGCCTGAGAGGGTGAACGGACACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGCTAAGAATATTCCGCAATGGGCGAAAGCCTGACGGAGCGACACCGCATGAGTGATGAAGGTCGAAAGATTGTAAAATTCTTTTTGAGAGTGATGAATAAAGTCGAGCAGTAATGCTCGGTGATGACGGTAACTTTTGAATAAGGGGTGGCTAATTACGTGCCAGCAGCCGCGGTAACACGTAAGCCCCAAGCGTTGTTCGGAATTATTGGGTGTAAAGGGCATGTAGGTGGTCTTGCAAGCTTGATGTGAAATCTTACAGCTTAACTGTAAAACTGCATTGAGAACTGCATAACTTAAGAATAACTGAGGCGCAACTGGAATTCCAGGTGTAGGGGTGAAATCTGTAGATATCTGGAAGAACACCAATGGCGAAGGCAAGTTGCGAGCAGATTATTGACACTGAGGTGCGAAGGTGCGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCGCACAGACAACGATGTACACTGGGCGTCTGGCTTTATGCTGGGTGCCGTAGTAAACGCGATAAGTGTACCGCCTGGGGAGTATGCTCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGAAATCTTACCTGGGTTTGACATTTAGTGGAATTGTATAGAGATATGCAAGGTACTTGTACCCGCTAAACAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTACTGCTAGTTACTAACAGGTTATGCTGAGGACTCTAGCGGAACTGCCGGTGACAAACCGGAGGAAGATGGGGATGACGTCAAGTCATCATGGCCCTTATGTCCAGGGCAACACACGTGCTACAATGGTTGTAACAAAGTGATGCGAAATCGCAAGATGAAGCAAAACGCAGAAATGCAATCGTAGTTCGGATTGGAGTCTGAAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCATATCAGCACGATGCGGTGAATACGTTCCCGGGCCTTGCACACACCGCCCGTCA
>293514
AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGTGCGTCTTAAGCATGCAAGTCGAGCGATGGTAGGGGGCTTGCTCCCTATTCATAGCGGCGGACTGGTGAGTAACGCGTAGATGACATACCTTTTGCTGGGGGATAGCTTGTGGAAACACAGGGTAATACCGCATACGATTGAGGCGGTTAGAGCGCTTCAATCAAAGCCTTGTATGGGGCGGCAGTTGAGTGGTCTGCGTACTATTAGCTTGTTGGTGGGGTAACGGCTCACCAAGGCGATGATAGTTATCCGGCCTGAGAGGGTGAACGGACACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGCTAAGAATATTCCGCAATGGGCGAAAGCCTGACGGAGCGACACCGCGTGAGTGATGAAGGTCGAAAGATTGTAAAACTCTTTTGAGAGTGATGAATAAGTCGAGCAGTAATGCTCGGTGATGACGGTAACTTTTGAATAAGGGGTGGCTAATTACGTGCCAGCAGCCGCGGTAACACGTAAGCCCCAAGCGTTGTTCGGAATTATTGGGCGTAAAGGGCATGTAGGTGGTCTTGCAAGCTTGATGTGAAATCTTACAGCTTAACTGTAAAACTGCATTGAGAACTGCATAACTTAAGAATAACTGAGGCGCAACTGGAATTCCAGGTGTAGGGGTGAAATCTGTAGATATCTGGAAGAACACCAATGGCGAAGGCAAGTTGCGAGCAGATTATTGACACTGAGGTGCGAAGGTGCGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCGCACAGTCAACGATGTACACTGGGCGTCTGGCTTTATGCTGGGTGCCGTAGTAAACGCGATAAGTGTACCGCCTGGGGAGTATGCTCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGGAATCTTACCTGGGTTTGACATACACATTATCTTTGCAGAGATGTAAAGCGGGGGTAACCCCAATGTGAACAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGCCAGTTACTAACAAGTTAAGTTGAGGACTCTGGCGAAACTGCCGGTGACAAATCGGAGGAAGATGGGGATGACGTCAAGTCATCATGGCCCTTATGTCCAGGGCAACACACGTGCTACAATGGTTGAGACAGAGTGATGCTAAGTCGCAAGATGGAGCAAAACGCAGAAATTCAATCGTAGTTCGGATTGGAGTCTGAAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCATATCAGCACGATGCGGTGAATACGTTCCCGGGCCTTGCACACACCGCCCGTCA
>292553
AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGTGCGTCTTAAGCATGCAAGTCGAGCGATGAATGAGGGGCTTGCTCCTTATTCATAGCGGCGGACTGGTGAGTAACGCGTAGATGACATGTCGATGGCAGGGGGATAGCCAGTAGAAATATTGGGTAATACCGCGTATCCTTCTTGTTGTTAGAGGACAAGAAGAAAAGCCTTGTATGGGGCGGCTATTGAGTGGTCTGCGTACTATTAGTTTGTTGGTGGGGTAACGGCCTACCAAGACTATGATAGTTATCCGGCCTGAGAGGGTGAACGGACACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGCTAAGAATATTCCGCAATGGGCGAAAGCCTGACGGAGCGACACCGCGTGAGTGATGAAGGTCGAAAGATTGTAAAACTCTTTTGAATATGATGAATAAGTCAAGCAGTAATGCTTGGCGATGACGGTAGTGTTTGAATAAGGGGTGGCTAATTACGTGCCAGCAGCCGCGGTAACACGTAAGCCCCAAGCGTTGTTCGGAATTATTGGGCGTAAAGGGCATGTAGGTGGTTTTGTAAGCTTGATGTGAAATCTTACAGCTTAACTGTAAAACTGCATTGAGAACTGCAGAACTAGAGTAACTGAGGTGCAACTGGAATTCCAGGTGTAGGGGTGAAATCTGTAGATATCTGGAAGAACACCAATGGCGAAGGCAAGTTGCAAGCAGATTACTGACACTGAGGTGCGAAGGTGCGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCGCACAGTCAACGATGTACACTGGGCGTCTGGCTTTATGCTGGGTGCCGTAGTAAACGCGATAAGTGTACCGCCTGGGGAGTATGCTCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGAAATCTTACCTGGGTTTGACATTTAGTGGAATTGTATAGAGATATGCAAGGTACTTGTACCCGCTAAACAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTACTGCTAGTTACTAACAGGTTATGCTGAGGACTCTAGCGGAACTGCCGGTGACAAACCGGAGGAAGATGGGGATGACGTCAAGTCATCATGGCCCTTATGTCCAGGGCAACACACGTGCTACAATGGTTGTAACAAAGTGATGCGAAATCGCAAGATGAAGCAAAACGCAGAAATGCAATCGTAGTTCGGATTGGAGTCTGAAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCATATCAGCACGATGCGGTGAATACGTTCCCGGGCCTTGCACTAACGGCCCGTCA
>266495
AGTTTGATCCTGGCTCAAGATGAACGCTAGCGGCAGGCTTAACACATGCAAGTCAAAGGGCAACGGGGAGAGTGCTTGCACTCTCTGCCGGCGACTGGCGCACGGGTGAGTAACACTTATGCAGACACTGCCTTCCACAGGGCGGACAACCTCTCCCAAAGGGAGGCTAATCCCGCGTATATCCCTTGGGGGCATCCCCGGGGGAGGAAAGGATTACCGGTGTGCAGGATGGGCATGCGGCGCATTACGCAGTAGGCGGGGTAACGGCCCACCTAACCGACCATGCGTATGGGTTCTGAGAGGAAGGCCCCCCACACTGGTACTGAGACACTGACCAGACTCCTACTGGAGGCAGCAGTGAGGAACATTGGTCAATGGGCGGGAGCCTGAACCAGCAAACCCGCGTGAAGGAAGAAGGCGCCGAACGTCGTAAACTTCTTTTGTCCGGGATCAAAGGGCGCCACGTGTGGCGTTGTGAGTGTACCTGTAGAGAAAGCTTCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGTAGGGAGCGAGCGTTGTCCGGATTTATTGGGTGTAAAGGGCGCGTAGGTGGTCGGTTAAGTCAGGTGTGAAAGCTCGGGGCTCAACCCGGAGGATCCGCTGGAACTTTGGTGTCATGAGGCGCAGGAGAAGTAAAGTGGAATTCGTGGTGTAGCGGTGAAATGCATAGATATTGGGCGGAACTCCGGTGGCGAAGGCAGCGTTCTGGCGCGTGCCTGACGCTGAGGCGCGAAAGCGTGGGTATCGAACGGGATTAGATACCCCGGTAGTCCACGCAGTAAACGATGAATACTGGGTGTCGGACCCATAGAACGTTTGGGTGCGCGCAGCGAAAGCGATAAGCATTCCAAGTGGGGAGTACACCGGCAGTGATGAGACTCAAAGGAATCGACGGGGGTTCGCACAAGTGGAGGGATATGTGGTTTAATTAGACGATAAGTGAGGAACGTGACCCGGGTTCAACAGGGAGTCGACAGGGGCAGAGATTCCCTCTTCCACGGACGTCTTCCGAGGTGGGGCATGGTTGTCAGTCAGCTACGTGCCGTGAGGTGTCGGCTTAAGTGCCATAAGGTGTGCAACACGGGCAGACAGTTGCTAACGGGTAGAGCAGTGGAATGTGTAGTGATTGCAGGGGCAAGCCGCGAGGAAGGGGGGGATGATGTCAAATCAGCGCGGCCCTTAGGTCAGGGGTGACACACGTGCTGCAATGGCGGGGACAGAGGGATGTGAAGAGGCGACGTGGAGCGAACCCCAAAAACCCCGCCCCAGTTAGGATTGTAGTATGCAACCCGAATACATGAAGCCGGAATAGGTAGTAATCGCGGATCAGAATGCAGCGGTGAATAAGTTCCCGGCTCTAGCACACACCGCCCGTCA
>229854
GAGTTTGATCCTGGCTCAGATTGAACGCTGGCGGCATGCTTAACACATGCAAGTCGAACGGCAGCATGACTTAGCTTGCTAAGTTGATGGCGAGTGGCGAACGGGTGAGTAACGCGTAGGAATATGCCTTAAAGAGGGGGACAACTTGGGGAAACTCAAGCTAATACCGCATAAACTCTTCGGAGAAAAGCTGGGGACTTTCGAGCCTGGCGCTTTAAGATTAGCCTGCGTCCGATTAGCTAGTTGGTAGGGTAAAGGCCTACCAAGGCGACGATCAGTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTGAGGGTTGTAAAGCACTTTCAGTGGGGAGGAGGGTTTCCCGGTTAAGAGCTAGGGGCATTGGACGTTACCCACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCCGCGGTAATACGGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCCGTTAAAANGGTGCCTAAGGTGGTTTGGATNAGTTATGTGTTAAATTCCCTGGCGCCTCCACCCTGGNGCCAGGTCCATANTAAAAACTGTTAAACTCCGAAGTATGGGCACAAGGTAANTTGGAAANTTCCGGTGGTNANCCGNTGAAAATGCGCTTAGAGATNCGGGAAGGGACCACCCCAGTGGGGAAGGCGGCTACCTGGCCTAATAACTGACATTGAGGCACGAAAAGCGTGGGGAGCAACCAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGTCAACTAGCTGTNGGTTATATGAATATAATTAGTGGCGAAGCTAACGCGATAAGTTGACCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATNGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTACCCTTGACATACAGTAAATCTTTCAGAGATGAGAGAGTGCCTTCGGGAATACTGATACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTATCTCTAGTTGCCAGCGAGTAATGTCGGGAACTCTAAAGAGACTGCCGGTGACAAACCGGAGGAAGGCGGGGACGACGTCAAGTCATCATGGCCCTTACGGGTAGGGCTACACACGTGCTACAATGGCCGATACAGAGGGGCGCGAAGGAGCGATCTGGAGCAAATCTTATAAAGTCGGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGCGAATCAGCATGTCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGCTGCACCAGAAGTAGATAGTCTAACCGCAAGGGGGACGTTTACCACGGTGTGGTTCATGACTGGGGTGAAGTCGTAACAAGGTAGCCG
>182569
AGAGTTTGATCCTGGCTCAGGATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATGGTGTATCAATATATCTATGGCGACCAGCGCACCGGTGATGCACACCTCTCCTACCTGCCCCTTACTCCGGGATGATCTTTCTAAAAAAATATTACTACTCCATGGTATTACCGAAAAACGTCTTTTTGTTGTTTAAAAACTTCGATGGTGGAAGGTGATGCTTTCTATTATATACTTGGTGGGGTAACAGCCCACCACCTCAGCGATGAATAGGGGTTCTAATAAGAAGGTCCCCCCCATGGTAACTGGGCCCCGGTCCAAATTCTTCGGGAAGCCACCAGTGAGGATTATTGTTCAATGGCGGAGATTTTGACCCAGCCCAAGTAGCGTGAAGGATGACTGCTCCCATAGGTGGTAAACTTCTTTTATATGGGAATAAAGTGAGTCACGTGTGTCTTTTTGTATGTATCATATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATTCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGTTTGTTAAGTCAGTGGTGAAAGTTTGGGGCTCAACCGTGAAATTGCATTTGATACTGGCGGTCTTGAGTGCAGTAGAGGTGGGCGGAATTTGTGGTGTAGCGGTGAAATGCTTAGATATCATGCAGAACTCCGATTGCGAAGGCAGCTCACCGGAGTGTATCTGACGTTGAGGCTCGAAAGTGTGGGTATCAAACAGGATTAGATACCCTGGTAGTCCACACAGTAAAGAAGGAATATTGTCGTTGTGGGATCTCCATTAAGGGGTCAAGGGAAAGCATTAATTATTCCCCTGGGGGAGTAGTCCGCCAGAGGTGAAATTAAAAGAAATGGAGGGGGGCCGGCCCAAGGGAAGGACCATGTGGTTTAATTGGAGGATAGGGGAGGACCTTTCCCGGGGTTGAAAGTGCAAATGAATTATGGGGAGAGCCATTCCCTTCAAGGCATGAGAGAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGGTTAAGTCCCATAACGAGCGCAACCCTTATCTTCAGTTACTATCAGGTCAAGCTGAGCACTCTGGAGAGACTGCCGTTGTAAGATGAGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCTACACACGTGTTACAATGGGGGGTACAGAAGGCAGCTACCCAGCGACAGGATGCCAATCCCAAAAACCTATCTCAGTTCGGATTGAAGTCTGCAACCCGCCTTCGTGAAGTTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCGGGCCTTGCACACACCGCCCGTCA
>1719550
TCCTGGCTCAGAACGAACGTTGGCGGCGTGGATTAGGCATGCAAGTCGCGCGAATCCCCGCAAGGGGGGAAGCGGCGTAAGGGGCAGTAAGGCGTGGGTACCTACCCGGGGGTCGGGGATAGCCCGTCGAGAGACGGGGTAATACCCGATGACGTGGAGACACCAAAGGTCCGCCGCCCTCGGCGGGGCCCACGTGATATTAGCTAGTTGGCGGGGTAACGGCCCACCAAGGCGGGGATGTCTAGCGGGTGTGAGAGCACGACCCGCGCCACTGGCACTGAGACACTGGCCAGACACCTACGGGTGGCTGCAGTCGAGGATCTTCGGCAATGGGGGAAACCCTGACCGAGCGACGCCGCGTGGGCGACGAAGGCCTTCGGGTTGTAAAGCCCTGTCGAGGGGGAGAAAGCCTTAACCGGGTGATCTATCCCTGGAGGAAGCACGGGCTAAGTTCGTGCCAGCAGCCGCGGTAAGACGAACCGTGCGAACGTTGTTCGGAATCACTGGGCTTAAAGGGCGCGTAGGCGGGTTGCCGCGTCCGGGGTGAAATCCCACGGCTCAACCGTGGGGCGGCCCCGGGTACGGGCAGCCTCGAGGAGAGTAGGGGCATGCGGAACTCTGGGTGGAGCGGTGAAATGCGTTGATATCCAGAGGAACTCCGGTGGCGAAGGCGGCATGCTGGACCCTTCCTGACGCTGAGGCGCGAAAGCCAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTGGCCCTAAACGATGAGAACTAGGTAGCCGGCCGGACATGGGCTGGCTGCCGGAGCCAAAGTGCTAAGTTCTCCGCCTGGGGAGTATGGCCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCTCACACAAGCGGTGGAGCATGTGGCTTAATTCGAGGCTACGCGAAGAACCTTATCCCGGGCTTGACATGTTCGAAAGAGGCTCGAAGTAGCCCGCGGAAACGTGGGGCCAACGGTATCCAGTCCGGAGCGAGCTACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTCGGGTTAAGTCCCATAACGAGCGAAACCCTTACCCTCAGTTGCTTACTAGGACTCTGGGGGGACTGCCGGTGTCAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCCCGGGGCTGCACACGTGCTACAATGGCGTGGACAAAGAGACGCGAGCCCGCGAGGGGGAGCCAATCTCAGAAAGCACGCCCCAGTTCAGATCGCAGGCTGCAACTCGCCTGCGTGAAGCCGGAATCGCTAGTAATCGCGGGTCAGCAACACCGCGGTGAATGTGTTCCTGAGCCTTGTACACACCGCCCGTCAAGCCACGAAAGAGAGGGACGTCCGGAGTCGCCTTCACCGGTGCCGAAGACGGACTTCTTGATTGGGACTAAGTCGTAACAAGGTAACC
>1794723
TTAGAGTTTGATCCTGGCTCAGAACGAACGTTGGCGGCGTGGATTAGGCATGCAAGTCTCGCGAATCCCCGCAAGGGGGGAAGCGGCGTAAGGGGCAGTAAGGCGTGGGTAACCCACCCCGGGGCCCGGGATAGCCCGTCGAGAGACGGGGTAATACCGGGCGACGCAGCGTGCCGGCATCGGTGTGCTGCCAAAGGTCCGCCGCCCCGGGCGGGGCCCACGTGGTATTAGCTAGTTGGTGGGGTGACGGCCCACCAAGGCGGAGATGCCTAGCGGGTGTGAGAGCACGACCCGCGCCACTGGCACTGAGACACTGGCCAGACACCTACGGGTGGCTGCAGTCGAGGATCTTCGGCAATGGGGGCAACCCTGACCGAGCGACGCCGCGTGGGCGACGAAGGCCTTCGGGTTGTAAAGCCCTGTCGAGGGGGAGAAACGTCCCGCAAGGGGCCTGATCTATCCCTGGAGGAAGCACGAGCTAAGTTCGTGCCAGCAGCCGCGGTAAGACGAACCGTGCGAACGTTGTTCGGAATCACTGGGCTTAAAGGGCGCGTAGGCGGGCTGCCGAGTCCGGGGTGAAATCCTCCCGCTCAACGGGAGAACGGCCCCGGGTACTGGCGGCCTCGAGGCGGGTAGGGGCGTGCGGAACACTGGGTGGAGCGGTGAAATGCGTTGATATCCAGTGGAACTCCGGTGGCGAAGGCGGCACGCTGGACCCGTCTGACGCTGAGGCGCGAAAGCCAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTGGCCCTAAACGTTGAGAACTAGGTAGTCGGCCGGACATGGGCTGACTGCCGGAGCGAAAGTGCTAAGTTCTCCGCCTGGGGAGTATGGTCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCTCACACAAGCGGTGGAGCATGTGGCTTAATTCGAGGCAACGCGAAGAACCTTATCCCGGGCTTGACATGTGCGAAAGCGTCTGGGGGTACCCGCCGGAAACGGCCGGGGAAGGTATCCAGTCCTGAACCAGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTCGGGTTAAGTCCCATAACGAGCGAAACCCTTACCCTCAGTTGCCAGCGGGTCACGCCGGGGACTCTGGGGGGACTGCCGGTGTCAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCCCGGGGCTGCACACGTGCTACAATGGCGTGGACAAAGGGGCGCGAACGCGCGAGCGGGAGCCGACCCCGGAAAGCACGCCCCAGTTCAGATCGCAGGCTGCAACTCGCCTGCGTGAAGTCGGAATCGCTAGTAATCGCGGGTCAGCAACACCGCGGTGAATGTGTTCCTGAGCCTTGTACACACCGCCCGTCAAGCCACGAAAGGGAGGGACGGCCGAAGTCGCGCCCCGCGCGCCGACGCCGGACTTCCCGATTGGGACTAAGTCGTAACAAGGTAACC
>1142181
CACGTGGGTCATTTGCCCCGAAGCCCGGGATAGCCCATGGAAACATGGATTAATACCGGATGTGGTTGGAGTACACAGGTGCTCCGTATTAAACGGTAGGTAGCAATACCTTCCGCTTCGGGATAAGCCCGCGGCCCATTAGCTAGTTGGTGGGGTAAGACCCAACCAAGGAGACAACCGGGAGCCGGACAGAAAGGGTGACGGCCACATTGGGACTGAGAAACGGCCCGATCCTACGGAGGCAGCAGTAAGAATCTTCCGCATGAACGAAGTCCGACCGAGCGACGCGCTGAGTGATGAAGGTGTTATGCATCGTAAAGCTCCTTCGGGGAGGAGAATAAGCATAGTCCAAAAGGCTATGTGATGACGACCCTCCCTAAAGAAGCCCCGGCTAATTACGTGCAGCAGCGCGGCAATACGTAAGGGGTAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGTGTGCAGGCGGGAGAGTAAGTTGGGGGTGAAATCTACGGGCCCAACCCGTAAACTGCCCTCAAAACTGCTTTTCTTGAGTGCAGGAGAGGAGACTGGAATTCCTAGTGTAGGAGTGAAATCTGTAGATATTAGGAAGAACACCGGTGGCGAAGGCGAGTCTCTGGCCTGACACTGACGCTGATACACGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGTTGTGCACTAGATGTTGGGGGTGTCAATCCCCTCAGTGTCGCAGTTAACGCATTAAGTGCACCGCCTGGGGAGTATGCTCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCAGGGCTTGACATACAGGTGCCGGGCTGTGAAAGCAGTCCTCTCTTCCGAGCGCCTGTACAGGTGTTGCACGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGTTAAGTCCCCCAACGAGCGCAACCCCTATTGTCTGTGCCATCATTAAGTTGGCACTCGAACGAAACTGCCGGTGATAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATATGGCCCTTAGGCCTGGCTACACGTGCTACAATGGACAGTACAAGAGTCGCAAGACCGAAAGGTGGACCATCCAAAAGCTGTCCTCAGTTCCGATTGAAGTCTGAAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGCATCAGAATGGCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACCCGAGTTGGAAGTACTTGAAGTCGCTGATCTAACCTTCGGGAGGAAGGCGCCGATTGTACGTCTGATAAGGGGGGTGAAGTCGTAACAAGGTAACC
>2683209
CTGGCGGCGTGGTTTAGGCATGCAAGTCGAACGCGAAAGATTTACTTCGGTAAATTGAGTAGAGTGGCGAACGGGTGAGTAATACGTACGAATCTACCTTAAAGACAGGGATAGTCCCGGGAAACTGGGTTTAATACCTGATGGTATCCGGCTTTGCCGGATTAAAGACGGCCTCTATTTATAAGCTGTTACTTTTAGATGAGCGTGCGCTCCATTAGTTAGTTGGTAAGGTAAGAGCTTACCAAGGCGATGATGGATAGGCGTCCTTAACGGGTGGTCGCCCACACTGGGATTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGTCGAGAATCGTCTACAATGAACGCAAGTTTGATAGTGCGACGCCGCGTGAATGAAGAAGCATTTCGGTGTGTAAAATTCTTTTATATAAGAACAGTGCATGTATGGTAAATAATTATACGTGAGAGATAGTACTATATGAATAAGCTCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGAGCAAGCGTTGTCCGGAATTACTAGGTGTAAAGGGTAAGTAGGCGGAAATTTAAGTCTCCGGTTAAATCTTCGGGCTCAACCCGAAATCTGCCTGAGATACTGGATTTCTAGAGTAAAGCAGATGAAGGCGGAATTCCTGGAGTAGCGGTGGAATGCGTAGATATCAGGAAGAACACCCATAGCGAAGGCAGCTTTCAATGCTATTACTGACGCTCAATTACGAAGGTGCGGGTATCGAACAGGATTAGATACCCTGGTAGTCCGCACAGTAAACGATATGTACTTGATATTGGATGTTGAAAATTCAGTGTCGTAGCTAACGCGTTAAGTACATCACCTGGGGACTAACGGCCGCAAGGTTAAAACTCAAAGGAATTGACGGGGGCCCACACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGAACTTGACATGCCGAGAATCCTGTAGAAATATGGGAGTGCCTTTTTTGGAGCTCGGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATCTTTAGTTGCTACCATTAAGTTGAGGACTCTAAAGAGACTGCCAGAGTACAAATCTGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGTCCTTATGTTCAGGGCTACACACGTGCTACAATGGTTGGAACAAAAGGCAGCGAAGGGGCGACCCGGAGCTAATCTCCAAACCCAATCTTAGTCCGGATTGCAGTCTGCAACTCGACTGCATGAAGTTGGAATCGCTAGTAATCGTGAGTCAGCATATCACGGTGAACATGTTCCTGGGCCTTGTACACACCGCCCGTCAAGTCAGCCGAATCGAGTGCACCCGAAGAAGGTGAGTTAATTAGACAGCTTTCGAAGGTGTGCTTGTAAGGGGGACTAAGTC
>2784824
AGTGGCGCACGGGTGAGTAACGCGTGGGTAACTTGCCTTTAAGTGAGGGATAACCCACTGAAAGGTGGACTAATACCTCATAAGACCACAGTGCTACGGCAGCGTGGTCAAAGGTGGCTTTATTAAAAGCTGCCGCTTGGAGAGAGACCCGCGTCCCATCAGCTTGTTGGTAAGGTAATGGCTTACCAAGGCCGAGACGGGTAGCTGGTCTGAGAGGATGGCCAGCCACACTGGAACTGAAACACGGTCCAGACTCCTACGGGAGGCAGCAGTGAGGAATCTTGCGCAATGGGGGGAACCCTGACGCAGCAACGCCGCGTGAGTGAAGAAGGTCTTCGGGTCGTAAAGCCCTGTCGGGAGGGAAGAAACAGTTATGCATGAATAATGCATAACCTTGACGGTACCTCCNGAGGAAGCACCGGCCAACTCCGTGCCAGCAGCCGCGGTAAAACGGAGGGTGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGCGTGTAGGCGGATAGATAAGTCGAGTGTGAAAGCCCTCAGCTTAACTGAGGAAGTGCATTCGAAACTATCTTTCTTGGGTACGGAAGAGGGAAGTGGAATTCCCGGTGTAGGGGTGAAATCCGTAGATATCGGGAGGAATACCAGTGGCGAAGGCGACTTCCTGGACCGTCACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGGGCACTAGGTGTATCTCGCTTAGCGGGATGTGCCGTAGCTAACGCATTAAGTGCCCCGCCTGGGGAGTACGGTCGCAAGACTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGTGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCTGGGTTTGACATGCCGAGAATCTGCCAGAAATGGTGGAGTGCCCCGTTAGGGGAACTCGGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCACCTTTAGTTGCCAGCATTAAGTTGGGCACTCTAAAGGGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGACGACGTCNAGTCCTCATGGCCTTTATACCCAGGGCTACACACGTGCTACAATGGCCAGTACAAAGGGCTGCAATCCCGCGAGGGGGAGCCAACCCCAAAAATCTGGTCTTAGTTCGGATTGGAGTCTGCAACTCGACTCCATAAAGGTGGAATCGCTAGTAATCGTGAATCAGCACGTCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACGAAAGTTGGCTGTACCAGAAGTTGCTGAGCTAACTCGCCTCGGCGGGAGGCAGGCACCTAAGGTGTGGTTGATGATTGGGGTGAAGT
>2941516
TTAGAGTTTGATCCTGGCTCAGGATGAACGCTAGCGATAGGCCTAACACATGCAAGTCGAGGGGTAACAGGGTAGCAATACCGCTGACGACCGGCAAATGGGTGAGTAACGCGTATGCAACCTACCGATAACAGTTGGATAGCTCCCTGAAAGGGGAATTAAACCGGCATGACACTATGAGATCGCCTGTTTTCATAGTTAAATATTTATAGGTTATTGATGGGCATGCGTGACATTAGCAAGTTGGTGAGGTAACGGCTCACCAATGCTACGATGTCTAGGGGTTCTGAGAGGAAGGTCCCCCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGACGGAAGTCTGAACCACCCACTTCGCGTGCAGGATGACTGCCCTATGGGTTGTAAACTGCTTTTATATAAGAGGAACAGTATTTATGTATAGATATTTGCCAGTATTATATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGAGTTTAAAGGGTGAGTACGCGGTAGTATAAGTCAGCGGTGATAACTCGCAGCTCATCTGTAAGCTTGCCGTTGACACTGTATTACTTGACTTAACGTTGAGGTATGCTGAATGGGGGGGGGTTACCCGTTGAAATGCATTAATCAAAACAACAGACCACCCGATTTGCGGACGGCAGCAAAACTACACTGTCCACTGACGCTGATGCACAAAAGGCGTGGGTATCAAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGATTACTGGTTGTTTGTGATACACTGCAAGTGACTGAGCGAAAGCACTAAGTAATCCACTTGGCGAGTACGTCGGCAACGATGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTCTAATTCGAGGCAACGCGAAGAACCTTACCCAGACTTGACATCTAGGAAAGGTCCTTGAAAGAGGATCGTGCCCGCAAGGGAATCCTAAGACAGGTGTTGCATGGCTGTCGTCAGCTCCTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGCTTACAGTTACCATCGGTTCGGCCGGGGACTCTGTAAGGACTGCCGCTGATAAAGCGAAGGAAGGCGGGGACGACGTCAAGCAATCACGGCCCTTACGTCTGGGGCTACACACGTGCTACAATGGCCGGTACAATGAGTCGCAAAACCGCGAGGTCAAGCTAATCTCAAAAAACCGGTCTCAGTTCGGATTGGAGTCTGCAACCCGACTTCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGCGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCA
>998428
GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGAGTTGTTCCTTCGGGGACAGCTTAGCGGCGGACGGGTGAGTAACACGTAGGCAACCTGCCTGCAGGACCGGGATAACCCACGGAAACGTGAGCTAATACCGGATAGATGGTTCCCTCGCATGAGGGGATCAGGAAAGACGGGGCAACCTGTCACTTGTAGATGGGCCTGCGGCGCATTAGCTAGTTGGCGAGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGAACGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGAGTGAGGAAGGTCTTCGGATCGTAAAGCTCTGTTGCCAAGGAAGAACGCTTGGTGGAGTAACTGCCATCAAGGTGACGGTACTTGAGAAGAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTGTCCGGAATTATGGGCGTAAGCGCGCGCAGCGGTTCTTTAAGTCTGAGGTTAAATGCAGGGCTCAACCTTGTAACGCCTTGGAAACTGGGGGACTGGAGTGTAGGAGAGGAAAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGCCTATAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAGGTGTTAGGGGTTTCGATACCCTTGGTGCCGAAGTTAACACAGTAAGCACTCCGCCTGGGGAGTACGCTCGCAAGAGTGAAACTCAAAGGAATTGACGGGGACCCGCACAGGCAGTGGAGTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCCGATGAAACGTCTAGAGATAGGCGCCCTCTTCGGAGCATTGGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGAATTCAGTTGCCAGCACTTCGGGTGGGCACTCTGAATTGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCGTGCCCCTTATGACCTGGGCTACACACGTACTACAATGGTCGGTACAACGGGCAGCGAAGCCGCGAGGCGGAGCCAATCCTAGAAAAGCCGATCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGTCT
>4343117
AACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGCGTGCGCGGTTCACGAACTTGTACGTGGATGGGCGCACGGCGCAGGGGGGCGTAACACGTGGGCACTCTGCCCTCCGATGGGGAATACTCCCGCGAACCGGGGGCTAATACCGCATAACATTCCGAGGACTTGGGTTCTTGGATTCAAAGCAGTGATGCCTGTGAGGAGGAGCCCGCGCCCGATTAGCTAGTTGGTAGGGTAACGGCCTACCTCGGCAATGATCGGTAGCTGGTCTGAGAGGATAATCAGACACACTGCAACTGAAACGAGGCCCAGACTCCTACCGTAGGGAACGCTGGGGAATCTTGCCTTCTGGGCGAAAGCATGACCCAACGACGCCGCGTGGGGGATGAAGCTTTTGCTAGTGTAAACCCCTTTTCACTGGTAAGAATGCACGCAAGGGAGCGACAGTACCCTGGCAAGAAGCCCCGGCTAACTACGTGCCACCCGCCTCGGTAAGACCTAGGGGGCCAGCGTTGTTCGGAATTACTGGGTGTATAGGGTACTTATGCGGTGCGACAAGTTGGGAGTGAAATCTCTGGGCTTAACCCAGAGGCTGCTTCTCAAACTGCTATGCTTGATTGTGACAGAGGCTCTTGAAATTGCAGGAGTAGCGTTGAAATGCATGTATATCTGCAAGATCACCCGAGATATGGACGAACAGCTGGATCACAAGTGACGCTGAGGAACGAAAGCTACGCTGAGCGAACAGGATTATATACACTGGTAGTCCTAGCACTAAACGATCATGACTTGCGGTGACGACCGTTCGGACGTCTCCCGGAGCTAACGCGTTAAGTCCTGCACCTGGGGAGTACGGTCGCAGACTGGAAGTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAACATGTGGTTCAATTCGACGCTACGCGAGGAACCTTACCTGGTTCGAAATTCTTATGACCAGCTGTAGAATTACGGCTTTCCTTCAAGAGACATGAGTCTAGGCGCTCCATGGCTGTCGTCAGTTCGTTCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGCACGTAGTTACTACTCGCAAGAGAGGACTCTACGTGGACTGCTCCGGATAACGGAGAGGAAGGTGGGAATGACGTCAAGTCCGCATGGCCTTTATGTCCAGGGCTACACACGTGTTACAATGCAGGGTACAAACCGTTGCCAACCCGCGAGGGGGAGCTAATCGGATAAAACTGTGCTCAGTTCGGATTGCAGTCTGCAACTCGACTGCATGAAGCTGGAATCGCTAGTAATGGGGATCAGCTTGACGCCGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACATCACGAAAGTGAGCTCACCTAGAAGTCGCCACGCTAACCGCAAGGGGGCAGGCGCCCAAGGTATGACTCATGATTGGGGTG
>4353661
GGATGAACGCTAGCGGGAGGCTTAATACATGCAAGTCGAGGGTGAAGCTTTCTTCGGAAAGTGGAAACCGGCGAACGGGTGCGTAACGCGTACGCAACTTACCCCTTGCTGGAGAATAGCCCCGGGAAACTGGGATTAATGCTCCATGGTATGGTGAAATCGCATGATTTTATCATTAAAGGTTACGGCAAGGGATAGGCGTGCGTCCCATTAGCTTGTTGGTGAGGTAACGGCTCACCAATGCAAACGATGGGTAGCTGGTCTGAGAGGATGATCAGCCACACGGGCACTGAGACACGGGCCCGACTCCTACGGGAGGCAGCAGTAGGGAATATTGGACAATGGACGAAAGTCTGATCCAGCCATCCCGCGTGCAGGACGAATGCCCTATGGGTTGTAAACTGCTTTTCTAAGGAAAGAAATATCTCATTCATGAGGTGCTGACGGTACCTTAGGAATAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGCGGTATGATAAGTCAGTGGTGAAAGCCCGGGGCTCAACTCCGGAACTGCCGTTGATACTGTCATACTTGAGTCCAGTTGAGGTGGGCGGAATGATACATGTAGCGGTGAAATGCTTAGATATGTATCAGAACACCGACTGCGAAGGCAGCTCACTAAACTGGTACTGACGCTGAGGCACGAAAGCGTGGGTAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGCTAACTCGGTATGTGCGATATACTGTACGTGCCTGAGGGAAACCGTTAAGTTAGCCACCTGGGGAGTACGTTCGCAAGAATGAAACTCAAAGGAATTGACGGGGGTCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCTGGGCTCTAATGTACCACGCCCGACCCTGAAAGGGGTCTTCTTCTTCGGAAGCGGGGTACAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTCCTTAGTTGCCAGCGGTCCGGCCGGGGACTCTAAGGAGACTGCCTTCGCAAGGAGTGAGGAAGGAGGGGACGACGTCAAATCATCATGGCCTTTATGCCCAGGGCTACACACGTGCTACAATGGTGAGGACAAAGGGCAGCCACTTAGCGATAAGGAGCAAATCCCAAAAACCTCACCTCAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAACCGCTAGTAATCGCAGATCAGACATGCTGCGGTGAATACGTTCCCGGACCTTGTACACACCGCCCGTCAAGCCATGGAGCCGGGTGTACCTTAAGGCGATAACCGAAAGGAGTTGCCCAAGGTA"""
_seqs_16s = []
for seq_id, seq in list(parse_fasta(seqs_16s.split("\n"))):
_seqs_16s.append(BiologicalSequence(seq, seq_id))
seqs_16s = SequenceCollection(_seqs_16s)
tax = """669210 k__Bacteria; p__; c__; o__; f__; g__; s__
881726 k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Paenibacillaceae; g__Paenibacillus; s__
296752 k__Bacteria; p__Spirochaetes; c__Spirochaetes; o__Spirochaetales; f__Spirochaetaceae; g__Treponema; s__
1794723 k__Bacteria; p__Planctomycetes; c__Planctomycetia; o__Gemmatales; f__Gemmataceae; g__Gemmata; s__
2941516 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Marinilabiaceae; g__; s__
793074 k__Bacteria; p__; c__; o__; f__; g__; s__
4353661 k__Bacteria; p__Bacteroidetes; c__[Saprospirae]; o__[Saprospirales]; f__; g__; s__
292553 k__Bacteria; p__Spirochaetes; c__Spirochaetes; o__Spirochaetales; f__Spirochaetaceae; g__Treponema; s__
2784824 k__Bacteria; p__Proteobacteria; c__Deltaproteobacteria; o__Syntrophobacterales; f__Syntrophaceae; g__; s__
1719550 k__Bacteria; p__Planctomycetes; c__Planctomycetia; o__Gemmatales; f__Gemmataceae; g__Gemmata; s__
182569 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
266495 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__S24-7; g__; s__
524860 k__Bacteria; p__Planctomycetes; c__Planctomycetia; o__Gemmatales; f__Gemmataceae; g__Gemmata; s__
293514 k__Bacteria; p__Spirochaetes; c__Spirochaetes; o__Spirochaetales; f__Spirochaetaceae; g__Treponema; s__
2683209 k__Bacteria; p__WWE1; c__[Cloacamonae]; o__[Cloacamonales]; f__; g__; s__
501793 k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Paenibacillaceae; g__Paenibacillus; s__
229854 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Legionellales; f__Legionellaceae; g__Legionella; s__
583705 k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__; g__; s__
1142181 k__Bacteria; p__Spirochaetes; c__GN05; o__SBYZ_6080; f__; g__; s__
998428 k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Paenibacillaceae; g__Paenibacillus; s__
4343117 k__Bacteria; p__Acidobacteria; c__DA052; o__Ellin6513; f__; g__; s__"""
tax_lookup = dict([e.strip().split('\t') for e in tax.split('\n')])
What are the fraction of 3 base k-words unique to the following two sequences?
seq1 = BiologicalSequence('AGCTAGCATCGATCGATCGATGCATGCAT')
seq2 = BiologicalSequence('AGCTCGGCATCGAGGGCAGTCAATCGATCT')
help(kmer_distance)
# Compute the kmer distance in this cell
Display the guide tree for the sequences in the cell below.
query_seqs = SequenceCollection(
[BiologicalSequence("ACGATGACCAGTGCTACCAGT", "s1"),
BiologicalSequence("AACGATCGATCGATCGTGCTA", "s2"),
BiologicalSequence("AACGATCTGCTA", "s3"),
BiologicalSequence("CGATCGATGACATGCATG", "s4"),
BiologicalSequence("CGATCTGCAT", "s5")])
help(guide_tree_from_sequences)
# Display the guide tree in this cell.
What are the differences in the guide tree from Question 2, the tree that is generated after 1 iterations of iterative mutliple sequence alignment, and the tree that is generated after 5 iterations of iterative multiple sequence alignment? Display the trees for both 1 and 5 iterations of iterative multiple sequence alignment.
help(iterative_msa_and_tree)
from skbio.alignment import global_pairwise_align_nucleotide
# add your command for 1 iterations of iterative multiple sequence alignment here
# hint: pass pairwise_aligner=global_pairwise_align_nucleotide
# add your command for 5 iterations of iterative multiple sequence alignment here
# hint: pass pairwise_aligner=global_pairwise_align_nucleotide
Generate and display a tree based on progressive alignment of the sequences from the second cell (the ones in the seqs_16s
varaible). This step can take about 10 minutes to complete.
help(progressive_msa_and_tree)
# Add your command for progressive alignment and tree building here
# hint: pass pairwise_aligner=global_pairwise_align_nucleotide
Using the tree representing the sequences from question four as a guide, define clusters (i.e., groups) of sequences at 90% and 70% identity. There is not a single right answer for this, or a single method for grouping sequences. Go about this systematically, and describe the process that you're going through in a couple of paragraphs. These groups are usually referred to as operational taxonomic units, or OTUs, because they represent a hypothesis about the taxonomic relatedness of a group of sequences (which is a proxy for a hypothesis about the relatedness of the group of organisms containing those sequences in their genomes).
If you want to obtain a given sequence, you can now do so by looking up its identifier with seqs_16s.get_seq
:
print seqs_16s.get_seq('4343117')
print seqs_16s.get_seq('4353661')
To compute the pairwise identity for two sequences, use pairwise_percent_id
as follows:
from skbio.alignment import global_pairwise_align_nucleotide
from skbio import BiologicalSequence
def pairwise_percent_id(seq1_id, seq2_id, seq_lookup):
seq1 = seq_lookup.get_seq(seq1_id)
seq2 = seq_lookup.get_seq(seq2_id)
aln = global_pairwise_align_nucleotide(seq1, seq2)
return 1 - aln.distances()[0][1]
print pairwise_percent_id('793074', '4353661', seqs_16s)
# Compute additional pairwise identities, as necessary, to answer this question here. Show all of your commands!
Discuss your results here.
Choose one representative sequence from each of the clusters you defined in question 5. Look these up in tax_lookup
by their ids to get the taxonomy of each sequence, and include those in the results below. When you see a key that ends with __
, that means that there is no known taxonomic assignment for that sequence at that level.
print tax_lookup['4343117']
print tax_lookup['4353661']
# Perform addition taxonomy look-ups here
Discuss your results here.
Is the taxonomy of the represenative sequences consistent with phylogenetic tree you generated in question 4? For your 90% and 70% OTUs, list three taxa (e.g., at the phylum, class, or species level) that are monophyletic, if any, and three taxa that are not monophyletic, if any. Discuss two specific reasons why some taxa might appear to not be monophyletic based on your tree.
Discuss your results here.