It's possible to generalize Smith-Waterman and Needleman-Wunsch, the dynamic programming algorithms that we explored for pairwise sequence aligment, to identify the optimal alignment of more than two sequences. Remember that our scoring scheme for pairwise alignment with Smith-Waterman looked like the following:
$$ \begin{align} & F(0, 0) = 0\\ & F(i, 0) = F(i-1, 0) - d\\ & F(0, j) = F(0, j-1) - d\\ \\ & F(i, j) = max \begin{pmatrix} & F(i-1, j-1) + s(c_i, c_j)\\ & F(i-1, j) - d\\ & F(i, j-1) - d)\\ \end{pmatrix} \end{align} $$To generalize this to three sequences, we could create $3 \times 3$ scoring, dynamic programming, and traceback matrices. Our scoring scheme would then look like the following:
$$ \begin{align} & F(0, 0, 0) = 0\\ & F(i, 0, 0) = F(i-1, 0, 0) - d\\ & F(0, j, 0) = F(0, j-1, 0) - d\\ & F(0, 0, k) = F(0, 0, k-1) - d\\ \\ & F(i, j, k) = max \begin{pmatrix} F(i-1, j-1, k-1) + s(c_i, c_j) + s(c_i, c_k) + s(c_j, c_k)\\ F(i, j-1, k-1) + s(c_j, c_k) - d\\ F(i-1, j, k-1) + s(c_i, c_k) - d\\ F(i-1, j-1, k) + s(c_i, c_j) - d\\ F(i, j, k-1) - 2d\\ F(i, j-1, k) - 2d\\ F(i-1, j, k) - 2d\\ \end{pmatrix} \end{align} $$However the complexity of this algorithm is much worse than for pairwise alignment. For pairwise alignment, remember that if aligning two sequences of lengths $m$ and $n$, the runtime of the algorithm will be proportional to $m \times n$. If $n$ is longer than or as long as $m$, we simplify the statement to say that the runtime of the algorithm will be be proportional to $n^2$. This curve has a pretty scary trajectory: runtime for pairwise alignment with dynamic programming is said to scale quadratically.
# first we need to do some configuration of the notebook, don't worry about this for now
from __future__ import division, print_function
from functools import partial
from IPython.core import page
page.page = print
import matplotlib.pyplot as plt
seq_lengths = range(25)
s2_times = [t ** 2 for t in range(25)]
plt.plot(range(25), s2_times)
plt.xlabel('Sequence Length')
plt.ylabel('Runtime (s)')
<matplotlib.text.Text at 0x10886b350>
The exponent in the $n^2$ term comes from the fact that, in pairwise alignment, if we assume our sequences are both of length $n$, there are $n \times n$ cells to fill in in the dynamic programming matrix. If we were to generalize either Smith-Waterman or Needleman-Wunsch to three sequences, we would need to create a 3 dimensional array to score and traceback the alignment. For sequences of length $n$, we would therefore have $n \times n \times n$ cells to fill in, and our runtime versus sequence length curve would look like the following.
s3_times = [t ** 3 for t in range(25)]
plt.plot(range(25), s3_times)
plt.xlabel('Sequence Length')
plt.ylabel('Runtime (s)')
<matplotlib.text.Text at 0x108861650>
That curve looks steeper than the curve for pairwise alignment, and the values on the y-axis are bigger, but it's not really clear how much of a problem this is until we plot runtime for three sequences in the context of the run times for pairwise alignment.
plt.plot(range(25), s2_times)
plt.plot(range(25), s3_times)
plt.xlabel('Sequence Length')
plt.ylabel('Runtime (s)')
<matplotlib.text.Text at 0x109209250>
And for four sequences:
s4_times = [t ** 4 for t in range(25)]
plt.plot(range(25), s2_times)
plt.plot(range(25), s3_times)
plt.plot(range(25), s4_times)
plt.xlabel('Sequence Length')
plt.ylabel('Runtime (s)')
<matplotlib.text.Text at 0x1092a8290>
We clearly have a problem here, and that is that the runtime for multiple sequence alignment using full dynamic programming algoriths grows exponentially with the number of sequences to be aligned. If $n$ is our sequence length, and $s$ is the number of sequences, that means that runtime is proportional to $n^s$. In pairwise alignment, $s$ is always equal to 2, so the problem is more manangeable. However, for the general case of $s$ sequences, we really can't even consider Smith-Waterman or Needleman-Wunsch for more than just a few sequences. The pattern in the plots above should illustrate why.
As we explored with database searching, we need to figure out how to align fewer sequences. This is where progressive alignment comes in.
In progressive alignment, the problem of exponential growth of runtime and space is managed by selectively aligning pairs of sequences, and aligning alignments of sequences. What we typically do is identify a pair of closely sequences, and align those. Then, we identify the next closely related sequence to that initial pair, and align that sequence to the alignment. This concept of aligning a sequence to an alignment is new, and we'll come back to it in just a few minutes. The other concept of identifying the most closely related sequences, and then the next most closely related sequence, and so on should sound familar. It effectively means that we're traversing a tree. And herein lies our problem: we need a tree to efficently align multiple sequences, but we need an alignment to build a good tree.
You probably have two burning questions in your mind right now:
We'll explore both of those through-out the rest of this notebook. First, let's cover the process of progressive multiple sequence alignment, just assuming for a moment that we know how to do both of those things.
The process of progressive multiple sequence alignment could look like the following. First, we start with some sequences and a tree representing the relationship between those sequences. We'll call this our guide tree, because it's going to guide us through the process of multiple sequence alignment. In progressive multiple sequence alignment, we build a multiple sequence alignment for each internal node of the tree, where the alignment at a given internal node contains all of the sequences in the clade defined by that node.
Starting from the root node, descend the bottom branch of the tree until you get to the an internal node. If an alignment hasn't been constructed for that node yet, continue descending the tree until to get to a pair of nodes. In this case, we follow the two branches to the tips. We then align the sequences at that pair of tips (usually with Needleman-Wunsch, for multiple sequence alignment), and assign that alignment to the node connecting those tips.
Next, we want to find what to align the resulting alignment to, so start from the root node and descend the top branch of the tree. When you get to the next node, determine if an alignment has already been created for that node. If not, our job is to build that alignment so we have something to align against. In this case, that means that we need to align s1
, s2
, and s3
. We can achieve this by aligning s1
and s3
first, to get the alignment at the internal node connecting them.
We can next align the alignment of s1
and s3
with s2
, to get the alignment at the internal node connecting those clades.
And finally, we can compute the alignment at the root node of the tree, by aligning the alignment of s1
, s2
, and s3
with the alignment of s4
and s5
.
The alignment at the root node is our multiple sequence alignment.
Let's address the first of our outstanding questions. I mentioned above that we need an alignment to build a good tree. The key word here is good. We can build a very rough tree - one that we would never want to present as representing the actual relationships between the sequences in question - without first aligning the sequences. Remember that building a UPGMA tree requires only a distance matrix, so if we can find a non-alignment-dependent way to compute distances between the sequences, we can build a rough UPGMA tree from them.
Let's compute distances between the sequences based on their word composition. We'll define a word here as k
adjacent characters in the sequence. We can then define a function that will return all of the words in a sequence as follows. These words can be defined as being overlapping, or non-overlapping. We'll go with overlapping for this example, as the more words we have, the better our guide tree should be.
from skbio import BiologicalSequence
%psource BiologicalSequence.k_words
def k_words(self, k, overlapping=True): """Get the list of words of length k Parameters ---------- k : int The word length. overlapping : bool, optional Defines whether the k-words should be overlapping or not overlapping. Returns ------- iterator of BiologicalSequences Iterator of words of length `k` contained in the BiologicalSequence. Raises ------ ValueError If k < 1. Examples -------- >>> from skbio.sequence import BiologicalSequence >>> s = BiologicalSequence('ACACGACGTT') >>> [str(kw) for kw in s.k_words(4, overlapping=False)] ['ACAC', 'GACG'] >>> [str(kw) for kw in s.k_words(3, overlapping=True)] ['ACA', 'CAC', 'ACG', 'CGA', 'GAC', 'ACG', 'CGT', 'GTT'] """ if k < 1: raise ValueError("k must be greater than 0.") sequence_length = len(self) if overlapping: step = 1 else: step = k for i in range(0, sequence_length - k + 1, step): yield self[i:i+k]
for e in BiologicalSequence("ACCGGTGACCAGTTGACCAGTA").k_words(3):
print(e)
ACC CCG CGG GGT GTG TGA GAC ACC CCA CAG AGT GTT TTG TGA GAC ACC CCA CAG AGT GTA
for e in BiologicalSequence("ACCGGTGACCAGTTGACCAGTA").k_words(7):
print(e)
ACCGGTG CCGGTGA CGGTGAC GGTGACC GTGACCA TGACCAG GACCAGT ACCAGTT CCAGTTG CAGTTGA AGTTGAC GTTGACC TTGACCA TGACCAG GACCAGT ACCAGTA
for e in BiologicalSequence("ACCGGTGACCAGTTGACCAGTA").k_words(3, overlapping=False):
print(e)
ACC GGT GAC CAG TTG ACC AGT
If we then have two sequences, we can compute the word counts for each and define a distance between the sequences as the fraction of words that are unique to either sequence.
from iab.algorithms import kmer_distance
%psource kmer_distance
def kmer_distance(sequence1, sequence2, k=3, overlapping=True): """Compute the kmer distance between a pair of sequences Parameters ---------- sequence1 : BiologicalSequence sequence2 : BiologicalSequence k : int, optional The word length. overlapping : bool, optional Defines whether the k-words should be overlapping or not overlapping. Returns ------- float Fraction of the set of k-mers from both sequence1 and sequence2 that are unique to either sequence1 or sequence2. Raises ------ ValueError If k < 1. Notes ----- k-mer counts are not incorporated in this distance metric. """ sequence1_kmers = set(sequence1.k_words(k, overlapping)) sequence2_kmers = set(sequence2.k_words(k, overlapping)) all_kmers = sequence1_kmers | sequence2_kmers shared_kmers = sequence1_kmers & sequence2_kmers number_unique = len(all_kmers) - len(shared_kmers) fraction_unique = number_unique / len(all_kmers) return fraction_unique
We can then use this as a distance function...
s1 = BiologicalSequence("ACCGGTGACCAGTTGACCAGT")
s2 = BiologicalSequence("ATCGGTACCGGTAGAAGT")
s3 = BiologicalSequence("GGTACCAAATAGAA")
print(s1.distance(s2, kmer_distance))
print(s1.distance(s3, kmer_distance))
0.75 0.857142857143
If we wanted to override the default to create (for example) a 5-mer distance function, we could use functools.partial
.
fivemer_distance = partial(kmer_distance, k=5)
s1 = BiologicalSequence("ACCGGTGACCAGTTGACCAGT")
s2 = BiologicalSequence("ATCGGTACCGGTAGAAGT")
s3 = BiologicalSequence("GGTACCAAATAGAA")
print(s1.distance(s2, fivemer_distance))
print(s1.distance(s3, fivemer_distance))
0.916666666667 1.0
We can now apply one of these functions to build a distance matrix for a set of sequences that we want to align.
from skbio import SequenceCollection
query_sequences = SequenceCollection(
[BiologicalSequence("ACCGGTGACCAGTTGACCAGT", "s1"),
BiologicalSequence("ATCGGTACCGGTAGAAGT", "s2"),
BiologicalSequence("GGTACCAAATAGAA", "s3"),
BiologicalSequence("GGCACCAAACAGAA", "s4"),
BiologicalSequence("GGCCCACTGAT", "s5")])
guide_dm = query_sequences.distances(kmer_distance)
print(guide_dm)
5x5 distance matrix IDs: 's1', 's2', 's3', 's4', 's5' Data: [[ 0. 0.75 0.85714286 0.85714286 0.89473684] [ 0.75 0. 0.61111111 0.86363636 1. ] [ 0.85714286 0.61111111 0. 0.66666667 0.95 ] [ 0.85714286 0.86363636 0.66666667 0. 0.83333333] [ 0.89473684 1. 0.95 0.83333333 0. ]]
scikit-bio also has some basic visualization functinality for these objects. For example, we can easily visualize this object as a heatmap.
fig = guide_dm.plot(cmap='Greens')
We can next use some functionality from SciPy to cluster the sequences with UPGMA, and print out a dendrogram.
from scipy.cluster.hierarchy import average, dendrogram, to_tree
for q in query_sequences:
print(q)
guide_lm = average(guide_dm.condensed_form())
guide_d = dendrogram(guide_lm, labels=guide_dm.ids, orientation='right',
link_color_func=lambda x: 'black')
guide_tree = to_tree(guide_lm)
ACCGGTGACCAGTTGACCAGT ATCGGTACCGGTAGAAGT GGTACCAAATAGAA GGCACCAAACAGAA GGCCCACTGAT
from iab.algorithms import guide_tree_from_sequences
%psource guide_tree_from_sequences
def guide_tree_from_sequences(sequences, distance_fn=kmer_distance, display_tree = False): guide_dm = sequences.distances(distance_fn) guide_lm = average(guide_dm.condensed_form()) guide_tree = to_tree(guide_lm) if display_tree: guide_d = dendrogram(guide_lm, labels=guide_dm.ids, orientation='right', link_color_func=lambda x: 'black') return guide_tree
t = guide_tree_from_sequences(query_sequences, display_tree=True)
We now have a guide tree, so we can move on to the next step of progressive alignment.
Next, we'll address our second burning question: aligning alignments. As illustrated above, there are basically three different types of pairwise alignment we need to support for progressive multiple sequenence alignment with Needleman-Wunsch. These are:
Standard Needleman-Wunsch supports the first, and it is very easy to generalize it to support the latter two. The only change that is necessary is in how the alignment of two non-gap characters is scored. Recall that we previously scored an alignment of two characters by looking up the score of subsitution from one to the other in a substitution matrix. To adapt this for aligning a sequence to an alignment, or for aligning an alignment to an alignment, we compute this subsitution as the average score of aligning the pairs of characters.
For example, if we want to align the alignment column from $aln1$:
A
C
to the alignment column from $aln2$:
T
G
we could compute the subsitution score using the matrix $m$ as:
$$ s = \frac{m[A][T] + m[A][G] + m[C][T] + m[C][G]}{aln1_{length} \times aln2_{length}} $$The following code adapts our implementation of Needleman-Wunsh to support aligning a sequence to an alignment, or aligning an alignment to an alignment.
from iab.algorithms import format_dynamic_programming_matrix, format_traceback_matrix
from skbio.alignment._pairwise import _compute_score_and_traceback_matrices
%psource _compute_score_and_traceback_matrices
def _compute_score_and_traceback_matrices( aln1, aln2, gap_open_penalty, gap_extend_penalty, substitution_matrix, new_alignment_score=-np.inf, init_matrices_f=_init_matrices_nw, penalize_terminal_gaps=True, gap_substitution_score=0): """Return dynamic programming (score) and traceback matrices. A note on the ``penalize_terminal_gaps`` parameter. When this value is ``False``, this function is no longer true Smith-Waterman/Needleman-Wunsch scoring, but when ``True`` it can result in biologically irrelevant artifacts in Needleman-Wunsch (global) alignments. Specifically, if one sequence is longer than the other (e.g., if aligning a primer sequence to an amplification product, or searching for a gene in a genome) the shorter sequence will have a long gap inserted. The parameter is ``True`` by default (so that this function computes the score and traceback matrices as described by the original authors) but the global alignment wrappers pass ``False`` by default, so that the global alignment API returns the result that users are most likely to be looking for. """ aln1_length = aln1.sequence_length() aln2_length = aln2.sequence_length() # cache some values for quicker/simpler access aend = _traceback_encoding['alignment-end'] match = _traceback_encoding['match'] vgap = _traceback_encoding['vertical-gap'] hgap = _traceback_encoding['horizontal-gap'] new_alignment_score = (new_alignment_score, aend) # Initialize a matrix to use for scoring the alignment and for tracing # back the best alignment score_matrix, traceback_matrix = init_matrices_f( aln1, aln2, gap_open_penalty, gap_extend_penalty) # Iterate over the characters in aln2 (which corresponds to the vertical # sequence in the matrix) for aln2_pos, aln2_chars in enumerate(aln2.iter_positions(str), 1): # Iterate over the characters in aln1 (which corresponds to the # horizontal sequence in the matrix) for aln1_pos, aln1_chars in enumerate(aln1.iter_positions(str), 1): # compute the score for a match/mismatch substitution_score = _compute_substitution_score( aln1_chars, aln2_chars, substitution_matrix, gap_substitution_score) diag_score = \ (score_matrix[aln2_pos-1, aln1_pos-1] + substitution_score, match) # compute the score for adding a gap in aln2 (vertical) if not penalize_terminal_gaps and (aln1_pos == aln1_length): # we've reached the end of aln1, so adding vertical gaps # (which become gaps in aln1) should no longer # be penalized (if penalize_terminal_gaps == False) up_score = (score_matrix[aln2_pos-1, aln1_pos], vgap) elif traceback_matrix[aln2_pos-1, aln1_pos] == vgap: # gap extend, because the cell above was also a gap up_score = \ (score_matrix[aln2_pos-1, aln1_pos] - gap_extend_penalty, vgap) else: # gap open, because the cell above was not a gap up_score = \ (score_matrix[aln2_pos-1, aln1_pos] - gap_open_penalty, vgap) # compute the score for adding a gap in aln1 (horizontal) if not penalize_terminal_gaps and (aln2_pos == aln2_length): # we've reached the end of aln2, so adding horizontal gaps # (which become gaps in aln2) should no longer # be penalized (if penalize_terminal_gaps == False) left_score = (score_matrix[aln2_pos, aln1_pos-1], hgap) elif traceback_matrix[aln2_pos, aln1_pos-1] == hgap: # gap extend, because the cell to the left was also a gap left_score = \ (score_matrix[aln2_pos, aln1_pos-1] - gap_extend_penalty, hgap) else: # gap open, because the cell to the left was not a gap left_score = \ (score_matrix[aln2_pos, aln1_pos-1] - gap_open_penalty, hgap) # identify the largest score, and use that information to populate # the score and traceback matrices best_score = _first_largest([new_alignment_score, left_score, diag_score, up_score]) score_matrix[aln2_pos, aln1_pos] = best_score[0] traceback_matrix[aln2_pos, aln1_pos] = best_score[1] return score_matrix, traceback_matrix
from skbio.alignment._pairwise import _traceback
%psource _traceback
def _traceback(traceback_matrix, score_matrix, aln1, aln2, start_row, start_col, gap_character='-'): # cache some values for simpler aend = _traceback_encoding['alignment-end'] match = _traceback_encoding['match'] vgap = _traceback_encoding['vertical-gap'] hgap = _traceback_encoding['horizontal-gap'] # initialize the result alignments aln1_sequence_count = aln1.sequence_count() aligned_seqs1 = [[] for e in range(aln1_sequence_count)] aln2_sequence_count = aln2.sequence_count() aligned_seqs2 = [[] for e in range(aln2_sequence_count)] current_row = start_row current_col = start_col best_score = score_matrix[current_row, current_col] current_value = None while current_value != aend: current_value = traceback_matrix[current_row, current_col] if current_value == match: for aligned_seq, input_seq in zip(aligned_seqs1, aln1): aligned_seq.append(str(input_seq[current_col-1])) for aligned_seq, input_seq in zip(aligned_seqs2, aln2): aligned_seq.append(str(input_seq[current_row-1])) current_row -= 1 current_col -= 1 elif current_value == vgap: for aligned_seq in aligned_seqs1: aligned_seq.append('-') for aligned_seq, input_seq in zip(aligned_seqs2, aln2): aligned_seq.append(str(input_seq[current_row-1])) current_row -= 1 elif current_value == hgap: for aligned_seq, input_seq in zip(aligned_seqs1, aln1): aligned_seq.append(str(input_seq[current_col-1])) for aligned_seq in aligned_seqs2: aligned_seq.append('-') current_col -= 1 elif current_value == aend: continue else: raise ValueError( "Invalid value in traceback matrix: %s" % current_value) for i in range(aln1_sequence_count): aligned_seq = ''.join(aligned_seqs1[i][::-1]) seq_id = _get_seq_id(aln1[i], str(i)) aligned_seqs1[i] = BiologicalSequence(aligned_seq, id=seq_id) for i in range(aln2_sequence_count): aligned_seq = ''.join(aligned_seqs2[i][::-1]) seq_id = _get_seq_id(aln2[i], str(i + aln1_sequence_count)) aligned_seqs2[i] = BiologicalSequence(aligned_seq, id=seq_id) return (aligned_seqs1, aligned_seqs2, best_score, current_col, current_row)
from skbio.alignment import global_pairwise_align_nucleotide
%psource global_pairwise_align_nucleotide
def global_pairwise_align_nucleotide(seq1, seq2, gap_open_penalty=5, gap_extend_penalty=2, match_score=1, mismatch_score=-2, substitution_matrix=None, penalize_terminal_gaps=False): """Globally align pair of nuc. seqs or alignments with Needleman-Wunsch Parameters ---------- seq1 : str, BiologicalSequence, or Alignment The first unaligned sequence(s). seq2 : str, BiologicalSequence, or Alignment The second unaligned sequence(s). gap_open_penalty : int or float, optional Penalty for opening a gap (this is substracted from previous best alignment score, so is typically positive). gap_extend_penalty : int or float, optional Penalty for extending a gap (this is substracted from previous best alignment score, so is typically positive). match_score : int or float, optional The score to add for a match between a pair of bases (this is added to the previous best alignment score, so is typically positive). mismatch_score : int or float, optional The score to add for a mismatch between a pair of bases (this is added to the previous best alignment score, so is typically negative). substitution_matrix: 2D dict (or similar) Lookup for substitution scores (these values are added to the previous best alignment score). If provided, this overrides ``match_score`` and ``mismatch_score``. penalize_terminal_gaps: bool, optional If True, will continue to penalize gaps even after one sequence has been aligned through its end. This behavior is true Needleman-Wunsch alignment, but results in (biologically irrelevant) artifacts when the sequences being aligned are of different length. This is ``False`` by default, which is very likely to be the behavior you want in all or nearly all cases. Returns ------- skbio.Alignment ``Alignment`` object containing the aligned sequences as well as details about the alignment. See Also -------- local_pairwise_align local_pairwise_align_protein local_pairwise_align_nucleotide skbio.alignment.local_pairwise_align_ssw global_pairwise_align global_pairwise_align_protein Notes ----- Default ``match_score``, ``mismatch_score``, ``gap_open_penalty`` and ``gap_extend_penalty`` parameters are derived from the NCBI BLAST Server [1]_. This function can be use to align either a pair of sequences, a pair of alignments, or a sequence and an alignment. References ---------- .. [1] http://blast.ncbi.nlm.nih.gov/Blast.cgi """ # use the substitution matrix provided by the user, or compute from # match_score and mismatch_score if a substitution matrix was not provided if substitution_matrix is None: substitution_matrix = \ make_identity_substitution_matrix(match_score, mismatch_score) else: pass return global_pairwise_align(seq1, seq2, gap_open_penalty, gap_extend_penalty, substitution_matrix, penalize_terminal_gaps=penalize_terminal_gaps)
For the sake of the examples below, I'm going to override one of the global_pairwise_align_nucleotide
defaults to penalize terminal gaps. This effectively tells the algorithm that we know we have a collection of sequences that are homologous from beginning to end.
global_pairwise_align_nucleotide = partial(global_pairwise_align_nucleotide, penalize_terminal_gaps=True)
For example, we can still use this code to align pairs of sequences (but note that we now need to pass those sequences in as a pair of one-item lists):
print(query_sequences[0])
print(query_sequences[1])
aln1 = global_pairwise_align_nucleotide(query_sequences[0], query_sequences[1])
print(aln1)
ACCGGTGACCAGTTGACCAGT ATCGGTACCGGTAGAAGT >s1 ACCGGTGACCAGTTGACCAGT >s2 ATCGGT-ACCGGTAGA--AGT
/Users/caporaso/.virtualenvs/iab/lib/python2.7/site-packages/skbio/alignment/_pairwise.py:540: EfficiencyWarning: You're using skbio's python implementation of Needleman-Wunsch alignment. This is known to be very slow (e.g., thousands of times slower than a native C implementation). We'll be adding a faster version soon (see https://github.com/biocore/scikit-bio/issues/254 to track progress on this). "to track progress on this).", EfficiencyWarning)
We can align that alignment to one of our other sequences.
print(query_sequences[2])
print(global_pairwise_align_nucleotide(aln1, query_sequences[2]))
GGTACCAAATAGAA >s1 ACCGGTGACCAGTTGACCAGT >s2 ATCGGT-ACCGGTAGA--AGT >s3 ---GGTACCAAATAGA--A--
Alternatively, we can align another pair of sequences:
aln2 = global_pairwise_align_nucleotide(query_sequences[2], query_sequences[3])
print(aln2)
>s3 GGTACCAAATAGAA >s4 GGCACCAAACAGAA
And then align that alignment against our previous alignment:
print(aln1)
print(aln2)
aln3 = global_pairwise_align_nucleotide(aln1, aln2)
print(aln3)
>s1 ACCGGTGACCAGTTGACCAGT >s2 ATCGGT-ACCGGTAGA--AGT >s3 GGTACCAAATAGAA >s4 GGCACCAAACAGAA >s1 ACCGGTGACCAGTTGACCAGT >s2 ATCGGT-ACCGGTAGA--AGT >s3 ---GGTACCAAATAGA--A-- >s4 ---GGCACCAAACAGA--A--
We can now combine all of these steps to take a set of query sequences, build a guide tree, perform progressive multiple sequence alignment, and return the guide tree (as a SciPy linkage matrix) and the alignment.
from skbio import TreeNode
guide_tree = TreeNode.from_linkage_matrix(guide_lm, guide_dm.ids)
We can view the guide tree in newick format as follows:
print(guide_tree)
(s5:0.45975877193,(s1:0.410714285714,(s4:0.382575757576,(s2:0.305555555556,s3:0.305555555556):0.0770202020202):0.0281385281385):0.0490444862155);
from iab.algorithms import progressive_msa
%psource progressive_msa
def progressive_msa(sequences, guide_tree, pairwise_aligner): """ Perform progressive msa of sequences Parameters ---------- sequences : skbio.SequenceCollection The sequences to be aligned. guide_tree : skbio.TreeNode The tree that should be used to guide the alignment process. pairwise_aligner : function Function that should be used to perform the pairwise alignments, for example skbio.Alignment.global_pairwise_align_nucleotide. Must support skbio.BiologicalSequence objects or skbio.Alignment objects as input. Returns ------- skbio.Alignment """ c1, c2 = guide_tree.children if c1.is_tip(): c1_aln = sequences[c1.name] else: c1_aln = progressive_msa(sequences, c1, pairwise_aligner) if c2.is_tip(): c2_aln = sequences[c2.name] else: c2_aln = progressive_msa(sequences, c2, pairwise_aligner) return pairwise_aligner(c1_aln, c2_aln)
msa = progressive_msa(query_sequences, guide_tree, pairwise_aligner=global_pairwise_align_nucleotide)
print(msa)
>s5 ---GGC--CCA-----CTGAT >s1 ACCGGTGACCAGTTGACCAGT >s4 ---GGCACCAAACAGA--A-- >s2 ATCGGTACC-GGTAGA--AGT >s3 ---GGTACCAAATAGA--A--
We can now build a (hopefully) improved tree from our multiple sequence alignment. First we'll look at our original distance matrix again, and then the distance matrix generted from the progressive multiple sequence alignment.
fig = guide_dm.plot(cmap='Greens')
msa_dm = msa.distances()
fig = msa_dm.plot(cmap='Greens')
The UPGMA trees that result from these alignments are very different. First we'll look at the guide tree, and then the tree resulting from the progressive multiple sequence alignment.
d = dendrogram(guide_lm, labels=guide_dm.ids, orientation='right',
link_color_func=lambda x: 'black')
msa_lm = average(msa_dm.condensed_form())
d = dendrogram(msa_lm, labels=msa_dm.ids, orientation='right',
link_color_func=lambda x: 'black')
And we can wrap this all up in a single convenience function:
from iab.algorithms import progressive_msa_and_tree
%psource progressive_msa_and_tree
def progressive_msa_and_tree(sequences, pairwise_aligner, sequence_distance_fn=kmer_distance, guide_tree=None, display_aln=False, display_tree=False): """ Perform progressive msa of sequences and build a UPGMA tree Parameters ---------- sequences : skbio.SequenceCollection The sequences to be aligned. pairwise_aligner : function Function that should be used to perform the pairwise alignments, for example skbio.Alignment.global_pairwise_align_nucleotide. Must support skbio.BiologicalSequence objects or skbio.Alignment objects as input. sequence_distance_fn : function Function that returns and skbio.DistanceMatrix given an skbio.SequenceCollection. This will be used to build a guide tree if one is not provided. guide_tree : skbio.TreeNode, optional The tree that should be used to guide the alignment process. display_aln : bool, optional Print the alignment before returning. display_tree : bool, optional Print the tree before returning. Returns ------- skbio.alignment skbio.TreeNode """ if guide_tree is None: guide_dm = sequences.distances(sequence_distance_fn) guide_lm = average(guide_dm.condensed_form()) guide_tree = TreeNode.from_linkage_matrix(guide_lm, guide_dm.ids) msa = progressive_msa(sequences, guide_tree, pairwise_aligner=pairwise_aligner) if display_aln: print(msa) msa_dm = msa.distances() msa_lm = average(msa_dm.condensed_form()) msa_tree = TreeNode.from_linkage_matrix(msa_lm, msa_dm.ids) if display_tree: print("\nOutput tree:") d = dendrogram(msa_lm, labels=msa_dm.ids, orientation='right', link_color_func=lambda x: 'black', leaf_font_size=24) return msa, msa_tree
msa, tree = progressive_msa_and_tree(query_sequences, pairwise_aligner=global_pairwise_align_nucleotide, display_tree=True, display_aln=True)
>s5 ---GGC--CCA-----CTGAT >s1 ACCGGTGACCAGTTGACCAGT >s4 ---GGCACCAAACAGA--A-- >s2 ATCGGTACC-GGTAGA--AGT >s3 ---GGTACCAAATAGA--A-- Output tree:
In an iterative alignment, the output tree from the above progressive alignment is used as a guide tree, and the full process repeated. This is performed to reduce errors that result from a low-quality guide tree.
from iab.algorithms import iterative_msa_and_tree
%psource iterative_msa_and_tree
def iterative_msa_and_tree(sequences, num_iterations, pairwise_aligner, sequence_distance_fn=kmer_distance, display_aln=False, display_tree=False): """ Perform progressive msa of sequences and build a UPGMA tree Parameters ---------- sequences : skbio.SequenceCollection The sequences to be aligned. num_iterations : int The number of iterations of progressive multiple sequence alignment to perform. Must be greater than zero and less than five. pairwise_aligner : function Function that should be used to perform the pairwise alignments, for example skbio.Alignment.global_pairwise_align_nucleotide. Must support skbio.BiologicalSequence objects or skbio.Alignment objects as input. sequence_distance_fn : function Function that returns and skbio.DistanceMatrix given an skbio.SequenceCollection. This will be used to build a guide tree. display_aln : bool, optional Print the alignment before returning. display_tree : bool, optional Print the tree before returning. Returns ------- skbio.alignment skbio.TreeNode """ if num_iterations > 5: raise ValueError("A maximum of five iterations is allowed." "You requested %d." % num_iterations) previous_iter_tree = None for i in range(num_iterations): if i == (num_iterations - 1): # only display the last iteration display = True else: display = False previous_iter_msa, previous_iter_tree = \ progressive_msa_and_tree(sequences, pairwise_aligner=pairwise_aligner, sequence_distance_fn=sequence_distance_fn, guide_tree=previous_iter_tree, display_aln=display_aln and display, display_tree=display_tree and display) return previous_iter_msa, previous_iter_tree
msa, tree = iterative_msa_and_tree(query_sequences, pairwise_aligner=global_pairwise_align_nucleotide, num_iterations=1, display_aln=True, display_tree=True)
>s5 ---GGC--CCA-----CTGAT >s1 ACCGGTGACCAGTTGACCAGT >s4 ---GGCACCAAACAGA--A-- >s2 ATCGGTACC-GGTAGA--AGT >s3 ---GGTACCAAATAGA--A-- Output tree:
msa, tree = iterative_msa_and_tree(query_sequences, pairwise_aligner=global_pairwise_align_nucleotide, num_iterations=2, display_aln=True, display_tree=True)
>s5 ---GGC--CCA-----CTGAT >s4 ---GGCACCAAACAGA----A >s3 ---GGTACCAAATAGA----A >s1 ACCGGTGACCAGTTGACCAGT >s2 ATCGGT-ACCGGTAGA--AGT Output tree:
msa, tree = iterative_msa_and_tree(query_sequences, pairwise_aligner=global_pairwise_align_nucleotide, num_iterations=3, display_aln=True, display_tree=True)
>s5 ---GGC--CCA-----CTGAT >s4 ---GGCACCAAACAGA----A >s3 ---GGTACCAAATAGA----A >s1 ACCGGTGACCAGTTGACCAGT >s2 ATCGGT-ACCGGTAGA--AGT Output tree:
msa, tree = iterative_msa_and_tree(query_sequences, pairwise_aligner=global_pairwise_align_nucleotide, num_iterations=5, display_aln=True, display_tree=True)
>s5 ---GGC--CCA-----CTGAT >s4 ---GGCACCAAACAGA----A >s3 ---GGTACCAAATAGA----A >s1 ACCGGTGACCAGTTGACCAGT >s2 ATCGGT-ACCGGTAGA--AGT Output tree: