NOTE: This chapter is very basic at this stage, and needs work on text and content. It is slated for expansion. In the meantime, if you're looking for extra information on this topic I recommend Chapter 27 of Evolution (free), and Inferring Phylogeny by Felsenstein. You can add suggestions for content, note issues, and follow progress on it updates, in issue #25.
What:
The process of inferring the evolutionary relationships between organisms or groups of organisms, and usually representing that information as a tree.
Why:
How:
By comparing traits of extant organisms. In our case, traits are columns in a multiple sequence alignment. Many algorithms and tools exist for achieving this, and they vary widely in runtime and quality of results. We're going to begin by learning about one of the oldest and simplest methods for doing this: Unweighted Pair Group Method with Arithmetic Mean or UPGMA. (Don't be scared by the name - it's actually fairly simple.)
UPGMA is a heirarchical clustering algorithm. It is widely used, though it's application in phylogenetics is usually restricted to building preliminary trees to "guide" the process of multiple sequence alignment, as it makes some assumptions that don't work well for inferring relationships between organsims. We're going to start with it here however for a few reasons. First, the underlying math is very basic, so we don't need to assume anything about your background. Second, there are some other applications of UPGMA that we'll explore later, including grouping samples based on their species compositions. In general, my strategy with teaching this material is to start by giving you a basic introdution into how a process works so you can visualize and do it. From there, we can get more complex.
Here's the goal (well, a goal, but this is the basic idea):
Each leaf (or tip, or terminal node) in this tree represents a sequence, and the length of the horizonal branches between them indicate their dissimilarity to one another. This is a rooted tree, which means that it includes an assumption about the last common ancestor of all sequences represented in the tree.
Note that the vertical lines in this tree are used for layout purposes only - they do not represent dissimilarity between sequences.
An unrooted trees, like the following, doesn't include an assumption about the last common ancestor of all sequences:
Terminal nodes, tips or leaves extant organisms, frequently called operational taxonomic units or OTUs. OTUs are families of related organisms.
Internal nodes hypothetical ancestors - we postulate their existence but often don’t have direct evidence.
Clade a node and all nodes "below" it (i.e., toward the tips)
Root the internal node defining the clade which contains all nodes in the tree
Branches representative of the distance between the nodes.
Monophyletic group the last common ancestor was a member of the group (e.g., multicellular organisms)
Polyphyletic group the last common ancestor was not a member of the group (e.g., flying animals)
We assume that as time progresses, sequences will diverge such that more similar sequences have diverged more recently. The problem of phylogenetic reconstruction however is that we only have the tips. We don't have sequences for the internal nodes, and so we use modern sequences to develop a hypothesis about the evolutionary history of a sequence (and hopefully of the organisms who encode those sequences in their genomes).
How many (rooted) trees are there for n
sequences? This topic is discussed in detail in Chapter 3 of Inferring Phylogenies, the definitive text on this topic, but the basic answer is a lot.
Because of the massive number of possible trees for any reasonable number of sequences, we can't just search all trees to figure out which one best matches our data. Instead, we must take a heurtistic approach to tree building. Some design features of heurtistic methods that are used in practice are that they:
Computing a UPGMA tree for a group of sequences relies on first computing distances between each pair of those sequences. A distance, in this sense, is a technical term. It's a measure of dissimilarity between two items, x
and y
, which meets a few criteria:
d(x,y) >= 0
(non-negativity)d(x,y) = 0
iff
x = y
(identity of indiscernibles)d(x,y) = d(y,x)
(symmetry)d(x,z) <= d(x,y) + d(y,z)
(triangle inequality)Let's start with something more similar that sequences. Let's compute the distances between points in a Cartesian plane, to explore what each of these mean.
from __future__ import print_function
from IPython.core import page
import matplotlib.pyplot as plt
page.page = print
x_vals = [1, 4, 2]
y_vals = [3, 3, 6]
coordinates = zip(x_vals, y_vals)
fig, ax = plt.subplots()
ax.scatter(x_vals, y_vals)
ax.set_xlim(0)
ax.set_ylim(0)
(0, 6.5)
from scipy.spatial.distance import euclidean
%psource euclidean
def euclidean(u, v): """ Computes the Euclidean distance between two 1-D arrays. The Euclidean distance between 1-D arrays `u` and `v`, is defined as .. math:: {||u-v||}_2 Parameters ---------- u : (N,) array_like Input array. v : (N,) array_like Input array. Returns ------- euclidean : double The Euclidean distance between vectors `u` and `v`. """ u = _validate_vector(u) v = _validate_vector(v) dist = norm(u - v) return dist
Does Euclidean distance appear to meet the above four requirements of a distance metric?
We can now compute the distances between all of these pairs of coordinates, and we can store these in what's known as a distance matrix.
dm = []
for c1 in coordinates:
row = []
for c2 in coordinates:
row.append(euclidean(c1, c2))
dm.append(row)
print(dm)
[[0.0, 3.0, 3.1622776601683795], [3.0, 0.0, 3.6055512754639891], [3.1622776601683795, 3.6055512754639891, 0.0]]
The scikit-bio project defines defines a DistanceMatrix
object that we can use to work with this.
from skbio.stats.distance import DistanceMatrix
dm = DistanceMatrix(dm, map(str,range(3)))
One feature that this gets us is nicer printing.
print(dm)
3x3 distance matrix IDs: '0', '1', '2' Data: [[ 0. 3. 3.16227766] [ 3. 0. 3.60555128] [ 3.16227766 3.60555128 0. ]]
scikit-bio also provides a convenience wrapper that runs the SciPy pairwise distance measures for us and returns a DistanceMatrix
object. We'll use that from here.
from skbio.diversity.beta import pw_distances
help(pw_distances)
Help on function pw_distances in module skbio.diversity.beta._base: pw_distances(counts, ids=None, metric='braycurtis') Compute distances between all pairs of columns in a counts matrix Parameters ---------- counts : 2D array_like of ints or floats Matrix containing count/abundance data where each row contains counts of observations in a given sample. ids : iterable of strs, optional Identifiers for each sample in ``counts``. metric : str, optional The name of the pairwise distance function to use when generating pairwise distances. See the scipy ``pdist`` docs, linked under *See Also*, for available metrics. Returns ------- skbio.DistanceMatrix Distances between all pairs of samples (i.e., rows). The number of row and columns will be equal to the number of rows in ``counts``. Raises ------ ValueError If ``len(ids) != len(counts)``. See Also -------- scipy.spatial.distance.pdist pw_distances_from_table
dm = pw_distances(coordinates, metric="euclidean")
print(dm)
3x3 distance matrix IDs: '0', '1', '2' Data: [[ 0. 3. 3.16227766] [ 3. 0. 3.60555128] [ 3.16227766 3.60555128 0. ]]
The conditions of a distance matrix listed above lead to a few specific features: the distance matrix is symmetric (if you flip the upper triangle over the diagonal, the values are the same as those in the lower triangle), the distance matrix is hollow (the diagonal is all zeros), and there are no negative values.
Most often, distances between pairs of sequences are derived from a multiple sequence alignment. These differ from the pairwise alignments that we’ve looked at thus far, but use the same underlying algorithms (and we'll be coming back to this in the next chapeter).
Let's load up some aligned sequences, and compute a distance matrix. For now, we'll compute distances between the sequences using the hamming
function that we worked with in the pairwise alignment chapter.
from skbio import Alignment, DNA
aln = Alignment([DNA('ACCGTGAAGCCAATAC', 's1'),
DNA('A-CGTGCAACCATTAC', 's2'),
DNA('AGCGTGCAGCCAATAC', 's3'),
DNA('AGGGTGCCGC-AATAC', 's4'),
DNA('AGGGTGCCAC-AATAC', 's5')])
%psource aln.distances
def distances(self, distance_fn=None): """Compute distances between all pairs of sequences Parameters ---------- distance_fn : function, optional Function for computing the distance between a pair of sequences. This must take two sequences as input (as `skbio.sequence.BiologicalSequence` objects) and return a single integer or float value. Defaults to `scipy.spatial.distance.hamming`. Returns ------- skbio.DistanceMatrix Matrix containing the distances between all pairs of sequences. Raises ------ skbio.util.exception.BiologicalSequenceError If ``len(self) != len(other)`` and ``distance_fn`` == ``scipy.spatial.distance.hamming``. See Also -------- skbio.DistanceMatrix scipy.spatial.distance.hamming Examples -------- >>> from skbio.alignment import Alignment >>> from skbio.sequence import DNA >>> seqs = [DNA("A-CCGGG", id="s1"), ... DNA("ATCC--G", id="s2"), ... DNA("ATCCGGA", id="s3")] >>> a1 = Alignment(seqs) >>> print(a1.distances()) 3x3 distance matrix IDs: 's1', 's2', 's3' Data: [[ 0. 0.42857143 0.28571429] [ 0.42857143 0. 0.42857143] [ 0.28571429 0.42857143 0. ]] """ return super(Alignment, self).distances(distance_fn)
master_dm = aln.distances()
print(master_dm)
5x5 distance matrix IDs: 's1', 's2', 's3', 's4', 's5' Data: [[ 0. 0.25 0.125 0.3125 0.375 ] [ 0.25 0. 0.1875 0.375 0.3125] [ 0.125 0.1875 0. 0.1875 0.25 ] [ 0.3125 0.375 0.1875 0. 0.0625] [ 0.375 0.3125 0.25 0.0625 0. ]]
Once we have these distances, we can cluster the sequences based on their similiaries/dissimilarities. This is the first process that we'll explore for tree building.
NOTE: The example below assumes that each value in this distance matrix is multiplied by the sequence length, so we'll do that here and work work with the resulting distance matrix.
master_dm = DistanceMatrix(master_dm.data*16, master_dm.ids)
print(master_dm)
5x5 distance matrix IDs: 's1', 's2', 's3', 's4', 's5' Data: [[ 0. 4. 2. 5. 6.] [ 4. 0. 3. 6. 5.] [ 2. 3. 0. 3. 4.] [ 5. 6. 3. 0. 1.] [ 6. 5. 4. 1. 0.]]
Unweighted Pair-Group Method with Arithmetic mean
Unweighted: all tip-to-tip distances contribute equally Pair-group: all branch points lead to exactly two clades Arithmetic mean: distances to each clade are the mean of distances to all members of that clade
Let's start, working from the above distance matrix.
Step 1.1: The smallest distance in the above matrix is 1.00
, between s4
and s5
. So, we'll draw that clade and set each branch length to 0.5
.
Step 1.2: Next, we'll create a new, smaller distance matrix where the sequences s4
and s5
are now represented by a single clade, (s4, s5)
.
iter1_ids = ['s1', 's2', 's3', '(s4, s5)']
iter1_dm = [[0.0, 4.0, 2.0, None],
[4.0, 0.0, 3.0, None],
[2.0, 3.0, 0.0, None],
[None, None, None, None]]
Step 1.3: We'll now fill in the values from the new clade to each of the existing sequences (or clades). The distance will be the mean between a pre-existing clades, and each of the sequences in the new clade. For example, the distance between s1
and (s4, s5)
is the mean of the distance between s1
and s4
and s1
and s5
:
import numpy as np
np.mean([master_dm[0][3], master_dm[0][4]])
5.5
Step 1.3 (continued): Similarly, the distance between s2
and (s4, s5)
is the mean of the distance between s2
and s4
and s2
and s5
:
np.mean([master_dm[1][3], master_dm[1][4]])
5.5
Step 1.3 (continued): And finally, the distance between s3
and (s4, s5)
is the mean of the distance between s3
and s4
and the distance between s3
and s5
:
np.mean([master_dm[2][3], master_dm[2][4]])
3.5
Step 1.3 (continued): If we fill these values in (note that they will be the same above and below the diagonal) the post-iteration 1 distance matrix looks like the following:
iter1_dm = [[0.0, 4.0, 2.0, 5.5],
[4.0, 0.0, 3.0, 5.5],
[2.0, 3.0, 0.0, 3.5],
[5.5, 5.5, 3.5, 0.0]]
iter1_dm = DistanceMatrix(iter1_dm, iter1_ids)
print(iter1_dm)
4x4 distance matrix IDs: 's1', 's2', 's3', '(s4, s5)' Data: [[ 0. 4. 2. 5.5] [ 4. 0. 3. 5.5] [ 2. 3. 0. 3.5] [ 5.5 5.5 3.5 0. ]]
Step 1.4: There is still more than one value below the diagonal, so we start a new iteration.
Step 2.1: The smallest distance in the above matrix is 2.00
, between s1
and s3
. So, we'll draw that clade and set each branch length to 1.0
.
Step 2.2: Next, we'll create a new, smaller distance matrix where the sequences s1
and s3
are now represented by a single clade, (s1, s3)
.
iter2_ids = ['(s1, s3)', 's2', '(s4, s5)']
iter2_dm = [[None, None, None],
[None, 0.0, 5.5],
[None, 5.5, 0.0]]
Step 2.3: We'll now fill in the values from the new clade to each of the existing sequences (or clades). The distance will be the mean between a pre-existing clades, and each of the sequences in the new clade. For example, the distance between s2
and (s1, s3)
is the mean of the distance between s2
and s1
and s2
and s3
:
np.mean([master_dm[1][0], master_dm[1][2]])
3.5
Step 2.3 (continued): Next, we need to find the distance between (s1, s3)
and (s4, s5)
. This is computed as the mean of the distance between s1
and s4
, the distance between s1
and s5
, the distance between s3
and s4
, and the distance between s3
and s5
. Note that we are going back to our master distance matrix here for these distances, not our iteration 1 distance matrix.
np.mean([master_dm[0][3], master_dm[0][4], master_dm[2][3], master_dm[2][4]])
4.5
Step 2.3 (continued): We can now fill in all of the distances in our iteration 2 distance matrix.
iter2_dm = [[0.0, 3.5, 4.5],
[3.5, 0.0, 5.5],
[4.5, 5.5, 0.0]]
iter2_dm = DistanceMatrix(iter2_dm, iter2_ids)
print(iter2_dm)
3x3 distance matrix IDs: '(s1, s3)', 's2', '(s4, s5)' Data: [[ 0. 3.5 4.5] [ 3.5 0. 5.5] [ 4.5 5.5 0. ]]
Step 2.4: There is still more than one value below the diagonal, so we start a new iteration.
Step 3.1: The smallest distance in the above matrix is 3.50
, between (s1, s3)
and s2
. So, we'll draw that clade and set each branch length to 1.75
.
Step 3.2: Next, we'll create a new, smaller distance matrix where the clade (s1, s3)
and the sequence s2
are now represented by a single clade, ((s1, s3), s2)
.
iter3_ids = ['((s1, s3), s2)', '(s4, s5)']
iter3_dm = [[None, None],
[None, 0.0]]
Step 3.3: We'll now fill in the values from the new clade to each of the existing sequences (or clades). This is computed as the mean of the distance between s1
and s4
, the distance between s1
and s5
, the distance between s3
and s4
, the distance between s3
and s5
, the distance between s2
and s4
, and the distance between s2
and s5
. Again, note that we are going back to our master distance matrix here for these distances, not our iteration 1 or iteration 2 distance matrix.
np.mean([master_dm[0][3], master_dm[0][4], master_dm[2][3], master_dm[2][4], master_dm[1][3], master_dm[1][4]])
4.833333333333333
Step 3.3 (continued): We can now fill in all of the distances in our iteration 3 distance matrix.
iter3_dm = [[0.0, 4.8],
[4.8, 0.0]]
iter3_dm = DistanceMatrix(iter3_dm, iter3_ids)
print(iter3_dm)
2x2 distance matrix IDs: '((s1, s3), s2)', '(s4, s5)' Data: [[ 0. 4.8] [ 4.8 0. ]]
Step 3.4: At this stage, there is only one distance below the diagonal in our distance matrix. So, we can use that distance to draw the final branch in our tree, setting the total branch length to 4.8.
# scipy.cluster.hierarchy.average is an implementation of UPGMA
from scipy.cluster.hierarchy import average, dendrogram
lm = average(master_dm.condensed_form())
d = dendrogram(lm, labels=master_dm.ids, orientation='right',
link_color_func=lambda x: 'black')
The material in this section was compiled while consulting the following sources: