In [1]:

from qiime.test import write_test_data

# A random 1% of the representative sequences from a de novo OTU
# picking run on the Moving Pictures of the Human Microbiome dataset
# (this is biased toward human microbiome OTUs)
input_seqs_fp = "moving-picture-rep-seqs.01.fna"

# sequences derived from my taxonomic assigner comparison:
# https://github.com/gregcaporaso/short-read-tax-assignment
# this is a randomly selected subset of GG 13_8 97% OTUs, sliced to the 
# 515F/806R amplicon region. (this is limited in practical applicability 
# because they are known and trusted reference sequences)
input_seqs_fp = "/Users/caporaso/code/short-read-tax-assignment/data/simulated-community/B1/rep_set.fna"
!rm -r core-set-aligned gg-aligned
!cp $input_seqs_fp seqs

rm: core-set-aligned: No such file or directory
rm: gg-aligned: No such file or directory

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/matplotlib/__init__.py:1155: UserWarning:  This call to matplotlib.use() has no effect
because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

  warnings.warn(_use_error_msg)

Run PyNAST, first using the core set and then using the Greengenes 13_8 85% reference alignment as the template. I picked 85% as it is most similar in number of sequences to the core set, and it doesn't perform much differently than 82% or 88% (which I tested with this same workflow).

In [2]:

!align_seqs.py -t /Users/caporaso/data/greengenes_core_sets/core_set_aligned.fasta.imputed -m pynast -o core-set-aligned -i seqs

In [3]:

!align_seqs.py -t /Users/caporaso/data/gg_13_8_otus/rep_set_aligned/85_otus.fasta -o gg-aligned -i seqs

Compute the md5sum for each of the resulting alignments. This indicates that there are differences in the resulting alignments.

In [4]:

!md5 core-set-aligned/seqs_aligned.fasta gg-aligned/seqs_aligned.fasta

MD5 (core-set-aligned/seqs_aligned.fasta) = a41f90e7faa0805c4402ba25147db587
MD5 (gg-aligned/seqs_aligned.fasta) = 2a5baa9e0f17e1e0527ee289e51767cb

In [5]:

from skbio import Alignment
a1 = Alignment.read('core-set-aligned/seqs_aligned.fasta')
a2 = Alignment.read('gg-aligned/seqs_aligned.fasta')

a1_ids = set(a1.ids())
a2_ids = set(a2.ids())
common_ids = a1_ids & a2_ids

print "Number of sequences retained in the core-set-aligned data, but dropped in the GG-aligned data:"
print len(a1_ids - a2_ids), a1_ids - a2_ids
print "Number of sequences retained in the GG-aligned data, but dropped in the core-set-aligneddata:"
print len(a2_ids - a1_ids), a2_ids - a1_ids
print "Number of sequences that are in both alignments:"
print len(common_ids)

Number of sequences retained in the core-set-aligned data, but dropped in the GG-aligned data:
114 set(['1138415', '620458', '787185', '4422781', '1133875', '1139993', '70116', '1137286', '805909', '580539', '637375', '4334501', '316473', '2376256', '241454', '4466525', '231442', '1123821', '294168', '1931229', '4351546', '112266', '1128568', '1147143', '1123941', '144181', '4395261', '1123374', '4419449', '1141371', '320667', '163153', '542199', '104706', '726766', '1123693', '4390557', '4413535', '575851', '262538', '4352288', '4434861', '185692', '146335', '3190394', '165845', '250116', '32422', '1138080', '252674', '181083', '168712', '834883', '104391', '328080', '140938', '161026', '4433573', '231374', '4370558', '1132594', '4341657', '805753', '1136495', '224123', '1139850', '3732401', '535932', '1148214', '612302', '4420556', '4431104', '4461578', '1140108', '526468', '4442702', '4409907', '835896', '170412', '3136117', '330263', '4299531', '4311115', '532740', '557526', '1137157', '1121115', '4423597', '4354640', '4385539', '812471', '3269889', '524865', '1134236', '322278', '4314389', '2772147', '93366', '4314942', '4459226', '4471196', '168172', '70586', '587064', '2209962', '560746', '563367', '571277', '4434859', '533634', '1141904', '620479', '1147974', '1113318'])
Number of sequences retained in the GG-aligned data, but dropped in the core-set-aligneddata:
8 set(['4478591', '3836963', '951205', '669760', '3770699', '1638796', '1742076', '3383796'])
Number of sequences that are in both alignments:
9780

In [6]:

a1 = a1.subalignment(seqs_to_keep=common_ids)
a2 = a2.subalignment(seqs_to_keep=common_ids)
a1.write('core-set-aligned/seqs_aligned.fasta')
a2.write('gg-aligned/seqs_aligned.fasta')

Apply the Lane mask (i.e., static entropy) filtering and gap filtering to the alignment.

In [7]:

!filter_alignment.py -i core-set-aligned/seqs_aligned.fasta -o core-set-aligned/

In [8]:

!filter_alignment.py -i gg-aligned/seqs_aligned.fasta -o gg-aligned/

Compute the md5sum for each of the resulting filtered alignments. This indicates that there are still differences in the resulting alignments.

In [9]:

!md5 core-set-aligned/seqs_aligned_pfiltered.fasta  gg-aligned/seqs_aligned_pfiltered.fasta 

MD5 (core-set-aligned/seqs_aligned_pfiltered.fasta) = 1c261ab57bdfefaa1a99fce00a365972
MD5 (gg-aligned/seqs_aligned_pfiltered.fasta) = bda31860b7b4f391b302ea06e871152d

Build phylogenetic trees with FastTree from the filtered alignments.

In [10]:

!make_phylogeny.py -i core-set-aligned/seqs_aligned_pfiltered.fasta -o core-set-aligned/seqs_aligned_pfiltered.tre

In [11]:

!make_phylogeny.py -i gg-aligned/seqs_aligned_pfiltered.fasta -o gg-aligned/seqs_aligned_pfiltered.tre

Load the trees into skbio.TreeNode objects, compute DistanceMatrices of tip-to-tip distances for each, and test whether those are correlated with a Mantel test. The fairly high Mantel test statistic, and significant p-value indicate that while the alignments may be different, they result in the similar phylogenetic trees. (Note that we don't know which of these trees is better based on this test).

In [12]:

from skbio import TreeNode
from skbio.stats.distance import mantel

t1 = TreeNode.read('core-set-aligned/seqs_aligned_pfiltered.tre')
t2 = TreeNode.read('gg-aligned/seqs_aligned_pfiltered.tre')

mantel(t1.tip_tip_distances(), t2.tip_tip_distances(), strict=False)

Out[12]:

(0.9121502058163361, 0.001, 9780)

In [12]: