from qiime.test import write_test_data
# A random 1% of the representative sequences from a de novo OTU
# picking run on the Moving Pictures of the Human Microbiome dataset
# (this is biased toward human microbiome OTUs)
input_seqs_fp = "moving-picture-rep-seqs.01.fna"
# sequences derived from my taxonomic assigner comparison:
# https://github.com/gregcaporaso/short-read-tax-assignment
# this is a randomly selected subset of GG 13_8 97% OTUs, sliced to the
# 515F/806R amplicon region. (this is limited in practical applicability
# because they are known and trusted reference sequences)
input_seqs_fp = "/Users/caporaso/code/short-read-tax-assignment/data/simulated-community/B1/rep_set.fna"
!rm -r core-set-aligned gg-aligned
!cp $input_seqs_fp seqs
rm: core-set-aligned: No such file or directory rm: gg-aligned: No such file or directory
/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/matplotlib/__init__.py:1155: UserWarning: This call to matplotlib.use() has no effect because the backend has already been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot, or matplotlib.backends is imported for the first time. warnings.warn(_use_error_msg)
Run PyNAST, first using the core set and then using the Greengenes 13_8 85% reference alignment as the template. I picked 85% as it is most similar in number of sequences to the core set, and it doesn't perform much differently than 82% or 88% (which I tested with this same workflow).
!align_seqs.py -t /Users/caporaso/data/greengenes_core_sets/core_set_aligned.fasta.imputed -m pynast -o core-set-aligned -i seqs
!align_seqs.py -t /Users/caporaso/data/gg_13_8_otus/rep_set_aligned/85_otus.fasta -o gg-aligned -i seqs
Compute the md5sum for each of the resulting alignments. This indicates that there are differences in the resulting alignments.
!md5 core-set-aligned/seqs_aligned.fasta gg-aligned/seqs_aligned.fasta
MD5 (core-set-aligned/seqs_aligned.fasta) = a41f90e7faa0805c4402ba25147db587 MD5 (gg-aligned/seqs_aligned.fasta) = 2a5baa9e0f17e1e0527ee289e51767cb
from skbio import Alignment
a1 = Alignment.read('core-set-aligned/seqs_aligned.fasta')
a2 = Alignment.read('gg-aligned/seqs_aligned.fasta')
a1_ids = set(a1.ids())
a2_ids = set(a2.ids())
common_ids = a1_ids & a2_ids
print "Number of sequences retained in the core-set-aligned data, but dropped in the GG-aligned data:"
print len(a1_ids - a2_ids), a1_ids - a2_ids
print "Number of sequences retained in the GG-aligned data, but dropped in the core-set-aligneddata:"
print len(a2_ids - a1_ids), a2_ids - a1_ids
print "Number of sequences that are in both alignments:"
print len(common_ids)
Number of sequences retained in the core-set-aligned data, but dropped in the GG-aligned data: 114 set(['1138415', '620458', '787185', '4422781', '1133875', '1139993', '70116', '1137286', '805909', '580539', '637375', '4334501', '316473', '2376256', '241454', '4466525', '231442', '1123821', '294168', '1931229', '4351546', '112266', '1128568', '1147143', '1123941', '144181', '4395261', '1123374', '4419449', '1141371', '320667', '163153', '542199', '104706', '726766', '1123693', '4390557', '4413535', '575851', '262538', '4352288', '4434861', '185692', '146335', '3190394', '165845', '250116', '32422', '1138080', '252674', '181083', '168712', '834883', '104391', '328080', '140938', '161026', '4433573', '231374', '4370558', '1132594', '4341657', '805753', '1136495', '224123', '1139850', '3732401', '535932', '1148214', '612302', '4420556', '4431104', '4461578', '1140108', '526468', '4442702', '4409907', '835896', '170412', '3136117', '330263', '4299531', '4311115', '532740', '557526', '1137157', '1121115', '4423597', '4354640', '4385539', '812471', '3269889', '524865', '1134236', '322278', '4314389', '2772147', '93366', '4314942', '4459226', '4471196', '168172', '70586', '587064', '2209962', '560746', '563367', '571277', '4434859', '533634', '1141904', '620479', '1147974', '1113318']) Number of sequences retained in the GG-aligned data, but dropped in the core-set-aligneddata: 8 set(['4478591', '3836963', '951205', '669760', '3770699', '1638796', '1742076', '3383796']) Number of sequences that are in both alignments: 9780
a1 = a1.subalignment(seqs_to_keep=common_ids)
a2 = a2.subalignment(seqs_to_keep=common_ids)
a1.write('core-set-aligned/seqs_aligned.fasta')
a2.write('gg-aligned/seqs_aligned.fasta')
Apply the Lane mask (i.e., static entropy) filtering and gap filtering to the alignment.
!filter_alignment.py -i core-set-aligned/seqs_aligned.fasta -o core-set-aligned/
!filter_alignment.py -i gg-aligned/seqs_aligned.fasta -o gg-aligned/
Compute the md5sum for each of the resulting filtered alignments. This indicates that there are still differences in the resulting alignments.
!md5 core-set-aligned/seqs_aligned_pfiltered.fasta gg-aligned/seqs_aligned_pfiltered.fasta
MD5 (core-set-aligned/seqs_aligned_pfiltered.fasta) = 1c261ab57bdfefaa1a99fce00a365972 MD5 (gg-aligned/seqs_aligned_pfiltered.fasta) = bda31860b7b4f391b302ea06e871152d
Build phylogenetic trees with FastTree from the filtered alignments.
!make_phylogeny.py -i core-set-aligned/seqs_aligned_pfiltered.fasta -o core-set-aligned/seqs_aligned_pfiltered.tre
!make_phylogeny.py -i gg-aligned/seqs_aligned_pfiltered.fasta -o gg-aligned/seqs_aligned_pfiltered.tre
Load the trees into skbio.TreeNode objects, compute DistanceMatrices of tip-to-tip distances for each, and test whether those are correlated with a Mantel test. The fairly high Mantel test statistic, and significant p-value indicate that while the alignments may be different, they result in the similar phylogenetic trees. (Note that we don't know which of these trees is better based on this test).
from skbio import TreeNode
from skbio.stats.distance import mantel
t1 = TreeNode.read('core-set-aligned/seqs_aligned_pfiltered.tre')
t2 = TreeNode.read('gg-aligned/seqs_aligned_pfiltered.tre')
mantel(t1.tip_tip_distances(), t2.tip_tip_distances(), strict=False)
(0.9121502058163361, 0.001, 9780)