The American Gut Beyond Bacteria samples are a perk that go much deeper than the 16S prep used for the majority of the samples in the AG dataset. Specifically, in addition to 16S, these samples also undergo whole genome shotgun sequencing, fungal sequencing, 18S sequencing and viromics preps.
Whole genome shotgun sequencing allows us to gather information across the genomes represented within the sample, and to infer functional potential. The added sequence data also helps to refine specific observed species in the samples allowing for a higher level of specificity relative to 16S. The basic strategy is to take all of the small sequences, and compare them to a well-characterized reference database of previously observed genes.
Fungal sequencing is similar to 16S in that a single part of each genome is targeted, in this case, the internal transcribed spacer region of the ribosomal operon. Fungal sequencing is difficult as the ribosomal genes are not specific enough to resolve differences between fungi (e.g., they're evolving faster than the ribosomal genes). As a result, a common practice is to instead target the ITS region which evolves very fast. The downside is that, since the ITS region is not conserved and evolves so quickly, it is not feasible to align it and produce a phylogenetic tree which can hinder downstream analysis.
18S sequencing is like 16S, except that it will amplify microbial eukaryotes. The samples we've processed so far for Beyond Bacteria have not ampified 18S, however. In other words, we haven't been able to detect any microbial eukaryotes.
The virome is an analogous data set to the whole genome shotgun one except that additional sample prep work is performed to help isolate virus like particles. The challenge then is to infer what "species" of virus are represented using a characterized reference database.
Within this Notebook, we're operating on data that has already been preprocessed (e.g., sequences mapped to their respective references).
%matplotlib inline
First, let's load up the precomputed tables. We're only exploring WGS at this time as processing is still underway for ITS and virome. The 16S data will be incorporated into the regular 16S pipeline used for the rest of the American Gut samples. The tables being operated on were produced using Metaphlan and HUMAnN by our collaborator, Dr. Joe Petrosino.
from biom import load_table
from skbio import TreeNode
wgs_taxa = load_table('../data/AG/BeyondBacteria/AmericanGutMetaphlanWGS.biom').sort()
wgs_pathway = load_table('../data/AG/BeyondBacteria/PathwayTable.biom').sort()
# For consistency purposes, let's also update the sample IDs. The site that processed
# the Beyond Bacteria samples stripped the leading zeros of the barcodes, however,
# the rest of the American Gut data contain those zeros.
# IDs are of the form AG-1234
_ = wgs_taxa.update_ids({k: "%.9d" % int(k.split('-', 1)[1]) for k in wgs_taxa.ids()})
_ = wgs_pathway.update_ids({k: "%.9d" % int(k.split('-', 1)[1]) for k in wgs_pathway.ids()})
# Need to update the sample indices. This is a workaround for an cache invalidation bug
# https://github.com/biocore/biom-format/issues/599
wgs_taxa._index_ids()
wgs_pathway._index_ids()
Next, we're going to load a little more contextual information about what is contained in these tables. The taxonomy tree is just that: the bacterial taxonomy (though limited to just those organisms in the tree). This will let us order subsequent plots by taxonomy. We're also going to load up pathway names, which are more informative than just the raw IDs contained in the table.
wgs_taxa_tree = TreeNode.from_newick(open('../data/AG/BeyondBacteria/AmericanGutMetaphlanWGS.taxonomy.tree'))
taxa_order = [n.name for n in wgs_taxa_tree.tips()]
wgs_pathway_names = {}
for line in open('../data/AG/BeyondBacteria/PathwayTable.names.txt'):
line = line.strip()
module_id, _ = line.split(' ', 1)
wgs_pathway_names[module_id] = line
Next, let's go ahead and order the WGS taxonomy table and "augment" the IDs in the pathways table.
wgs_taxa_ordered = wgs_taxa.sort_order(taxa_order, axis='observation')
_ = wgs_pathway.update_ids(wgs_pathway_names, axis='observation')
The precomputed pathway table is not in relative abundance, so we will quickly convert it.
wgs_pathway_rel = wgs_pathway.norm(inplace=False).transform(lambda v, i, md: v * 100)
Now, let's define a helper function to form a heatmap. We're going to use a log transform on the data as well to add a little "pop".
def heatmap(table, title, transform=lambda x: x):
# Adapted from http://www.bertplot.com/visualization/?p=292
fig, ax = subplots(figsize=(15,25))
ax.pcolor(transform(table.matrix_data.toarray()), cmap=plt.cm.Reds)
ax.set_xticks(arange(table.ids().size) + 0.5)
ax.set_yticks(arange(table.ids(axis='observation').size) + 0.5)
ax.set_xticklabels(table.ids())
ax.set_yticklabels(table.ids(axis='observation'))
ax.xaxis.tick_top()
ax.yaxis.tick_left()
plt.text(0.5, 1.02, title, fontsize=25, horizontalalignment='center', transform=ax.transAxes)
fig.tight_layout()
And last (for setup), lets write a small method that can form hyperlinks to make it easy to track down additional information about any of the observations. We're only going to show the top few observations by relative abundance per sample.
from IPython.core.display import HTML
def linkify(url_fmt, query, name, abund):
url = url_fmt.format(query.replace(' ', '+'))
return ' %0.2f%% <a href=%s target="_blank">%s</a></br>' % (abund, url, name)
google_scholar_url_fmt = "http://scholar.google.com/scholar?hl=en&as_sdt=1,5&as_vis=1&q=%22{}%22"
kegg_module_url_fmt = "http://www.genome.jp/dbget-bin/www_bget?md:{}"
def table_links(table, url_fmt, name_transform=lambda x: x, id_tag=''):
TOP_N_HITS = 10
content = []
for id_ in table.ids():
content.append("<a id='%s_%s'></a>%s</br>" % (id_, id_tag, id_))
abund_and_id = zip(table.data(id_), table.ids(axis='observation'))
for v, name in sorted(abund_and_id, reverse=True)[:TOP_N_HITS]:
if v > 0:
query, name = name_transform(name)
content.append(linkify(url_fmt, query, name, v))
return ''.join(content)
We can now begin to produce some results! The next two cells will show a heatmap of the species observed within each sample, and then list out the relative abundance of those organisms and links to where more information can be found. Dark red indicates that the species is at a high relative abundance within the sample, while lighter colors indicate the organism is at low relative abundance within the sample.
# perform a log transform on the data to add pop on the display
heatmap(wgs_taxa_ordered, 'Beyond Bacteria WGS taxonomic summary, ordered by taxonomy', transform=lambda x: log(x + 1e-20))
HTML(table_links(wgs_taxa, google_scholar_url_fmt, name_transform=lambda x: [x.split('__', 1)[1].replace('_', ' ')] * 2, id_tag='taxa'))
These next two cells will show pathway results. Same as before: dark red indicates a high relative abundance while ligher colors indicate low relative abundance within each sample.
heatmap(wgs_pathway_rel, 'Beyond Bacteria WGS pathway summary')
HTML(table_links(wgs_pathway_rel, kegg_module_url_fmt, name_transform=lambda x: x.split(' ', 1), id_tag='pathway'))