The largest challenge facing the usage of metagenomic approaches in microbiology is the need to extend traditional microbiology training to include metagenomic or sequencing data analysis. Sean Eddy (a compuational biologist at the Howard Hughes Medical Institute) nicely describes the impacts of high throughput sequencing on biology and its training in his keynote address (

To facilitate the barriers to microbiologists for metagenomic assembly, we have complemented this review with a tutorial of how to estimate the abundance of reference sequences (e.g., genes, contigs, etc.) in a metagenome. We include approaches that include using references that are both (i) available genome references or (ii) assembled from the metagenome. In general, to complete this tutorial and most metagenomic assembly, one would need:

  • Access to a server. Most metagenomic assembly will require more memory than most researchers will have on their personal computers. In this tutorial, we will provide training on the publicly accessible Amazon EC2 instances which can be rented by anyone with a registered account.
  • Access to a metagenomic dataset. We have selected the usage of the HMP Mock Community WGS dataset ( for this tutorial given its availability, practical size, and availability of reference genomes. This dataset represents a mock metagenome of 22 known organisms for which DNA was extracted from cultured isolates, combined, and sequenced.
  • Software for assembly, read mapping, and annotation. We will demonstrate the installation of this software on an Ubuntu-based server.

0. Getting on the same page.

The first step in this tutorial is to provide all users access to a server for which these instructions can be used, regardless of what computer you may be on. To do this, we'll be using cloud computing. More specifically, Amazon Web Services Elastic Compute Cloud. To rent compute time off of Amazon Web Services, you'll have to sign up and pay with a credit card. The cost is pretty manageable, You should be able to complete this tutorial in less than four hours, which comes out to < $1.

Once you are signed up for Amazon Web Services, you need to follow some instructions to launch a cloud "instance" or server. For this tutorial, we suggest you use the instructions from the Data Science Toolbox, A couple things to note prior to running through those instructions:

  • Choose the "in the cloud" instructions
  • You can choose any AMI but we suggest US EAST, ami-d1737bb8
  • When you configure your instance, choose the m3.large instance
  • Do not forget to "Add rule" as described in the directions in Data Science Toolbox Step 2: Add a "custom TCP rule" for port "8888" and source "Anywhere".
  • Complete the Data Science Instructions through Step 4. Once you get to Step 5, refer below.

If you have trouble logging into your instance and you are on a Mac or Linux OS:

  • Check to see that you changed permissions on your key file (the one that ends in *.pem)
  • Make sure you run the ssh command to log into the instance in the same directory as your security file or specify the location of that file

Once you are logged into the instance, for example, you've successfully run the following command (except you'll have your own security file uniquely named and your own special EC2 address):

$ ssh -i MyKeyPair.pem [email protected]

And you now have a command line that looks like:

[email protected]:

You need to do a couple things to get this tutorial running:

Copy and paste the following commands one by one into your command line and press ENTER after each one:

cd /mnt

sudo git clone

Next, copy and paste the following command and enter a notebook password of your choice when prompted:

dst setup base

Then, copy and paste this command:

sudo ipython notebook --profile=dst --notebook-dir=/mnt/frontiers-review-2015

This will start up an IPython Notebook for this tutorial. Leave the terminal screen open and find your internet browser, preferablly Google Chrome. You'll also need the address for your EC2 instance public DNS that you used to log in above "e.g., ec2-XX-XX-XX-XXX". If you don't know it, you can always check on your AWS EC2 dashboard (see running instances) here,

On your web browser, navigate to https://ec2-XX-XX-XX-XXX:8888 (except with your specific EC2 public DNS). Almost all web browsers will have a message that says you're heading to an unsafe place. Don't be alarmed. On Chrome, you can hit the "Advanced" options link and hit "Proceed anyways". Then, type in the password (password is the one you chose above), and voila, you'll see a notebook that contains this text called "frontiers-nb-2015".

1. How to use this IPython Notebook

IPython notebooks are very useful to collaboratively train bioinformatics. These notebooks have recently been featured in Nature News ( and

In using these notebooks, there are a few imporant things to note. There are two types of content in this tutorial: text and code. This content is placed in this notebook as "cells". If you click around on this page, you'll see different cells highlighted. To execute each cell (regardless of content), you hit on your keyboard SHIFT+ENTER. If the cell contains text, the content will be displayed directly. If the cell contains code, the code will then execute. Also, you can execute all cells in the notebook by going to the Cell tab where "File, Edit, View, Insert, Cell..." are in the top left of this webpage and selecting Run all.

2. Download the tutorial dataset.

We will begin this tutorial by downloading the HMP mock metagenome from the NCBI Short Read Archives (SRA). Many public metagenomes are stored as SRA files in the NCBI. The easiest way to get these SRA files is to use a special set of tools called the sratoolkit. If you have your dataset SRA run ID (in this case SRR172903), you can download the dataset and convert it to the standard "fasta" or "fastq" sequencing format is to use a special program to convert the file.

In [1]:
In [2]:
!tar -xvf sratoolkit.2.4.5-2-ubuntu64.tar.gz

You can see that we now have a file containing the software with the "ls" command. You'll also see this notebook in the list of files in the present location we are working in.

In [3]:

Now we'll use the installed sratoolkit program to download the HMP mock dataset in "fastq" format. (This takes a minute or two. You'll find that patience is require for working with metagenomes. The nice thing about working in the cloud is that you are "renting" the computational power so it is not using your personal computer's memory -- freeing it up for things you can do while waiting. You'll note that you can see that "Kernel busy" will be shown on the top-right corner of the screen below "Logout" button.)

In [4]:
!sratoolkit.2.4.5-2-ubuntu64/bin/fastq-dump SRR172903

3. Quality control

There are a number of methods to determine the quality of the sequencing data that you will assemble. First, one can look at the quality scores of your sequencing reads and if desired, trim reads with quality scores that are not sufficient for your needs. A vast number of tools are available to perform quality trimming of sequencing reads, including tools with nice tutorials including FastX Toolkit ( and, FastQC ( and and Sickle ( and

The sequencing data file you have downloaded is a "fastq" text file, where data describing each sequencing read is shown on four lines. Let's take a quick look:

In [5]:
!head -n 4 SRR172903.fastq
  • This first line (starting with "@SRR172903.1") is the read identifier, it usually shows the read ID, some information for the sequencing facility about the run that it was obtained on.
  • The second line is the DNA sequence.
  • The third line is the same as the first line but replacing the "@" with a "+", sometimes this is only a "+" in some datasets
  • The fourth line gives you information on the quality score of each base pair for the DNA sequence. Note that it is the same length as the DNA sequence and that quality scores are based on ASCII character scores (with an offset determined by the sequencing technology, Illumina is currently an offset of 64, e.g., ASCII code 64 = 0 Phred score). The quality score is equal to the -10 * log (p), where p is the probability of the base being called wrong (e.g., if Q= 20, p=0.01, 1% probability base is called wrong).

For this tutorial, we will be removing reads that have more than 50% of the read length with a Phred score of less than 33. We will be using Fastx-Toolkit ( which can be used for many types of quality control (e.g., adapter trimming). First, we will download, uncompress, and then install this program.

In [6]:
In [7]:
!tar -xvf fastx_toolkit-0.0.14.tar.bz2
!tar -xvf libgtextutils-0.7.tar.gz
In [16]:

Now, we will perform the quality filtering saving the quality-filtered file as SRR172903.qc.fastq:

FASTQ Quality Filter

$ fastq_quality_filter -h usage: fastq_quality_filter [-h] [-v] [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-q N]       = Minimum quality score to keep.
   [-p N]       = Minimum percent of bases that must have [-q] quality.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-v]         = Verbose - report number of sequences.
          If [-o] is specified,  report will be printed to STDOUT.
          If [-o] is not specified (and output goes to STDOUT),
          report will be printed to STDERR.
In [17]:
!fastq_quality_filter -q 33 -p 50 -i SRR172903.fastq > SRR172903.qc.fastq

4. Checking the diversity - What is the distribution of "Who is There?"

An advantage of metagenomic sequencing is the ability to quantify microbial diversity in an environment without the need to first cultivate cells. Typically, most studies access taxonomic diversity (especially with the usage of targeted sequencing of the 16S rRNA gene*). Diversity can also be measured in the representation of specific sequence patterns in a metagenome. For example, one can quantify the abundance of unique nucleotide "words" of length K, or k-mers, in a metagenome. These k-mers can also be used in the assembly of metagenomes where overlapping k-mers are indicative of reads that should be connected together. The diversity of these k-mers can give you insight into the the diversity of your sample. Further, since assembly compares each k-mer against all k-mers, larger numbers of k-mers present will require more computational memory. A nice review on k-mers and assembly is Miller et al.

(*Note that 16S rRNA amplicon sequencing is a targeted approach and not considered metagenomics in this review. Shotgun metagenomic sequencing uses DNA extracted from all cells in a community and sequenced. Targeted sequencing amplifies a specific genomic locus and independently sequenced. A great review on metagenome analysis is Sharpton et al.)

The first thing we will do is install khmer ( -- it contains a suite of khmer and pre-assembly tools. We will use it for k-mer counting here. Once you run the script below, you can use khmer's many tools.

In [18]:
In [19]:

The following script is contained within the khmer package and can estimate the total unique number of k-mers in your dataset. Use cases for this might be a) determining how diverse a metagenome is compared to e.g., a bacterial genome for assembly, b) to compare k-mer diversity among multiple metagenomes, c) exploring the impacts of choice of length k for assembly.

Next, to estimate the number unique k-mers in the datasets for multiple k's (17, 21, 25, 29, 33, 37), execute the scripts below. The script output will identify the unique k-mers but will also save in a report named unique_count. (This takes about 15 minutes on a large instance and 8-10 minutes on an extra large.)

In [20]:
!python -R unique_count -k 17 SRR172903.qc.fastq
!python -R unique_count -k 21 SRR172903.qc.fastq
!python -R unique_count -k 25 SRR172903.qc.fastq
!python -R unique_count -k 29 SRR172903.qc.fastq
!python -R unique_count -k 33 SRR172903.qc.fastq
!python -R unique_count -k 37 SRR172903.qc.fastq

You can see that this file now has in the first column the k-mer length and in the second column the estimated number of words of length k in the metagenomes. If you had multiple genomes, you could compare diversity of e.g., the total number of k-mers across datasets. To view the results of the file, you can use the concatentate program/command "cat".

In [21]:
!cat unique_count

5. Getting a sequence coverage profile: What genes are present in my metagenome?

Most metagenomic analysis require one to estimate the abundance of reference genes (e.g., orginating from genomes or one's own metagenomic assembly). This tutorial will cover both cases where references are available or unavailable (requiring de novo assembly).

6. Case I - Reference genomes available.

For the mock HMP metagenome, the HMP has sequenced the genomes of the isolates used for this simulated dataset. The list of these genomes can be obtained on the HMP website, and we provide it here in a Github repository, a tool used for collaboratively sharing data and code. The command below will download data for this tutorial.

In [22]:
!cat ncbi_acc.txt

The following command downloads all the genomes for each ID in the above list into a directory called "genomes".

In [23]:
!python ncbi_acc.txt genomes

7. Estimating abundance of assembled contigs

To estimate the representation of reference genes or genomes in your metagenome, you can align reads to references using read mapping software (e.g., Bowtie2, BWA, etc.). In this tutorial, we will use Bowtie2 which we will install on this server. We will then be mapping the metagenome to a single reference genome (that we downloaded above).

In [24]:

I've written a script that will automatically map a set of reads to a given reference and output a file containing the number of reads that are mapped to a given reference. To use this script, we'll also need to install samtools. A samfile is a super compressed file that efficiently stores mapped information from mappers. Samtools helps us interact with this file.

In [25]:
!apt-get install samtools

To map reads to a reference, we have provided an easy to use program. The steps the program performs are as follows:

  • Index your reference genome,
  • Map reads to your index genome (with default bowtie parameters),
  • Use Samtools to estimate the number of reads mapped, number of reads unmapped, and provide a tab delimited file with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads.

This takes about 8-10 minutes.

In [26]:
!bash genomes/NC_000913.2.fa SRR172903.qc.fastq

We can look at the total number of reads mapped and unmapped from our metagenome to the genome NC_000913.2. We can also get a file that shows the reference sequence name (first column), reference sequence length (second column), # mapped reads (third column) and # unmapped reads (last column). The other columns contain information that samtools can use for other queries, you can read about samtools here,

In [27]:
!cat reads-mapped.count.txt
In [28]:
!cat reads-unmapped.count.txt
In [29]:

If you want a challenge, you can try mapping the metagenome to all reference genomes provided in the genome folder. To do so, try concatentating all genomes into one file (using this command: "cat genomes/*fa >> all-genomes.fa") and running the program on all-genomes.fa instead of NC_000913.2.fa.

8. Case II - De novo assembly of reference genes.

Assembly of the HMP mock metagenome

Assembly is the process of merging overlapping metagenomic reads from hopefully the same genome into a longer, continguous sequence (most commonly called a contig). It is advantageous in that it provides longer lengths for sequences that can later be used as references (that may be previously unknown), reduces the dataset size for analysis, and provides references that are not dependent on previous knowledge.

The choice of what assembler to use is not an easy one and is a subject of debate (see It is most important to remember that an assembly is a hypothesized consensus representation of your dataset. The assembly itself is an initial step that needs to be followed by an evaluation of its accuracy and usefulness. For most assemblers, the inputs are sequencing reads and paramters for the assembly software. For this tutorial, we will be completing the assembly with an assembler published in 2014 called Megahit (Li et al., 2015, Sharpton's review (Sharpton, 2014) also reviews quite nicely some of the many assembly programs and approaches for metagenomic assembly.

To reduce the memory that is needed, it is often advantageous to normalize the distribution of k-mers in a metagenome. Removing extraneous information not needed for assembly also removes reads that may contain errors and may improve assembly ( These scripts and tutorials are available at

For this tutorial, we will assemble our metagenome with the Megathit assembler so first we have to install it.

In [30]:

This assembly will take about 15 minutes and will save the assembly to a folder names "megahit_assembly". You can read about the parameters to this program, such as --memory that specifies the maximum memory that can be used on the megahit software repo,

In [31]:
!megahit/megahit --memory 10e9 -l 250 --k-max 81 -r SRR172903.qc.fastq --cpu-only -o megahit_assembly

To take a look at the assembly, let's run the khmer assembly summary program on it, the final contigs are in megahit_assembly/final.contigs.fa. Let's get statistics on all contigs greater than or equal to 200 bp.

In [32]:
!python khmer/sandbox/ 200 megahit_assembly/final.contigs.fa

9. Estimating abundances of contigs

Once the assembly is finished, you have a set of reference contigs which you can now estimate the abundance of the metagenome. The approach for doing so is identical to that shown above where you use reference genomes.

This will take about 20 minutes.

In [33]:
!bash megahit_assembly/final.contigs.fa SRR172903.qc.fastq

You can take a look at the results of the mapping much like you did above when we were mapping reads to the NCBI genome.

In [34]:
!cat reads-mapped.count.txt
In [35]:
!cat reads-unmapped.count.txt
In [36]:

10. Annotating the assembled contigs

Sequencing is often used to determine "who" and/or "what" is in your sample. In our case, we know that the HMP mock community should orignate from a set of genomes (which we actually downlaoded above). One of the most popular tools of comparing an unknown sequence to a known reference is The Basic Local Alignment Search Tool (or BLAST). To identify the origin of our contigs, we will align assembled contigs to the genomes used in the HMP mock community.

The first thing we will do is download the BLAST software. Given the increasing volume of sequencing datasets, one may also consider new tools for more efficient annotation are now available such as Diamond (,

In [40]:
!tar -xvf ncbi-blast-2.2.30+-x64-linux.tar.gz

Now, we will make a searchable database for the BLAST software. First, we'll concatenate all the genomes in the genomes directory to one file.

In [41]:
!cat genomes/*fa >> all-genomes.fa
!ncbi-blast-2.2.30+/bin/makeblastdb -in all-genomes.fa -dbtype nucl -out all-genomes
!ncbi-blast-2.2.30+/bin/blastn -db all-genomes -query megahit_assembly/final.contigs.fa -outfmt 6 -out contigs.x.all-genomes.blastnout

The above command aligns each query (each sequence in the assembled final.contings.fa file) with each sequence (e.g., genome in all-genomes.fa). The -outfmt tells the program to save the results in a tab-delimited format in the -out file contigs.x.all-genomes.blastnout.

Let's take a look at the first 10 lines of that file. You'll see the query (contig) and the hit (genome) followed by the percent identity, the length of alignment, mismatch counts, gap open counts, query start position, query end position, subject start position, subject end position, E-value, and bit score.

In [42]:
!head -n 10 contigs.x.all-genomes.blastnout

Depending on your scientific question, it may be of more interest to have open reading frames (ORFs) annotated rather than contig sequences. In this case, there exist multiple ORF callers (e.g., FragGeneScan, and Metagene, that can be used. We can call ORFs from our contigs using FragGeneScan. We will download, install, and then call ORFs from our contigs as follows:

In [43]:
!tar -xvf FragGeneScan1.19.tar.gz
In [44]:

We will run FragGeneScan on the assembled contigs, assuming that it fits the training profile of a "complete" genome sequence (in their documentaion, this equals complete genomic sequences or short sequence reads without sequencing error).

In [45]:
!FragGeneScan1.19/FragGeneScan -s megahit_assembly/final.contigs.fa -o final.contigs.orfs.fa -w 1 -t complete 

The ORFs called are contained as FASTA files in final.contigs.orfs.fa.faa (amino acids) and final.contigs.orfs.fa.ffn (nucleotides). You can annotate these against a database of your choice just as described for the contigs above.

11. Going forward

Now you have all the information you need to produce the following information:

  • Sequence Abundance Information: Sequence (e.g., contig) and abundance (e.g., number of mapped reads)
  • Sequence Annotation Information: Sequence (eg., contig) and NCBI genome

You'll note that this is similar to 16S rRNA amplicon analysis where you'd have an OTU abundance table and OTU best hit annotations. For metagenomic analysis, this information takes you into further analysis and visualization packages like PhyloSeq in R (

Once you run through this tutorial on this workbook, a good exercise would be to try to run the assembly outside the IPython Notebook environment. To do so, you can log into your EC2 instance, navigate to the directory where this data is stored (cd /mnt/frontiers-review-2015), and you could run every command in this notebook on the command line (with the exception of the "!" at the beginning of each command in the notebook. Also, note that you will not have to reinstall the software.

In [ ]: