This tutorial covers a full QIIME workflow using Illumina sequencing data. This tutorial is is intended to be quick to run, and as such, uses only a subset of a full Illumina Genome Analyzer II (GAIIx) run. We'll make use of the 13_8
release of the Greengenes reference OTUs. You can always find a link to the latest version of the reference OTUs on the QIIME resources page.
The data used in this tutorial is derived from the Moving Pictures of the Human Microbiome study, where two human subjects collected daily samples from four body sites: the tongue, the palm of the left hand, the palm of the right hand, and the gut (via fecal samples obtained by swapping used toilet paper). This data was sequenced across six lanes of an Illumina GAIIx, using the barcoding amplicon sequencing protocol described in Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. A more recent version of this protocol that can be used with the Illumina HiSeq 2000 and MiSeq can be found here.
This tutorial is presented as an IPython Notebook. We're considering updating all of our tutorials to be presented in this format, and are interested in feedback on that. Let us know what you think here. For more information on using QIIME with IPython, see our recent paper. You can find more information on the IPython Notebook here, and the nbviewer tool (which we use to display the notebook) here.
We'll begin by downloading data and initializing some variables to configure our IPython computing environment.
Note: If you are using an Apple computer, we recommend downloading the dataset directly using the above link as OS X does not come with wget
pre-installed. Alternatively, wget
can be compiled from source or you can use curl
, which is similar to wget
although the usage is different.
from os import chdir, mkdir
from os.path import join
# these are only available in the current development branch of IPython
from IPython.display import FileLinks, FileLink
!wget https://s3.amazonaws.com/s3-qiime_tutorial_files/moving_pictures_tutorial-1.8.0.tgz
!wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz
!tar -xzf moving_pictures_tutorial-1.8.0.tgz
!tar -xzf gg_13_8_otus.tar.gz
Through-out this tutorial we make use of a reference sequence collection, tree, and taxonomy derived from the Greengenes database. For convenience, we'll define them as environment variables. We'll then reference the environment variables through-out this tutorial when they are used.
otu_base = "gg_13_8_otus/"
reference_seqs = join(otu_base,"rep_set/97_otus.fasta")
reference_tree = join(otu_base,"trees/97_otus.tree")
reference_tax = join(otu_base,"taxonomy/97_otu_taxonomy.txt")
Start by seeing what files are in our tutorial direcotry. We can do this using ls
as we would on the command line, but in this case we prefix with an !
to tell IPython that we're issuing a bash
(i.e., command line) command, rather than a python command.
!ls moving_pictures_tutorial-1.8.0
QIIME additionally supports more convenient output formattting for the IPython notebook so you can directly interact with or download your data.
FileLinks('moving_pictures_tutorial-1.8.0')
The QIIME mapping file contains all of the per-sample metadata, including technical information such as primers and barcodes that were used for each sample, and information about the samples. In this data set we're looking at human microbiome samples from four sites on the bodies of two individuals at mutliple time points. The metadata in this case therefore includes a subject identifier, a timepoint, and a body site for each sample. You can review the filtered_mapping_l1.txt
at the link in the previous cell to see an example of the data.
In this step, we run check_id_map.py
to review the mapping file to confirm that it's in QIIME-compatible format.
!check_id_map.py -o moving_pictures_tutorial-1.8.0/illumina/cid_l1/ -m moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l1.txt
In this case there were no errors, but if there were we would review the resulting html summary to find out what errors are present. You could then fix those in a spreadsheet program or text editor. To view that html file, call FileLinks
on the output directory from the previous step and click the link to the html
file.
For the sake of illustrating what errors in a mapping file might look like, we've created a bad mapping file (filtered_mapping_l1.bad.txt
) and provided that as an example. Call check_id_map.py
on the file filtered_mapping_l1.bad.txt
, and then view the html output. What are the issues with that mapping file?
We next need to demultiplex our sequences (i.e. assigning barcoded reads to the samples they are derived from). In general, you should get seperate fastq files for your sequence and barcode reads. On the multiple-lane Illumina platforms, we typically reuse barcodes across lanes, so we must demultiplex each lane independently. To do that, run the following command (will run for a few minutes).
This is a big command, but it's relatively straight-forward. We're telling QIIME that we have six lanes of sequence data (specified as a comma-separated list of files passed as -i
), six lanes of barcode data (specified as a comma-separated list of files passed as -b
), and a metadata mapping file corresponding to each lane (specified as a comma-separated list of files passed as -m
). The metadata mapping file contains the sample-to-barcode mapping that we need for demultiplexing.
Important: The order of files passed for -m
, -b
, and -i
must be consistent, so if you pass the lane 1 sequence data first for -i
, you must pass the lane 1 barcode data first for -b
, and the lane 1 metadata mapping file first as -m
. The only other parameter here is the output directory, which we'll call slout
, for split libraries output.
!split_libraries_fastq.py -o moving_pictures_tutorial-1.8.0/illumina/slout/ -i moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_1_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_2_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_3_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_4_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_5_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_6_sequence.fastq -b moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_1_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_2_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_3_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_4_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_5_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_6_sequence_barcodes.fastq -m moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l1.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l2.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l3.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l4.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l5.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l6.txt
We often want to see the results of running a command. Here we can do that by calling our output formatter again, this time passing the output directory from the previous step.
FileLinks('moving_pictures_tutorial-1.8.0/illumina/slout/')
!count_seqs.py -i moving_pictures_tutorial-1.8.0/illumina/slout/seqs.fna
Now that we have demultiplexed sequences, we're ready to cluster these sequences into OTUs. There are three high-level ways to do this in QIIME. We can use a de novo, closed-reference, or open-reference OTU picking. Open-reference OTU picking is currently our preferred method. Discussion of these methods can be found in QIIME's OTU picking document.
Here we apply open-reference OTU picking, which can require about 10 minutes to run on this data set.
Note that this command takes the seqs.fna
file that was generated in the previous step, as well as the reference fasta file ($reference_seqs
here). We're also taking a shortcut here for the sake of reduced run time: we're using the fast uclust parameters. To allow this to run in a just a few of minutes, we're using parameters that are optimized for reduced runtime at the expense of accuracy. These correspond to uclust
's default parameters. QIIME uses slightly more stringent parameter settings by default. These parameters are specified the the parameters file which is passes as -p
. You can find information on defining parameters files here.
!pick_open_reference_otus.py -o moving_pictures_tutorial-1.8.0/illumina/otus/ -i moving_pictures_tutorial-1.8.0/illumina/slout/seqs.fna -r $reference_seqs -p moving_pictures_tutorial-1.8.0/uc_fast_params.txt
If you want to get this data faster, you can use closed reference OTU picking at this stage instead of open reference OTU picking. We generally don't recommend this in practice, but you can do that by using the following command instead of the pick_open_reference_otus.py
command. For all subsequent steps that require a tree (mainly core_diversity_analyses.py
you should pass the $reference_tree
variable.
pick_closed_reference_otus.py -o moving_pictures_tutorial-1.8.0/illumina/otus/ -i moving_pictures_tutorial-1.8.0/illumina/slout/seqs.fna -r $reference_seqs -t $reference_tax -p moving_pictures_tutorial-1.8.0/uc_fast_params.txt
The primary output that we can about from this command is the OTU table, or the number of times each operational taxonomic unit (OTU) is observed in each sample. QIIME uses the Genomics Standards Consortium candidate standard Biological Observation Matrix (BIOM) format for representing these files. You can find additional information on the BIOM format here, and information on converting this files to tab-separated text that can be view in spreadsheet programs here. The rep_set.tre
file is also essential for downstream phylogenetic diversity calculations.
To view the output of this command, call FileLinks
on the output directory.
FileLinks('moving_pictures_tutorial-1.8.0/illumina/otus/')
To compute some summary statistics of the OTU table we can run the following command.
!biom summarize-table -i moving_pictures_tutorial-1.8.0/illumina/otus/otu_table_mc2_w_tax_no_pynast_failures.biom -o moving_pictures_tutorial-1.8.0/illumina/otus/otu_table_mc2_w_tax_no_pynast_failures_summary.txt
To view the summary, call FileLink
on the output file.
FileLink('moving_pictures_tutorial-1.8.0/illumina/otus/otu_table_mc2_w_tax_no_pynast_failures_summary.txt')
The key piece of information you need to pull from this output is the depth of sequencing that should be used in diversity analyses. Many of the analyses that follow require that there are an equal number of sequences in each sample, so you need to review the Counts/sample detail and decide what depth you'd like. Any samples that don't have at least that many sequences will not be included in the analyses, so this is always a trade-off between the number of sequences you throw away and the number of samples you throw away. For some perspective on this, see Kuczynski 2010.
We started with six lanes of data but have now summarized these in a single OTU table. However, we still need to merge the per-lane mapping files into a single combined mapping file that represents all six lanes, and therefore all of our data. Note that we will have duplicated barcodes in our mapping file, but that's OK as we've already demultiplexed our reads. We don't use the barcodes again. We can merge the six mapping files as follows. From this point on, we'll work with combined_mapping_file.txt
.
!merge_mapping_files.py -o moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt -m moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l1.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l2.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l3.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l4.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l5.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l6.txt
Let's review the resulting mapping file. To view a single file (rather than a directory) we use the FileLink
function instead of the FileLinks
function.
FileLink('moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt')
Here we're running the core_diversity_analyses.py
script which applies many of the "first-pass" diversity analyses that users are generally interested in. The main output that users will interact with is the index.html
file, which provides links into the different analysis results.
Note that in this step we're passing -e
which is the sampling depth that should be used for diversity analyses. I chose 258 here, based on reviewing the above output from biom summarize-table
. This value will be study-specific, so don't just use this value on your own data (those it's fine to use that value for this tutorial).
Warning: Here we're suppressing all alpha diversity calculations as generating alpha rarefaction plots can take approximately 20 minutes. If you'd like to include those, you should delete the --suppress_alpha_diversity
parameter from this command.
!core_diversity_analyses.py -o moving_pictures_tutorial-1.8.0/illumina/otus/cd258/ --suppress_alpha_diversity -i moving_pictures_tutorial-1.8.0/illumina/otus/otu_table_mc2_w_tax_no_pynast_failures.biom -m moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt -t moving_pictures_tutorial-1.8.0/illumina/otus/rep_set.tre -e 258 -c "SampleType,days_since_epoch"
Next open the index.html
file in the resulting directory. This will link you into the different results.
FileLink('moving_pictures_tutorial-1.8.0/illumina/otus/cd258/index.html')
This tutorial illustrated some of the basic features of QIIME, and there are a lot of places to go from here. If you're interested in seeing additional visualizations, you should check out the QIIME overview tutorial. The Procrustes analysis tutorial illustrates a really cool analysis, allowing you to continue with the same data used here, comparing against the samples sequenced on 454 (rather than Illumina, as in this analysis). If you're interested in some possibilities for statistical analyses you can try the supervised learning or distance matrix comparison tutorials, both of which can be adapted to use data generated in this tutorial.