In this tutorial we will explore the types of data that the MinION produces, and try to look at the error mode by visual inspection of alignments.
We will use data from the Ebola surveillance study in Guinea. Reads from this study are already on the server, in the tar archive 076383.tgz
However, first we need a reference sequence to map against.
wget https://raw.githubusercontent.com/nickloman/ebov/master/refs/EM_079517.fasta
Now let's extract FASTA files from the archive of reads. We can do this directly from the tar archive, but it's faster to extract the archive first. A tar archive is denoted by the file suffix 'tar' or 'tgz' for a zipped tar file. Poretools can extract a subset of reads using the '--type' parameter. Commonly used values include:
--type 2D
(two-direction reads)--type fwd
(template reads)--type rev
(complement reads)More detailed usage can be found at the documentation site:
http://poretools.readthedocs.org/en/latest/content/examples.html#poretools-fastq
tar xvfz 076383.tgz
This makes a directory called 076383_180Genomes_11rx
Let's extract the reads:
poretools fasta --type 2D 076383_180Genomes_11rx/pass > Ebola2D.fasta
Check how many reads you have obtained:
grep ">" Ebola2D.fasta | wc -l
Now let's align the reads to a reference sequence. You need to index the reference first:
bwa index EM_079517.fasta
Align the reads to the reference. Here we use BWA as the aligner and we specify that we have Oxford Nanopore reads with the ont2d option:
bwa mem -x ont2d EM_079517.fasta Ebola2D.fasta | samtools view -bS - | samtools sort - -o Ebola2D.sorted.bam
This command makes a file called Ebola2D.sorted.bam
Now index that bamfile:
samtools index Ebola2D.sorted.bam
We can get some basic statistics about how well it has aligned using samtools stats
samtools stats Ebola2D.sorted.bam > Ebola2D.stats.txt
head -40 Ebola2D.stats.txt
samtools stats
can give us some coverage plots:
grep "^COV" Ebola2D.stats.txt > Ebola2D.coverage.txt
head -10 Ebola2D.coverage.txt
You can plot this in RStudio with code like this:
library(ggplot2)
cov=read.table("Ebola2D.coverage.txt", sep="\t")
cov[1,]
ggplot(cov, aes(x=V3, y=V4)) + geom_histogram(stat="identity") + xlab("Coverage") + ylab("Count")
What do you think about the coverage plot? What clues does it give you about how the sample was prepared?
Now, repeat this process from the beginning, but do it for a different dataset, choose from:
1D reads only! Ensure you use a different file name, e.g. Ebola1D.fasta
Now, let's download the BAM file and inspect the alignment. My favoured tool for this is Tablet. It requires Java.
You need to download the reference, the sorted BAM file and the BAM index file to view this dataset in Tablet.
The easiest way to do this is scp
(secure copy protocol) which works anywhere you have an SSH server.
E.g. to copy a file FROM a server TO your laptop do:
scp username@ipaddress:/path/to/file .
To copy a file FROM your laptop TO a server do:
scp /path/to/file username@ipaddress:/remote/directory
So to copy all this over...
scp ubuntu@IPADDRESS/home/ubuntu/Ebola2D.sorted.bam .
scp ubuntu@IPADDRESS/home/ubuntu/Ebola2D.sorted.bam.bai .
scp ubuntu@IPADDRESS/home/ubuntu/EM_079517.fasta .
Inspect the alignment.
Have a look at the error profile. Are some parts of the genome better than others? Can you correlate this with the sequence?
The Ebola virus mutation rate is in the order of 1.2 x 10^-3 mutations/site/year. The genome size is 19000 bases long. This sample was collected about a year after the reference genome. Approximately how many SNPs do you expect to see?
Call SNPs - by eye!
Calling variants with nanopolish relies on squiggle data to generate the best consensus and gives a nicer result.
To call variants, there are three steps:
nanopolish eventalign
nanopolish variants
We've already aligned the reads (output file from BWA was Ebola2D.sorted.bam
)
nanopolish eventalign --reads Ebola2D.fasta -b Ebola2D.sorted.bam -g EM_079517.fasta --sam | samtools view -bS - | samtools sort - -o Ebola2D.eventalign.bam
We need to index the new BAM file that nanopolish eventalign
produced:
samtools index Ebola2D.eventalign.bam
And now we need to get the variants in VCF format:
nanopolish variants --progress -t 1 --reads Ebola2D.fasta -o Ebola2D.vcf -b Ebola2D.sorted.bam -e Ebola2D.eventalign.bam -g EM_079517.fasta -vv -w "EM_079517:0-20000" --snp
It is actually possible to use different models with nanopolish variants specifying the model filenames --models-fofn offset_models.fofn
Compare this list with the list of variants that you already eyeballed. How do they compare?
Did nanopolish spot things that you didn't?
Did nanopolish get anything wrong? Could you figure out a way of filtering the VCF to remove these errors?
nanopolish variants --progress -t 1 --reads Ebola2D.fasta -o Ebola2D.6mer.vcf -b Ebola2D.sorted.bam -e Ebola2D.eventalign.bam -g EM_079517.fasta -vv -w "EM_079517:0-20000" --snp
Ebola2D.6mer.vcf
look compared with the old one?