This IPython notebook file demonstrates some of the functionality available in poretools, and an example run report for a recent R7 chemistry nanopore run.
Some of the examples call out to R for plotting, so to make this work in this notebook we need to load the rpy2.ipython
module.
%load_ext rpy2.ipython
poretools can run either on an individual FAST5 file, a directory containing FAST5 files, or a tar archive of FAST5 files. Here we set up a $directory
variable for use for the rest of the tutorial. You could change this and run the same commands on your data.
directory='/mnt/borage/nick/nanopore/data/Flowcell6/downloads'
!find $directory -maxdepth 1 -name "*.fast5" | wc -l
60196
There are 60,196 FAST5 files in the directory.
poretools has a number of different command line options. Running poretools with no parameters gives us a brief list (and complies with Torsten's first rule)
!poretools
usage: poretools [-h] [-v] {combine,fastq,fasta,stats,hist,events,readstats,tabular,nucdist,qualdist,winner,squiggle,times,yield_plot} ... poretools: error: too few arguments
We can get more information if we run poretools with the -h (help) option.
!poretools -h
usage: poretools [-h] [-v] {combine,fastq,fasta,stats,hist,events,readstats,tabular,nucdist,qualdist,winner,squiggle,times,yield_plot} ... optional arguments: -h, --help show this help message and exit -v, --version Installed poretools version [sub-commands]: {combine,fastq,fasta,stats,hist,events,readstats,tabular,nucdist,qualdist,winner,squiggle,times,yield_plot} combine Combine a set of FAST5 files in a TAR achive fastq Extract FASTQ sequences from a set of FAST5 files fasta Extract FASTA sequences from a set of FAST5 files stats Get read size stats for a set of FAST5 files hist Plot read size histogram for a set of FAST5 files events Extract each nanopore event for each read. readstats Extract signal information for each read over time. tabular Extract the lengths and name/seq/quals from a set of FAST5 files in TAB delimited format nucdist Get the nucl. composition of a set of FAST5 files qualdist Get the qual score composition of a set of FAST5 files winner Get the longest read from a set of FAST5 files squiggle Plot the observed signals for FAST5 reads. times Return the start times from a set of FAST5 files in tabular format yield_plot Plot the yield over time for a set of FAST5 files
Let's start with a simple one, the stats
command, this will give us some basic statistics about our reads.
The -q
option stops poretools
outputting any warning messages.
!poretools stats -q $directory
total reads 104969 total base pairs 550200969 mean 5241.56 median 4616 min 5 max 154417
How do we have 104,969 reads from 60,196 FAST5 files? That's because forward, reverse and two-directional reads are all counted separately.
!poretools stats -q --type fwd $directory
total reads 53914 total base pairs 290880019 mean 5395.26 median 4441 min 5 max 154417
We have 53,914 forward reads in our total dataset.
!poretools stats -q --type rev $directory
total reads 29622 total base pairs 124718901 mean 4210.35 median 3765 min 5 max 44835
!poretools stats -q --type 2D $directory
total reads 21433 total base pairs 134602049 mean 6280.13 median 6020 min 211 max 38598
We have 21,433 two-direction reads, which is about 40% of the reads which have been base-called and about 72% of the reads that have a detectable complement strand.
!poretools readstats -q $directory > readstats.txt
!wc -l readstats.txt
readstats
gives you a line per every FAST5 file in your dataset. The columns are:
!head -10 readstats.txt
1404984747 491 None 301 0 1405005921 325 None 299 0 None 410 None 0 0 1404940637 415 None 5271 5113 1405002720 222 None 3064 0 1404959466 209 None 5901 5634 1404949019 12 None 4524 3576 1404930734 470 None 345 0 1404936756 491 None 310 0 1404947432 109 None 697 0
One useful plot you can easily do with the output of read stats is to plot the number of events in forward reads against reverse reads. Ideally every read would have a similar number, which would indicate the hairpin is correctly attached and the strand translocation rate is controlled by the enzyme.
%R stats=read.table("readstats.txt", sep="\t")
%R stats=subset(stats, V4 < 20000 & V5 < 20000)
<class 'pandas.core.frame.DataFrame'> Int64Index: 59048 entries, 0 to 59047 Data columns (total 5 columns): V1 59048 non-null values V2 59048 non-null values V3 59048 non-null values V4 59048 non-null values V5 59048 non-null values dtypes: int32(2), object(3)
%R smoothScatter(stats$V4,stats$V5)
!poretools winner -q --type 2D $directory > winner.fasta
This will just use the header data to generate a squiggle plot:
!head -1 winner.fasta | sed 's/>.* //' | xargs poretools squiggle --saveas png --num-facets 12
squiggle=!head -1 winner.fasta | sed 's/>.* //'
squiggle=squiggle[0] + '.png'
Image(squiggle)
!poretools yield_plot -q --theme-bw --saveas yield_plot.png --plot-type reads $directory
!poretools yield_plot -q --theme-bw --saveas yield_plot.pdf --plot-type reads $directory
Image("yield_plot.png")
!poretools hist -q --theme-bw --min-length 1000 --max-length 40000 --saveas hist.png $directory
!poretools hist -q --theme-bw --min-length 1000 --max-length 40000 --saveas hist.pdf $directory
Image("hist.png")
!poretools nucdist -q $directory
A 65660764 296997439 0.221081919834 C 70989346 296997439 0.239023428077 T 70730220 296997439 0.23815094244 G 75164353 296997439 0.253080811919 N 14452756 296997439 0.0486628977296
!poretools qualdist -q $directory
! 0 13821031 296997405 0.0465358645137 " 1 25112550 296997405 0.0845547791907 # 2 30275124 296997405 0.101937335109 $ 3 28014262 296997405 0.0943249386304 % 4 27335226 296997405 0.0920386021555 & 5 29026969 296997405 0.097734756302 ' 6 28000641 296997405 0.0942790762768 ( 7 25262940 296997405 0.0850611472514 ) 8 21469851 296997405 0.0722896922281 * 9 17021140 296997405 0.0573107364356 + 10 12194629 296997405 0.0410597156564 , 11 9132498 296997405 0.0307494201843 - 12 6955173 296997405 0.0234182955235 . 13 5258815 296997405 0.0177066025207 / 14 3952021 296997405 0.0133065842781 0 15 2980747 296997405 0.0100362728759 1 16 2267230 296997405 0.00763383774346 2 17 1751907 296997405 0.00589872830707 3 18 1375773 296997405 0.00463227279713 4 19 1098801 296997405 0.00369969899232 5 20 888009 296997405 0.00298995541729 6 21 726878 296997405 0.00244742205744 7 22 599176 296997405 0.0020174452366 8 23 497704 296997405 0.00167578568574 9 24 415916 296997405 0.00140040280823 : 25 746873 296997405 0.002514745878 ; 26 215364 296997405 0.00072513764893 < 27 176731 296997405 0.000595059071307 = 28 138388 296997405 0.000465956933193 > 29 103529 296997405 0.000348585537305 ? 30 73216 296997405 0.000246520672462 @ 31 49430 296997405 0.000166432430613 A 32 30164 296997405 0.000101563176958 B 33 16107 296997405 5.42327970845e-05 C 34 8018 296997405 2.69968688784e-05 D 35 3409 296997405 1.14782147676e-05 E 36 996 296997405 3.35356465488e-06 F 37 129 296997405 4.34347229398e-07 G 38 11 296997405 3.70373606463e-08 I 40 2 296997405 6.73406557205e-09 L 43 5 296997405 1.68351639301e-08 P 47 2 296997405 6.73406557205e-09 V 53 3 296997405 1.01010983581e-08 Z 57 5 296997405 1.68351639301e-08 h 71 2 296997405 6.73406557205e-09 j 73 4 296997405 1.34681311441e-08 m 76 3 296997405 1.01010983581e-08 s 82 3 296997405 1.01010983581e-08