Notebook

Olympia Oyster (Pat) -Initial- PacBio Data Analysis¶

Includes conversion, mapping RNA-seq data, and blast comparison of transcriptome
Focus is on scaffolds made from assembly of reads > 10k bp

In [1]:

#updated - trying out new software
!date

Sat Feb 22 10:43:06 PST 2014

pbh5tools¶

dependency- https://github.com/PacificBiosciences/pbcore

In [45]:

!bash5tools.py m130619_081336_42134_c100525122550000001823081109281326_s1_p0.bas.h5.1 --outFilePrefix myreads --outType fasta --readType Raw

usage: bash5tools.py [-h] [--verbose] [--version] [--profile] [--debug] [--outFilePrefix OUTFILEPREFIX]
                     [--readType {ccs,subreads,unrolled}] [--outType OUTTYPE] [--minLength MINLENGTH]
                     [--minReadScore MINREADSCORE] [--minPasses MINPASSES]
                     input.bas.h5
bash5tools.py: error: argument --readType: invalid choice: 'Raw' (choose from 'ccs', 'subreads', 'unrolled')

In [43]:

!bash5tools.py -h

usage: bash5tools.py [-h] [--verbose] [--version] [--profile] [--debug] [--outFilePrefix OUTFILEPREFIX]
                     [--readType {ccs,subreads,unrolled}] [--outType OUTTYPE] [--minLength MINLENGTH]
                     [--minReadScore MINREADSCORE] [--minPasses MINPASSES]
                     input.bas.h5

Tool for extracting data from .bas.h5 files

positional arguments:
  input.bas.h5          input .bas.h5 filename

optional arguments:
  -h, --help            show this help message and exit
  --verbose, -v         Set the verbosity level (default: None)
  --version             show program's version number and exit
  --profile             Print runtime profile at exit (default: False)
  --debug               Run within a debugger session (default: False)
  --outFilePrefix OUTFILEPREFIX
                        output filename prefix [None]
  --readType {ccs,subreads,unrolled}
                        read type (ccs, subreads, or unrolled) []
  --outType OUTTYPE     output file type (fasta, fastq) [fasta]

Read filtering arguments:
  --minLength MINLENGTH
                        min read length [0]
  --minReadScore MINREADSCORE
                        min read score, valid only with --readType={unrolled,subreads} [0]
  --minPasses MINPASSES
                        min number of CCS passes, valid only with --readType=ccs [0]

!wget http://eagle.fish.washington.edu/whale/Pat/Analysis_Results/m130619_081336_42134_c100525122550000001823081109281326_s1_p0.bas.h5

In [33]:

pwd

Out[33]:

u'/Users/sr320/Desktop'

In [37]:

!wc m130619_081336_42134_c100525122550000001823081109281326_s1_p0.bas.h5.1

^C

Initial analysis below¶

Data was provided by core facility with comments:

Data are grouped into folders by SMRT cell. Folders are generically named by the position of the cell in the run (A01_1, B01_1, etc.). Reads are available in PacBio's HDF5 format as .bas.h5 files. Metadata for each SMRT cell is available in PacBio's XML format.

Tools for analysis of .bas.h5 files are available on PacBio's DevNet site (http://pacbiodevnet.com/) including the SMRT Analysis toolkit which can perform de novo assembly and resequencing.

Data was downloaded locally.
http://eagle.fish.washington.edu/whale/index.php?dir=Pat%2F

Two files

Conversion to fastq¶

via Giles.

Basically I took the bas.h5 file and ran it through the pbh5tools, specifically bash5tools.py.

The exact command I ran was

bash5tools.py --readType unrolled --outType fastq inputfile.bas.h5

where inputfile.bas.h5 was the name of the file you got from them.

My only question was whether I did the --readType correctly. I found some info on that online as well as some examples of running this command and they all talked about using a "Raw" mode which was removed from this version of bash5tools so I assumed it was the unrolled option because the other two options (ccs and subreads) where the same in previous versions.

To Install,

First you need numpy and h5py, both are python libraries. I installed these via MacPorts. Then you to install the pbcore library, which is a python library provided by PacBios (same website as the pbh5tools).

Basically download pbcore, unpack, then run python setup.py build sudo python setup.py install

This will install it with all the other python libraries, then do same with pbh5tools. You might need to add things to your PYTHONPATH and PATH environment variables. If you want to, you can customize where it goes by adding a --prefix to the install command but then you have to adjust PYTHONPATH and PATH (which is what I did).

Links:

Understanding PacBios reads https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Understanding-PacBio-transcriptome-data#wiki-readexplained

Examples for bash5tools.py http://seqanswers.com/forums/archive/index.php/t-16895.html

Github for tools https://github.com/PacificBiosciences

resulting in
http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_1.fastq

Importing into CLC¶

info via CLC

CLC bio tools are currently not optimized to handle the specific error profile of PacBio reads, and we therefore do not yet provide support for using this data type.

If you still wish to try using PacBio data in the Genomics Workbench, then if you have your data in standard fastq format, you should be able to succeed in importing it into the CLC bio Genomics Workbench using the Illumina import option. If you choose to do this, please choose the option "NCBI/Sanger or Illumina Pipeline 1.8 or later" under 'Quality scores' in the Import wizard .

Our developers are currently evaluating methods for error correction of PacBio reads (using short read data), and for hybrid de novo assembly using PacBio and short read data combined as input. If the outcome of this work matches the performance of tools commonly used for PacBio error correction (e.g. Celera Assembler using the pacBioToCA utility; Koren et al. 2012. Nature Biotechnology), or hybrid de novo assembly (e.g. ALPATHS-LG; Ribeiro et al. 2012. Genome Research), we will be considering providing such functionality in the future.

thus

Stats¶

47,475 Sequences

QC Report - http://eagle.fish.washington.edu/cnidarian/OlyO_PacBio_QC.pdf

also viewable below

https://docs.google.com/file/d/0B9V_gF766XZATTlWVzBDMkxxOGM/preview

More stats-
3058 sequences > 10000 bp

http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_10k.fa

Running denovo assembly on
Fast (simple contigs)

http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa

https://docs.google.com/file/d/0B9V_gF766XZAOUg3ajNnbDZoMk0/preview

In [5]:

!head /Volumes/web/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa

>OlyO_Pat_PacBio_10k_contig_1
AAAAAAAGGGAGATGTTTTCCTCATGTTGAATTGAATTCTTCAACTCATTTAAATCAGGC
AAGTGTTCAACACTAACGCTAGTTCCGGTTAGCCTGTAGGTCTAATAACTTTTGTCCAAA
CTACCAGGATATACAATTATGCACTGTTTAGCAGGGAGACATCACAAAAGGTATTTAATT
CCGATTACGCAAGAGCTTTTCTGCGTATTGATCAGGTTTTTTGAATCGAGGCAATGCTAC
CTTACAGACAAAGTTTGTTGTTGTTCAGGGGTTTCAATAGTCACGTTTTAAAGGCAGCAT
AATTTCGTTATAATTCTAATGGTCGTTATACAATTTAGTTTACCAATATTTGACCTATCA
TTTGAGTCAAATACTTGGTCCTGATTGTGGTTTCATAGCCATTGTTAAGTGCCGTTTTAC
ACACTGTATTTGACTACGGACTGTTCCGTTACCTGATCAAGACTATGGGGCCTCCCACGG
CGGGTGTGACGCGGTCAACAGGGGATGCCTACTCCTCTCTAGGCAGCCTGATCCCCACTT

Ran cd-hit-est via webserver¶

Summary: Nothing combined at 90% level (probably to big for cd-est hit)

Importing in iPlant¶

TopHat2-PE

Reference Genome - OlyO_Pat_PacBio_10k_contigs.fa

Output - OlyO_10kcontigs_TopHatm106

In [32]:

#finished

output file
http://eagle.fish.washington.edu/cnidarian/OlyO10k_filtered_106A_Male_Mix_TAGCTT_L004_R1.bam

Annotating 10kcontigs¶

http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa

Have working copy of transcriptome http://eagle.fish.washington.edu/cnidarian/OlyOv3_400bp.fasta and will blastn against 10k contigs

In [8]:

!makeblastdb -in /Volumes/web/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa -dbtype nucl -out /Volumes/Bay3/Software/ncbi-blast-2.2.27\+/db/OlyO_PacBio_10kcontigs


Building a new DB, current time: 07/10/2013 07:40:30
New DB name:   /Volumes/Bay3/Software/ncbi-blast-2.2.27+/db/OlyO_PacBio_10kcontigs
New DB title:  /Volumes/web/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 553 sequences in 0.575964 seconds.

In [11]:

!blastn -query /Volumes/web/cnidarian/OlyOv3_400bp.fasta -db /Volumes/Bay3/Software/ncbi-blast-2.2.27\+/db/OlyO_PacBio_10kcontigs -out /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs -outfmt 6 -evalue 1E-20 -num_threads 2 -task blastn

In [12]:

!head /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs

4486232	OlyO_Pat_PacBio_10k_contig_319	89.39	132	4	5	142	264	21178	21048	1e-39	 161
4486232	OlyO_Pat_PacBio_10k_contig_319	90.00	130	4	6	268	390	13091	12964	1e-38	 158
4486232	OlyO_Pat_PacBio_10k_contig_319	90.00	130	4	6	268	390	20765	20638	1e-38	 158
4486232	OlyO_Pat_PacBio_10k_contig_319	87.88	132	6	5	142	264	13504	13374	5e-37	 152
4486232	OlyO_Pat_PacBio_10k_contig_319	84.50	129	9	9	144	267	11561	11683	4e-25	 113
4486232	OlyO_Pat_PacBio_10k_contig_319	84.50	129	9	9	144	267	19235	19357	4e-25	 113
4486232	OlyO_Pat_PacBio_10k_contig_319	82.14	140	9	11	142	265	5867	5728	7e-23	 105
4486255	OlyO_Pat_PacBio_10k_contig_420	77.51	418	31	33	36	397	9827	10237	4e-63	 239
4486256	OlyO_Pat_PacBio_10k_contig_420	77.51	418	31	33	5	366	10237	9827	4e-63	 239
4486297	OlyO_Pat_PacBio_10k_contig_420	78.89	289	34	17	2	276	24459	24184	2e-49	 194

In [13]:

!head /Volumes/web/cnidarian/OlyOv3_400bp.fasta

>4485895 length 400 cvg_67.8_tip_0
ACCCAGAAAGGTTTAAAGAATGTATTTGATGAAGCCATTCTGGCTGCTCTGGAACCTCCTGAACCACCCAAAAAGAAGAAGTGTGTGTTGTTGTAATCTT
TGAACTCTCGTCAGTTTCATGTGTAATCATAGAATGATTTCAACTTGTCATCTGTGGGAAAATCTTGTGCAAAATTAAAAATAAAAACCACTTTTATACA
TGTCTGGATAAGTATTTTCACAGATGGAAGAGTGCGGGTTGAAATAGAGATTATTCCAACTTTCTGAAGAAAAGGAATATTTGAAGTTCCTGAGACGGAA
AAGGCAGGTGTTATTTTCAAGCGAACCACTAGCACAGTGCTGTGGTTTTATTATCCCATATGGGTCCAATGAACATATGATTTGTAAATATATATATAAT

>4485897 length 400 cvg_5.5_tip_1
ACACTGCACATCGCGGTCCTAGATTTTAACGACAACACGCCGTATTTCCTTAACAGTACATATAAATTTAGTGAAAATGAATCCACTTACAACAGAACAA
GAATTGGAGCATTGTATGCCCATGATCTGGACTCGGGACAAAACGCCAATATAACGTTTTCTATCTCTGGAGGAAACAGTCAAGGACATTTTCAAATAGA
TCCATACACGGGTGATTTATTCATAAATGGCCTTATCGACAGAGAAAATGTATCCTCATACAACCTGAGAGTTACTATAAGAGATAATCCGTCTAATCCG

In [14]:

!grep -c "OlyO_Pat_" /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs

Will try perl script I modified to take this "reverse" blast and make a gff file.

Another option in SQLShare

In [27]:

!2_Blast2gff.pl -i /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs -o /Users/sr320/Desktop/OlyO10kcontigs_exon_a.gff -d "OlyOv3_400" -p EXON -s "something"

In [30]:

mv /Users/sr320/Desktop/OlyO10kcontigs_exon_a.gff /Volumes/web/cnidarian/

mv: /Volumes/web/cnidarian/OlyO10kcontigs_exon_a.gff: set owner/group (was: 501/20): Operation not permitted

perl script would not write to eagle so had to move

http://eagle.fish.washington.edu/cnidarian/OlyO10kcontigs_exon_a.gff

In [31]:

!head /Volumes/web/cnidarian/OlyO10kcontigs_exon_a.gff

OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	21178	21048	1e-39	-	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	13091	12964	1e-38	-	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	20765	20638	1e-38	-	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	13504	13374	5e-37	-	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	11561	11683	4e-25	+	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	19235	19357	4e-25	+	.	4486232	
OlyO_Pat_PacBio_10k_contig_319	blastn:OlyOv3_400	blastn	5867	5728	7e-23	-	.	4486232	
OlyO_Pat_PacBio_10k_contig_420	blastn:OlyOv3_400	blastn	9827	10237	4e-63	+	.	4486255	
OlyO_Pat_PacBio_10k_contig_420	blastn:OlyOv3_400	blastn	10237	9827	4e-63	-	.	4486256	
OlyO_Pat_PacBio_10k_contig_420	blastn:OlyOv3_400	blastn	24459	24184	2e-49	-	.	4486297

In [2]:

!wget https://usegalaxy.org/dataset/display?dataset_id=bbd44e69cb8906b586614e867a9985b1&to_ext=bed

In [3]:

pwd

Out[3]:

u'/Users/Steven/Dropbox/Steven/ipython_nb'

In [4]:

ls

BiGill_Gene_Methylation.ipynb*         README.md
BiGill_RNAseq.ipynb                    README_old.md
BiGill_Tran_Elements.ipynb*            Roberts_cv.ipynb
BiGill_array.ipynb                     Ruphi_OA_RNAseq.ipynb*
BiGo - methratio error.ipynb*          TJGR_CpG_binding.ipynb
BiGo_GFF_dev.ipynb*                    TJGR_CpG_islands.ipynb
BiGo_RNAseq.ipynb*                     TJGR_Methylation_GenomeSnapshot.ipynb
BiGo_larvae.ipynb                      TJGR_Mgo_Expression.ipynb*
BiGo_larvae_2.ipynb                    TJGR_OysterGenome_IGV.ipynb*
BiGo_larvae_methylkit.ipynb            TJGR_ProteinAnnot.ipynb
BiGo_methratio.ipynb                   TJGR_pearl.ipynb*
BiGo_methratio_mito.ipynb              UW_SoftwareBootcamp.ipynb
BlackAb_Annot.ipynb*                   _BiGO_expPLOT.ipynb
CC_ampk.ipynb                          _BiGo_RNAseq.pdf
EY_methratio.ipynb                     _BiGo_larvae_bsmap274.ipynb
Gene Specific Methylation.ipynb*       _MGarray_blast.ipynb
LT_C1q.ipynb                           _scratch.ipynb
OA_MS_data_check.ipynb                 fish546/
OlyO_Chi_Exp.ipynb*                    gen_workflows.ipynb
OlyO_GonadExp.ipynb                    img/
OlyO_PacBio.ipynb*                     intersectbed.ipynb
OlyO_transcriptome.ipynb               snippets.md
OsHV_host.ipynb*                       tools/
PAG_2014.ipynb

Version 2 of genome¶