Includes conversion, mapping RNA-seq data, and blast comparison of transcriptome
Focus is on scaffolds made from assembly of reads > 10k bp
#updated - trying out new software
!date
Sat Feb 22 10:43:06 PST 2014
dependency- https://github.com/PacificBiosciences/pbcore
!bash5tools.py m130619_081336_42134_c100525122550000001823081109281326_s1_p0.bas.h5.1 --outFilePrefix myreads --outType fasta --readType Raw
usage: bash5tools.py [-h] [--verbose] [--version] [--profile] [--debug] [--outFilePrefix OUTFILEPREFIX] [--readType {ccs,subreads,unrolled}] [--outType OUTTYPE] [--minLength MINLENGTH] [--minReadScore MINREADSCORE] [--minPasses MINPASSES] input.bas.h5 bash5tools.py: error: argument --readType: invalid choice: 'Raw' (choose from 'ccs', 'subreads', 'unrolled')
!bash5tools.py -h
usage: bash5tools.py [-h] [--verbose] [--version] [--profile] [--debug] [--outFilePrefix OUTFILEPREFIX] [--readType {ccs,subreads,unrolled}] [--outType OUTTYPE] [--minLength MINLENGTH] [--minReadScore MINREADSCORE] [--minPasses MINPASSES] input.bas.h5 Tool for extracting data from .bas.h5 files positional arguments: input.bas.h5 input .bas.h5 filename optional arguments: -h, --help show this help message and exit --verbose, -v Set the verbosity level (default: None) --version show program's version number and exit --profile Print runtime profile at exit (default: False) --debug Run within a debugger session (default: False) --outFilePrefix OUTFILEPREFIX output filename prefix [None] --readType {ccs,subreads,unrolled} read type (ccs, subreads, or unrolled) [] --outType OUTTYPE output file type (fasta, fastq) [fasta] Read filtering arguments: --minLength MINLENGTH min read length [0] --minReadScore MINREADSCORE min read score, valid only with --readType={unrolled,subreads} [0] --minPasses MINPASSES min number of CCS passes, valid only with --readType=ccs [0]
pwd
u'/Users/sr320/Desktop'
!wc m130619_081336_42134_c100525122550000001823081109281326_s1_p0.bas.h5.1
^C
Data was provided by core facility with comments:
Data are grouped into folders by SMRT cell. Folders are generically named by the position of the cell in the run (A01_1, B01_1, etc.). Reads are available in PacBio's HDF5 format as .bas.h5 files. Metadata for each SMRT cell is available in PacBio's XML format.
Tools for analysis of .bas.h5 files are available on PacBio's DevNet site (http://pacbiodevnet.com/) including the SMRT Analysis toolkit which can perform de novo assembly and resequencing.
Data was downloaded locally.
http://eagle.fish.washington.edu/whale/index.php?dir=Pat%2F
Two files
via Giles.
Basically I took the bas.h5 file and ran it through the pbh5tools, specifically bash5tools.py.
The exact command I ran was
bash5tools.py --readType unrolled --outType fastq inputfile.bas.h5
where inputfile.bas.h5 was the name of the file you got from them.
My only question was whether I did the --readType correctly. I found some info on that online as well as some examples of running this command and they all talked about using a "Raw" mode which was removed from this version of bash5tools so I assumed it was the unrolled option because the other two options (ccs and subreads) where the same in previous versions.
To Install,
First you need numpy and h5py, both are python libraries. I installed these via MacPorts. Then you to install the pbcore library, which is a python library provided by PacBios (same website as the pbh5tools).
Basically download pbcore, unpack, then run python setup.py build sudo python setup.py install
This will install it with all the other python libraries, then do same with pbh5tools. You might need to add things to your PYTHONPATH and PATH environment variables. If you want to, you can customize where it goes by adding a --prefix
Links:
Understanding PacBios reads https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Understanding-PacBio-transcriptome-data#wiki-readexplained
Examples for bash5tools.py http://seqanswers.com/forums/archive/index.php/t-16895.html
Github for tools https://github.com/PacificBiosciences
resulting in
http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_1.fastq
info via CLC
CLC bio tools are currently not optimized to handle the specific error profile of PacBio reads, and we therefore do not yet provide support for using this data type.
If you still wish to try using PacBio data in the Genomics Workbench, then if you have your data in standard fastq format, you should be able to succeed in importing it into the CLC bio Genomics Workbench using the Illumina import option. If you choose to do this, please choose the option "NCBI/Sanger or Illumina Pipeline 1.8 or later" under 'Quality scores' in the Import wizard .
Our developers are currently evaluating methods for error correction of PacBio reads (using short read data), and for hybrid de novo assembly using PacBio and short read data combined as input. If the outcome of this work matches the performance of tools commonly used for PacBio error correction (e.g. Celera Assembler using the pacBioToCA utility; Koren et al. 2012. Nature Biotechnology), or hybrid de novo assembly (e.g. ALPATHS-LG; Ribeiro et al. 2012. Genome Research), we will be considering providing such functionality in the future.
thus
47,475 Sequences
QC Report - http://eagle.fish.washington.edu/cnidarian/OlyO_PacBio_QC.pdf
also viewable below
https://docs.google.com/file/d/0B9V_gF766XZATTlWVzBDMkxxOGM/preview
More stats-
3058 sequences > 10000 bp
http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_10k.fa
Running denovo assembly on
Fast (simple contigs)
http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa
https://docs.google.com/file/d/0B9V_gF766XZAOUg3ajNnbDZoMk0/preview
!head /Volumes/web/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa
>OlyO_Pat_PacBio_10k_contig_1 AAAAAAAGGGAGATGTTTTCCTCATGTTGAATTGAATTCTTCAACTCATTTAAATCAGGC AAGTGTTCAACACTAACGCTAGTTCCGGTTAGCCTGTAGGTCTAATAACTTTTGTCCAAA CTACCAGGATATACAATTATGCACTGTTTAGCAGGGAGACATCACAAAAGGTATTTAATT CCGATTACGCAAGAGCTTTTCTGCGTATTGATCAGGTTTTTTGAATCGAGGCAATGCTAC CTTACAGACAAAGTTTGTTGTTGTTCAGGGGTTTCAATAGTCACGTTTTAAAGGCAGCAT AATTTCGTTATAATTCTAATGGTCGTTATACAATTTAGTTTACCAATATTTGACCTATCA TTTGAGTCAAATACTTGGTCCTGATTGTGGTTTCATAGCCATTGTTAAGTGCCGTTTTAC ACACTGTATTTGACTACGGACTGTTCCGTTACCTGATCAAGACTATGGGGCCTCCCACGG CGGGTGTGACGCGGTCAACAGGGGATGCCTACTCCTCTCTAGGCAGCCTGATCCCCACTT
Summary: Nothing combined at 90% level (probably to big for cd-est hit)
TopHat2-PE
Reference Genome - OlyO_Pat_PacBio_10k_contigs.fa
Output - OlyO_10kcontigs_TopHatm106
#finished
output file
http://eagle.fish.washington.edu/cnidarian/OlyO10k_filtered_106A_Male_Mix_TAGCTT_L004_R1.bam
http://eagle.fish.washington.edu/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa
Have working copy of transcriptome http://eagle.fish.washington.edu/cnidarian/OlyOv3_400bp.fasta and will blastn against 10k contigs
!makeblastdb -in /Volumes/web/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa -dbtype nucl -out /Volumes/Bay3/Software/ncbi-blast-2.2.27\+/db/OlyO_PacBio_10kcontigs
Building a new DB, current time: 07/10/2013 07:40:30 New DB name: /Volumes/Bay3/Software/ncbi-blast-2.2.27+/db/OlyO_PacBio_10kcontigs New DB title: /Volumes/web/cnidarian/OlyO_Pat_PacBio_10k_contigs.fa Sequence type: Nucleotide Keep Linkouts: T Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 553 sequences in 0.575964 seconds.
!blastn -query /Volumes/web/cnidarian/OlyOv3_400bp.fasta -db /Volumes/Bay3/Software/ncbi-blast-2.2.27\+/db/OlyO_PacBio_10kcontigs -out /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs -outfmt 6 -evalue 1E-20 -num_threads 2 -task blastn
!head /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs
4486232 OlyO_Pat_PacBio_10k_contig_319 89.39 132 4 5 142 264 21178 21048 1e-39 161 4486232 OlyO_Pat_PacBio_10k_contig_319 90.00 130 4 6 268 390 13091 12964 1e-38 158 4486232 OlyO_Pat_PacBio_10k_contig_319 90.00 130 4 6 268 390 20765 20638 1e-38 158 4486232 OlyO_Pat_PacBio_10k_contig_319 87.88 132 6 5 142 264 13504 13374 5e-37 152 4486232 OlyO_Pat_PacBio_10k_contig_319 84.50 129 9 9 144 267 11561 11683 4e-25 113 4486232 OlyO_Pat_PacBio_10k_contig_319 84.50 129 9 9 144 267 19235 19357 4e-25 113 4486232 OlyO_Pat_PacBio_10k_contig_319 82.14 140 9 11 142 265 5867 5728 7e-23 105 4486255 OlyO_Pat_PacBio_10k_contig_420 77.51 418 31 33 36 397 9827 10237 4e-63 239 4486256 OlyO_Pat_PacBio_10k_contig_420 77.51 418 31 33 5 366 10237 9827 4e-63 239 4486297 OlyO_Pat_PacBio_10k_contig_420 78.89 289 34 17 2 276 24459 24184 2e-49 194
!head /Volumes/web/cnidarian/OlyOv3_400bp.fasta
>4485895 length 400 cvg_67.8_tip_0 ACCCAGAAAGGTTTAAAGAATGTATTTGATGAAGCCATTCTGGCTGCTCTGGAACCTCCTGAACCACCCAAAAAGAAGAAGTGTGTGTTGTTGTAATCTT TGAACTCTCGTCAGTTTCATGTGTAATCATAGAATGATTTCAACTTGTCATCTGTGGGAAAATCTTGTGCAAAATTAAAAATAAAAACCACTTTTATACA TGTCTGGATAAGTATTTTCACAGATGGAAGAGTGCGGGTTGAAATAGAGATTATTCCAACTTTCTGAAGAAAAGGAATATTTGAAGTTCCTGAGACGGAA AAGGCAGGTGTTATTTTCAAGCGAACCACTAGCACAGTGCTGTGGTTTTATTATCCCATATGGGTCCAATGAACATATGATTTGTAAATATATATATAAT >4485897 length 400 cvg_5.5_tip_1 ACACTGCACATCGCGGTCCTAGATTTTAACGACAACACGCCGTATTTCCTTAACAGTACATATAAATTTAGTGAAAATGAATCCACTTACAACAGAACAA GAATTGGAGCATTGTATGCCCATGATCTGGACTCGGGACAAAACGCCAATATAACGTTTTCTATCTCTGGAGGAAACAGTCAAGGACATTTTCAAATAGA TCCATACACGGGTGATTTATTCATAAATGGCCTTATCGACAGAGAAAATGTATCCTCATACAACCTGAGAGTTACTATAAGAGATAATCCGTCTAATCCG
!grep -c "OlyO_Pat_" /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs
1730
Will try perl script I modified to take this "reverse" blast and make a gff file.
Another option in SQLShare
!2_Blast2gff.pl -i /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs -o /Users/sr320/Desktop/OlyO10kcontigs_exon_a.gff -d "OlyOv3_400" -p EXON -s "something"
mv /Users/sr320/Desktop/OlyO10kcontigs_exon_a.gff /Volumes/web/cnidarian/
mv: /Volumes/web/cnidarian/OlyO10kcontigs_exon_a.gff: set owner/group (was: 501/20): Operation not permitted
perl script would not write to eagle so had to move
!head /Volumes/web/cnidarian/OlyO10kcontigs_exon_a.gff
OlyO_Pat_PacBio_10k_contig_319 blastn:OlyOv3_400 blastn 21178 21048 1e-39 - . 4486232 OlyO_Pat_PacBio_10k_contig_319 blastn:OlyOv3_400 blastn 13091 12964 1e-38 - . 4486232 OlyO_Pat_PacBio_10k_contig_319 blastn:OlyOv3_400 blastn 20765 20638 1e-38 - . 4486232 OlyO_Pat_PacBio_10k_contig_319 blastn:OlyOv3_400 blastn 13504 13374 5e-37 - . 4486232 OlyO_Pat_PacBio_10k_contig_319 blastn:OlyOv3_400 blastn 11561 11683 4e-25 + . 4486232 OlyO_Pat_PacBio_10k_contig_319 blastn:OlyOv3_400 blastn 19235 19357 4e-25 + . 4486232 OlyO_Pat_PacBio_10k_contig_319 blastn:OlyOv3_400 blastn 5867 5728 7e-23 - . 4486232 OlyO_Pat_PacBio_10k_contig_420 blastn:OlyOv3_400 blastn 9827 10237 4e-63 + . 4486255 OlyO_Pat_PacBio_10k_contig_420 blastn:OlyOv3_400 blastn 10237 9827 4e-63 - . 4486256 OlyO_Pat_PacBio_10k_contig_420 blastn:OlyOv3_400 blastn 24459 24184 2e-49 - . 4486297
!wget https://usegalaxy.org/dataset/display?dataset_id=bbd44e69cb8906b586614e867a9985b1&to_ext=bed
pwd
u'/Users/Steven/Dropbox/Steven/ipython_nb'
ls
BiGill_Gene_Methylation.ipynb* README.md BiGill_RNAseq.ipynb README_old.md BiGill_Tran_Elements.ipynb* Roberts_cv.ipynb BiGill_array.ipynb Ruphi_OA_RNAseq.ipynb* BiGo - methratio error.ipynb* TJGR_CpG_binding.ipynb BiGo_GFF_dev.ipynb* TJGR_CpG_islands.ipynb BiGo_RNAseq.ipynb* TJGR_Methylation_GenomeSnapshot.ipynb BiGo_larvae.ipynb TJGR_Mgo_Expression.ipynb* BiGo_larvae_2.ipynb TJGR_OysterGenome_IGV.ipynb* BiGo_larvae_methylkit.ipynb TJGR_ProteinAnnot.ipynb BiGo_methratio.ipynb TJGR_pearl.ipynb* BiGo_methratio_mito.ipynb UW_SoftwareBootcamp.ipynb BlackAb_Annot.ipynb* _BiGO_expPLOT.ipynb CC_ampk.ipynb _BiGo_RNAseq.pdf EY_methratio.ipynb _BiGo_larvae_bsmap274.ipynb Gene Specific Methylation.ipynb* _MGarray_blast.ipynb LT_C1q.ipynb _scratch.ipynb OA_MS_data_check.ipynb fish546/ OlyO_Chi_Exp.ipynb* gen_workflows.ipynb OlyO_GonadExp.ipynb img/ OlyO_PacBio.ipynb* intersectbed.ipynb OlyO_transcriptome.ipynb snippets.md OsHV_host.ipynb* tools/ PAG_2014.ipynb
!/Users/sr320/Dropbox/Steven/gt/fish546/2_Blast2Gff.pl \
-i /Volumes/web/dermochelys/Bioinformatics/IGV\ stuff/otbnos.tab \
-o /Volumes/web/cnidarian/Olytranv2_OlyPacBio_v02.gff \
-d "OlyOvtran_v2" \
-p EXON \
-s "something"
!head /Volumes/web/cnidarian/OlyOv3_400_blastn_OlyO10kcontigs
4486232 OlyO_Pat_PacBio_10k_contig_319 89.39 132 4 5 142 264 21178 21048 1e-39 161 4486232 OlyO_Pat_PacBio_10k_contig_319 90.00 130 4 6 268 390 13091 12964 1e-38 158 4486232 OlyO_Pat_PacBio_10k_contig_319 90.00 130 4 6 268 390 20765 20638 1e-38 158 4486232 OlyO_Pat_PacBio_10k_contig_319 87.88 132 6 5 142 264 13504 13374 5e-37 152 4486232 OlyO_Pat_PacBio_10k_contig_319 84.50 129 9 9 144 267 11561 11683 4e-25 113 4486232 OlyO_Pat_PacBio_10k_contig_319 84.50 129 9 9 144 267 19235 19357 4e-25 113 4486232 OlyO_Pat_PacBio_10k_contig_319 82.14 140 9 11 142 265 5867 5728 7e-23 105 4486255 OlyO_Pat_PacBio_10k_contig_420 77.51 418 31 33 36 397 9827 10237 4e-63 239 4486256 OlyO_Pat_PacBio_10k_contig_420 77.51 418 31 33 5 366 10237 9827 4e-63 239 4486297 OlyO_Pat_PacBio_10k_contig_420 78.89 289 34 17 2 276 24459 24184 2e-49 194
!head /Users/sr320/Desktop/OlyO10kcontigs_exon_a.gff
OlyO_Pat_PacBio_10k_contig_319 blastn:OlyOv3_400 blastn 21178 21048 1e-39 - . 4486232 OlyO_Pat_PacBio_10k_contig_319 blastn:OlyOv3_400 blastn 13091 12964 1e-38 - . 4486232 OlyO_Pat_PacBio_10k_contig_319 blastn:OlyOv3_400 blastn 20765 20638 1e-38 - . 4486232 OlyO_Pat_PacBio_10k_contig_319 blastn:OlyOv3_400 blastn 13504 13374 5e-37 - . 4486232 OlyO_Pat_PacBio_10k_contig_319 blastn:OlyOv3_400 blastn 11561 11683 4e-25 + . 4486232 OlyO_Pat_PacBio_10k_contig_319 blastn:OlyOv3_400 blastn 19235 19357 4e-25 + . 4486232 OlyO_Pat_PacBio_10k_contig_319 blastn:OlyOv3_400 blastn 5867 5728 7e-23 - . 4486232 OlyO_Pat_PacBio_10k_contig_420 blastn:OlyOv3_400 blastn 9827 10237 4e-63 + . 4486255 OlyO_Pat_PacBio_10k_contig_420 blastn:OlyOv3_400 blastn 10237 9827 4e-63 - . 4486256 OlyO_Pat_PacBio_10k_contig_420 blastn:OlyOv3_400 blastn 24459 24184 2e-49 - . 4486297
!head /Volumes/web/dermochelys/Bioinformatics/IGV\ stuff/otbnos.tab
Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_549 87.16 7716 122 762 6 7148 9554 2135 0.0 7956 Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_550 86.10 6915 116 736 873 7148 7401 693 0.0 6665 Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_550 87.76 1699 18 167 874 2434 7475 9121 0.0 1810 Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_550 85.99 828 18 90 2747 3508 11421 10626 0.0 797 Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_550 80.39 1188 15 153 2441 3428 9248 10417 0.0 702 Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_550 91.46 82 1 5 2559 2637 5083 5005 1e-21 108 Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_1964 81.58 7947 208 1059 6 7148 9076 1582 0.0 5406 Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_5046 87.24 4946 52 527 873 5348 5837 10673 0.0 5103 Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_252 82.64 6913 77 873 1241 7150 6128 12920 0.0 5081 Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_2528 84.63 5140 96 596 6 4590 4524 9524 0.0 4475
# could just awk etc "!awk '{ print $10 }'
!head /Volumes/web/cnidarian/Olytranv2_OlyPacBio_v02.gff
OlyO_Pat_PacBio_1_contig_549 blastn:OlyOvtran_v2 blastn 9554 2135 0.0 - . Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_550 blastn:OlyOvtran_v2 blastn 7401 693 0.0 - . Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_550 blastn:OlyOvtran_v2 blastn 7475 9121 0.0 + . Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_550 blastn:OlyOvtran_v2 blastn 11421 10626 0.0 - . Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_550 blastn:OlyOvtran_v2 blastn 9248 10417 0.0 + . Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_550 blastn:OlyOvtran_v2 blastn 5083 5005 1e-21 - . Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_1964 blastn:OlyOvtran_v2 blastn 9076 1582 0.0 - . Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_5046 blastn:OlyOvtran_v2 blastn 5837 10673 0.0 + . Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_252 blastn:OlyOvtran_v2 blastn 6128 12920 0.0 + . Olurida_trim_nodups_v2reads_contig_9 OlyO_Pat_PacBio_1_contig_2528 blastn:OlyOvtran_v2 blastn 4524 9524 0.0 + . Olurida_trim_nodups_v2reads_contig_9