This IPython notebook is intended to serve as a structured means to annotate sequences using UniProt/SwissProt database. The notebook can be easily modified to personal preferences. As developed, the notebook requires the user has the following software installed ...
Instructions for use.
In a working directory of your choosing place query fasta file, naming as query.fa
. Edit the cell below, providing the path to said working directory.
Identify the location of the blast database you would like to use and indicate path in the cell below.
Identify the location of your sqlshare-pythonclient/tools
and indicate path in the cell below.
Change the input to the usr
variable to reflect your SQLShare user account.
#Location Variables
wd="/Users/sr320/git-repos/austral/modules/data/wd/"
db="/Users/sr320/data-genomic/blast/db/uniprot_sprot_r2015_01"
sqls="/Applications/bioinfo/sqlshare-pythonclient/tools/"
usr="sr320@washington.edu"
!head {wd}query.fa
>PiuraChilensis_v1_contig_1 ATTTACAATACGAAGTAAAATAGATAACGTGAAAATAATCTTGGTGCTGGATGATCGATC AAGTTCACCAATATTTTATTGTAAAAAATCATTCTAAACAGCATGAAATCGTGTACAATG TATAAACAAGCAAATATATAACACTAAAGCAAGAGGGCGTAAGTGGGGGGGTGGGTGAGA GTAAAAAATTCAAACATGTCAAATACCCCGGCGTTAGCCTTAAAAGCACCATGGACTTCT GCCTTCAATAAGCATAAAATTAAAACACCTAATACACAATGAATATACAGATAAAACAGA TTTATGAATAGTTGGTGTTACATCTTTTACAGCCATAAGCCTTCATTTTGCTTCCAAACG TATAAAATCTGACTTGGAACAATATACAGCCATGAGATATGACACAGCGAGCACTACAAT ATATATTTATCTTGTACTATACAGCCTGTACAAGAAAATTCTGGAATTGTCTTCACAAGA GACAGAAAAATAGTTGCAATGTGAATGCTAGTCTACTATTTGATCACAATTGGATAGAAA
#number of sequences
!fgrep -c ">" {wd}query.fa
15022
!/Applications/bioinfo/ncbi-blast-2.2.30/bin/blastx \
-query {wd}query.fa \
-db {db} \
-max_target_seqs 1 \
-max_hsps 1 \
-outfmt 6 \
-evalue 1E-05 \
-num_threads 2 \
-out {wd}blast_sprot.tab
!wc -l {wd}blast_sprot.tab
!tr '|' "\t" <{wd}blast_sprot.tab> {wd}blast_sprot_sql.tab
!head -1 {wd}blast_sprot.tab
!echo SQLShare ready version has Pipes converted to Tabs ....
!head -1 {wd}blast_sprot_sql.tab
!python {sqls}singleupload.py \
-d _blast_sprot \
{wd}blast_sprot_sql.tab
!python {sqls}fetchdata.py \
-s "SELECT Column1, term, GOSlim_bin, aspect, ProteinName FROM [{usr}].[_blast_sprot]md left join [samwhite@washington.edu].[UniprotProtNamesReviewed_yes20130610]sp on md.Column3=sp.SPID left join [sr320@washington.edu].[SPID and GO Numbers]go on md.Column3=go.SPID left join [sr320@washington.edu].[GO_to_GOslim]slim on go.GOID=slim.GO_id where aspect like 'P'" \
-f tsv \
-o {wd}GOdescriptions.txt
!head -2 {wd}GOdescriptions.txt
pylab inline
cd {wd}
from pandas import *
gs = read_table('GOdescriptions.txt')
gs.groupby('GOSlim_bin').Column1.count().plot(kind='barh', color=list('y'))
savefig('GOSlim.png', bbox_inches='tight')