Fasta2Slim¶

This IPython notebook is intended to serve as a structured means to annotate sequences using UniProt/SwissProt database. The notebook can be easily modified to personal preferences. As developed, the notebook requires the user has the following software installed ...

IPython
NCBI Blast
SQLShare Python Client

Instructions for use.
In a working directory of your choosing place query fasta file, naming as query.fa. Edit the cell below, providing the path to said working directory.

Identify the location of the blast database you would like to use and indicate path in the cell below.

Identify the location of your sqlshare-pythonclient/tools and indicate path in the cell below.

Change the input to the usr variable to reflect your SQLShare user account.

In [9]:

#Location Variables
wd="/Users/sr320/git-repos/austral/modules/data/wd/"

db="/Users/sr320/data-genomic/blast/db/uniprot_sprot_r2015_01"

sqls="/Applications/bioinfo/sqlshare-pythonclient/tools/"

usr="sr320@washington.edu"

In [10]:

!head {wd}query.fa

>PiuraChilensis_v1_contig_1
ATTTACAATACGAAGTAAAATAGATAACGTGAAAATAATCTTGGTGCTGGATGATCGATC
AAGTTCACCAATATTTTATTGTAAAAAATCATTCTAAACAGCATGAAATCGTGTACAATG
TATAAACAAGCAAATATATAACACTAAAGCAAGAGGGCGTAAGTGGGGGGGTGGGTGAGA
GTAAAAAATTCAAACATGTCAAATACCCCGGCGTTAGCCTTAAAAGCACCATGGACTTCT
GCCTTCAATAAGCATAAAATTAAAACACCTAATACACAATGAATATACAGATAAAACAGA
TTTATGAATAGTTGGTGTTACATCTTTTACAGCCATAAGCCTTCATTTTGCTTCCAAACG
TATAAAATCTGACTTGGAACAATATACAGCCATGAGATATGACACAGCGAGCACTACAAT
ATATATTTATCTTGTACTATACAGCCTGTACAAGAAAATTCTGGAATTGTCTTCACAAGA
GACAGAAAAATAGTTGCAATGTGAATGCTAGTCTACTATTTGATCACAATTGGATAGAAA

In [11]:

#number of sequences
!fgrep -c ">" {wd}query.fa

Blast¶

In [ ]:

!/Applications/bioinfo/ncbi-blast-2.2.30/bin/blastx \
-query {wd}query.fa \
-db {db} \
-max_target_seqs 1 \
-max_hsps 1 \
-outfmt 6 \
-evalue 1E-05 \
-num_threads 2 \
-out {wd}blast_sprot.tab

Number of matched sequences:¶

In [ ]:

!wc -l {wd}blast_sprot.tab 

In [ ]:

!tr '|' "\t" <{wd}blast_sprot.tab> {wd}blast_sprot_sql.tab 
!head -1 {wd}blast_sprot.tab
!echo SQLShare ready version has Pipes converted to Tabs ....
!head -1 {wd}blast_sprot_sql.tab 

In [ ]:

!python {sqls}singleupload.py \
-d _blast_sprot \
{wd}blast_sprot_sql.tab 

In [ ]:

!python {sqls}fetchdata.py \
-s "SELECT Column1, term, GOSlim_bin, aspect, ProteinName FROM [{usr}].[_blast_sprot]md left join [samwhite@washington.edu].[UniprotProtNamesReviewed_yes20130610]sp on md.Column3=sp.SPID left join [sr320@washington.edu].[SPID and GO Numbers]go on md.Column3=go.SPID left join [sr320@washington.edu].[GO_to_GOslim]slim on go.GOID=slim.GO_id where aspect like 'P'" \
-f tsv \
-o {wd}GOdescriptions.txt

In [ ]:

!head -2 {wd}GOdescriptions.txt

Plot GoSlim terms¶

In [ ]:

pylab inline

In [ ]:

cd {wd}

In [ ]:

from pandas import *

gs = read_table('GOdescriptions.txt')

In [ ]:

gs.groupby('GOSlim_bin').Column1.count().plot(kind='barh', color=list('y'))
savefig('GOSlim.png', bbox_inches='tight')

In [ ]:

Fasta2Slim¶

Blast¶

Number of matched sequences:¶

Joining in SQL Share¶

Plot GoSlim terms¶