The concept is that you can take a fasta file in a working directory and end up with GO slim information all within a single notebook that is automated. Currently this work by writing (and overwriting) as scracth file to SQLShare. Assumptions are that you are working in a directory with fasta file named query.fa
. And blast algorithms are in PATH.
#allows plots to be shown inline
%pylab inline
Populating the interactive namespace from numpy and matplotlib
#Setting Working Directory
wd="/Volumes/web/whale/fish546/qpx_go_val"
#Setting directory of Blast Databases
dbd="/Volumes/Bay3/Software/ncbi-blast-2.2.29\+/db/"
#Database name
dbn="uniprot_sprot_r2013_12"
#Blast algorithim
ba="blastx"
#Location of SQLShare python tools: you can empty ("") if tools are in PATH
spd="/Users/sr320/sqlshare-pythonclient/tools/"
cd {wd}
/Volumes/web/whale/fish546/qpx_go_val
!{ba} -query query.fa -db {dbd}{dbn} -out {dbn}_{ba}_out.tab -evalue 1E-50 -num_threads 4 -max_hsps_per_subject 1 -max_target_seqs 1 -outfmt 6
BLAST Database error: No alias or index file found for protein database [/Volumes/Bay3/Software/ncbi-blast-2.2.29+/db/uniprot_sprot_r2013_12] in search path [/Volumes/web/whale/fish546/pipeline_test_dir4::]
!head -1 {dbn}_{ba}_out.tab
QPX_transcriptome_v1_Contig_2 sp|P52712|CBPX_ORYSJ 43.75 416 213 12 2095 869 6 407 3e-98 326
#Translate pipes to tab so SPID is in separate column for Joining
!tr '|' "\t" <{dbn}_{ba}_out.tab> {dbn}_{ba}_out2.tab
!head -1 {dbn}_{ba}_out2.tab
#Uploads formatted blast table to SQLshare; currently has generic name and meant to be temporary: Warning will overwrite.
!python {spd}singleupload.py -d scratchblast_out {dbn}_{ba}_out2.tab
processing chunk line 0 to 1512 (0.378393888474 s elapsed) pushing uniprot_sprot_r2013_12_blastx_out2.tab... parsing 8A0C3E42... finished scratchblast_out
!python {spd}fetchdata.py -s "SELECT * FROM [sr320@washington.edu].[scratchblast_out]blast Left Join [sr320@washington.edu].[uniprot-reviewed_wGO_010714]unp ON blast.Column3 = unp.Entry Left Join [sr320@washington.edu].[SPID and GO Numbers]go ON unp.Entry = go.SPID Left Join [sr320@washington.edu].[GO_to_GOslim]slim ON slim.GO_id = go.GOID" -f tsv -o {dbn}_join2goslim.txt
!head -2 {dbn}_join2goslim.txt
!python {spd}singleupload.py -d scratchjoin_slim {dbn}_join2goslim.txt
processing chunk line 0 to 18037 (0.0718240737915 s elapsed) pushing uniprot_sprot_r2013_12_join2goslim.txt... parsing 9A18D989... finished scratchjoin_slim
#Sets GO aspect
!python {spd}fetchdata.py -s "SELECT Distinct Column1 as query, Column3 as SPID, GOSlim_bin FROM [sr320@washington.edu].[scratchjoin_slim] Where aspect = 'P'" -f tsv -o justslim.txt
!head justslim.txt
from pandas import *
jslim = read_table("justslim.txt", # name of the data file
#sep=",", # what character separates each column?
na_values=["", " "]) # what values should be considered "blank" values?
jslim.groupby('GOSlim_bin').query.count().plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x1067a5550>
!python {spd}singleupload.py -d scratchjoin_slim {dbn}_join2goslim.txt
processing chunk line 0 to 17730 (0.107241153717 s elapsed) pushing uniprot_sprot_r2013_12_join2goslim.txt... parsing 9474190F... finished scratchjoin_slim
#Sets GO aspect
!python {spd}fetchdata.py -s "SELECT Distinct Column1 as query, Column3 as SPID, GOSlim_bin FROM [sr320@washington.edu].[scratchjoin_slim] Where aspect = 'P'" -f tsv -o justslim.txt
!head justslim.txt
#from pandas import *
jslim = read_table("justslim.txt", # name of the data file
#sep=",", # what character separates each column?
na_values=["", " "]) # what values should be considered "blank" values?
jslim.groupby('GOSlim_bin').query.count().plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x1068c1790>
!say "hash tag winning"
#could also upload again to get a simple table
#could be done in pandas
#!python {spd}singleupload.py -d scratchpie justslim.txt
processing chunk line 0 to 2538 (0.00250601768494 s elapsed) pushing justslim.txt... parsing 87B0B7A8... finished scratchpie
#fetching data grouped by GObin
#!python {spd}fetchdata.py -s "SELECT GOSlim_bin, COUNT(GOSlim_bin) as termcount from [sr320@washington.edu].[scratchpie] Group by GOSlim_bin" -f tsv -o justpie.txt