Notebook

Investigating the code and data used in the paper by Qi in 2006. See if I can figure out what's going on.

In [1]:

ls

0yeast_gene_list/       12homology-PPI/        2tf-binding/          5essentiality/         8nature-compare-sequence/        Batch_feature_summary_ExtractWrapper.pl  train-set/
10mips-phenotype/       13domain-interaction/  3gene-ontology/       6HighExp-PPI/          9mips-pclass/                    Investigating qi_evaluation_2006.ipynb
11sequence-similarity/  1gene-expression/      4protein-expression/  7genetic-interaction/  Batch_feature_ExtractWrapper.pl  README

Should start by looking at the README, probably.

In [2]:

cat README

So what we want to do is the second thing, generating features for a file containing a list of yeast pairs. So we should have a look at these scripts:

In [8]:

%%bash
head -n 40 ./Batch_feature_ExtractWrapper.pl

######################################################################3
#
# copyright @ Yanjun Qi , qyj@cs.cmu.edu
# Please cite: 
# Y. Qi, Z. Bar-Joseph, J. Klein-Seetharaman, "Evaluation of different biological data and computational classification methods for use in protein interaction prediction", PROTEINS: Structure, Function, and Bioinformatics. 63(3):490-500. 2006
# Y. Qi, J. Klein-Seetharaman, Z. Bar-Joseph, "A mixture of feature experts approach for protein-protein interaction prediction", BMC Bioinformatics 8 (S10):S6, 2007 
# Y. Qi, J. Klein-Seetharaman, Z. Bar-Joseph, �Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple source�, Pacific Symposium on Biocomputing 10: (PSB 2005) Jan. 2005. 
# 
######################################################################3


# This program is a yeast PPI feature extraction wrapper 
# perl command inputPairlist


use strict; 
die "Usage: command inputPairFile \n" if scalar(@ARGV) < 1;
my ($inputPair ) = @ARGV;


print "\n--------------------------- 1gene-expression ----------------------------------------\n"; 

# -------------------   1gene-expression   ------------------------------

my $cmdPre = "perl ./1gene-expression/get_gene_expression.pl  "; 
my $cmdPro = "./1gene-expression/YeastGeneListOrfGeneName-106_pval_v9.0.txt ./1gene-expression/all_expression_fixed_s4_csv.txt  ./1gene-expression/expressionYanjunSplit.txt 0.6 "; 

my $cmd = $cmdPre." ".$inputPair." ".$cmdPro." ".$inputPair.".genexp" ; 
print "$cmd\n"; 
system($cmd); 



print "\n-------------------------------------------------------------------\n"; 

# -------------------   2tf-group-binding  ------------------------------

# perl -d get_tfGroupBinding.pl ./lists/sciencesubset.txt pvalbygene_nature04.txt 204_pvalbygene_nature04_TFs.groupIndex 0.05 ./lists/sciencesubset.tfgroup

my $cmdPre = "perl ./2tf-binding/get_tfGroupBinding.pl  ";

So, looks like it's a perl script that calls different perl scripts to create feature vectors for each protein interactions. Ok, so if this does that then where is the code for the various types of classification done in the paper?

Had a look for them and couldn't find them.