To reiterate, it looks like the Qi code provided with the 2006 paper does not include anything for the classification - I guess they didn't see any point seeing as the classification can be performed with plenty of different toolsets and they probably used one they couldn't just share. In any case, I want to replicate their results so I've got to get the feature vectors they had and get out some Python classification algorithms and see if I can get similar results to the paper.
The code is basically just a couple of perl scripts, and I think I should be able to just run these without any hassle and hopefully this will give me the feature vectors. Unfortunately, if anything goes wrong it's going to take a while to work out because I don't know Perl.
ls
0yeast_gene_list/ 12homology-PPI/ 2tf-binding/ 5essentiality/ 8nature-compare-sequence/ Batch_feature_summary_ExtractWrapper.pl README 10mips-phenotype/ 13domain-interaction/ 3gene-ontology/ 6HighExp-PPI/ 9mips-pclass/ Investigating qi_evaluation_2006.ipynb Replicating Qi 2006.ipynb 11sequence-similarity/ 1gene-expression/ 4protein-expression/ 7genetic-interaction/ Batch_feature_ExtractWrapper.pl Investigating qi_evaluation_2006.md train-set/
%%bash
head -n 30 Batch_feature_ExtractWrapper.pl
######################################################################3 # # copyright @ Yanjun Qi , qyj@cs.cmu.edu # Please cite: # Y. Qi, Z. Bar-Joseph, J. Klein-Seetharaman, "Evaluation of different biological data and computational classification methods for use in protein interaction prediction", PROTEINS: Structure, Function, and Bioinformatics. 63(3):490-500. 2006 # Y. Qi, J. Klein-Seetharaman, Z. Bar-Joseph, "A mixture of feature experts approach for protein-protein interaction prediction", BMC Bioinformatics 8 (S10):S6, 2007 # Y. Qi, J. Klein-Seetharaman, Z. Bar-Joseph, �Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple source�, Pacific Symposium on Biocomputing 10: (PSB 2005) Jan. 2005. # ######################################################################3 # This program is a yeast PPI feature extraction wrapper # perl command inputPairlist use strict; die "Usage: command inputPairFile \n" if scalar(@ARGV) < 1; my ($inputPair ) = @ARGV; print "\n--------------------------- 1gene-expression ----------------------------------------\n"; # ------------------- 1gene-expression ------------------------------ my $cmdPre = "perl ./1gene-expression/get_gene_expression.pl "; my $cmdPro = "./1gene-expression/YeastGeneListOrfGeneName-106_pval_v9.0.txt ./1gene-expression/all_expression_fixed_s4_csv.txt ./1gene-expression/expressionYanjunSplit.txt 0.6 "; my $cmd = $cmdPre." ".$inputPair." ".$cmdPro." ".$inputPair.".genexp" ; print "$cmd\n"; system($cmd);
Ok, so the documentation isn't very extensive, the important part is that perl command inputPairlist
. Where I guess the command is the name of the wrapper?
Anyway, which file is the inputpairlist? I think it's in the 0yeast_gene_list/
directory.
cd 0yeast_gene_list/
/home/gavin/Documents/MRes/YeastPPI-shared-08/0yeast_gene_list
ls
make_fullpair_4protein.pl YeastGeneListOrfGeneName-106_pval_v9.0.txt
%%bash
head YeastGeneListOrfGeneName-106_pval_v9.0.txt
YAL001C TFC3 YAL002W VPS8 YAL003W EFB1 YAL004W YAL004W YAL005C SSA1 YAL007C ERP2 YAL008W FUN14 YAL009W SPO7 YAL010C MDM10 YAL011W FUN36
Guess this maps one reference name for a gene to another?
%%bash
head -n 30 make_fullpair_4protein.pl
######################################################################3 # # copyright @ Yanjun Qi , qyj@cs.cmu.edu # Please cite: # Y. Qi, Z. Bar-Joseph, J. Klein-Seetharaman, "Evaluation of different biological data and computational classification methods for use in protein interaction prediction", PROTEINS: Structure, Function, and Bioinformatics. 63(3):490-500. 2006 # Y. Qi, J. Klein-Seetharaman, Z. Bar-Joseph, "A mixture of feature experts approach for protein-protein interaction prediction", BMC Bioinformatics 8 (S10):S6, 2007 # Y. Qi, J. Klein-Seetharaman, Z. Bar-Joseph, �Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple source�, Pacific Symposium on Biocomputing 10: (PSB 2005) Jan. 2005. # ######################################################################3 # # This program is to make the pair list from the protein list # # perl make_fullpair_4protein.pl protein_list.txt pos_list output_pos output_other # # # $ perl make_fullpair_4protein.pl YeastGeneListOrfGeneName-106_pval_v9.0.txt Science03-pos_MIPS_complexes.txt mipsPosPair.txt mipsRandpair.txt #==> There are 6270 unique proteins in original list. #==> There should have 19653315 pairs possibly generated totally ! # There are 8617 POS pairs originally . # fullpairs has: 7390 POS pairs. # fullpairs has: 19645925 RAND pairs. # ==> There are 19653315 pairs generated ! # use strict; die "Usage: command gene_name_file pos_pairFile outPosPairFile outRandPairFile \n" if scalar(@ARGV) < 4;
Slightly confused about this. I'm guessing that this is to produce the input pair list for the main wrapper but I don't see where to get pos_pairFile
, outPosPairFile
, outRandPairFile
. Trying to find: Science03-pos_MIPS_complexes.txt
, mipsPosPair.txt mipsRandpair.txt
:
cd ..
%%bash
find . -name Science*
find . -name mips*
No sign of them.
I guess I'm supposed to get them from elsewhere maybe? Doesn't make any sense though, why tarball up all the code but not make it runnable?
Ok, maybe I'm just being stupid and two of those are output files? So we'd just need a positive list of proteins? Is that in here somewhere? Don't see it.
Looks like I can get it online at this page on the Qi site
Downloaded them and put them in a directory called featuresets
. Can have a look at them now:
ls -lh featuresets/phyInteract/
total 188M -rw-r--r-- 1 gavin users 51K May 28 16:19 dipsPosPair -rw-r--r-- 1 gavin users 2.1M May 28 16:19 dipsPosPair.feature -rw-r--r-- 1 gavin users 4.2M May 28 16:19 dipsRandpairSub23w -rw-r--r-- 1 gavin users 181M May 28 16:19 dipsRandpairSub23w.feature -rw-r--r-- 1 gavin users 624 May 28 16:19 readme.txt
%%bash
head -n 1 featuresets/phyInteract/dipsPosPair.feature
0.793388763224369,0.00420804252006518,0.425688981443356,0.050368298677632,-100,-0.431032656998849,-100,0.219336773277164,-0.260415665650041,0.197204579148454,0.561672012832894,-0.0228811714495604,0.226695830426883,-0.498251467771468,0.191892679025485,-0.0442896786511169,-0.0410836070811429,0.31939401640312,0.201702280568662,-0.206576006538866,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.42227951866507,1,-100,0,-100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,1
%%bash
head -n 1 featuresets/phyInteract/dipsRandpairSub23w.feature
-100,-0.451693610174967,-0.0287451930023522,0.199960245640947,-100,-0.132785992866974,-100,0.0113596052584383,-0.0184840432669403,-0.332990558018139,0.231135300459446,0.568965723612462,-0.00592435004576543,0.212805215910624,-0.934391682945072,-0.0356978663758904,0.635341056917029,0.507997427846247,0.449270105867988,0.376562615235622,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.47217114669236,0,-100,0,-100,-100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,0
Opening up these files:
import csv,glob
nipfiles = glob.glob("featuresets/phyInteract/dips*.feature")
print nipfiles
#write generator functions for each file will require generator object
['featuresets/phyInteract/dipsPosPair.feature', 'featuresets/phyInteract/dipsRandpairSub23w.feature']
One option would be to write a generator function to open these up and return rows as they are required to save RAM. If I get round to writing this I'll put it here:
#writing generator class
Another option would be to use pandas
as it's pretty much designed to solve this problem.
import pandas as pd
#test importing one of these csvs
pd.read_csv(nipfiles[0],header=None).head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.793389 | 0.004208 | 0.425689 | 0.050368 | -100.000000 | -0.431033 | -100.000000 | 0.219337 | -0.260416 | 0.197205 | 0.561672 | -0.022881 | 0.226696 | -0.498251 | 0.191893 | -0.044290 | -0.041084 | 0.319394 | 0.201702 | -0.206576 | ... |
1 | -0.060023 | -0.210066 | -0.091081 | 0.057263 | 0.615934 | 0.000000 | -0.270493 | 0.459723 | 0.283593 | -0.820271 | 0.768494 | 0.476355 | 0.100546 | -0.458742 | 0.179573 | -0.001993 | 0.670944 | 0.523979 | 0.031387 | 0.144759 | ... |
2 | -0.139235 | 0.233024 | -0.246856 | -0.197301 | -0.735368 | -0.848902 | -0.242361 | -0.274482 | -0.260667 | 0.968193 | 0.512238 | 0.116022 | 0.087863 | -0.000231 | -0.999968 | 0.018011 | -0.331740 | 0.006124 | -0.144496 | -0.287701 | ... |
3 | -100.000000 | 0.516909 | 0.489173 | 0.382931 | -100.000000 | 0.365115 | 0.055116 | 0.382403 | -0.396365 | 0.106891 | 0.658290 | 0.533645 | 0.096992 | 0.496594 | 0.976505 | -0.014651 | -0.257374 | 0.015310 | 0.194071 | -0.033708 | ... |
4 | 0.654579 | 0.657707 | 0.114509 | 0.293567 | 0.096530 | 0.000000 | -0.557642 | -0.108332 | 0.079067 | 0.801406 | 0.234897 | 0.414738 | 0.370008 | -0.273466 | -0.949009 | 0.043061 | 0.343765 | 0.329410 | 0.320448 | -0.165973 | ... |
5 rows × 163 columns
#initialise dictionary
nipd = {}
#load into dictionary
for fname in nipfiles:
nipd[fname] = pd.read_csv(fname,header=None)
Now just need to see if we can extract the array that we need to train the logistic regression model. Looks like I should use the iloc
indexing method.
#concatenate the positive and negative training sets together
X = pd.concat((nipd[nipfiles[0]].iloc[:10,0:162],nipd[nipfiles[1]].iloc[:10,0:162]))
#get the class label vector y the same way
y = pd.concat((nipd[nipfiles[0]].iloc[:10,162],nipd[nipfiles[1]].iloc[:10,162]))
#checking that our arrays will be the right size
print "length of X array is %i"%len(X.values)
print "length of y array is %i"%len(y.values)
length of X array is 20 length of y array is 20
The different classification algorithms to test to replicate Qi's results are:
Starting with good old Naive Bayes as I pretty well understand how it works.
Simple classifier assuming that the features are independent given the class label, which is apparently a pretty good assumption most of the time. Naive Bayes is usually studied with binary features but those used in this example are clearly not all binary. So, how do we get Scikit-learn to deal with that?
Specifically, the features are encoded as described here.
Looks like the ways round are both pretty much just hacks so it's probably better to use something else. If I have to write my own then this page might come in useful. Don't particularly want to though.
This one's pretty simple as well, so shouldn't be too difficult to run. On the logistic regression page it appears that this can be used with defaults. As far as I understand logistic regression it doesn't make any assumptions about the distributions of the features.
from sklearn import linear_model as sklm
#initialise a logistic regression model
logmodel = sklm.LogisticRegression()
Now the question is what the fit
command expects of the data. What format should it be in?
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vector, where n_samples in the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples]
Target vector relative to X
Now trying to fit the model. Appears the following cell will take a very long time to run with the full number of samples (240000). Unclear what I can do about this.
logmodel.fit(X.values,y.values)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)
Now all we need is a test set.
#concatenate the positive and negative training sets together
testX = pd.concat((nipd[nipfiles[0]].iloc[10:20,0:162],nipd[nipfiles[1]].iloc[10:20,0:162]))
#get the class label vector y the same way
testy = pd.concat((nipd[nipfiles[0]].iloc[10:20,162],nipd[nipfiles[1]].iloc[10:20,162]))
estimates = logmodel.predict_proba(testX.values)
Then compare these to the real results. Is there an automated way to do this? Of course there is: Scikit-learn's ROC curve.
from sklearn.metrics import roc_curve,auc
fpr, tpr, thresholds = roc_curve(testy.values,estimates[:,1])
roc_auc = auc(fpr,tpr)
print "Area under ROC curve: %.2f"%(roc_auc)
Area under ROC curve: 0.52
#define a function for quickly plotting ROC curves with nice annotations
def plotroc(fpr,tpr):
clf()
plot(fpr, tpr)
plot([0, 1], [0, 1], 'k--')
xlim([0.0, 1.0])
ylim([0.0, 1.0])
xlabel('False Positive Rate')
ylabel('True Positive Rate')
title('Receiver operating characteristic')
show()
return None
plotroc(fpr,tpr)
Ok, so that's terrible, probably because we're only using 20 training and test examples. Can now increase that to 1000:
from sklearn.utils import shuffle as skshuffle
#how many samples?
N = 1000
#concatenate the positive and negative training sets together
X = pd.concat((nipd[nipfiles[0]].iloc[:N,0:162],nipd[nipfiles[1]].iloc[:N,0:162]))
#get the class label vector y the same way
y = pd.concat((nipd[nipfiles[0]].iloc[:N,162],nipd[nipfiles[1]].iloc[:N,162]))
#then shuffle all of them
X,y = skshuffle(X.values,y.values)
#find the midpoint
half = int(len(X))/2
#split it into test and train
Xtrain, Xtest = X[:half], X[half:]
ytrain, ytest = y[:half], y[half:]
Then we want to fit the model again and test it again:
#reinitialise a logistic regression model
logmodel = sklm.LogisticRegression()
#fit it again
logmodel.fit(Xtrain,ytrain)
#test it again
estimates = logmodel.predict_proba(Xtest)
#recalc the roc curve
fpr, tpr, thresholds = roc_curve(ytest,estimates[:,1])
#print area under roc curve
roc_auc = auc(fpr,tpr)
print "Area under ROC curve: %.2f"%(roc_auc)
#replot the roc curve
plotroc(fpr,tpr)
Area under ROC curve: 0.90
However, this is strange, because we're not even using the full data and we're getting better results than the paper. Should be getting an AUC score of about 0.15 - what we are getting is 0.90.
Although, the number they're quoting in the paper is something called an R50 value, which is:
...R50 is a partial AUC score that measures the area under the ROC curve until reaching 50 negative predictions.
So we need to calculate this R50 value if we're going to replicate the results.
In this case we have 500 training points so the R50 value corresponds to (I guess?) the area under the curve up until the false positive rate is 0.1.
roccrv = pd.DataFrame(array([fpr,tpr]).T)
#don't ask about this line
plotroc(*zip(*roccrv[roccrv[0]<0.1].values))
r50 = auc(*zip(*roccrv[roccrv[0]<0.1].values))
print "The R50 value for this ROC curve is %.2f"%r50
The R50 value for this ROC curve is 0.06
Which could be right, it's certainly below what they get in the paper and it's obtained using much less training data than used in the paper. Unfortunately, I still can't find another source for what an R50 actually is. I doubt what I'm doing is 100% correct because then the R50 value you get would depend on the size of your test set (if my training set were smaller I could tolerate a higher false positive rate when finding the R50 value, which would obviously improve my results.
In lieu of any of that I'm just going to try and train this model with 80% of the data and see if the result is exactly what they have in the paper. If it is, I'll just assume my interpretation of the R50 value is right.
#concatenate the positive and negative training sets together
X = pd.concat((nipd[nipfiles[0]].iloc[:,0:162],nipd[nipfiles[1]].iloc[:,0:162]))
#get the class label vector y the same way
y = pd.concat((nipd[nipfiles[0]].iloc[:,162],nipd[nipfiles[1]].iloc[:,162]))
#then shuffle all of them
X,y = skshuffle(X.values,y.values)
#find the midpoint
eighty = int(len(X))*0.8
#split it into test and train
Xtrain, Xtest = X[:eighty], X[eighty:]
ytrain, ytest = y[:eighty], y[eighty:]
-c:11: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future -c:11: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future -c:12: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future -c:12: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
Ignoring the above warnings for now.
Ran the code below, took a while (less than an hour though):
#reinitialise a logistic regression model
logmodel = sklm.LogisticRegression()
#fit it again
logmodel.fit(Xtrain,ytrain)
#test it again
estimates = logmodel.predict_proba(Xtest)
#recalc the roc curve
fpr, tpr, thresholds = roc_curve(ytest,estimates[:,1])
#print area under roc curve
roc_auc = auc(fpr,tpr)
print "Area under ROC curve: %.2f"%(roc_auc)
#replot the roc curve
plotroc(fpr,tpr)
Area under ROC curve: 0.94
To recalculate the R50 will need to know the size of the test set:
len(ytest)
48050
So we can only tolerate a false positive rate of:
r50fpr = 50.0/len(ytest)
print r50fpr
0.00104058272633
Then we can calculate the R50 for this:
roccrv = pd.DataFrame(array([fpr,tpr]).T)
r50 = auc(*zip(*roccrv[roccrv[0]<r50fpr].values))
print "The R50 value for this ROC curve is %.2f"%r50
The R50 value for this ROC curve is 0.00
So that can't be right. I'll ask the Machine Learning professors if they've heard of this value. In the mean time I think I can just recreate the test set used in the paper using the quote:
...in our random test set, there are approximately 50 positive items (because 1 in 600 pairs are interacting and we selected 30,000 pairs)...
What's the problem then? does the test set have too many positive items in it? The training and test have been shuffled in this example so how many positive examples are in there?
sum(ytest)
554
That's a lot more than 50...
Only other lead is that the paper says it uses a LR classifier that:
...uses a ridge estimator for building a binary LR model.
This is already a binary LR model so I guess I should make sure it's using a ridge estimator - ie make sure it is regularised.
Asked the Machine Learning professors about the R50 value and they had never heard of it.