The pulldown experiments let us identify proteins that are localised to the presynaptic active zone which we are interested in studying. Other than just identifying the proteins it also results in a couple of measures which can be useful for predicting whether proteins interact:
This notebook looks at the second of these two:
The affinity measure which has already been extracted from this is based on this paper by Xie et al. They call their co-complexed score a C2S score and has a probabilistic basis, which is ideal for our use. All we need to do is extract a value for each protein pair we're looking at and see what kind of coverage these values give us.
Due to the convenient structure of the file we are using we can extract the feature with the functionality of ocbio.extract
.
Each protein pair is in a row in Entrez Gene ID format with the corresponding C2S value.
cd ../../
/data/opencast/MRes
Unfortunately, when running the ocbio.extract code naively we
To improve the coverage we will first use the zeromissinginternal
option in the FeatureVectorAssembler.
This will replace any interactions between proteins which the database knows about but doesn't have a value stored with zeros.
Secondly, we will interpolate the expected value for this feature based on the values of other features with better coverage. To do this, first we will generate a set of feature vectors over the bait and prey combinations. Using this, we will train a linear regression algorithm.
Then, we can pickle the trained linear regressor and store the table to create the feature vectors it must use. These can then be passed as options for the parser in the code to generate the feature vectors for the classification task.
Loading the features which were used for classification in the pipeline prototype notebook:
import sys,csv
sys.path.append("opencast-bio/")
import ocbio.extract
reload(ocbio.extract)
<module 'ocbio.extract' from 'opencast-bio/ocbio/extract.py'>
!git annex unlock datasource.proto.tab
unlock datasource.proto.tab (copying...) ok
f = open("datasource.proto.tab","w")
c = csv.writer(f,delimiter="\t")
# the HIPPIE feature
c.writerow(["HIPPIE/hippie_current.txt","HIPPIE/feature.HIPPIE.2.db","protindexes=(1,3);valindexes=(4);zeromissing=1"])
# Gene Ontology features
c.writerow(["Gene_Ontology","Gene_Ontology","generator=geneontology/testgen.pickle"])
# iRefIndex features
c.writerow(["iRefIndex","iRefIndex","generator=iRefIndex/human.iRefIndex.Entrez.1ofk.pickle"])
# STRING features
c.writerow(["STRING","STRING","generator=string/human.Entrez.string.pickle"])
# ENTS summary feature
c.writerow(['ENTS_summary','ENTS_summary','generator=ents/human.Entrez.ENTS.summary.pickle'])
f.close()
assembler = ocbio.extract.FeatureVectorAssembler("datasource.proto.tab", verbose=True)
Using from top data directory datasource.proto.tab. Reading data source table: Data source: HIPPIE/hippie_current.txt to be processed to HIPPIE/feature.HIPPIE.2.db Data source: Gene_Ontology to be processed to Gene_Ontology Data source: iRefIndex to be processed to iRefIndex Data source: STRING to be processed to STRING Data source: ENTS_summary to be processed to ENTS_summary Initialising parsers. Database HIPPIE/feature.HIPPIE.2.db may be empty, must be updated, please run regenerate. Finished Initialisation.
assembler.regenerate(verbose=True)
Regenerating parsers: parser 0 Database file may be empty, regenerating at HIPPIE/feature.HIPPIE.2.db from HIPPIE/hippie_current.txt. Filling database......................................................................................................................................................................... Parsed 169626 lines. parser 1 Custom generator function, no database to regenerate. parser 2 Custom generator function, no database to regenerate. parser 3 Custom generator function, no database to regenerate. parser 4 Custom generator function, no database to regenerate.
assembler.assemble("forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv",
"features/pulldown.interactions.interpolate.vectors.txt", verbose=True)
Reading pairfile: forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv Checking feature sizes: Data source HIPPIE/hippie_current.txt produces features of size 1. Data source Gene_Ontology produces features of size 90. Data source iRefIndex produces features of size 11. Data source STRING produces features of size 8. Data source ENTS_summary produces features of size 1. Writing feature vectors......................................................................................................................................................................... Wrote 1699246 vectors. Matched 100.00 % of protein pairs in forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv to features from HIPPIE/hippie_current.txt Matched 100.00 % of protein pairs in forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv to features from Gene_Ontology Matched 100.00 % of protein pairs in forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv to features from iRefIndex Matched 100.00 % of protein pairs in forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv to features from STRING Matched 100.00 % of protein pairs in forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv to features from ENTS_summary
f = open("datasource.affinity.tab","w")
c = csv.writer(f,delimiter="\t")
# just the affinity feature
c.writerow(["affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv",
"affinityresults/results2/affinity.Entrez.db","zeromissinginternal=1"])
f.close()
!git annex unlock affinityresults/results2/affinity.Entrez.db
assembler = ocbio.extract.FeatureVectorAssembler("datasource.affinity.tab", verbose=True)
Using from top data directory datasource.affinity.tab. Reading data source table: Data source: affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv to be processed to affinityresults/results2/affinity.Entrez.db Initialising parsers. Database affinityresults/results2/affinity.Entrez.db last updated 2014-06-25 12:06:47 Finished Initialisation.
assembler.assemble("forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv",
"features/pulldown.interactions.interpolate.affinity.targets.txt", verbose=True)
Reading pairfile: forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv Checking feature sizes: Data source affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv produces features of size 1. Writing feature vectors......................................................................................................................................................................... Wrote 1699246 vectors. Matched 92.24 % of protein pairs in forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv to features from affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv
Scikit-learn has an implementation of linear regression we can train to interpolate missing values. Specifically, in this case we will be using the ridge regression model to avoid overfitting on this large dataset. We will also be using 10-fold cross-validation and looking at the mean squared error over the test set to estimate the effectiveness of the model.
import sklearn.linear_model
#load vectors into memory:
#loading X vectors
X = loadtxt("features/pulldown.interactions.interpolate.vectors.txt")
#loading y targets
y = loadtxt("features/pulldown.interactions.interpolate.affinity.targets.txt",dtype=str)
missinglabels = y == "missing"
#zero missing values
y[missinglabels] = 0.0
#convert to float
y = y.astype(np.float)
import sklearn.utils
X,y = sklearn.utils.shuffle(X,y)
import sklearn.cross_validation
kf = sklearn.cross_validation.KFold(y.shape[0],10)
for train,test in kf:
print train,test
[ 169925 169926 169927 ..., 1699243 1699244 1699245] [ 0 1 2 ..., 169922 169923 169924] [ 0 1 2 ..., 1699243 1699244 1699245] [169925 169926 169927 ..., 339847 339848 339849] [ 0 1 2 ..., 1699243 1699244 1699245] [339850 339851 339852 ..., 509772 509773 509774] [ 0 1 2 ..., 1699243 1699244 1699245] [509775 509776 509777 ..., 679697 679698 679699] [ 0 1 2 ..., 1699243 1699244 1699245] [679700 679701 679702 ..., 849622 849623 849624] [ 0 1 2 ..., 1699243 1699244 1699245] [ 849625 849626 849627 ..., 1019547 1019548 1019549] [ 0 1 2 ..., 1699243 1699244 1699245] [1019550 1019551 1019552 ..., 1189471 1189472 1189473] [ 0 1 2 ..., 1699243 1699244 1699245] [1189474 1189475 1189476 ..., 1359395 1359396 1359397] [ 0 1 2 ..., 1699243 1699244 1699245] [1359398 1359399 1359400 ..., 1529319 1529320 1529321] [ 0 1 2 ..., 1529319 1529320 1529321] [1529322 1529323 1529324 ..., 1699243 1699244 1699245]
scores = []
for train,test in kf:
#split the data
X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
#train the classifier
linreg = sklearn.linear_model.LinearRegression()
linreg.fit(X_train,y_train)
#test the classifier
scores.append(linreg.score(X_test,y_test))
print scores
[0.006851770404450952, 0.0074703763710419757, 0.0088518497911050931, 0.0091843298107785465, 0.0088906585310051245, 0.0084456750397341462, 0.0073650204069001246, 0.0071932408986311591, 0.0072068518622049327, 0.0085652637462354519]
mean(scores)
0.0080025036862087506
According to the documentation:
The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares
((y_true - y_pred) ** 2).sum()
and v is the residual sum of squares((y_true - y_true.mean()) ** 2).sum()
. Best possible score is 1.0, lower values are worse.
So, this isn't a very good regression algorithm. Trying an ensemble method, such as an AdaBoost regressor.
import sklearn.ensemble
adascores = []
for train,test in kf:
#split the data
X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
#train the classifier
ada = sklearn.ensemble.AdaBoostRegressor()
ada.fit(X_train,y_train)
#test the classifier
adascores.append(ada.score(X_test,y_test))
break
Had to break the above loop after one iteration as it was taking too long to run.
adascores
[-1.9735555022882036]
So the regression sum of squares much be three times larger than the residual sum of squares. Obviously, that's worse than the above.
Xycov = cov(X.T,y.T)
print Xycov.shape
(112, 112)
print Xycov[:,-1][Xycov[:,-1]>1]
[ 1.87683432 1.41172834 1.56386886 1.88611184 4.12909717 2.59874547]
So several of the features are strongly correlated with this feature. Worth checking that this actually was the case.
Trying a ridge regression classifier, with different values for alpha:
ridgescores = []
for train,test in kf:
#split the data
X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
#train the classifier
ridge = sklearn.linear_model.Ridge(alpha=1.0)
ridge.fit(X_train,y_train)
#test the classifier
ridgescores.append(ridge.score(X_test,y_test))
break
ridgescores
[0.0084991392124383891]
Still not better than simple linear regression.
plot(X[:,-2][:100],y[:100])
[<matplotlib.lines.Line2D at 0x3e47750>]
import time
N = 100
XN = X[:,:][N-100:N]
plot(XN/amax(XN,axis=0))
plot(y[N-100:N]/max(abs(y[N-100:N])), label="target")
ypred = ridge.predict(X[N-100:N,:])
plot(ypred/max(abs(ypred)))
legend()
N += 100
amax(XN,axis=0)
array([ 0., 0., 0., 0.])
len(amax(X,axis=0))
111
So it looks like there is not much point to this approach. Would be as useful to simply fill with an average value.
print "Average value of pulldown affinity score over pulldown data: {0}".format(mean(y))
Average value of pulldown affinity score over pulldown data: -0.578366553592
import pickle
f = open("affinityresults/results2/affinity.pulldown.average.pickle","wb")
pickle.dump([mean(y)],f)
f.close()
An alternative to trying to fill in the missing values prior to training is to train two models and compute the weighted average of their results. One of these models would be trained without the pulldown features over the whole training set. The other would be trained with the pulldown features over a subset of the training data. These models would then be tested with different weights for the averaging over the subset.
As the dataset we are using is large this subset may still be large enough to produce viable results, especially if we use leave-one-out cross validation.
Before we can try this though, we have to check what the coverage of these features over the training set is.
assembler = ocbio.extract.FeatureVectorAssembler("datasource.affinity.tab", verbose=True)
Using from top data directory datasource.affinity.tab. Reading data source table: Data source: affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv to be processed to affinityresults/results2/affinity.Entrez.db Initialising parsers. Database affinityresults/results2/affinity.Entrez.db last updated 2014-06-25 12:06:47 Finished Initialisation.
By generating vectors we can see what the coverage is over the DIP training set:
assembler.assemble("DIP/human/training.nolabel.positive.Entrez.txt",
"features/training.DIP.positive.affinity.txt", verbose=True)
Reading pairfile: DIP/human/training.nolabel.positive.Entrez.txt Checking feature sizes: Data source affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv produces features of size 1. Writing feature vectors Wrote 4857 vectors. Matched 3.71 % of protein pairs in DIP/human/training.nolabel.positive.Entrez.txt to features from affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv
print "Number of positive vectors available: {0}".format(int(0.0371*4857))
Number of positive vectors available: 180
assembler.assemble("DIP/human/training.nolabel.negative.Entrez.txt",
"features/training.DIP.negative.affinity.txt", verbose=True)
Reading pairfile: DIP/human/training.nolabel.negative.Entrez.txt Checking feature sizes: Data source affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv produces features of size 1. Writing feature vectors.................................................................................................................................................................................................................................................................................................................. Wrote 3061800 vectors. Matched 2.28 % of protein pairs in DIP/human/training.nolabel.negative.Entrez.txt to features from affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv
print "Number of negative vectors available: {0}".format(int(0.0228*3061800))
Number of negative vectors available: 69809
We can do the same for the HIPPIE training set:
assembler.assemble("HIPPIE/hippie_current.pairs.txt",
"features/training.HIPPIE.positive.affinity.txt", verbose=True)
Reading pairfile: HIPPIE/hippie_current.pairs.txt Checking feature sizes: Data source affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv produces features of size 1. Writing feature vectors................ Wrote 169626 vectors. Matched 8.96 % of protein pairs in HIPPIE/hippie_current.pairs.txt to features from affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv
print "Number of positive HIPPIE interactions available: {0}".format(int(0.0896*169626))
Number of positive HIPPIE interactions available: 15198
So the above may be a viable strategy.