The pulldown experiments let us identify proteins that are localised to the presynaptic active zone which we are interested in studying. Other than just identifying the proteins it also results in a couple of measures which can be useful for predicting whether proteins interact:

Protein abundances
Affinity measures

This notebook looks at the second of these two:

Affinity measures¶

The affinity measure which has already been extracted from this is based on this paper by Xie et al. They call their co-complexed score a C2S score and has a probabilistic basis, which is ideal for our use. All we need to do is extract a value for each protein pair we're looking at and see what kind of coverage these values give us.

Due to the convenient structure of the file we are using we can extract the feature with the functionality of ocbio.extract. Each protein pair is in a row in Entrez Gene ID format with the corresponding C2S value.

In [1]:

cd ../../

/data/opencast/MRes

Improving coverage¶

Unfortunately, when running the ocbio.extract code naively we To improve the coverage we will first use the zeromissinginternal option in the FeatureVectorAssembler. This will replace any interactions between proteins which the database knows about but doesn't have a value stored with zeros.

Secondly, we will interpolate the expected value for this feature based on the values of other features with better coverage. To do this, first we will generate a set of feature vectors over the bait and prey combinations. Using this, we will train a linear regression algorithm.

Then, we can pickle the trained linear regressor and store the table to create the feature vectors it must use. These can then be passed as options for the parser in the code to generate the feature vectors for the classification task.

Loading regression features¶

Loading the features which were used for classification in the pipeline prototype notebook:

In [2]:

import sys,csv

In [3]:

sys.path.append("opencast-bio/")

In [4]:

import ocbio.extract

In [17]:

reload(ocbio.extract)

Out[17]:

<module 'ocbio.extract' from 'opencast-bio/ocbio/extract.py'>

In [7]:

!git annex unlock datasource.proto.tab

unlock datasource.proto.tab (copying...) ok

In [8]:

f = open("datasource.proto.tab","w")
c = csv.writer(f,delimiter="\t")
# the HIPPIE feature
c.writerow(["HIPPIE/hippie_current.txt","HIPPIE/feature.HIPPIE.2.db","protindexes=(1,3);valindexes=(4);zeromissing=1"])
# Gene Ontology features
c.writerow(["Gene_Ontology","Gene_Ontology","generator=geneontology/testgen.pickle"])
# iRefIndex features
c.writerow(["iRefIndex","iRefIndex","generator=iRefIndex/human.iRefIndex.Entrez.1ofk.pickle"])
# STRING features
c.writerow(["STRING","STRING","generator=string/human.Entrez.string.pickle"])
# ENTS summary feature
c.writerow(['ENTS_summary','ENTS_summary','generator=ents/human.Entrez.ENTS.summary.pickle'])
f.close()

In [16]:

assembler = ocbio.extract.FeatureVectorAssembler("datasource.proto.tab", verbose=True)

Using  from top data directory datasource.proto.tab.
Reading data source table:
	Data source: HIPPIE/hippie_current.txt to be processed to HIPPIE/feature.HIPPIE.2.db
	Data source: Gene_Ontology to be processed to Gene_Ontology
	Data source: iRefIndex to be processed to iRefIndex
	Data source: STRING to be processed to STRING
	Data source: ENTS_summary to be processed to ENTS_summary
Initialising parsers.
Database HIPPIE/feature.HIPPIE.2.db may be empty, must be updated, please run regenerate.
Finished Initialisation.

In [12]:

assembler.regenerate(verbose=True)

Regenerating parsers:
	 parser 0
Database file may be empty, regenerating at HIPPIE/feature.HIPPIE.2.db from HIPPIE/hippie_current.txt.
Filling database.........................................................................................................................................................................
Parsed 169626 lines.
	 parser 1
Custom generator function, no database to regenerate.
	 parser 2
Custom generator function, no database to regenerate.
	 parser 3
Custom generator function, no database to regenerate.
	 parser 4
Custom generator function, no database to regenerate.

Writing training vectors¶

In [17]:

assembler.assemble("forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv",
                   "features/pulldown.interactions.interpolate.vectors.txt", verbose=True)

Reading pairfile: forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv
Checking feature sizes:
	 Data source HIPPIE/hippie_current.txt produces features of size 1.
	 Data source Gene_Ontology produces features of size 90.
	 Data source iRefIndex produces features of size 11.
	 Data source STRING produces features of size 8.
	 Data source ENTS_summary produces features of size 1.
Writing feature vectors.........................................................................................................................................................................
Wrote 1699246 vectors.
Matched 100.00 % of protein pairs in forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv to features from HIPPIE/hippie_current.txt
Matched 100.00 % of protein pairs in forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv to features from Gene_Ontology
Matched 100.00 % of protein pairs in forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv to features from iRefIndex
Matched 100.00 % of protein pairs in forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv to features from STRING
Matched 100.00 % of protein pairs in forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv to features from ENTS_summary

Writing training labels¶

In [18]:

f = open("datasource.affinity.tab","w")
c = csv.writer(f,delimiter="\t")
# just the affinity feature
c.writerow(["affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv",
            "affinityresults/results2/affinity.Entrez.db","zeromissinginternal=1"])
f.close()

In [7]:

!git annex unlock affinityresults/results2/affinity.Entrez.db

In [8]:

assembler = ocbio.extract.FeatureVectorAssembler("datasource.affinity.tab", verbose=True)

Using  from top data directory datasource.affinity.tab.
Reading data source table:
	Data source: affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv to be processed to affinityresults/results2/affinity.Entrez.db
Initialising parsers.
Database affinityresults/results2/affinity.Entrez.db last updated 2014-06-25 12:06:47
Finished Initialisation.

In [9]:

assembler.assemble("forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv",
                   "features/pulldown.interactions.interpolate.affinity.targets.txt", verbose=True)

Reading pairfile: forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv
Checking feature sizes:
	 Data source affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv produces features of size 1.
Writing feature vectors.........................................................................................................................................................................
Wrote 1699246 vectors.
Matched 92.24 % of protein pairs in forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv to features from affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv

Linear regression¶

Scikit-learn has an implementation of linear regression we can train to interpolate missing values. Specifically, in this case we will be using the ridge regression model to avoid overfitting on this large dataset. We will also be using 10-fold cross-validation and looking at the mean squared error over the test set to estimate the effectiveness of the model.

In [3]:

import sklearn.linear_model

Loading the data¶

In [4]:

#load vectors into memory:
#loading X vectors
X = loadtxt("features/pulldown.interactions.interpolate.vectors.txt")

In [3]:

#loading y targets
y = loadtxt("features/pulldown.interactions.interpolate.affinity.targets.txt",dtype=str)

In [4]:

missinglabels = y == "missing"

In [5]:

#zero missing values
y[missinglabels] = 0.0

In [6]:

#convert to float
y = y.astype(np.float)

Splitting the data¶

In [9]:

import sklearn.utils

In [10]:

X,y = sklearn.utils.shuffle(X,y)

In [11]:

import sklearn.cross_validation

In [12]:

kf = sklearn.cross_validation.KFold(y.shape[0],10)

In [13]:

for train,test in kf:
    print train,test

[ 169925  169926  169927 ..., 1699243 1699244 1699245] [     0      1      2 ..., 169922 169923 169924]
[      0       1       2 ..., 1699243 1699244 1699245] [169925 169926 169927 ..., 339847 339848 339849]
[      0       1       2 ..., 1699243 1699244 1699245] [339850 339851 339852 ..., 509772 509773 509774]
[      0       1       2 ..., 1699243 1699244 1699245] [509775 509776 509777 ..., 679697 679698 679699]
[      0       1       2 ..., 1699243 1699244 1699245] [679700 679701 679702 ..., 849622 849623 849624]
[      0       1       2 ..., 1699243 1699244 1699245] [ 849625  849626  849627 ..., 1019547 1019548 1019549]
[      0       1       2 ..., 1699243 1699244 1699245] [1019550 1019551 1019552 ..., 1189471 1189472 1189473]
[      0       1       2 ..., 1699243 1699244 1699245] [1189474 1189475 1189476 ..., 1359395 1359396 1359397]
[      0       1       2 ..., 1699243 1699244 1699245] [1359398 1359399 1359400 ..., 1529319 1529320 1529321]
[      0       1       2 ..., 1529319 1529320 1529321] [1529322 1529323 1529324 ..., 1699243 1699244 1699245]

Fitting the model¶

In [29]:

scores = []
for train,test in kf:
    #split the data
    X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
    #train the classifier
    linreg = sklearn.linear_model.LinearRegression()
    linreg.fit(X_train,y_train)
    #test the classifier
    scores.append(linreg.score(X_test,y_test))

In [30]:

print scores

[0.006851770404450952, 0.0074703763710419757, 0.0088518497911050931, 0.0091843298107785465, 0.0088906585310051245, 0.0084456750397341462, 0.0073650204069001246, 0.0071932408986311591, 0.0072068518622049327, 0.0085652637462354519]

In [31]:

mean(scores)

Out[31]:

0.0080025036862087506

According to the documentation:

The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is 1.0, lower values are worse.

So, this isn't a very good regression algorithm. Trying an ensemble method, such as an AdaBoost regressor.

In [15]:

import sklearn.ensemble

In [17]:

adascores = []
for train,test in kf:
    #split the data
    X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
    #train the classifier
    ada = sklearn.ensemble.AdaBoostRegressor()
    ada.fit(X_train,y_train)
    #test the classifier
    adascores.append(ada.score(X_test,y_test))
    break

Had to break the above loop after one iteration as it was taking too long to run.

In [18]:

adascores

Out[18]:

[-1.9735555022882036]

So the regression sum of squares much be three times larger than the residual sum of squares. Obviously, that's worse than the above.

In [24]:

Xycov = cov(X.T,y.T)

In [28]:

print Xycov.shape

(112, 112)

In [38]:

print Xycov[:,-1][Xycov[:,-1]>1]

[ 1.87683432  1.41172834  1.56386886  1.88611184  4.12909717  2.59874547]

So several of the features are strongly correlated with this feature. Worth checking that this actually was the case.

Trying a ridge regression classifier, with different values for alpha:

In [123]:

ridgescores = []
for train,test in kf:
    #split the data
    X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
    #train the classifier
    ridge = sklearn.linear_model.Ridge(alpha=1.0)
    ridge.fit(X_train,y_train)
    #test the classifier
    ridgescores.append(ridge.score(X_test,y_test))
    break

In [47]:

ridgescores

Out[47]:

[0.0084991392124383891]

Still not better than simple linear regression.

In [49]:

plot(X[:,-2][:100],y[:100])

Out[49]:

[<matplotlib.lines.Line2D at 0x3e47750>]

In [61]:

import time

In [125]:

N = 100

In [191]:

XN = X[:,:][N-100:N]
plot(XN/amax(XN,axis=0))
plot(y[N-100:N]/max(abs(y[N-100:N])), label="target")
ypred = ridge.predict(X[N-100:N,:])
plot(ypred/max(abs(ypred)))
legend()
N += 100

In [159]:

amax(XN,axis=0)

Out[159]:

array([ 0.,  0.,  0.,  0.])

In [156]:

len(amax(X,axis=0))

Out[156]:

So it looks like there is not much point to this approach. Would be as useful to simply fill with an average value.

In [7]:

print "Average value of pulldown affinity score over pulldown data: {0}".format(mean(y))

Average value of pulldown affinity score over pulldown data: -0.578366553592

In [8]:

import pickle

In [9]:

f = open("affinityresults/results2/affinity.pulldown.average.pickle","wb")
pickle.dump([mean(y)],f)
f.close()

Checking training set coverage¶

An alternative to trying to fill in the missing values prior to training is to train two models and compute the weighted average of their results. One of these models would be trained without the pulldown features over the whole training set. The other would be trained with the pulldown features over a subset of the training data. These models would then be tested with different weights for the averaging over the subset.

As the dataset we are using is large this subset may still be large enough to produce viable results, especially if we use leave-one-out cross validation.

Before we can try this though, we have to check what the coverage of these features over the training set is.

In [18]:

assembler = ocbio.extract.FeatureVectorAssembler("datasource.affinity.tab", verbose=True)

Using  from top data directory datasource.affinity.tab.
Reading data source table:
	Data source: affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv to be processed to affinityresults/results2/affinity.Entrez.db
Initialising parsers.
Database affinityresults/results2/affinity.Entrez.db last updated 2014-06-25 12:06:47
Finished Initialisation.

DIP¶

By generating vectors we can see what the coverage is over the DIP training set:

In [8]:

assembler.assemble("DIP/human/training.nolabel.positive.Entrez.txt",
                   "features/training.DIP.positive.affinity.txt", verbose=True)

Reading pairfile: DIP/human/training.nolabel.positive.Entrez.txt
Checking feature sizes:
	 Data source affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv produces features of size 1.
Writing feature vectors
Wrote 4857 vectors.
Matched 3.71 % of protein pairs in DIP/human/training.nolabel.positive.Entrez.txt to features from affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv

In [9]:

print "Number of positive vectors available: {0}".format(int(0.0371*4857))

Number of positive vectors available: 180

In [10]:

assembler.assemble("DIP/human/training.nolabel.negative.Entrez.txt",
                   "features/training.DIP.negative.affinity.txt", verbose=True)

Reading pairfile: DIP/human/training.nolabel.negative.Entrez.txt
Checking feature sizes:
	 Data source affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv produces features of size 1.
Writing feature vectors..................................................................................................................................................................................................................................................................................................................
Wrote 3061800 vectors.
Matched 2.28 % of protein pairs in DIP/human/training.nolabel.negative.Entrez.txt to features from affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv

In [11]:

print "Number of negative vectors available: {0}".format(int(0.0228*3061800))

Number of negative vectors available: 69809

HIPPIE¶

We can do the same for the HIPPIE training set:

In [19]:

assembler.assemble("HIPPIE/hippie_current.pairs.txt",
                   "features/training.HIPPIE.positive.affinity.txt", verbose=True)

Reading pairfile: HIPPIE/hippie_current.pairs.txt
Checking feature sizes:
	 Data source affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv produces features of size 1.
Writing feature vectors................
Wrote 169626 vectors.
Matched 8.96 % of protein pairs in HIPPIE/hippie_current.pairs.txt to features from affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv

In [20]:

print "Number of positive HIPPIE interactions available: {0}".format(int(0.0896*169626))

Number of positive HIPPIE interactions available: 15198

So the above may be a viable strategy.