The abundances reported by a purification affinity experiment can be used to infer if two proteins interact. This data takes the form of an abundance measured across a number of experiments for every protein involved in the experiment. If a protein is abundant it is more likely to interact with other proteins.

Missing values¶

Missing values will be treated as they were for the affinity feature extraction. This means that we must pickle an average value for this feature to use over missing values on the full training set.

In [1]:

cd ../..

/data/opencast/MRes

In [3]:

import csv

In [4]:

f = open("datasource.abundance.tab","w")
c = csv.writer(f,delimiter="\t")
# just the abundance feature
c.writerow(["forGAVIN/pulldown_data/dataset/ppi_ab_entrez.csv",
            "forGAVIN/pulldown_data/dataset/abundance.Entrez.db","ignoreheader=1;zeromissinginternal=1"])
f.close()

In [6]:

import sys

In [7]:

sys.path.append("opencast-bio/")

In [8]:

import ocbio.extract

In [10]:

!git annex unlock forGAVIN/pulldown_data/dataset/abundance.Entrez.db

unlock forGAVIN/pulldown_data/dataset/abundance.Entrez.db (copying...) ok

In [11]:

assembler = ocbio.extract.FeatureVectorAssembler("datasource.abundance.tab",verbose=True)

Using  from top data directory datasource.abundance.tab.
Reading data source table:
	Data source: forGAVIN/pulldown_data/dataset/ppi_ab_entrez.csv to be processed to forGAVIN/pulldown_data/dataset/abundance.Entrez.db
Initialising parsers.
Database forGAVIN/pulldown_data/dataset/abundance.Entrez.db last updated 2014-06-25 12:04:08
Finished Initialisation.

In [14]:

assembler.assemble("forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv",
                   "features/pulldown.interactions.interpolate.abundance.targets.txt",
                   verbose=True, missinglabel="0")

Reading pairfile: forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv
Checking feature sizes:
	 Data source forGAVIN/pulldown_data/dataset/ppi_ab_entrez.csv produces features of size 1.
Writing feature vectors.........................................................................................................................................................................
Wrote 1699246 vectors.
Matched 99.78 % of protein pairs in forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv to features from forGAVIN/pulldown_data/dataset/ppi_ab_entrez.csv

In [15]:

y = loadtxt("features/pulldown.interactions.interpolate.abundance.targets.txt")

In [30]:

print "Average value of abundance feature: {0}".format(mean(y[y>1]))

Average value of abundance feature: 650944.146792

In [17]:

import pickle

In [43]:

f = open("forGAVIN/pulldown_data/dataset/abundance.average.pickle","wb")
pickle.dump([mean(y)],f)
f.close()

Testing linear regression¶

To make sure that linear regression isn't going to work better on this dataset than it did with the affinity notebook we should try to fit a linear regression model in this case as well:

In [34]:

#loading X vectors
X = loadtxt("features/pulldown.interactions.interpolate.vectors.txt")

In [35]:

import sklearn.utils

In [36]:

X,y = sklearn.utils.shuffle(X,y)

In [37]:

import sklearn.cross_validation

In [38]:

kf = sklearn.cross_validation.KFold(y.shape[0],10)

In [40]:

import sklearn.linear_model

In [41]:

scores = []
for train,test in kf:
    #split the data
    X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
    #train the classifier
    linreg = sklearn.linear_model.LinearRegression()
    linreg.fit(X_train,y_train)
    #test the classifier
    scores.append(linreg.score(X_test,y_test))

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-41-6e78862002a9> in <module>()
      5     #train the classifier
      6     linreg = sklearn.linear_model.LinearRegression()
----> 7     linreg.fit(X_train,y_train)
      8     #test the classifier
      9     scores.append(linreg.score(X_test,y_test))

/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/base.pyc in fit(self, X, y, n_jobs)
    369         else:
    370             self.coef_, self.residues_, self.rank_, self.singular_ = \
--> 371                 linalg.lstsq(X, y)
    372             self.coef_ = self.coef_.T
    373 

/usr/lib/python2.7/dist-packages/scipy/linalg/basic.pyc in lstsq(a, b, cond, overwrite_a, overwrite_b)
    433         v, x, s, rank, info = gelss(a1, b1, cond=cond, lwork=lwork,
    434                                                 overwrite_a=overwrite_a,
--> 435                                                 overwrite_b=overwrite_b)
    436     else:
    437         raise NotImplementedError('calling gelss from %s' % gelss.module_name)

KeyboardInterrupt:

Got tired of waiting.

In [42]:

print scores

[0.0002388620761113458, -0.0022111557516242275, 0.0016565073645899986, 0.00067015014436855314, -6.2298034043895001e-05, 9.6373969038832108e-05, -0.00020647131170825617]

Looks like there's very little advantage to a linear regression model here.