The abundances reported by a purification affinity experiment can be used to infer if two proteins interact. This data takes the form of an abundance measured across a number of experiments for every protein involved in the experiment. If a protein is abundant it is more likely to interact with other proteins.
Missing values will be treated as they were for the affinity feature extraction. This means that we must pickle an average value for this feature to use over missing values on the full training set.
cd ../..
/data/opencast/MRes
import csv
f = open("datasource.abundance.tab","w")
c = csv.writer(f,delimiter="\t")
# just the abundance feature
c.writerow(["forGAVIN/pulldown_data/dataset/ppi_ab_entrez.csv",
"forGAVIN/pulldown_data/dataset/abundance.Entrez.db","ignoreheader=1;zeromissinginternal=1"])
f.close()
import sys
sys.path.append("opencast-bio/")
import ocbio.extract
!git annex unlock forGAVIN/pulldown_data/dataset/abundance.Entrez.db
unlock forGAVIN/pulldown_data/dataset/abundance.Entrez.db (copying...) ok
assembler = ocbio.extract.FeatureVectorAssembler("datasource.abundance.tab",verbose=True)
Using from top data directory datasource.abundance.tab. Reading data source table: Data source: forGAVIN/pulldown_data/dataset/ppi_ab_entrez.csv to be processed to forGAVIN/pulldown_data/dataset/abundance.Entrez.db Initialising parsers. Database forGAVIN/pulldown_data/dataset/abundance.Entrez.db last updated 2014-06-25 12:04:08 Finished Initialisation.
assembler.assemble("forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv",
"features/pulldown.interactions.interpolate.abundance.targets.txt",
verbose=True, missinglabel="0")
Reading pairfile: forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv Checking feature sizes: Data source forGAVIN/pulldown_data/dataset/ppi_ab_entrez.csv produces features of size 1. Writing feature vectors......................................................................................................................................................................... Wrote 1699246 vectors. Matched 99.78 % of protein pairs in forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv to features from forGAVIN/pulldown_data/dataset/ppi_ab_entrez.csv
y = loadtxt("features/pulldown.interactions.interpolate.abundance.targets.txt")
print "Average value of abundance feature: {0}".format(mean(y[y>1]))
Average value of abundance feature: 650944.146792
import pickle
f = open("forGAVIN/pulldown_data/dataset/abundance.average.pickle","wb")
pickle.dump([mean(y)],f)
f.close()
To make sure that linear regression isn't going to work better on this dataset than it did with the affinity notebook we should try to fit a linear regression model in this case as well:
#loading X vectors
X = loadtxt("features/pulldown.interactions.interpolate.vectors.txt")
import sklearn.utils
X,y = sklearn.utils.shuffle(X,y)
import sklearn.cross_validation
kf = sklearn.cross_validation.KFold(y.shape[0],10)
import sklearn.linear_model
scores = []
for train,test in kf:
#split the data
X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
#train the classifier
linreg = sklearn.linear_model.LinearRegression()
linreg.fit(X_train,y_train)
#test the classifier
scores.append(linreg.score(X_test,y_test))
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-41-6e78862002a9> in <module>() 5 #train the classifier 6 linreg = sklearn.linear_model.LinearRegression() ----> 7 linreg.fit(X_train,y_train) 8 #test the classifier 9 scores.append(linreg.score(X_test,y_test)) /usr/local/lib/python2.7/dist-packages/sklearn/linear_model/base.pyc in fit(self, X, y, n_jobs) 369 else: 370 self.coef_, self.residues_, self.rank_, self.singular_ = \ --> 371 linalg.lstsq(X, y) 372 self.coef_ = self.coef_.T 373 /usr/lib/python2.7/dist-packages/scipy/linalg/basic.pyc in lstsq(a, b, cond, overwrite_a, overwrite_b) 433 v, x, s, rank, info = gelss(a1, b1, cond=cond, lwork=lwork, 434 overwrite_a=overwrite_a, --> 435 overwrite_b=overwrite_b) 436 else: 437 raise NotImplementedError('calling gelss from %s' % gelss.module_name) KeyboardInterrupt:
Got tired of waiting.
print scores
[0.0002388620761113458, -0.0022111557516242275, 0.0016565073645899986, 0.00067015014436855314, -6.2298034043895001e-05, 9.6373969038832108e-05, -0.00020647131170825617]
Looks like there's very little advantage to a linear regression model here.