Generating the training and test set involves using the ocbio.extract
module with the chosen gold standard positive and negative datasets.
This notebook is supposed to act like a script to do this, with documentation inline.
First, the datasource table must be regenerated at the top directory containing the data:
cd ../../
/data/opencast/MRes
import csv
As the data repository has now been annexed the datasource table must first be unlocked:
!git annex unlock datasource.tab
unlock datasource.tab (copying...) ok
#this script should be updated to add new features when available
f = open("datasource.tab", "w")
c = csv.writer(f,delimiter="\t")
# Gene Ontology features
c.writerow(["Gene_Ontology","Gene_Ontology","generator=geneontology/testgen.pickle"])
# Y2H SVM feature
c.writerow(["Y2H/Y2H.txt","Y2H/Y2H.db","valindexes=(4);ignoreheader=1;zeromissing=1"])
# ENTS feature
c.writerow(["ENTS","ENTS","generator=ents/human.ENTS.features.pickle"])
# ENTS summary feature
c.writerow(["ENTS_summary","ENTS_summary","generator=ents/human.Entrez.ENTS.summary.pickle"])
f.close()
Next, ocbio.extract
must be added to the path and imported:
import sys
sys.path.append("opencast-bio/")
import ocbio.extract
reload(ocbio.extract)
<module 'ocbio.extract' from 'opencast-bio/ocbio/extract.pyc'>
!git annex unlock Y2H/Y2H.db
unlock Y2H/Y2H.db (copying...) ok
Then an assembler object must be initialised using the data source table:
assembler = ocbio.extract.FeatureVectorAssembler("datasource.tab", verbose=True)
Using from top data directory datasource.tab. Reading data source table: Data source: Gene_Ontology to be processed to Gene_Ontology Data source: Y2H/Y2H.txt to be processed to Y2H/Y2H.db Data source: ENTS to be processed to ENTS Data source: ENTS_summary to be processed to ENTS_summary Initialising parsers. Database Y2H/Y2H.db last updated 2014-06-25 12:15:04 Finished Initialisation.
Then all the features should be regenerated to ensure they are up to date:
assembler.regenerate(verbose=True)
Regenerating parsers: parser 0 Custom generator function, no database to regenerate. parser 1 Database Y2H/Y2H.db last updated 2014-06-25 12:15:04 parser 2 Custom generator function, no database to regenerate. parser 3 Custom generator function, no database to regenerate.
Using a set of positive interactions found through the iRefIndex project created in this notebook we can create a set of positive and negative feature vectors to train the classifier with:
assembler.assemble("iRefIndex/human.iRefIndex.positive.pairs.txt",
"features/human.iRefIndex.positive.vectors.txt",verbose=True)
Reading pairfile: iRefIndex/human.iRefIndex.positive.pairs.txt Checking feature sizes: Data source Gene_Ontology produces features of size 90. Data source Y2H/Y2H.txt produces features of size 1. Data source ENTS produces features of size 107. Data source ENTS_summary produces features of size 1. Writing feature vectors.................. Wrote 188833 vectors. Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.positive.pairs.txt to features from Gene_Ontology Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.positive.pairs.txt to features from Y2H/Y2H.txt Matched 38.39 % of protein pairs in iRefIndex/human.iRefIndex.positive.pairs.txt to features from ENTS Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.positive.pairs.txt to features from ENTS_summary
assembler.assemble("iRefIndex/human.iRefIndex.negative.pairs.txt",
"features/human.iRefIndex.negative.vectors.txt",verbose=True)
Reading pairfile: iRefIndex/human.iRefIndex.negative.pairs.txt Checking feature sizes: Data source Gene_Ontology produces features of size 90. Data source Y2H/Y2H.txt produces features of size 1. Data source ENTS produces features of size 107. Data source ENTS_summary produces features of size 1. Writing feature vectors................................................................................................... Wrote 997760 vectors. Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.negative.pairs.txt to features from Gene_Ontology Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.negative.pairs.txt to features from Y2H/Y2H.txt Matched 29.69 % of protein pairs in iRefIndex/human.iRefIndex.negative.pairs.txt to features from ENTS Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.negative.pairs.txt to features from ENTS_summary
To apply our classifier to the interactions in the Active Zone network we will need feature vectors corresponding to those interactions. These can be found in the following file:
assembler.assemble("forGAVIN/mergecode/OUT/edgelist.txt",
"features/human.activezone.txt",verbose=Tfeatures/
Reading pairfile: forGAVIN/mergecode/OUT/edgelist.txt Checking feature sizes: Data source Gene_Ontology produces features of size 90. Data source Y2H/Y2H.txt produces features of size 1. Data source ENTS produces features of size 107. Data source ENTS_summary produces features of size 1. Writing feature vectors Wrote 9375 vectors. Matched 100.00 % of protein pairs in forGAVIN/mergecode/OUT/edgelist.txt to features from Gene_Ontology Matched 100.00 % of protein pairs in forGAVIN/mergecode/OUT/edgelist.txt to features from Y2H/Y2H.txt Matched 42.74 % of protein pairs in forGAVIN/mergecode/OUT/edgelist.txt to features from ENTS Matched 100.00 % of protein pairs in forGAVIN/mergecode/OUT/edgelist.txt to features from ENTS_summary