The iRefIndex is a collection of protein interactions databases providing and index of canonical interaction pairs and references to the database providing evidence for the interaction. The purpose of this notebook is to extract a binary feature for each database integrated into iRefIndex. These databases are:
To extract this feature we will iterate over the table and use each Entrez Gene protein pair as a key to index the database referring to each entry:
cd ../../iRefIndex/
/home/gavin/Documents/MRes/iRefIndex
import csv
import pdb
f = open("9606.mitab.08122013.txt")
c = csv.reader(f,delimiter="\t")
irefindexdict = {}
for l in c:
#extract Gene IDs
gids = []
for x in [l[2],l[3]]:
for s in x.split("|"):
s = s.split(":")
if s[0]=="entrezgene/locuslink":
gids.append(s[1])
#only add entry to dictionary if there is a pair of Gene IDs
if len(gids) == 2:
try:
irefindexdict[frozenset(gids)] += [l[12]]
except KeyError:
irefindexdict[frozenset(gids)] = [l[12]]
f.close()
Now we find the strings corresponding to unique databases:
uniqdbs = list(set(flatten(irefindexdict.values())))
print uniqdbs
['MI:0465(dip)', 'MI:0469(intact)', 'MI:0463(biogrid)', 'MI:0468(hprd)', 'MI:0000(corum)', 'MI:0000(mppi)', 'MI:0462(bind)', 'MI:0917(matrixdb)', 'MI:0000(bind_translation)', 'MI:0000(ophid)', 'MI:0974(innatedb)']
Using these we can create a dictionary using the same keys as above but using a 1-of-k coding for each database:
ireffeaturedict = {}
for k in irefindexdict.keys():
fvector = []
for db in uniqdbs:
if db in irefindexdict[k]:
fvector.append("1")
else:
fvector.append("0")
ireffeaturedict[k] = fvector
These results will be saved in two ways:
f = open("human.iRefIndex.Entrez.1ofk.txt", "w")
c = csv.writer(f,delimiter="\t")
c.writerow(["protein1","protein2"]+uniqdbs)
for k in ireffeaturedict.keys():
pair = list(k)
if len(pair) == 1:
pair = pair*2
c.writerow(pair + ireffeaturedict[k])
f.close()
!head human.iRefIndex.Entrez.1ofk.txt
import sys
sys.path.append("/home/gavin/Documents/MRes/opencast-bio/")
import ocbio.irefindex
features = ocbio.irefindex.features(ireffeaturedict)
import pickle
f = open("human.iRefIndex.Entrez.1ofk.pickle","wb")
pickle.dump(features,f)
f.close()