The iRefIndex is a collection of protein interactions databases providing and index of canonical interaction pairs and references to the database providing evidence for the interaction. The purpose of this notebook is to extract a binary feature for each database integrated into iRefIndex. These databases are:

BIND
BioGRID
CORUM
DIP
HPRD
InnateDB
IntAct
MatrixDB
MINT
MPact
MPIDB
MPPI
OPHID

To extract this feature we will iterate over the table and use each Entrez Gene protein pair as a key to index the database referring to each entry:

In [1]:

cd ../../iRefIndex/

/home/gavin/Documents/MRes/iRefIndex

In [4]:

import csv

In [13]:

import pdb

In [24]:

f = open("9606.mitab.08122013.txt")
c = csv.reader(f,delimiter="\t")
irefindexdict = {}
for l in c:
    #extract Gene IDs
    gids = []
    for x in [l[2],l[3]]:
        for s in x.split("|"):
            s = s.split(":")
            if s[0]=="entrezgene/locuslink":
                gids.append(s[1])
    #only add entry to dictionary if there is a pair of Gene IDs
    if len(gids) == 2:
        try:
            irefindexdict[frozenset(gids)] += [l[12]]
        except KeyError:
            irefindexdict[frozenset(gids)] = [l[12]]
f.close()

Now we find the strings corresponding to unique databases:

In [26]:

uniqdbs = list(set(flatten(irefindexdict.values())))
print uniqdbs

['MI:0465(dip)', 'MI:0469(intact)', 'MI:0463(biogrid)', 'MI:0468(hprd)', 'MI:0000(corum)', 'MI:0000(mppi)', 'MI:0462(bind)', 'MI:0917(matrixdb)', 'MI:0000(bind_translation)', 'MI:0000(ophid)', 'MI:0974(innatedb)']

Using these we can create a dictionary using the same keys as above but using a 1-of-k coding for each database:

In [27]:

ireffeaturedict = {}
for k in irefindexdict.keys():
    fvector = []
    for db in uniqdbs:
        if db in irefindexdict[k]:
            fvector.append("1")
        else:
            fvector.append("0")
    ireffeaturedict[k] = fvector

Saving the results¶

These results will be saved in two ways:

First, the results will be saved to a file using the above unique database identifiers as column labels
Second, the dictionary will be pickled in a class specifically for iRefIndex and this will be saved to be loaded to build feature vectors

In [29]:

f = open("human.iRefIndex.Entrez.1ofk.txt", "w")
c = csv.writer(f,delimiter="\t")
c.writerow(["protein1","protein2"]+uniqdbs)
for k in ireffeaturedict.keys():
    pair = list(k)
    if len(pair) == 1:
        pair = pair*2
    c.writerow(pair + ireffeaturedict[k])
f.close()

In [30]:

!head human.iRefIndex.Entrez.1ofk.txt

In [31]:

import sys

In [32]:

sys.path.append("/home/gavin/Documents/MRes/opencast-bio/")

In [35]:

import ocbio.irefindex

In [37]:

features = ocbio.irefindex.features(ireffeaturedict)

In [38]:

import pickle

In [39]:

f = open("human.iRefIndex.Entrez.1ofk.pickle","wb")
pickle.dump(features,f)
f.close()