# Node Label Prediction \ Link Prediction¶

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import networkx as nx
%matplotlib inline


We will start by node label prediction. Download this network. It contains protein communications in Bakerâ€™s yeast. Each node (protein) has a special binary attribute ICSC (intracellular signaling cascade).

In [3]:
g = nx.read_gml('seminar_prediction/ppi.CC.gml')
cc = list(nx.connected_components(g))
g = nx.subgraph(g,cc[0])
g = nx.relabel.convert_node_labels_to_integers(g)

In [4]:
labels = np.array(nx.get_node_attributes(g, 'ICSC').values(), dtype=float)

In [5]:
nx.draw_spring(g, node_color = labels)


It might not be clear from the picture above but the level of homogeneity is quite high. For each node we are able to compute the average value of label

In [8]:
nnICSC = np.asarray(map(lambda(v): np.mean(labels[g.neighbors(v)]), g.nodes_iter()))
nnICSC
plt.figure(figsize=(10,5))
plt.hist(nnICSC[np.where(labels == 1)], bins=6,)

Out[8]:
(array([  5.,   1.,   3.,  15.,  12.,  34.]),
array([ 0.        ,  0.16666667,  0.33333333,  0.5       ,  0.66666667,
0.83333333,  1.        ]),
<a list of 6 Patch objects>)

## Iterative Classification Method¶

ICM is kind of NN-classifier. Our prediction is based on the largest ratio of nearest neighbours of labeled nodes.

• Randomly set unlabeled nodes.
• Implement classification rule of ICM (HINT look at the code cell above)
• Implement function for classification error and use it wisely
In [9]:
lablNN = labels[:]
idx = np.random.randint(0,len(lablNN), size=40)
lablNN[idx] = np.nan

In [12]:
idx = np.nonzero(np.isnan(lablNN))[0]
lablClassified = np.asarray(map(lambda(v): np.round(np.nanmean(lablNN[g.neighbors(v)])), idx))
lablNN[idx] = lablClassified
lablNN

Out[12]:
array([ 1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,
1.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,
0.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,
0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,
0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,  0.,
1.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,
1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,
0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,
1.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  0.])
In [13]:
def classError(trueLabel, predLabels):
return np.sum(trueLabel != predLabels)

In [14]:
classError(labels, lablNN)

Out[14]:
0

## Label Propagation¶

Now instead of looking at neigbours we are switching random walks between all the nodes

Just to recall the Label Propagation method:

1. Compute $P = D^{-1}A$
2. Set $Y^{(0)} = (Y_l,0)$ ($Y_l$ - labeled data)
3. repeat
• $Y^{(t+1)} = PY^{(t)}$
• $Y_l^{(t+1)} = Y_l$
4. until $Y^{(t)}$ converges
In [15]:
# It is better to initialize like that
fixedLabels = labels[:]+1
curLabels = labels[:]+1

# And indicate labeled nodes instead of unlabeled
idxLabeled = np.zeros((g.number_of_nodes(),), dtype=bool)
idxLabeled[np.random.randint(0,len(labels), size=90)] = True
curLabels[~idxLabeled] = 0

In [16]:
def LabelPropagation(G, idxLabeled, curLabels, fixedLabels, iters = 1000):
P = np.diag(1.0/sum(A)).dot(A)

iterNorm = 1.0
resultLabels = P.dot(curLabels)
deltaNorm = 1.0
iterNum = 1

while (deltaNorm > 1e-3 and iterNum < iters and ~np.all(resultLabels >= 1)):
iterNorm = np.linalg.norm(resultLabels - curLabels)
runLabels, curLabels = resultLabels[:],resultLabels[:]
runLabels[idxLabeled] = fixedLabels[idxLabeled]
resultLabels = P.dot(runLabels)
deltaNorm = np.abs(iterNorm - np.linalg.norm(resultLabels - curLabels))
print deltaNorm
iterNum+=1
print 'Converged in {0} iteration'.format(iterNum)
return np.round(resultLabels)

In [18]:
result = LabelPropagation(g, idxLabeled, curLabels, fixedLabels, iters = 1000)
result

6.6004377163
1.05281315836
0.279725703227
0.129492471182
0.051648016945
0.0200835427088
0.0128969227926
0.00434079880178
0.00382025524803
0.0013564441591
0.00124407968333
0.000582641210313
Converged in 13 iteration

Out[18]:
array([ 1.,  1.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,
2.,  2.,  2.,  1.,  1.,  1.,  1.,  2.,  1.,  1.,  1.,  2.,  2.,
2.,  2.,  1.,  2.,  1.,  1.,  1.,  2.,  1.,  1.,  1.,  2.,  2.,
1.,  2.,  1.,  2.,  2.,  1.,  1.,  1.,  1.,  1.,  2.,  2.,  2.,
1.,  1.,  1.,  1.,  1.,  1.,  2.,  1.,  1.,  1.,  2.,  2.,  2.,
1.,  1.,  2.,  1.,  1.,  1.,  2.,  2.,  2.,  1.,  2.,  1.,  1.,
1.,  1.,  2.,  2.,  1.,  1.,  1.,  2.,  2.,  1.,  1.,  1.,  1.,
1.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  1.,  1.,  1.,
1.,  2.,  2.,  2.,  1.,  1.,  1.,  1.,  2.,  2.,  1.,  1.,  1.,
2.,  1.,  2.,  2.,  2.,  2.,  2.,  2.,  1.,  1.])
In [19]:
classError(labels+1, result)

Out[19]:
18

In this section we will implement some scoring functions for link prediction and compare the values for adjacent and non-adjacent nodes.

Load french blog network and compute the following scores:

In [20]:
g = nx.read_gml('./seminar_prediction/fblog.gml')
vNum = g.number_of_nodes()

In [21]:
def matrixPlot(A):
plt.figure(1, figsize=(6, 6))
plt.imshow(A,
interpolation="none"
)


#### Shortest Path Length¶

In [22]:
# Your code here
spScore = nx.shortest_paths.floyd_warshall_numpy(g)
matrixPlot(spScore)


#### Number of common neighbours¶

In [27]:
# Your code here
nnScore = np.zeros((vNum,vNum))
for i in xrange(vNum):
for j in xrange(i+1, vNum):
nnScore[i,j] = len(np.intersect1d(nx.neighbors(g,i),nx.neighbors(g,j)))
nnScore[0,]

Out[27]:
array([ 0.,  1.,  2.,  1.,  2.,  2.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,
0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,
0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,
0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,
1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,
0.,  0.,  0.,  0.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

#### Jaccard Score¶

In [29]:
# Your code here
jaScore = np.zeros((vNum,vNum))
for i in xrange(vNum):
for j in xrange(i+1, vNum):
jaScore[i,j] = 1.0*len(np.intersect1d(nx.neighbors(g,i),nx.neighbors(g,j))) / len(np.union1d(nx.neighbors(g,i),nx.neighbors(g,j)))
jaScore[0,]

Out[29]:
array([ 0.        ,  0.05882353,  0.5       ,  0.09090909,  0.14285714,
0.33333333,  0.4       ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.02      ,  0.        ,
0.        ,  0.        ,  0.        ,  0.03448276,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.04347826,  0.        ,  0.        ,
0.01818182,  0.        ,  0.        ,  0.        ,  0.        ,
0.03703704,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.01960784,  0.        ,
0.        ,  0.        ,  0.04545455,  0.        ,  0.        ,
0.04166667,  0.        ,  0.09090909,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.05      ,  0.        ,  0.        ,  0.03846154,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.02272727,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.03921569,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ])

$Score(a,b) = \sum\limits_{v \in \text{NN}(a) \cap \text{NN}(b)} \frac{1}{\log(|\text{NN}(v)|)}$

In [31]:
# Your code here
aaScore = np.zeros((vNum,vNum))
for i in xrange(vNum):
for j in xrange(i+1, vNum):
for v in nx.common_neighbors(g, i, j):
aaScore[i,j] += 1/np.log(len(nx.neighbors(g,v)))
aaScore[0,]

Out[31]:
array([ 0.        ,  0.43429448,  0.79496824,  0.36067376,  0.79496824,
0.79496824,  0.79496824,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.43429448,  0.        ,
0.        ,  0.        ,  0.        ,  0.36067376,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.36067376,  0.        ,  0.        ,
0.36067376,  0.        ,  0.        ,  0.        ,  0.        ,
0.36067376,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.36067376,  0.        ,
0.        ,  0.        ,  0.36067376,  0.        ,  0.        ,
0.43429448,  0.        ,  0.36067376,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.36067376,  0.        ,  0.        ,  0.36067376,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.43429448,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.79496824,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
0.        ,  0.        ])

#### Preferential Attachment score¶

$Score(a,b) = |\text{NN}(a)| \times |\text{NN}(b)|$

In [32]:
# Your code here
paScore = np.zeros((vNum,vNum))
for i in xrange(vNum):
for j in xrange(i+1,vNum):
paScore[i,j] = len(nx.neighbors(g,i))*len(nx.neighbors(g,j))


#### Katz Score¶

$Score(a,b) = \sum\limits_{\text{Paths}_{a,b}} \beta^{|p_{a,b}|}\times|p_{a,b}|$

In [ ]:
# Your code here
katzScore = np.zeros((vNum, vNum))
beta = 0.6
for i in xrange(vNum):
for j in xrange(i+1, vNum):
l = map(lambda x: beta**len(x)*len(x), tuple(nx.all_simple_paths(g, i, j, cutoff=4)))
katzScore[i,j] = np.sum(np.power(beta, l)*l)


Let's compare the scores behavious for pairs of nodes with or without edge in-between

In [42]:
A = np.asarray(nx.adj_matrix(g).todense())
xyTriu = np.vstack(np.triu_indices_from(A, k=1)).T
wEdge = [nnScore[xy[0],xy[1]] for xy in xyTriu if A[xy[0],xy[1]]]
woEdge = [nnScore[xy[0],xy[1]] for xy in xyTriu if ~A[xy[0],xy[1]]]
data = [wEdge, woEdge]

In [43]:
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(10,5))
axes.violinplot(data, showmeans=True)
axes.set_xticklabels(['', 'With Edges', '', 'W/o Edges'])

Out[43]:
[<matplotlib.text.Text at 0x7f4379bdc610>,
<matplotlib.text.Text at 0x7f4379bed1d0>,
<matplotlib.text.Text at 0x7f4379b97350>,
<matplotlib.text.Text at 0x7f4379b97a90>]