Trying out community detection. At this point can only do this on the STRING based weighted edges generated in this notebook.

In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray()

<matplotlib.figure.Figure at 0x30f9ad0>

In [2]:

import pandas as pd

In [3]:

cd ../../forGAVIN/mergecode/OUT/

/data/opencast/MRes/forGAVIN/mergecode/OUT

In [4]:

!git annex unlock ../../../HBP/testdata/edgelist_update_weighted.txt

unlock ../../../HBP/testdata/edgelist_update_weighted.txt (copying...) ok

In [5]:

cp edgelist_update_weighted.txt ../../../HBP/testdata/

In [6]:

cd ../../../HBP/

/data/opencast/MRes/HBP

Community detection code from Colin written in C++:

In [292]:

!make

Building with option-useomp
g++     -DUSEOMP -O3 -fopenmp -fomit-frame-pointer -funroll-loops -fforce-addr -fexpensive-optimizations -Wno-deprecated -g3 -o run communityDetection.C

This is how to run it:

In [7]:

!./run

./run requires 2 arguments: 
-file         : the network file to run 
-quite        : Turn on/off comments to command line
              : 0 = switch on comments to command line
              : 1 = switch off comments to command line
-algorithm    : the type of algorithm to run
              : 1 = Geodesic edge Betweenness
              : 2 = Random edge Betweenness
              : 3 = Spectral Betweenness
-weighted     : specify if network file is *weighted or not: 
              : 0  = Using a non-weighted network file 
              : 1  = Using a     weighted network file 
-enrichment   : run the enrichment on the clustered network
              : 0 = run enrichment after clustering - Default
              : 1 = run enrichment without clustering
              : 2 = do not run enrichment
-----
*             : The structure of the network file is:
-weighted 1   : A (interactor) 	 B (interactor) 	 W (weight) 
-weighted 0   : A (interactor) 	 B (interactor) 
              : Where A and B are integers (Entrez IDs) and W is a double.
EXAMPLE 1
-----
./run -file testData/edgelist.txt -weighted 0 -algorithm 1
-----
EXAMPLE 2
-----
./run -file testData/karate.txt -weighted 1 -algorithm 1
-----

Output is written to the OUT file, so we have to unlock them first.

In [7]:

!git annex unlock OUT/*

unlock OUT/communities_cytoscape.txt (copying...) ok
unlock OUT/community-metric_nmi.txt (copying...) ok
unlock OUT/communityout.txt (copying...) ok
unlock OUT/consensusout.txt (copying...) ok
unlock OUT/enrichment_SCH_enriched.csv (copying...) ok
unlock OUT/enrichment_disease.csv (copying...) ok
unlock OUT/network.png (copying...) ok
unlock OUT/overlap_dis_SCHenriched.csv (copying...) ok
unlock OUT/p_significance_tests_SCH_enriched.csv (copying...) ok
unlock OUT/p_significance_tests_disease.csv (copying...) ok
unlock OUT/p_significance_tests_summary_SCH_enriched.csv (copying...) ok
unlock OUT/p_significance_tests_summary_disease.csv (copying...) ok
unlock OUT/p_value_network_SCH_enriched.csv (copying...) ok
unlock OUT/p_value_network_disease.csv (copying...) ok
unlock OUT/p_values_SCH_enriched.csv (copying...) ok
unlock OUT/p_values_disease.csv (copying...) ok
unlock OUT/permute_p_values_SCH_enriched.csv (copying...) ok
unlock OUT/permute_p_values_disease.csv (copying...) ok
unlock OUT/removededges.txt (copying...) ok

Unweighted case¶

Running the code on the unweighted case to begin with:

In [8]:

!./run -file testdata/edgelist_update_weighted.txt -weighted 0 -algorithm 3 -quite 1

> ---- START ---- 
> Using Spectral Betweenness.
> Using a non-weighted network 
> Scanning network_file: testdata/edgelist_update_weighted.txt
> Header: None 
> Running Spectral Betweenness.
--- DONE ---
> cputime: 275.53 seconds 
> Network (nodes): 1572 (edges): 9373
> ---- END ----

This file describes the communities produced:

In [9]:

!head -n 10 OUT/communityout.txt

communityout
Max Q: 0.36175
cputime: 275.53 seconds 
Network (nodes): 1572 (edges): 9373
we 0 dist 0 community 1 Degree 1 visited 0 key 1315 (55268)
community: 1 size: 1
we 0 dist 0 community 2 Degree 4 visited 0 key 3 (18)
we 0 dist 0 community 2 Degree 5 visited 0 key 50 (372)
we 0 dist 0 community 2 Degree 5 visited 0 key 62 (481)
we 0 dist 0 community 2 Degree 2 visited 0 key 63 (482)

Weighted case¶

This is the case using the weights generated using the various data sources available:

In [10]:

!git annex unlock unweightedOUT/*

unlock unweightedOUT/communities_cytoscape.txt (copying...) ok
unlock unweightedOUT/community-metric_nmi.txt (copying...) ok
unlock unweightedOUT/communityout.txt (copying...) ok
unlock unweightedOUT/consensusout.txt (copying...) ok
unlock unweightedOUT/enrichment_SCH_enriched.csv (copying...) ok
unlock unweightedOUT/enrichment_disease.csv (copying...) ok
unlock unweightedOUT/network.png (copying...) ok
unlock unweightedOUT/overlap_dis_SCHenriched.csv (copying...) ok
unlock unweightedOUT/p_significance_tests_SCH_enriched.csv (copying...) ok
unlock unweightedOUT/p_significance_tests_disease.csv (copying...) ok
unlock unweightedOUT/p_significance_tests_summary_SCH_enriched.csv (copying...) ok
unlock unweightedOUT/p_significance_tests_summary_disease.csv (copying...) ok
unlock unweightedOUT/p_value_network_SCH_enriched.csv (copying...) ok
unlock unweightedOUT/p_value_network_disease.csv (copying...) ok
unlock unweightedOUT/p_values_SCH_enriched.csv (copying...) ok
unlock unweightedOUT/p_values_disease.csv (copying...) ok
unlock unweightedOUT/permute_p_values_SCH_enriched.csv (copying...) ok
unlock unweightedOUT/permute_p_values_disease.csv (copying...) ok
unlock unweightedOUT/removededges.txt (copying...) ok

In [11]:

cp OUT/* unweightedOUT/

In [12]:

!./run -file testdata/edgelist_update_weighted.txt -weighted 1 -algorithm 3 -quite 1

> ---- START ---- 
> Using Spectral Betweenness.
> Using a weighted network 
> Scanning network_file: testdata/edgelist_update_weighted.txt
> Header: None 
> Running Spectral Betweenness.
--- DONE ---
> cputime: 279.53 seconds 
> Network (nodes): 1572 (edges): 9373
> ---- END ----

Normalised Mutual Information test¶

We can provide a measure of the similarity of the communities detected using a Normalised Mutual Information test. As wikipedia puts it:

mutual information measures the information that X and Y share: it measures how much knowing one of these variables reduces uncertainty about the other.

Scikit-learn provides this test, all that's required is to load the data into Python.

In [13]:

import csv

In [14]:

def parsecommunityout(fname):
    f = open(fname)
    c = csv.reader(f,delimiter=" ")
    communitydict = {}
    for l in c:
        if l[0] == "we":
            communitydict[l[-1].strip("()")] = int(l[5])
    sortedids = sorted(communitydict.keys(),key=int)
    communities = map(communitydict.__getitem__,sortedids)
    return sortedids,communities

In [15]:

wids,wcom = parsecommunityout("OUT/communityout.txt")

In [16]:

uwids,uwcom = parsecommunityout("unweightedOUT/communityout.txt")

In [17]:

import sklearn.metrics

In [18]:

sklearn.metrics.normalized_mutual_info_score(wcom,uwcom)

Out[18]:

0.60407676187875536

But, this result might not be correct, as the class labels sorting the nodes into communities are arbitrary.

Again, the below code is not mine:

In [19]:

from __future__ import division
import collections, re, math, sys, csv

In [20]:

#--- helper function
def convertStr(s):
    ret = int(s)
    return ret

In [21]:

rows, new, dic = [], [], []
consensus = "OUT/communities_cytoscape.txt"
community = "unweightedOUT/communities_cytoscape.txt"

#--- readin first community file
with open(consensus, 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter='=')
    headerline = reader.next();
    for row in reader:
        new.append(row)

#--- readin second community file
with open(community, 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter='=')
    headerline = reader.next();
    for row in reader:
        dic.append(row)

In [22]:

!git annex unlock OUT/community-metric_nmi.txt

In [23]:

for i in dic:
    for j in new:
        if i[0] == j[0]:
            i.append(j[1])
dic.sort(key=lambda x: int(x[2]))

#--- count of nodes in each community in reality and in algorithm
realc, algc = [], []
for i in dic:
    realc.append(convertStr(i[1]))
    algc.append(convertStr(i[2]))
comcount = dict([(i,realc.count(i)) for i in set(realc)])
pcount = dict([(i,algc.count(i)) for i in set(algc)])

#--- holds the algorithm community label with the corresponding community labels
#    provided e.g., {1:{2:2,3:3}}
algcount, tempd = {}, {}
templist = []
comtrack = 0
for key, val in pcount.iteritems():
    templist = realc[comtrack:comtrack+val]
    comtrack += val
    for i in set(templist):
        tempd[i] = templist.count(i)
    algcount[key] = tempd
    tempd = {}

#--- finding the node with the maximum label in that community
#    key = int, value = dictionary
label = {}
for key, value in algcount.iteritems():
    label[key] = max(value, key = value.get)
for i in dic:
    for j in label:
        if (int)(i[2]) == j:
            i.append(label[j])

#--- METRICS    
#--- Calculate the NMI
nt, np, ntp = {}, {}, {}
NMI = 0
nodes = len(dic)
for i in range(1, max(comcount)+1):
    for j in range(1, max(pcount)+1):
        nt[i] = 0
        np[j] = 0
        ntp[(str(i)+' '+str(j))] = 0
nt = comcount
np = pcount
for i in dic:
    ntp[(str(i[1]).strip()+' '+str(i[2]).strip())] += 1
num, den = 0, 0
for i, t in nt.iteritems():
    for j, p in np.iteritems():
        temp, temp2 = 0, 0
        if (ntp[(str(i).strip()+' '+str(j).strip())] > 0) and (t > 0) and (p > 0):
            temp = ((ntp[(str(i).strip()+' '+str(j).strip())]*nodes)/(t*p))
            temp2 = ntp[(str(i).strip()+' '+str(j).strip())]* math.log(temp)
        num += temp2
d1, d2 = 0, 0
for t in nt.values():
        d1 += t * math.log(t/nodes)
for p in np.values():
        d2 += p * math.log(p/nodes)
den = d1 + d2
NMI = -(2*num)/den


#--- print results to screen
print 'NMI = %s' %(NMI)
f = open ( 'OUT/community-metric_nmi.txt', 'w' )
f.writelines("node\treal\talg\tnewlabel\n")
f.writelines("%s\t%s\t%s\t%s\n" % (i[0],i[1],i[2],i[3]) for i in dic)
f.writelines('NMI = %s' %(NMI))
f.close()

NMI = 0.604022794092

So the result is almost exactly the same, but the above code appears to take into account the problem with arbitrary community labels. Good to see both methods getting similar results, in any case.

Disease Enrichment¶

Having a look at the disease enrichment now:

In [24]:

dew =  pd.read_csv("OUT/permute_p_values_disease.csv", delimiter="\t")
deuw = pd.read_csv("unweightedOUT/permute_p_values_disease.csv", delimiter="\t")

Check papers to read about the hypergeometric disease enrichment test:

In [25]:

dew.head()

Out[25]:

	Community	Size	disease_of_metabolism	p-value	{p-value}	Sig. Level	p_lower (%)	p_upper (%)	disease_by_infectious_agent	p-value.1	...	Sig. Level.135	p_lower (%).135	p_upper (%).135	muscular_dystrophy	p-value.136	{p-value}.136	Sig. Level.136	p_lower (%).136	p_upper (%).136	Unnamed: 824
0	1	50	NaN	0.125016	0.541943	0.010459	12.5	93.0	NaN	0.837946	...	0.015763	45.7	89.0	NaN	0.598702	0.659976	0.015500	59.9	76.4	NaN
1	2	48	NaN	0.761838	0.562907	0.013470	74.6	39.4	NaN	0.989202	...	0.000000	100.0	55.6	NaN	0.051262	0.671740	0.006974	5.2	99.6	NaN
2	3	26	NaN	0.574356	0.576412	0.015636	58.3	64.3	NaN	0.406454	...	0.014087	28.9	95.9	NaN	1.000000	0.731936	0.000000	100.0	61.4	NaN
3	4	48	NaN	0.877369	0.564046	0.010373	88.7	23.8	NaN	0.033306	...	0.015723	43.4	88.9	NaN	0.009142	0.668188	0.003010	1.1	99.7	NaN
4	5	26	NaN	0.574356	0.610789	0.015636	52.1	70.0	NaN	0.812738	...	0.000000	100.0	73.9	NaN	0.375665	0.743471	0.015315	37.1	91.9	NaN

5 rows × 825 columns

In [26]:

deuw.head()

Out[26]:

	Community	Size	disease_of_metabolism	p-value	{p-value}	Sig. Level	p_lower (%)	p_upper (%)	disease_by_infectious_agent	p-value.1	...	Sig. Level.135	p_lower (%).135	p_upper (%).135	muscular_dystrophy	p-value.136	{p-value}.136	Sig. Level.136	p_lower (%).136	p_upper (%).136	Unnamed: 824
0	1	1	NaN	1.000000	0.880437	0.000000	100.0	85.9	NaN	1.000000	...	0.000000	100.0	98.9	NaN	1.000000	0.983303	0.000000	100	98.3	NaN
1	2	99	NaN	0.765480	0.529656	0.013399	76.6	32.3	NaN	0.970181	...	0.005178	2.3	99.6	NaN	0.535656	0.606840	0.015771	55	74.1	NaN
2	3	37	NaN	0.684769	0.560950	0.014692	69.7	48.4	NaN	0.009547	...	0.015230	37.4	93.7	NaN	1.000000	0.686668	0.000000	100	49.4	NaN
3	4	17	NaN	0.490462	0.594665	0.015808	48.8	73.4	NaN	0.783821	...	0.012346	19.0	98.3	NaN	1.000000	0.784876	0.000000	100	71.6	NaN
4	5	17	NaN	0.940347	0.603658	0.007490	94.1	24.9	NaN	0.949553	...	0.000000	100.0	79.9	NaN	1.000000	0.797138	0.000000	100	73.2	NaN

5 rows × 825 columns

How to interpret these results? I suppose we would like to split it by the different diseases and then detect those with low p-values, probably by sorting them. All of the disease columns are regularly spaced, so should be easy to split into different dataframes:

In [27]:

def splitbydisease(diseasedataframe):
    diseasedict = {}
    clabels = diseasedataframe.iloc[:,[0,1,3,4,5,6,7]].columns
    for offset in range(2,diseasedataframe.columns.shape[0]-1,6):
        diseaseindexes = [0,1] + range(offset+1,offset+6)
        diseasedict[diseasedataframe.columns[offset]]=diseasedataframe.iloc[:,diseaseindexes]
        #fix the column names
        diseasedict[diseasedataframe.columns[offset]].columns = clabels
    return diseasedict

In [28]:

deuwdict = splitbydisease(deuw)

In [29]:

dewdict = splitbydisease(dew)

Now we can search for diseases in each of these communities with p-values less than 5%.

In [30]:

def significantdiseases(diseasedict):
    sigdiseasedict = {}
    for k in diseasedict:
        pvalues = diseasedict[k].iloc[:,2]
        pvalues = pvalues[pvalues<0.05]
        sigdiseasedict[k] = diseasedict[k].iloc[pvalues.index]
    return sigdiseasedict

In [31]:

sigdeuwdict = significantdiseases(deuwdict)

In [32]:

sigdewdict = significantdiseases(dewdict)

Now we want to iterate through the diseases and sort the lowest p-values, while annotating which disease and community is involved. Should probably build a new dataframe and put this information in it.

To keep the analysis simple, we only want to consider schizophrenia and alzheimers.

In [33]:

interestdiseases = ["Alzheimer's_disease",'schizophrenia']

In [34]:

for k in sigdeuwdict:
    sigdeuwdict[k].insert(2,'disease',k)
pieces = map(lambda k: sigdeuwdict[k],interestdiseases)

In [35]:

sigunweighted = pd.concat(pieces)

In [36]:

for k in sigdewdict:
    sigdewdict[k].insert(2,'disease',k)
pieces = map(lambda k: sigdewdict[k],interestdiseases)

In [37]:

sigweighted = pd.concat(pieces)

In [38]:

sigunweighted.sort(columns='p-value')

Out[38]:

	Community	Size	disease	p-value	{p-value}	Sig. Level	p_lower (%)	p_upper (%)
28	29	88	schizophrenia	0.001691	0.549829	0.001299	0.2	99.9
1	2	99	schizophrenia	0.004321	0.547038	0.002074	0.5	99.9
29	30	26	Alzheimer's_disease	0.005374	0.597882	0.002312	0.7	99.8
60	61	24	schizophrenia	0.019591	0.562575	0.004383	1.5	99.8
62	63	23	Alzheimer's_disease	0.035030	0.573961	0.005814	4.3	98.2
60	61	24	Alzheimer's_disease	0.042531	0.593381	0.006381	4.5	98.2
62	63	23	schizophrenia	0.046159	0.589334	0.006635	4.3	98.5

In [39]:

sigweighted.sort(columns='p-value')

Out[39]:

	Community	Size	disease	p-value	{p-value}	Sig. Level	p_lower (%)	p_upper (%)
43	44	35	schizophrenia	0.003446	0.564032	0.001853	0.6	99.7
44	45	73	schizophrenia	0.003900	0.528577	0.001971	0.4	99.9
47	48	17	Alzheimer's_disease	0.007570	0.619867	0.002741	0.2	100.0
32	33	64	Alzheimer's_disease	0.008390	0.563842	0.002884	0.7	99.5
32	33	64	schizophrenia	0.023347	0.541118	0.004775	2.2	98.8
13	14	18	schizophrenia	0.041873	0.590369	0.006334	3.2	99.5
1	2	48	schizophrenia	0.046719	0.563059	0.006674	4.9	98.0
14	15	8	Alzheimer's_disease	0.049793	0.666429	0.006879	5.2	99.2

Now all that's required is to retreive the IDs of the proteins in each of these communities to compare the different disease associated communities. Wait, Cytoscape already does that for me. I can just use that.

Writing these results to a file so I can discuss them later:

In [40]:

!git annex unlock unweighted_significant_disease_communities.tsv

unlock unweighted_significant_disease_communities.tsv (copying...) ok

In [41]:

sigunweighted.sort(columns='p-value').to_csv("unweighted_significant_disease_communities.tsv",
                                             sep="\t",index=False)

In [42]:

!git annex unlock weighted_significant_disease_communities.tsv

unlock weighted_significant_disease_communities.tsv (copying...) ok

In [43]:

sigweighted.sort(columns='p-value').to_csv("weighted_significant_disease_communities.tsv",
                                           sep="\t",index=False)

In [44]:

!head unweighted_significant_disease_communities.tsv

Community	Size	disease	p-value	{p-value}	Sig. Level	p_lower (%)	p_upper (%)
29	88	schizophrenia	0.00169105	0.549829	0.0012993	0.2	99.9
2	99	schizophrenia	0.00432129	0.547038	0.00207427	0.5	99.9
30	26	Alzheimer's_disease	0.00537398	0.597882	0.00231195	0.7	99.8
61	24	schizophrenia	0.019590700000000003	0.562575	0.00438256	1.5	99.8
63	23	Alzheimer's_disease	0.0350298	0.5739609999999999	0.00581401	4.3	98.2
61	24	Alzheimer's_disease	0.0425309	0.593381	0.00638138	4.5	98.2
63	23	schizophrenia	0.046159399999999996	0.589334	0.00663542	4.3	98.5

Crossover between weighted and unweighted¶

The obvious question to ask is what is the crossover between the high confidence disease communities in each of these groupings. We can find this by comparing the members of each of these communities. While doing the NMI test we loaded in these communities, so we can access them and compare them easily.

In [45]:

weightedcommunities = {}
for l in zip(*parsecommunityout("OUT/communityout.txt")):
    try:
        weightedcommunities[l[1]] += [l[0]]
    except KeyError:
        weightedcommunities[l[1]] = [l[0]]

In [46]:

unweightedcommunities = {}
for l in zip(*parsecommunityout("unweightedOUT/communityout.txt")):
    try:
        unweightedcommunities[l[1]] += [l[0]]
    except KeyError:
        unweightedcommunities[l[1]] = [l[0]]

Unweighted community 29¶

The most likely disease association in the above unweighted table is community 29 with a size of 88. Comparing this to find the closest match in the weighted case:

In [47]:

def findsimilarcom(community,communitydict):
    overlapdict = {}
    for k in communitydict:
        #make two lists of inclusion/exclusion for all shared ids
        uwids = community
        wids = communitydict[k]
        tids = set(uwids+wids)
        winc,uwinc = [],[]
        for i in tids:
            winc.append(int(i in wids))
            uwinc.append(int(i in uwids))
        overlap = sum(map(lambda a,b: int(a==b), uwinc,winc))/len(tids)
        if overlap > 0.0:
            overlapdict[k] = overlap
    return overlapdict

In [48]:

findsimilarcom(unweightedcommunities[29],weightedcommunities)

Out[48]:

{2: 0.014925373134328358,
 7: 0.0086956521739130436,
 13: 0.0058479532163742687,
 24: 0.031914893617021274,
 26: 0.0059880239520958087,
 30: 0.011363636363636364,
 31: 0.011363636363636364,
 32: 0.019047619047619049,
 33: 0.44761904761904764,
 40: 0.052631578947368418,
 41: 0.008771929824561403,
 43: 0.011363636363636364,
 44: 0.0081967213114754103,
 45: 0.080536912751677847,
 46: 0.030534351145038167,
 48: 0.0096153846153846159,
 54: 0.02,
 83: 0.008771929824561403}

So many of these proteins are spread around smaller other communities in the weighted case, but most are found in another community which overlaps 31%.

In [49]:

len(weightedcommunities[33]),len(unweightedcommunities[29])

Out[49]:

(64, 88)

Investigating the effect of the weights. Would like to know which weighting might have caused the community to be split like this. We can inspect the graphs more easily by plotting them:

In [50]:

#load the edges of the graph
interactions = loadtxt("testdata/edgelist_update_weighted.txt",dtype=str)
interactions = interactions[1:]

In [51]:

import networkx as nx

In [52]:

def plotcommunities(com1,com2):
    G = nx.Graph()
    for l in interactions:
        if l[0] in set(com1+com2) and l[1] in set(com1+com2):
            G.add_edge(l[0],l[1],weight=float(l[2]))
    edict = {}
    lim = min([d['weight'] for (u,v,d) in G.edges(data=True)])
    diff = linspace(lim,1.0,10)[1]- linspace(lim,1.0,10)[0]
    for x in linspace(lim,1.0,10):
        edict[x] = [(u,v) for (u,v,d) in G.edges(data=True) if d['weight'] > x and d['weight'] < x+diff]
    pos = nx.circular_layout(com1)
    pos2 = nx.circular_layout(set(com2)-set(com1))
    for k in pos2:
        pos2[k] = array([pos2[k][0]+1.5,pos2[k][1]])
    pos = dict(pos.items()+pos2.items())
    nx.draw_networkx_nodes(G,pos,node_size=20,alpha=0.5)
    for k in edict:
        nx.draw_networkx_edges(G,pos,edgelist=edict[k],alpha=(k-lim)*(1/(1-lim)),edge_color='r')
        #nx.draw_networkx_edges(G,pos,edgelist=edict[k],edge_color='r')
    l=nx.draw_networkx_labels(G,pos,font_size=10,font_family='sans-serif')
    return None

In [53]:

plotcommunities(weightedcommunities[33],unweightedcommunities[29])

In [80]:

import os

In [81]:

plotpath = os.path.abspath("../plots/bayes/")

In [82]:

savez(os.path.join(plotpath,"nx2933.npz"),unweightedcommunities[29],weightedcommunities[33])

In [54]:

plotcommunities(unweightedcommunities[29],weightedcommunities[33])

Average weighting of the edges within each community:

In [56]:

def averageweight(community):
    weights = []
    for l in interactions:
        if l[0] in community and l[1] in community:
            weights.append(float(l[2]))
    return average(weights)

In [57]:

averageweight(weightedcommunities[33])

Out[57]:

0.89165857738239485

In [58]:

averageweight(unweightedcommunities[29])

Out[58]:

0.89599059455826779

Weighted community 44¶

Now investigating a high-likelihood community from the weighted case:

In [59]:

s=findsimilarcom(weightedcommunities[44],unweightedcommunities)
s

Out[59]:

{2: 0.015151515151515152,
 29: 0.0081967213114754103,
 30: 0.016666666666666666,
 63: 0.11538461538461539,
 64: 0.48717948717948717,
 69: 0.029999999999999999,
 77: 0.011627906976744186,
 84: 0.019230769230769232}

So the proteins involved are split amount some larger communities in the unweighted case, mostly in community 6. We can plot these:

In [60]:

plotcommunities(weightedcommunities[44],unweightedcommunities[64])

In [83]:

savez(os.path.join(plotpath,"nx6444.npz"),unweightedcommunities[64],weightedcommunities[44])

In [61]:

plotcommunities(unweightedcommunities[64],weightedcommunities[44])

Schizophrenia functional annotations¶

The Disease Enrichment test also produces a similar table annotating functions that are associated with Schizophrenia. We can repeat the above parsing to find the likely ones of this:

In [62]:

schw = pd.read_csv("OUT/permute_p_values_SCH_enriched.csv", delimiter="\t")
schuw = pd.read_csv("unweightedOUT/permute_p_values_SCH_enriched.csv", delimiter="\t")

In [63]:

sigschwdict = significantdiseases(splitbydisease(schw))
sigschuwdict = significantdiseases(splitbydisease(schuw))

In [64]:

for k in sigschwdict:
    sigschwdict[k].insert(2,'disease',k)
pieces = map(lambda k: sigschwdict[k],sigschwdict)

In [65]:

sigschweighted = pd.concat(pieces)

In [66]:

for k in sigschuwdict:
    sigschuwdict[k].insert(2,'disease',k)
pieces = map(lambda k: sigschuwdict[k],sigschuwdict)

In [67]:

sigschunweighted = pd.concat(pieces)

In [68]:

sigschweighted.sort(columns='p-value')

Out[68]:

	Community	Size	disease	p-value	{p-value}	Sig. Level	p_lower (%)	p_upper (%)
25	26	80	Endocytosis	-6.661340e-16	0.638668	NaN	0.0	100.0
44	45	73	Exocytosis	2.220450e-16	0.607270	4.712160e-10	0.0	100.0
39	40	32	CAT_signaling	6.416100e-10	0.672824	8.010050e-07	0.0	100.0
45	46	47	Exocytosis	6.159730e-09	0.632842	2.481880e-06	0.0	100.0
45	46	47	Intracellular_trafficking	1.953090e-07	0.663362	1.397530e-05	0.0	100.0
23	24	9	CAT_signaling	6.450880e-07	0.837461	2.539860e-05	0.0	100.0
32	33	64	Excitability	1.881530e-06	0.694875	4.337650e-05	0.0	100.0
53	54	14	Intracellular_trafficking	5.222120e-06	0.803925	7.226400e-05	0.0	100.0
44	45	73	Neurotransmitter_metabolism	9.475210e-06	0.763295	9.734020e-05	0.0	100.0
7	8	53	Cell_metabolism	1.138280e-05	0.659110	1.066900e-04	0.0	100.0
3	4	48	Structural_plasticity	1.576900e-05	0.603017	1.255740e-04	0.0	100.0
39	40	32	LGIC_signaling	1.663450e-05	0.856139	1.289740e-04	0.0	100.0
1	2	48	Protein_cluster	4.553190e-05	0.654236	2.133770e-04	0.0	100.0
32	33	64	Unknown	4.553640e-05	0.686648	2.133880e-04	0.0	100.0
77	78	15	G-protein_relay	8.115600e-05	0.900729	2.848670e-04	0.0	100.0
45	46	47	Ion_balance/transport	1.457220e-04	0.671435	3.817070e-04	0.0	100.0
46	47	5	G-protein_relay	3.607280e-04	0.970943	6.004980e-04	0.0	100.0
49	50	19	Protein_cluster	5.627940e-04	0.762330	7.499850e-04	0.0	100.0
12	13	84	RPSFB	6.244790e-04	0.619736	7.899930e-04	0.0	100.0
44	45	73	Protein_cluster	6.846600e-04	0.601769	8.271590e-04	0.2	100.0
44	45	73	CAT_signaling	1.170980e-03	0.624930	1.081490e-03	0.1	100.0
43	44	35	GPCR_signaling	1.425310e-03	0.941988	1.193010e-03	0.1	100.0
36	37	2	Intracellular_Signal_Transduction	2.369600e-03	0.901327	1.537530e-03	0.1	100.0
5	6	45	Cell_metabolism	2.620530e-03	0.644941	1.616680e-03	0.2	100.0
1	2	48	G-protein_relay	2.759550e-03	0.794211	1.658900e-03	0.2	100.0
17	18	120	Structural_plasticity	3.187110e-03	0.565784	1.782400e-03	0.3	100.0
23	24	9	Excitability	3.791720e-03	0.920861	1.943540e-03	0.3	100.0
32	33	64	Protein_cluster	1.065260e-02	0.631767	3.246400e-03	1.3	99.7
29	30	1	Excitability	1.081420e-02	0.985162	3.270670e-03	1.5	100.0
42	43	1	Excitability	1.081420e-02	0.988130	3.270670e-03	1.2	100.0
17	18	120	Intracellular_Signal_Transduction	1.149320e-02	0.550901	3.370620e-03	1.4	99.6
14	15	8	Cell_metabolism	1.240010e-02	0.865700	3.499480e-03	1.0	99.7
52	53	31	Intracellular_Signal_Transduction	1.544150e-02	0.630677	3.899110e-03	1.7	99.8
43	44	35	LGIC_signaling	1.572140e-02	0.842228	3.933730e-03	1.6	99.9
34	35	3	LGIC_signaling	1.708820e-02	0.985256	4.098320e-03	1.5	100.0
52	53	31	Ion_balance/transport	1.793900e-02	0.705077	4.197290e-03	2.0	99.9
18	19	12	Ion_balance/transport	1.934710e-02	0.833093	4.355780e-03	1.7	100.0
73	74	1	Cell_metabolism	2.226460e-02	0.978490	4.665720e-03	2.2	100.0
38	39	2	Unknown	2.277690e-02	0.977524	4.717850e-03	2.3	100.0
4	5	26	RPSFB	2.309210e-02	0.715671	4.749620e-03	1.8	99.9
1	2	48	CAT_signaling	2.411220e-02	0.637163	4.850860e-03	1.7	99.9
43	44	35	Ion_balance/transport	2.486020e-02	0.711774	4.923630e-03	1.7	99.8
43	44	35	Intracellular_Signal_Transduction	2.532140e-02	0.610452	4.967920e-03	2.1	99.5
15	16	17	Structural_plasticity	3.152940e-02	0.672475	5.525880e-03	4.3	99.8
32	33	64	Intracellular_Signal_Transduction	3.348920e-02	0.575023	5.689260e-03	3.8	98.7
12	13	84	Cell_metabolism	3.535680e-02	0.614940	5.840100e-03	3.4	99.3
49	50	19	Tyrosine_kinase_signaling	3.584560e-02	0.960434	5.878830e-03	4.1	99.9
5	6	45	Structural_plasticity	3.724010e-02	0.592989	5.987760e-03	3.8	99.4
47	48	17	Ion_balance/transport	3.765970e-02	0.804750	6.020090e-03	3.5	99.8
21	22	1	Structural_plasticity	4.198470e-02	0.966469	6.342080e-03	3.5	100.0
81	82	17	Intracellular_Signal_Transduction	4.681000e-02	0.671723	6.679730e-03	5.7	99.1
61	62	1	Intracellular_Signal_Transduction	4.898220e-02	0.950547	6.825170e-03	5.2	100.0
72	73	1	Intracellular_Signal_Transduction	4.898220e-02	0.956253	6.825170e-03	4.6	100.0

In [69]:

sigschunweighted.sort(columns='p-value')

Out[69]:

	Community	Size	disease	p-value	{p-value}	Sig. Level	p_lower (%)	p_upper (%)
1	2	99	Exocytosis	-6.661340e-16	0.596752	NaN	0.0	100.0
68	69	68	Endocytosis	-4.440890e-16	0.691935	NaN	0.0	100.0
60	61	24	G-protein_relay	8.690600e-12	0.873672	9.322340e-08	0.0	100.0
28	29	88	Excitability	8.561880e-11	0.662133	2.926070e-07	0.0	100.0
58	59	31	CAT_signaling	4.482160e-10	0.689979	6.694890e-07	0.0	100.0
72	73	13	Protein_cluster	1.814150e-09	0.816884	1.346900e-06	0.0	100.0
1	2	99	Intracellular_trafficking	2.364650e-07	0.597795	1.537740e-05	0.0	100.0
28	29	88	Protein_cluster	8.989080e-07	0.615212	2.998180e-05	0.0	100.0
16	17	5	Structural_plasticity	1.376690e-05	0.845397	1.173320e-04	0.0	100.0
3	4	17	Cell_metabolism	2.083410e-05	0.764369	1.443390e-04	0.0	100.0
6	7	92	RPSFB	3.503870e-05	0.599114	1.871830e-04	0.0	100.0
1	2	99	Neurotransmitter_metabolism	4.320860e-05	0.722620	2.078620e-04	0.0	100.0
69	70	3	Structural_plasticity	7.081230e-05	0.904585	2.660960e-04	0.0	100.0
61	62	24	Unknown	1.111100e-04	0.821977	3.333130e-04	0.0	100.0
71	72	44	CAT_signaling	4.116570e-04	0.656179	6.414730e-04	0.0	100.0
15	16	49	Structural_plasticity	7.192340e-04	0.580705	8.477720e-04	0.0	100.0
28	29	88	LGIC_signaling	9.317240e-04	0.737109	9.648090e-04	0.0	100.0
27	28	19	Unknown	1.090500e-03	0.831836	1.043700e-03	0.0	100.0
54	55	2	Intracellular_Signal_Transduction	2.369600e-03	0.914893	1.537530e-03	0.1	100.0
58	59	31	Intracellular_Signal_Transduction	3.135030e-03	0.626750	1.767820e-03	0.1	99.9
22	23	71	Cell_metabolism	3.932760e-03	0.616346	1.979220e-03	0.0	100.0
62	63	23	Intracellular_Signal_Transduction	4.158490e-03	0.645144	2.034990e-03	0.2	100.0
30	31	3	Tyrosine_kinase_signaling	5.717900e-03	0.996023	2.384370e-03	0.4	100.0
26	27	33	CAT_signaling	6.526000e-03	0.688906	2.546260e-03	0.4	100.0
62	63	23	LGIC_signaling	6.928630e-03	0.884490	2.623100e-03	0.7	100.0
4	5	17	RPSFB	7.016560e-03	0.756378	2.639570e-03	0.5	100.0
62	63	23	Ion_balance/transport	7.804610e-03	0.726029	2.782750e-03	0.5	100.0
61	62	24	Ion_balance/transport	8.809520e-03	0.747960	2.954980e-03	1.4	100.0
58	59	31	LGIC_signaling	1.243410e-02	0.888333	3.504200e-03	0.9	99.9
67	68	1	Endocytosis	1.335880e-02	0.987174	3.630470e-03	1.3	100.0
28	29	88	CAT_signaling	1.478950e-02	0.600355	3.817170e-03	1.0	99.8
32	33	3	LGIC_signaling	1.708820e-02	0.981325	4.098320e-03	1.9	100.0
60	61	24	CAT_signaling	1.728890e-02	0.720771	4.121890e-03	1.5	100.0
77	78	11	Intracellular_trafficking	2.088610e-02	0.834183	4.522150e-03	1.5	100.0
60	61	24	Exocytosis	2.133330e-02	0.709442	4.569270e-03	1.8	99.9
47	48	1	Cell_metabolism	2.226460e-02	0.981423	4.665720e-03	1.9	100.0
82	83	1	Cell_metabolism	2.226460e-02	0.976534	4.665720e-03	2.4	100.0
34	35	2	Unknown	2.277690e-02	0.976547	4.717850e-03	2.4	100.0
13	14	95	RPSFB	2.376770e-02	0.608652	4.816930e-03	2.2	99.6
66	67	1	Exocytosis	2.544530e-02	0.972712	4.979740e-03	2.8	100.0
19	20	87	Structural_plasticity	2.600320e-02	0.567603	5.032600e-03	2.6	98.7
22	23	71	Structural_plasticity	2.619760e-02	0.574707	5.050870e-03	2.1	99.3
13	14	95	Intracellular_Signal_Transduction	3.836680e-02	0.568902	6.074110e-03	4.3	98.0
14	15	16	Intracellular_Signal_Transduction	3.992310e-02	0.693810	6.191060e-03	3.2	99.8
62	63	23	GPCR_signaling	4.328120e-02	0.964559	6.434900e-03	3.7	99.9
63	64	23	GPCR_signaling	4.328120e-02	0.963602	6.434900e-03	3.8	99.9
68	69	68	Neurotransmitter_metabolism	4.361520e-02	0.783897	6.458550e-03	4.1	99.7
60	61	24	GPCR_signaling	4.513410e-02	0.959851	6.564830e-03	4.2	99.9
14	15	16	Cell_metabolism	4.753880e-02	0.776354	6.728960e-03	3.3	99.9
43	44	1	Intracellular_Signal_Transduction	4.898220e-02	0.942939	6.825170e-03	6.0	100.0
40	41	1	Intracellular_Signal_Transduction	4.898220e-02	0.943890	6.825170e-03	5.9	100.0

In [72]:

!git annex unlock unweighted_significant_SCH_communities.tsv

unlock unweighted_significant_SCH_communities.tsv (copying...) ok

In [73]:

!git annex unlock weighted_significant_SCH_communities.tsv

unlock weighted_significant_SCH_communities.tsv (copying...) ok

In [74]:

#save these tables
sigschunweighted.sort(columns='p-value').to_csv("unweighted_significant_SCH_communities.tsv",
                                             sep="\t",index=False)
sigschweighted.sort(columns='p-value').to_csv("weighted_significant_SCH_communities.tsv",
                                             sep="\t",index=False)

Function-Schizophrenia crossover¶

This file summarises the crossover between the functions and the disease to summarise which functions are most likely to be involved with the disease.

In [75]:

overlap = pd.read_csv("unweightedOUT/overlap_dis_SCHenriched.csv", sep="\t",header=None,names=range(139))

In [76]:

overlap = overlap.fillna('')

We are only interested in the overlap with schizophrenia, so we want to cut out just that column.

In [77]:

overlap = overlap.iloc[:,[0,1,108]]

In [78]:

with open("unweightedOUT/overlap_dis_SCHenriched.csv") as f:
    c = csv.reader(f,delimiter="\t")
    overlap = {}
    for l in c:
        try:
            if l[0] == 'community':
                currentcommunity = l[1]
            if len(l) > 108:
                if l[108] == '1':
                    try:
                        overlap[currentcommunity] += [l[0]]
                    except KeyError:
                        overlap[currentcommunity] = [l[0]]
        except:
            pass

Ok, so we now have access to this data.