Trying out community detection. At this point can only do this on the STRING based weighted edges generated in this notebook.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray()
<matplotlib.figure.Figure at 0x30f9ad0>
import pandas as pd
cd ../../forGAVIN/mergecode/OUT/
/data/opencast/MRes/forGAVIN/mergecode/OUT
!git annex unlock ../../../HBP/testdata/edgelist_update_weighted.txt
unlock ../../../HBP/testdata/edgelist_update_weighted.txt (copying...) ok
cp edgelist_update_weighted.txt ../../../HBP/testdata/
cd ../../../HBP/
/data/opencast/MRes/HBP
Community detection code from Colin written in C++:
!make
Building with option-useomp g++ -DUSEOMP -O3 -fopenmp -fomit-frame-pointer -funroll-loops -fforce-addr -fexpensive-optimizations -Wno-deprecated -g3 -o run communityDetection.C
This is how to run it:
!./run
./run requires 2 arguments: -file : the network file to run -quite : Turn on/off comments to command line : 0 = switch on comments to command line : 1 = switch off comments to command line -algorithm : the type of algorithm to run : 1 = Geodesic edge Betweenness : 2 = Random edge Betweenness : 3 = Spectral Betweenness -weighted : specify if network file is *weighted or not: : 0 = Using a non-weighted network file : 1 = Using a weighted network file -enrichment : run the enrichment on the clustered network : 0 = run enrichment after clustering - Default : 1 = run enrichment without clustering : 2 = do not run enrichment ----- * : The structure of the network file is: -weighted 1 : A (interactor) B (interactor) W (weight) -weighted 0 : A (interactor) B (interactor) : Where A and B are integers (Entrez IDs) and W is a double. EXAMPLE 1 ----- ./run -file testData/edgelist.txt -weighted 0 -algorithm 1 ----- EXAMPLE 2 ----- ./run -file testData/karate.txt -weighted 1 -algorithm 1 -----
Output is written to the OUT
file, so we have to unlock them first.
!git annex unlock OUT/*
unlock OUT/communities_cytoscape.txt (copying...) ok unlock OUT/community-metric_nmi.txt (copying...) ok unlock OUT/communityout.txt (copying...) ok unlock OUT/consensusout.txt (copying...) ok unlock OUT/enrichment_SCH_enriched.csv (copying...) ok unlock OUT/enrichment_disease.csv (copying...) ok unlock OUT/network.png (copying...) ok unlock OUT/overlap_dis_SCHenriched.csv (copying...) ok unlock OUT/p_significance_tests_SCH_enriched.csv (copying...) ok unlock OUT/p_significance_tests_disease.csv (copying...) ok unlock OUT/p_significance_tests_summary_SCH_enriched.csv (copying...) ok unlock OUT/p_significance_tests_summary_disease.csv (copying...) ok unlock OUT/p_value_network_SCH_enriched.csv (copying...) ok unlock OUT/p_value_network_disease.csv (copying...) ok unlock OUT/p_values_SCH_enriched.csv (copying...) ok unlock OUT/p_values_disease.csv (copying...) ok unlock OUT/permute_p_values_SCH_enriched.csv (copying...) ok unlock OUT/permute_p_values_disease.csv (copying...) ok unlock OUT/removededges.txt (copying...) ok
Running the code on the unweighted case to begin with:
!./run -file testdata/edgelist_update_weighted.txt -weighted 0 -algorithm 3 -quite 1
> ---- START ---- > Using Spectral Betweenness. > Using a non-weighted network > Scanning network_file: testdata/edgelist_update_weighted.txt > Header: None > Running Spectral Betweenness. --- DONE --- > cputime: 275.53 seconds > Network (nodes): 1572 (edges): 9373 > ---- END ----
This file describes the communities produced:
!head -n 10 OUT/communityout.txt
communityout Max Q: 0.36175 cputime: 275.53 seconds Network (nodes): 1572 (edges): 9373 we 0 dist 0 community 1 Degree 1 visited 0 key 1315 (55268) community: 1 size: 1 we 0 dist 0 community 2 Degree 4 visited 0 key 3 (18) we 0 dist 0 community 2 Degree 5 visited 0 key 50 (372) we 0 dist 0 community 2 Degree 5 visited 0 key 62 (481) we 0 dist 0 community 2 Degree 2 visited 0 key 63 (482)
This is the case using the weights generated using the various data sources available:
!git annex unlock unweightedOUT/*
unlock unweightedOUT/communities_cytoscape.txt (copying...) ok unlock unweightedOUT/community-metric_nmi.txt (copying...) ok unlock unweightedOUT/communityout.txt (copying...) ok unlock unweightedOUT/consensusout.txt (copying...) ok unlock unweightedOUT/enrichment_SCH_enriched.csv (copying...) ok unlock unweightedOUT/enrichment_disease.csv (copying...) ok unlock unweightedOUT/network.png (copying...) ok unlock unweightedOUT/overlap_dis_SCHenriched.csv (copying...) ok unlock unweightedOUT/p_significance_tests_SCH_enriched.csv (copying...) ok unlock unweightedOUT/p_significance_tests_disease.csv (copying...) ok unlock unweightedOUT/p_significance_tests_summary_SCH_enriched.csv (copying...) ok unlock unweightedOUT/p_significance_tests_summary_disease.csv (copying...) ok unlock unweightedOUT/p_value_network_SCH_enriched.csv (copying...) ok unlock unweightedOUT/p_value_network_disease.csv (copying...) ok unlock unweightedOUT/p_values_SCH_enriched.csv (copying...) ok unlock unweightedOUT/p_values_disease.csv (copying...) ok unlock unweightedOUT/permute_p_values_SCH_enriched.csv (copying...) ok unlock unweightedOUT/permute_p_values_disease.csv (copying...) ok unlock unweightedOUT/removededges.txt (copying...) ok
cp OUT/* unweightedOUT/
!./run -file testdata/edgelist_update_weighted.txt -weighted 1 -algorithm 3 -quite 1
> ---- START ---- > Using Spectral Betweenness. > Using a weighted network > Scanning network_file: testdata/edgelist_update_weighted.txt > Header: None > Running Spectral Betweenness. --- DONE --- > cputime: 279.53 seconds > Network (nodes): 1572 (edges): 9373 > ---- END ----
We can provide a measure of the similarity of the communities detected using a Normalised Mutual Information test. As wikipedia puts it:
mutual information measures the information that X and Y share: it measures how much knowing one of these variables reduces uncertainty about the other.
Scikit-learn provides this test, all that's required is to load the data into Python.
import csv
def parsecommunityout(fname):
f = open(fname)
c = csv.reader(f,delimiter=" ")
communitydict = {}
for l in c:
if l[0] == "we":
communitydict[l[-1].strip("()")] = int(l[5])
sortedids = sorted(communitydict.keys(),key=int)
communities = map(communitydict.__getitem__,sortedids)
return sortedids,communities
wids,wcom = parsecommunityout("OUT/communityout.txt")
uwids,uwcom = parsecommunityout("unweightedOUT/communityout.txt")
import sklearn.metrics
sklearn.metrics.normalized_mutual_info_score(wcom,uwcom)
0.60407676187875536
But, this result might not be correct, as the class labels sorting the nodes into communities are arbitrary.
Again, the below code is not mine:
from __future__ import division
import collections, re, math, sys, csv
#--- helper function
def convertStr(s):
ret = int(s)
return ret
rows, new, dic = [], [], []
consensus = "OUT/communities_cytoscape.txt"
community = "unweightedOUT/communities_cytoscape.txt"
#--- readin first community file
with open(consensus, 'r') as csvfile:
reader = csv.reader(csvfile, delimiter='=')
headerline = reader.next();
for row in reader:
new.append(row)
#--- readin second community file
with open(community, 'r') as csvfile:
reader = csv.reader(csvfile, delimiter='=')
headerline = reader.next();
for row in reader:
dic.append(row)
!git annex unlock OUT/community-metric_nmi.txt
for i in dic:
for j in new:
if i[0] == j[0]:
i.append(j[1])
dic.sort(key=lambda x: int(x[2]))
#--- count of nodes in each community in reality and in algorithm
realc, algc = [], []
for i in dic:
realc.append(convertStr(i[1]))
algc.append(convertStr(i[2]))
comcount = dict([(i,realc.count(i)) for i in set(realc)])
pcount = dict([(i,algc.count(i)) for i in set(algc)])
#--- holds the algorithm community label with the corresponding community labels
# provided e.g., {1:{2:2,3:3}}
algcount, tempd = {}, {}
templist = []
comtrack = 0
for key, val in pcount.iteritems():
templist = realc[comtrack:comtrack+val]
comtrack += val
for i in set(templist):
tempd[i] = templist.count(i)
algcount[key] = tempd
tempd = {}
#--- finding the node with the maximum label in that community
# key = int, value = dictionary
label = {}
for key, value in algcount.iteritems():
label[key] = max(value, key = value.get)
for i in dic:
for j in label:
if (int)(i[2]) == j:
i.append(label[j])
#--- METRICS
#--- Calculate the NMI
nt, np, ntp = {}, {}, {}
NMI = 0
nodes = len(dic)
for i in range(1, max(comcount)+1):
for j in range(1, max(pcount)+1):
nt[i] = 0
np[j] = 0
ntp[(str(i)+' '+str(j))] = 0
nt = comcount
np = pcount
for i in dic:
ntp[(str(i[1]).strip()+' '+str(i[2]).strip())] += 1
num, den = 0, 0
for i, t in nt.iteritems():
for j, p in np.iteritems():
temp, temp2 = 0, 0
if (ntp[(str(i).strip()+' '+str(j).strip())] > 0) and (t > 0) and (p > 0):
temp = ((ntp[(str(i).strip()+' '+str(j).strip())]*nodes)/(t*p))
temp2 = ntp[(str(i).strip()+' '+str(j).strip())]* math.log(temp)
num += temp2
d1, d2 = 0, 0
for t in nt.values():
d1 += t * math.log(t/nodes)
for p in np.values():
d2 += p * math.log(p/nodes)
den = d1 + d2
NMI = -(2*num)/den
#--- print results to screen
print 'NMI = %s' %(NMI)
f = open ( 'OUT/community-metric_nmi.txt', 'w' )
f.writelines("node\treal\talg\tnewlabel\n")
f.writelines("%s\t%s\t%s\t%s\n" % (i[0],i[1],i[2],i[3]) for i in dic)
f.writelines('NMI = %s' %(NMI))
f.close()
NMI = 0.604022794092
So the result is almost exactly the same, but the above code appears to take into account the problem with arbitrary community labels. Good to see both methods getting similar results, in any case.
Having a look at the disease enrichment now:
dew = pd.read_csv("OUT/permute_p_values_disease.csv", delimiter="\t")
deuw = pd.read_csv("unweightedOUT/permute_p_values_disease.csv", delimiter="\t")
Check papers to read about the hypergeometric disease enrichment test:
dew.head()
Community | Size | disease_of_metabolism | p-value | {p-value} | Sig. Level | p_lower (%) | p_upper (%) | disease_by_infectious_agent | p-value.1 | ... | Sig. Level.135 | p_lower (%).135 | p_upper (%).135 | muscular_dystrophy | p-value.136 | {p-value}.136 | Sig. Level.136 | p_lower (%).136 | p_upper (%).136 | Unnamed: 824 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 50 | NaN | 0.125016 | 0.541943 | 0.010459 | 12.5 | 93.0 | NaN | 0.837946 | ... | 0.015763 | 45.7 | 89.0 | NaN | 0.598702 | 0.659976 | 0.015500 | 59.9 | 76.4 | NaN |
1 | 2 | 48 | NaN | 0.761838 | 0.562907 | 0.013470 | 74.6 | 39.4 | NaN | 0.989202 | ... | 0.000000 | 100.0 | 55.6 | NaN | 0.051262 | 0.671740 | 0.006974 | 5.2 | 99.6 | NaN |
2 | 3 | 26 | NaN | 0.574356 | 0.576412 | 0.015636 | 58.3 | 64.3 | NaN | 0.406454 | ... | 0.014087 | 28.9 | 95.9 | NaN | 1.000000 | 0.731936 | 0.000000 | 100.0 | 61.4 | NaN |
3 | 4 | 48 | NaN | 0.877369 | 0.564046 | 0.010373 | 88.7 | 23.8 | NaN | 0.033306 | ... | 0.015723 | 43.4 | 88.9 | NaN | 0.009142 | 0.668188 | 0.003010 | 1.1 | 99.7 | NaN |
4 | 5 | 26 | NaN | 0.574356 | 0.610789 | 0.015636 | 52.1 | 70.0 | NaN | 0.812738 | ... | 0.000000 | 100.0 | 73.9 | NaN | 0.375665 | 0.743471 | 0.015315 | 37.1 | 91.9 | NaN |
5 rows × 825 columns
deuw.head()
Community | Size | disease_of_metabolism | p-value | {p-value} | Sig. Level | p_lower (%) | p_upper (%) | disease_by_infectious_agent | p-value.1 | ... | Sig. Level.135 | p_lower (%).135 | p_upper (%).135 | muscular_dystrophy | p-value.136 | {p-value}.136 | Sig. Level.136 | p_lower (%).136 | p_upper (%).136 | Unnamed: 824 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | NaN | 1.000000 | 0.880437 | 0.000000 | 100.0 | 85.9 | NaN | 1.000000 | ... | 0.000000 | 100.0 | 98.9 | NaN | 1.000000 | 0.983303 | 0.000000 | 100 | 98.3 | NaN |
1 | 2 | 99 | NaN | 0.765480 | 0.529656 | 0.013399 | 76.6 | 32.3 | NaN | 0.970181 | ... | 0.005178 | 2.3 | 99.6 | NaN | 0.535656 | 0.606840 | 0.015771 | 55 | 74.1 | NaN |
2 | 3 | 37 | NaN | 0.684769 | 0.560950 | 0.014692 | 69.7 | 48.4 | NaN | 0.009547 | ... | 0.015230 | 37.4 | 93.7 | NaN | 1.000000 | 0.686668 | 0.000000 | 100 | 49.4 | NaN |
3 | 4 | 17 | NaN | 0.490462 | 0.594665 | 0.015808 | 48.8 | 73.4 | NaN | 0.783821 | ... | 0.012346 | 19.0 | 98.3 | NaN | 1.000000 | 0.784876 | 0.000000 | 100 | 71.6 | NaN |
4 | 5 | 17 | NaN | 0.940347 | 0.603658 | 0.007490 | 94.1 | 24.9 | NaN | 0.949553 | ... | 0.000000 | 100.0 | 79.9 | NaN | 1.000000 | 0.797138 | 0.000000 | 100 | 73.2 | NaN |
5 rows × 825 columns
How to interpret these results? I suppose we would like to split it by the different diseases and then detect those with low p-values, probably by sorting them. All of the disease columns are regularly spaced, so should be easy to split into different dataframes:
def splitbydisease(diseasedataframe):
diseasedict = {}
clabels = diseasedataframe.iloc[:,[0,1,3,4,5,6,7]].columns
for offset in range(2,diseasedataframe.columns.shape[0]-1,6):
diseaseindexes = [0,1] + range(offset+1,offset+6)
diseasedict[diseasedataframe.columns[offset]]=diseasedataframe.iloc[:,diseaseindexes]
#fix the column names
diseasedict[diseasedataframe.columns[offset]].columns = clabels
return diseasedict
deuwdict = splitbydisease(deuw)
dewdict = splitbydisease(dew)
Now we can search for diseases in each of these communities with p-values less than 5%.
def significantdiseases(diseasedict):
sigdiseasedict = {}
for k in diseasedict:
pvalues = diseasedict[k].iloc[:,2]
pvalues = pvalues[pvalues<0.05]
sigdiseasedict[k] = diseasedict[k].iloc[pvalues.index]
return sigdiseasedict
sigdeuwdict = significantdiseases(deuwdict)
sigdewdict = significantdiseases(dewdict)
Now we want to iterate through the diseases and sort the lowest p-values, while annotating which disease and community is involved. Should probably build a new dataframe and put this information in it.
To keep the analysis simple, we only want to consider schizophrenia and alzheimers.
interestdiseases = ["Alzheimer's_disease",'schizophrenia']
for k in sigdeuwdict:
sigdeuwdict[k].insert(2,'disease',k)
pieces = map(lambda k: sigdeuwdict[k],interestdiseases)
sigunweighted = pd.concat(pieces)
for k in sigdewdict:
sigdewdict[k].insert(2,'disease',k)
pieces = map(lambda k: sigdewdict[k],interestdiseases)
sigweighted = pd.concat(pieces)
sigunweighted.sort(columns='p-value')
Community | Size | disease | p-value | {p-value} | Sig. Level | p_lower (%) | p_upper (%) | |
---|---|---|---|---|---|---|---|---|
28 | 29 | 88 | schizophrenia | 0.001691 | 0.549829 | 0.001299 | 0.2 | 99.9 |
1 | 2 | 99 | schizophrenia | 0.004321 | 0.547038 | 0.002074 | 0.5 | 99.9 |
29 | 30 | 26 | Alzheimer's_disease | 0.005374 | 0.597882 | 0.002312 | 0.7 | 99.8 |
60 | 61 | 24 | schizophrenia | 0.019591 | 0.562575 | 0.004383 | 1.5 | 99.8 |
62 | 63 | 23 | Alzheimer's_disease | 0.035030 | 0.573961 | 0.005814 | 4.3 | 98.2 |
60 | 61 | 24 | Alzheimer's_disease | 0.042531 | 0.593381 | 0.006381 | 4.5 | 98.2 |
62 | 63 | 23 | schizophrenia | 0.046159 | 0.589334 | 0.006635 | 4.3 | 98.5 |
sigweighted.sort(columns='p-value')
Community | Size | disease | p-value | {p-value} | Sig. Level | p_lower (%) | p_upper (%) | |
---|---|---|---|---|---|---|---|---|
43 | 44 | 35 | schizophrenia | 0.003446 | 0.564032 | 0.001853 | 0.6 | 99.7 |
44 | 45 | 73 | schizophrenia | 0.003900 | 0.528577 | 0.001971 | 0.4 | 99.9 |
47 | 48 | 17 | Alzheimer's_disease | 0.007570 | 0.619867 | 0.002741 | 0.2 | 100.0 |
32 | 33 | 64 | Alzheimer's_disease | 0.008390 | 0.563842 | 0.002884 | 0.7 | 99.5 |
32 | 33 | 64 | schizophrenia | 0.023347 | 0.541118 | 0.004775 | 2.2 | 98.8 |
13 | 14 | 18 | schizophrenia | 0.041873 | 0.590369 | 0.006334 | 3.2 | 99.5 |
1 | 2 | 48 | schizophrenia | 0.046719 | 0.563059 | 0.006674 | 4.9 | 98.0 |
14 | 15 | 8 | Alzheimer's_disease | 0.049793 | 0.666429 | 0.006879 | 5.2 | 99.2 |
Now all that's required is to retreive the IDs of the proteins in each of these communities to compare the different disease associated communities. Wait, Cytoscape already does that for me. I can just use that.
Writing these results to a file so I can discuss them later:
!git annex unlock unweighted_significant_disease_communities.tsv
unlock unweighted_significant_disease_communities.tsv (copying...) ok
sigunweighted.sort(columns='p-value').to_csv("unweighted_significant_disease_communities.tsv",
sep="\t",index=False)
!git annex unlock weighted_significant_disease_communities.tsv
unlock weighted_significant_disease_communities.tsv (copying...) ok
sigweighted.sort(columns='p-value').to_csv("weighted_significant_disease_communities.tsv",
sep="\t",index=False)
!head unweighted_significant_disease_communities.tsv
Community Size disease p-value {p-value} Sig. Level p_lower (%) p_upper (%) 29 88 schizophrenia 0.00169105 0.549829 0.0012993 0.2 99.9 2 99 schizophrenia 0.00432129 0.547038 0.00207427 0.5 99.9 30 26 Alzheimer's_disease 0.00537398 0.597882 0.00231195 0.7 99.8 61 24 schizophrenia 0.019590700000000003 0.562575 0.00438256 1.5 99.8 63 23 Alzheimer's_disease 0.0350298 0.5739609999999999 0.00581401 4.3 98.2 61 24 Alzheimer's_disease 0.0425309 0.593381 0.00638138 4.5 98.2 63 23 schizophrenia 0.046159399999999996 0.589334 0.00663542 4.3 98.5
The obvious question to ask is what is the crossover between the high confidence disease communities in each of these groupings. We can find this by comparing the members of each of these communities. While doing the NMI test we loaded in these communities, so we can access them and compare them easily.
weightedcommunities = {}
for l in zip(*parsecommunityout("OUT/communityout.txt")):
try:
weightedcommunities[l[1]] += [l[0]]
except KeyError:
weightedcommunities[l[1]] = [l[0]]
unweightedcommunities = {}
for l in zip(*parsecommunityout("unweightedOUT/communityout.txt")):
try:
unweightedcommunities[l[1]] += [l[0]]
except KeyError:
unweightedcommunities[l[1]] = [l[0]]
The most likely disease association in the above unweighted table is community 29 with a size of 88. Comparing this to find the closest match in the weighted case:
def findsimilarcom(community,communitydict):
overlapdict = {}
for k in communitydict:
#make two lists of inclusion/exclusion for all shared ids
uwids = community
wids = communitydict[k]
tids = set(uwids+wids)
winc,uwinc = [],[]
for i in tids:
winc.append(int(i in wids))
uwinc.append(int(i in uwids))
overlap = sum(map(lambda a,b: int(a==b), uwinc,winc))/len(tids)
if overlap > 0.0:
overlapdict[k] = overlap
return overlapdict
findsimilarcom(unweightedcommunities[29],weightedcommunities)
{2: 0.014925373134328358, 7: 0.0086956521739130436, 13: 0.0058479532163742687, 24: 0.031914893617021274, 26: 0.0059880239520958087, 30: 0.011363636363636364, 31: 0.011363636363636364, 32: 0.019047619047619049, 33: 0.44761904761904764, 40: 0.052631578947368418, 41: 0.008771929824561403, 43: 0.011363636363636364, 44: 0.0081967213114754103, 45: 0.080536912751677847, 46: 0.030534351145038167, 48: 0.0096153846153846159, 54: 0.02, 83: 0.008771929824561403}
So many of these proteins are spread around smaller other communities in the weighted case, but most are found in another community which overlaps 31%.
len(weightedcommunities[33]),len(unweightedcommunities[29])
(64, 88)
Investigating the effect of the weights. Would like to know which weighting might have caused the community to be split like this. We can inspect the graphs more easily by plotting them:
#load the edges of the graph
interactions = loadtxt("testdata/edgelist_update_weighted.txt",dtype=str)
interactions = interactions[1:]
import networkx as nx
def plotcommunities(com1,com2):
G = nx.Graph()
for l in interactions:
if l[0] in set(com1+com2) and l[1] in set(com1+com2):
G.add_edge(l[0],l[1],weight=float(l[2]))
edict = {}
lim = min([d['weight'] for (u,v,d) in G.edges(data=True)])
diff = linspace(lim,1.0,10)[1]- linspace(lim,1.0,10)[0]
for x in linspace(lim,1.0,10):
edict[x] = [(u,v) for (u,v,d) in G.edges(data=True) if d['weight'] > x and d['weight'] < x+diff]
pos = nx.circular_layout(com1)
pos2 = nx.circular_layout(set(com2)-set(com1))
for k in pos2:
pos2[k] = array([pos2[k][0]+1.5,pos2[k][1]])
pos = dict(pos.items()+pos2.items())
nx.draw_networkx_nodes(G,pos,node_size=20,alpha=0.5)
for k in edict:
nx.draw_networkx_edges(G,pos,edgelist=edict[k],alpha=(k-lim)*(1/(1-lim)),edge_color='r')
#nx.draw_networkx_edges(G,pos,edgelist=edict[k],edge_color='r')
l=nx.draw_networkx_labels(G,pos,font_size=10,font_family='sans-serif')
return None
plotcommunities(weightedcommunities[33],unweightedcommunities[29])
import os
plotpath = os.path.abspath("../plots/bayes/")
savez(os.path.join(plotpath,"nx2933.npz"),unweightedcommunities[29],weightedcommunities[33])
plotcommunities(unweightedcommunities[29],weightedcommunities[33])
Average weighting of the edges within each community:
def averageweight(community):
weights = []
for l in interactions:
if l[0] in community and l[1] in community:
weights.append(float(l[2]))
return average(weights)
averageweight(weightedcommunities[33])
0.89165857738239485
averageweight(unweightedcommunities[29])
0.89599059455826779
Now investigating a high-likelihood community from the weighted case:
s=findsimilarcom(weightedcommunities[44],unweightedcommunities)
s
{2: 0.015151515151515152, 29: 0.0081967213114754103, 30: 0.016666666666666666, 63: 0.11538461538461539, 64: 0.48717948717948717, 69: 0.029999999999999999, 77: 0.011627906976744186, 84: 0.019230769230769232}
So the proteins involved are split amount some larger communities in the unweighted case, mostly in community 6. We can plot these:
plotcommunities(weightedcommunities[44],unweightedcommunities[64])
savez(os.path.join(plotpath,"nx6444.npz"),unweightedcommunities[64],weightedcommunities[44])
plotcommunities(unweightedcommunities[64],weightedcommunities[44])
The Disease Enrichment test also produces a similar table annotating functions that are associated with Schizophrenia. We can repeat the above parsing to find the likely ones of this:
schw = pd.read_csv("OUT/permute_p_values_SCH_enriched.csv", delimiter="\t")
schuw = pd.read_csv("unweightedOUT/permute_p_values_SCH_enriched.csv", delimiter="\t")
sigschwdict = significantdiseases(splitbydisease(schw))
sigschuwdict = significantdiseases(splitbydisease(schuw))
for k in sigschwdict:
sigschwdict[k].insert(2,'disease',k)
pieces = map(lambda k: sigschwdict[k],sigschwdict)
sigschweighted = pd.concat(pieces)
for k in sigschuwdict:
sigschuwdict[k].insert(2,'disease',k)
pieces = map(lambda k: sigschuwdict[k],sigschuwdict)
sigschunweighted = pd.concat(pieces)
sigschweighted.sort(columns='p-value')
Community | Size | disease | p-value | {p-value} | Sig. Level | p_lower (%) | p_upper (%) | |
---|---|---|---|---|---|---|---|---|
25 | 26 | 80 | Endocytosis | -6.661340e-16 | 0.638668 | NaN | 0.0 | 100.0 |
44 | 45 | 73 | Exocytosis | 2.220450e-16 | 0.607270 | 4.712160e-10 | 0.0 | 100.0 |
39 | 40 | 32 | CAT_signaling | 6.416100e-10 | 0.672824 | 8.010050e-07 | 0.0 | 100.0 |
45 | 46 | 47 | Exocytosis | 6.159730e-09 | 0.632842 | 2.481880e-06 | 0.0 | 100.0 |
45 | 46 | 47 | Intracellular_trafficking | 1.953090e-07 | 0.663362 | 1.397530e-05 | 0.0 | 100.0 |
23 | 24 | 9 | CAT_signaling | 6.450880e-07 | 0.837461 | 2.539860e-05 | 0.0 | 100.0 |
32 | 33 | 64 | Excitability | 1.881530e-06 | 0.694875 | 4.337650e-05 | 0.0 | 100.0 |
53 | 54 | 14 | Intracellular_trafficking | 5.222120e-06 | 0.803925 | 7.226400e-05 | 0.0 | 100.0 |
44 | 45 | 73 | Neurotransmitter_metabolism | 9.475210e-06 | 0.763295 | 9.734020e-05 | 0.0 | 100.0 |
7 | 8 | 53 | Cell_metabolism | 1.138280e-05 | 0.659110 | 1.066900e-04 | 0.0 | 100.0 |
3 | 4 | 48 | Structural_plasticity | 1.576900e-05 | 0.603017 | 1.255740e-04 | 0.0 | 100.0 |
39 | 40 | 32 | LGIC_signaling | 1.663450e-05 | 0.856139 | 1.289740e-04 | 0.0 | 100.0 |
1 | 2 | 48 | Protein_cluster | 4.553190e-05 | 0.654236 | 2.133770e-04 | 0.0 | 100.0 |
32 | 33 | 64 | Unknown | 4.553640e-05 | 0.686648 | 2.133880e-04 | 0.0 | 100.0 |
77 | 78 | 15 | G-protein_relay | 8.115600e-05 | 0.900729 | 2.848670e-04 | 0.0 | 100.0 |
45 | 46 | 47 | Ion_balance/transport | 1.457220e-04 | 0.671435 | 3.817070e-04 | 0.0 | 100.0 |
46 | 47 | 5 | G-protein_relay | 3.607280e-04 | 0.970943 | 6.004980e-04 | 0.0 | 100.0 |
49 | 50 | 19 | Protein_cluster | 5.627940e-04 | 0.762330 | 7.499850e-04 | 0.0 | 100.0 |
12 | 13 | 84 | RPSFB | 6.244790e-04 | 0.619736 | 7.899930e-04 | 0.0 | 100.0 |
44 | 45 | 73 | Protein_cluster | 6.846600e-04 | 0.601769 | 8.271590e-04 | 0.2 | 100.0 |
44 | 45 | 73 | CAT_signaling | 1.170980e-03 | 0.624930 | 1.081490e-03 | 0.1 | 100.0 |
43 | 44 | 35 | GPCR_signaling | 1.425310e-03 | 0.941988 | 1.193010e-03 | 0.1 | 100.0 |
36 | 37 | 2 | Intracellular_Signal_Transduction | 2.369600e-03 | 0.901327 | 1.537530e-03 | 0.1 | 100.0 |
5 | 6 | 45 | Cell_metabolism | 2.620530e-03 | 0.644941 | 1.616680e-03 | 0.2 | 100.0 |
1 | 2 | 48 | G-protein_relay | 2.759550e-03 | 0.794211 | 1.658900e-03 | 0.2 | 100.0 |
17 | 18 | 120 | Structural_plasticity | 3.187110e-03 | 0.565784 | 1.782400e-03 | 0.3 | 100.0 |
23 | 24 | 9 | Excitability | 3.791720e-03 | 0.920861 | 1.943540e-03 | 0.3 | 100.0 |
32 | 33 | 64 | Protein_cluster | 1.065260e-02 | 0.631767 | 3.246400e-03 | 1.3 | 99.7 |
29 | 30 | 1 | Excitability | 1.081420e-02 | 0.985162 | 3.270670e-03 | 1.5 | 100.0 |
42 | 43 | 1 | Excitability | 1.081420e-02 | 0.988130 | 3.270670e-03 | 1.2 | 100.0 |
17 | 18 | 120 | Intracellular_Signal_Transduction | 1.149320e-02 | 0.550901 | 3.370620e-03 | 1.4 | 99.6 |
14 | 15 | 8 | Cell_metabolism | 1.240010e-02 | 0.865700 | 3.499480e-03 | 1.0 | 99.7 |
52 | 53 | 31 | Intracellular_Signal_Transduction | 1.544150e-02 | 0.630677 | 3.899110e-03 | 1.7 | 99.8 |
43 | 44 | 35 | LGIC_signaling | 1.572140e-02 | 0.842228 | 3.933730e-03 | 1.6 | 99.9 |
34 | 35 | 3 | LGIC_signaling | 1.708820e-02 | 0.985256 | 4.098320e-03 | 1.5 | 100.0 |
52 | 53 | 31 | Ion_balance/transport | 1.793900e-02 | 0.705077 | 4.197290e-03 | 2.0 | 99.9 |
18 | 19 | 12 | Ion_balance/transport | 1.934710e-02 | 0.833093 | 4.355780e-03 | 1.7 | 100.0 |
73 | 74 | 1 | Cell_metabolism | 2.226460e-02 | 0.978490 | 4.665720e-03 | 2.2 | 100.0 |
38 | 39 | 2 | Unknown | 2.277690e-02 | 0.977524 | 4.717850e-03 | 2.3 | 100.0 |
4 | 5 | 26 | RPSFB | 2.309210e-02 | 0.715671 | 4.749620e-03 | 1.8 | 99.9 |
1 | 2 | 48 | CAT_signaling | 2.411220e-02 | 0.637163 | 4.850860e-03 | 1.7 | 99.9 |
43 | 44 | 35 | Ion_balance/transport | 2.486020e-02 | 0.711774 | 4.923630e-03 | 1.7 | 99.8 |
43 | 44 | 35 | Intracellular_Signal_Transduction | 2.532140e-02 | 0.610452 | 4.967920e-03 | 2.1 | 99.5 |
15 | 16 | 17 | Structural_plasticity | 3.152940e-02 | 0.672475 | 5.525880e-03 | 4.3 | 99.8 |
32 | 33 | 64 | Intracellular_Signal_Transduction | 3.348920e-02 | 0.575023 | 5.689260e-03 | 3.8 | 98.7 |
12 | 13 | 84 | Cell_metabolism | 3.535680e-02 | 0.614940 | 5.840100e-03 | 3.4 | 99.3 |
49 | 50 | 19 | Tyrosine_kinase_signaling | 3.584560e-02 | 0.960434 | 5.878830e-03 | 4.1 | 99.9 |
5 | 6 | 45 | Structural_plasticity | 3.724010e-02 | 0.592989 | 5.987760e-03 | 3.8 | 99.4 |
47 | 48 | 17 | Ion_balance/transport | 3.765970e-02 | 0.804750 | 6.020090e-03 | 3.5 | 99.8 |
21 | 22 | 1 | Structural_plasticity | 4.198470e-02 | 0.966469 | 6.342080e-03 | 3.5 | 100.0 |
81 | 82 | 17 | Intracellular_Signal_Transduction | 4.681000e-02 | 0.671723 | 6.679730e-03 | 5.7 | 99.1 |
61 | 62 | 1 | Intracellular_Signal_Transduction | 4.898220e-02 | 0.950547 | 6.825170e-03 | 5.2 | 100.0 |
72 | 73 | 1 | Intracellular_Signal_Transduction | 4.898220e-02 | 0.956253 | 6.825170e-03 | 4.6 | 100.0 |
sigschunweighted.sort(columns='p-value')
Community | Size | disease | p-value | {p-value} | Sig. Level | p_lower (%) | p_upper (%) | |
---|---|---|---|---|---|---|---|---|
1 | 2 | 99 | Exocytosis | -6.661340e-16 | 0.596752 | NaN | 0.0 | 100.0 |
68 | 69 | 68 | Endocytosis | -4.440890e-16 | 0.691935 | NaN | 0.0 | 100.0 |
60 | 61 | 24 | G-protein_relay | 8.690600e-12 | 0.873672 | 9.322340e-08 | 0.0 | 100.0 |
28 | 29 | 88 | Excitability | 8.561880e-11 | 0.662133 | 2.926070e-07 | 0.0 | 100.0 |
58 | 59 | 31 | CAT_signaling | 4.482160e-10 | 0.689979 | 6.694890e-07 | 0.0 | 100.0 |
72 | 73 | 13 | Protein_cluster | 1.814150e-09 | 0.816884 | 1.346900e-06 | 0.0 | 100.0 |
1 | 2 | 99 | Intracellular_trafficking | 2.364650e-07 | 0.597795 | 1.537740e-05 | 0.0 | 100.0 |
28 | 29 | 88 | Protein_cluster | 8.989080e-07 | 0.615212 | 2.998180e-05 | 0.0 | 100.0 |
16 | 17 | 5 | Structural_plasticity | 1.376690e-05 | 0.845397 | 1.173320e-04 | 0.0 | 100.0 |
3 | 4 | 17 | Cell_metabolism | 2.083410e-05 | 0.764369 | 1.443390e-04 | 0.0 | 100.0 |
6 | 7 | 92 | RPSFB | 3.503870e-05 | 0.599114 | 1.871830e-04 | 0.0 | 100.0 |
1 | 2 | 99 | Neurotransmitter_metabolism | 4.320860e-05 | 0.722620 | 2.078620e-04 | 0.0 | 100.0 |
69 | 70 | 3 | Structural_plasticity | 7.081230e-05 | 0.904585 | 2.660960e-04 | 0.0 | 100.0 |
61 | 62 | 24 | Unknown | 1.111100e-04 | 0.821977 | 3.333130e-04 | 0.0 | 100.0 |
71 | 72 | 44 | CAT_signaling | 4.116570e-04 | 0.656179 | 6.414730e-04 | 0.0 | 100.0 |
15 | 16 | 49 | Structural_plasticity | 7.192340e-04 | 0.580705 | 8.477720e-04 | 0.0 | 100.0 |
28 | 29 | 88 | LGIC_signaling | 9.317240e-04 | 0.737109 | 9.648090e-04 | 0.0 | 100.0 |
27 | 28 | 19 | Unknown | 1.090500e-03 | 0.831836 | 1.043700e-03 | 0.0 | 100.0 |
54 | 55 | 2 | Intracellular_Signal_Transduction | 2.369600e-03 | 0.914893 | 1.537530e-03 | 0.1 | 100.0 |
58 | 59 | 31 | Intracellular_Signal_Transduction | 3.135030e-03 | 0.626750 | 1.767820e-03 | 0.1 | 99.9 |
22 | 23 | 71 | Cell_metabolism | 3.932760e-03 | 0.616346 | 1.979220e-03 | 0.0 | 100.0 |
62 | 63 | 23 | Intracellular_Signal_Transduction | 4.158490e-03 | 0.645144 | 2.034990e-03 | 0.2 | 100.0 |
30 | 31 | 3 | Tyrosine_kinase_signaling | 5.717900e-03 | 0.996023 | 2.384370e-03 | 0.4 | 100.0 |
26 | 27 | 33 | CAT_signaling | 6.526000e-03 | 0.688906 | 2.546260e-03 | 0.4 | 100.0 |
62 | 63 | 23 | LGIC_signaling | 6.928630e-03 | 0.884490 | 2.623100e-03 | 0.7 | 100.0 |
4 | 5 | 17 | RPSFB | 7.016560e-03 | 0.756378 | 2.639570e-03 | 0.5 | 100.0 |
62 | 63 | 23 | Ion_balance/transport | 7.804610e-03 | 0.726029 | 2.782750e-03 | 0.5 | 100.0 |
61 | 62 | 24 | Ion_balance/transport | 8.809520e-03 | 0.747960 | 2.954980e-03 | 1.4 | 100.0 |
58 | 59 | 31 | LGIC_signaling | 1.243410e-02 | 0.888333 | 3.504200e-03 | 0.9 | 99.9 |
67 | 68 | 1 | Endocytosis | 1.335880e-02 | 0.987174 | 3.630470e-03 | 1.3 | 100.0 |
28 | 29 | 88 | CAT_signaling | 1.478950e-02 | 0.600355 | 3.817170e-03 | 1.0 | 99.8 |
32 | 33 | 3 | LGIC_signaling | 1.708820e-02 | 0.981325 | 4.098320e-03 | 1.9 | 100.0 |
60 | 61 | 24 | CAT_signaling | 1.728890e-02 | 0.720771 | 4.121890e-03 | 1.5 | 100.0 |
77 | 78 | 11 | Intracellular_trafficking | 2.088610e-02 | 0.834183 | 4.522150e-03 | 1.5 | 100.0 |
60 | 61 | 24 | Exocytosis | 2.133330e-02 | 0.709442 | 4.569270e-03 | 1.8 | 99.9 |
47 | 48 | 1 | Cell_metabolism | 2.226460e-02 | 0.981423 | 4.665720e-03 | 1.9 | 100.0 |
82 | 83 | 1 | Cell_metabolism | 2.226460e-02 | 0.976534 | 4.665720e-03 | 2.4 | 100.0 |
34 | 35 | 2 | Unknown | 2.277690e-02 | 0.976547 | 4.717850e-03 | 2.4 | 100.0 |
13 | 14 | 95 | RPSFB | 2.376770e-02 | 0.608652 | 4.816930e-03 | 2.2 | 99.6 |
66 | 67 | 1 | Exocytosis | 2.544530e-02 | 0.972712 | 4.979740e-03 | 2.8 | 100.0 |
19 | 20 | 87 | Structural_plasticity | 2.600320e-02 | 0.567603 | 5.032600e-03 | 2.6 | 98.7 |
22 | 23 | 71 | Structural_plasticity | 2.619760e-02 | 0.574707 | 5.050870e-03 | 2.1 | 99.3 |
13 | 14 | 95 | Intracellular_Signal_Transduction | 3.836680e-02 | 0.568902 | 6.074110e-03 | 4.3 | 98.0 |
14 | 15 | 16 | Intracellular_Signal_Transduction | 3.992310e-02 | 0.693810 | 6.191060e-03 | 3.2 | 99.8 |
62 | 63 | 23 | GPCR_signaling | 4.328120e-02 | 0.964559 | 6.434900e-03 | 3.7 | 99.9 |
63 | 64 | 23 | GPCR_signaling | 4.328120e-02 | 0.963602 | 6.434900e-03 | 3.8 | 99.9 |
68 | 69 | 68 | Neurotransmitter_metabolism | 4.361520e-02 | 0.783897 | 6.458550e-03 | 4.1 | 99.7 |
60 | 61 | 24 | GPCR_signaling | 4.513410e-02 | 0.959851 | 6.564830e-03 | 4.2 | 99.9 |
14 | 15 | 16 | Cell_metabolism | 4.753880e-02 | 0.776354 | 6.728960e-03 | 3.3 | 99.9 |
43 | 44 | 1 | Intracellular_Signal_Transduction | 4.898220e-02 | 0.942939 | 6.825170e-03 | 6.0 | 100.0 |
40 | 41 | 1 | Intracellular_Signal_Transduction | 4.898220e-02 | 0.943890 | 6.825170e-03 | 5.9 | 100.0 |
!git annex unlock unweighted_significant_SCH_communities.tsv
unlock unweighted_significant_SCH_communities.tsv (copying...) ok
!git annex unlock weighted_significant_SCH_communities.tsv
unlock weighted_significant_SCH_communities.tsv (copying...) ok
#save these tables
sigschunweighted.sort(columns='p-value').to_csv("unweighted_significant_SCH_communities.tsv",
sep="\t",index=False)
sigschweighted.sort(columns='p-value').to_csv("weighted_significant_SCH_communities.tsv",
sep="\t",index=False)
This file summarises the crossover between the functions and the disease to summarise which functions are most likely to be involved with the disease.
overlap = pd.read_csv("unweightedOUT/overlap_dis_SCHenriched.csv", sep="\t",header=None,names=range(139))
overlap = overlap.fillna('')
We are only interested in the overlap with schizophrenia, so we want to cut out just that column.
overlap = overlap.iloc[:,[0,1,108]]
with open("unweightedOUT/overlap_dis_SCHenriched.csv") as f:
c = csv.reader(f,delimiter="\t")
overlap = {}
for l in c:
try:
if l[0] == 'community':
currentcommunity = l[1]
if len(l) > 108:
if l[108] == '1':
try:
overlap[currentcommunity] += [l[0]]
except KeyError:
overlap[currentcommunity] = [l[0]]
except:
pass
Ok, so we now have access to this data.