The Fraggle similarity algorithm from Jameed Hussain and Gavin Harper is available in the RDKit since the 2013_09 release.
The algorithm, which is described here: https://github.com/rdkit/UGM_2013/blob/master/Presentations/Hussain.Fraggle.pdf?raw=true , uses the similarity between fragments of the query molecule and the database molecule and is an interesting complement to standard fingerprint similiarity.
Here I will take a look at Fraggle using the same tools I applied to the other fingerprinting methods in these two posts:
http://rdkit.blogspot.ch/2013/10/fingerprint-thresholds.html
http://rdkit.blogspot.ch/2013/10/comparing-fingerprints-to-each-other.html
The baseline similarity values for Fraggle are quite high:
Fingerprint | Metric | 90% level | 95% level | 99% level |
---|---|---|---|---|
Fraggle | 0.483 | 0.538 | 0.650 |
As expected from the definition, Fraggle similarity tends to be higher than RDKit5 similarity:
This is a nice example of a case where the RDKit5 fingerprint says the molecules are quite dissimilar, but Fraggle provides the expected high similarity score:
mol1 | mol2 | Fraggle | RDKit5 | Fragment | FragMol | |
---|---|---|---|---|---|---|
15634 | 0.927711 | 0.191693 | [*]c1ncnc2[nH]cnc21 |
Another interesting point about Fraggle is that it pulls back compounds that are quite complementary to the other methods we've looked at. To demonstrate, here is the percent overlap in the top 100 pairs found by Fraggle and a few other fingerprints:
Fingerprint 1 | Fingerprint 2 | Fraction in common (top 100) |
---|---|---|
Fraggle | AP | 0.18 |
Fraggle | Avalon-1024 | 0.16 |
Fraggle | RDKit5 | 0.24 |
Fraggle | TT | 0.21 |
AP | Avalon-1024 | 0.58 |
AP | RDKit5 | 0.69 |
AP | TT | 0.86 |
Avalon-1024 | RDKit5 | 0.56 |
Avalon-1024 | TT | 0.60 |
RDKit5 | TT | 0.70 |
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit.Avalon import pyAvalonTools
from rdkit.Chem import Draw
from rdkit.Chem.Fraggle import FraggleSim
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
from rdkit import DataStructs
from collections import defaultdict
import cPickle,random,gzip,time
import scipy as sp
import pandas
from rdkit.Chem import PandasTools
PandasTools.RenderImagesInAllDataFrames()
from scipy import stats
from IPython.core.display import display,HTML,Javascript
print rdBase.rdkitVersion
2014.03.1pre
ind = [x.split() for x in gzip.open('../data/chembl16_25K.pairs.txt.gz')]
ms1 = []
ms2 = []
for i,row in enumerate(ind):
m1 = Chem.MolFromSmiles(row[1])
ms1.append((row[0],m1))
m2 = Chem.MolFromSmiles(row[3])
ms2.append((row[2],m2))
random.seed(23)
random.shuffle(ms2)
t1=time.time()
sims=[]
for i,(m1,m2) in enumerate(zip(ms1,ms2)):
sim,frag= FraggleSim.GetFraggleSimilarity(m1[-1],m2[-1])
sims.append((sim,i))
if not (i%200):
print 'Done: %d in %.2f seconds'%(i,time.time()-t1)
t2=time.time()
print 'Finished in %.2f seconds'%(t2-t1)
cPickle.dump(sims,gzip.open('../data/chembl16_25K.fraggle_randompairs.sims.pkl.gz','wb+'))
sl = sorted(sims)
np = len(sl)
for bin in (.7,.8,.9,.95,.99):
print bin,sl[int(bin*np)]
hist([x[0] for x in sims],bins=20)
xlabel("Fraggle")
0.7 (0.37727272727272726, 11580) 0.8 (0.4196078431372549, 17489) 0.9 (0.4826254826254826, 393) 0.95 (0.5377358490566038, 3077) 0.99 (0.65, 17818)
<matplotlib.text.Text at 0x925f2250>
scoredLists = cPickle.load(gzip.open('../data/chembl16_25K.pairs.sims.pkl.gz','rb'))
t1=time.time()
rl=[]
for i,(m1,m2) in enumerate(zip(ms1,ms2)):
sim,frag= FraggleSim.GetFraggleSimilarity(m1[-1],m2[-1])
rl.append((sim,i))
if not (i%200):
print 'Done: %d in %.2f seconds'%(i,time.time()-t1)
t2=time.time()
print 'Finished in %.2f seconds'%(t2-t1)
scoredLists['Fraggle']=rl
Done: 0 in 0.10 seconds Done: 200 in 37.79 seconds Done: 400 in 83.12 seconds Done: 600 in 133.13 seconds Done: 800 in 174.72 seconds Done: 1000 in 226.38 seconds Done: 1200 in 274.27 seconds Done: 1400 in 322.24 seconds Done: 1600 in 366.22 seconds Done: 1800 in 408.76 seconds Done: 2000 in 460.27 seconds Done: 2200 in 504.41 seconds Done: 2400 in 543.93 seconds Done: 2600 in 591.81 seconds Done: 2800 in 635.61 seconds Done: 3000 in 681.73 seconds Done: 3200 in 728.26 seconds Done: 3400 in 771.40 seconds Done: 3600 in 813.29 seconds Done: 3800 in 861.38 seconds Done: 4000 in 906.49 seconds Done: 4200 in 954.90 seconds Done: 4400 in 997.52 seconds Done: 4600 in 1041.82 seconds Done: 4800 in 1088.03 seconds Done: 5000 in 1134.05 seconds Done: 5200 in 1170.79 seconds Done: 5400 in 1211.28 seconds Done: 5600 in 1257.09 seconds Done: 5800 in 1301.07 seconds Done: 6000 in 1343.60 seconds Done: 6200 in 1385.13 seconds Done: 6400 in 1425.09 seconds Done: 6600 in 1471.67 seconds Done: 6800 in 1513.88 seconds Done: 7000 in 1560.01 seconds Done: 7200 in 1603.64 seconds Done: 7400 in 1647.56 seconds Done: 7600 in 1692.30 seconds Done: 7800 in 1737.26 seconds Done: 8000 in 1781.66 seconds Done: 8200 in 1828.17 seconds Done: 8400 in 1871.50 seconds Done: 8600 in 1915.69 seconds Done: 8800 in 1956.71 seconds Done: 9000 in 1997.96 seconds Done: 9200 in 2040.47 seconds Done: 9400 in 2085.69 seconds Done: 9600 in 2133.86 seconds Done: 9800 in 2185.82 seconds Done: 10000 in 2234.24 seconds Done: 10200 in 2284.10 seconds Done: 10400 in 2333.33 seconds Done: 10600 in 2375.41 seconds Done: 10800 in 2418.13 seconds Done: 11000 in 2470.55 seconds Done: 11200 in 2512.55 seconds Done: 11400 in 2553.32 seconds Done: 11600 in 2598.75 seconds Done: 11800 in 2646.64 seconds Done: 12000 in 2692.88 seconds Done: 12200 in 2741.21 seconds Done: 12400 in 2783.86 seconds Done: 12600 in 2828.30 seconds Done: 12800 in 2872.25 seconds Done: 13000 in 2918.02 seconds Done: 13200 in 2959.99 seconds Done: 13400 in 3007.89 seconds Done: 13600 in 3050.15 seconds Done: 13800 in 3099.61 seconds Done: 14000 in 3145.79 seconds Done: 14200 in 3190.81 seconds Done: 14400 in 3234.20 seconds Done: 14600 in 3275.04 seconds Done: 14800 in 3314.82 seconds Done: 15000 in 3358.80 seconds Done: 15200 in 3400.57 seconds Done: 15400 in 3441.54 seconds Done: 15600 in 3494.32 seconds Done: 15800 in 3533.18 seconds Done: 16000 in 3578.51 seconds Done: 16200 in 3623.28 seconds Done: 16400 in 3664.12 seconds Done: 16600 in 3711.36 seconds Done: 16800 in 3751.84 seconds Done: 17000 in 3797.13 seconds Done: 17200 in 3844.04 seconds Done: 17400 in 3881.47 seconds Done: 17600 in 3928.48 seconds Done: 17800 in 3971.64 seconds Done: 18000 in 4016.54 seconds Done: 18200 in 4060.79 seconds Done: 18400 in 4106.77 seconds Done: 18600 in 4149.58 seconds Done: 18800 in 4190.75 seconds Done: 19000 in 4237.42 seconds Done: 19200 in 4279.87 seconds Done: 19400 in 4328.97 seconds Done: 19600 in 4373.51 seconds Done: 19800 in 4415.70 seconds Done: 20000 in 4458.43 seconds Done: 20200 in 4505.40 seconds Done: 20400 in 4549.35 seconds Done: 20600 in 4591.15 seconds Done: 20800 in 4632.82 seconds Done: 21000 in 4675.24 seconds Done: 21200 in 4722.26 seconds Done: 21400 in 4763.14 seconds Done: 21600 in 4804.33 seconds Done: 21800 in 4850.55 seconds Done: 22000 in 4893.26 seconds Done: 22200 in 4935.05 seconds Done: 22400 in 4980.35 seconds Done: 22600 in 5021.81 seconds Done: 22800 in 5063.18 seconds Done: 23000 in 5103.84 seconds Done: 23200 in 5146.18 seconds Done: 23400 in 5187.49 seconds Done: 23600 in 5232.10 seconds Done: 23800 in 5275.02 seconds Done: 24000 in 5318.88 seconds Done: 24200 in 5360.90 seconds Done: 24400 in 5404.27 seconds Done: 24600 in 5443.92 seconds Done: 24800 in 5488.28 seconds Finished in 5535.28 seconds
cPickle.dump(scoredLists,gzip.open('../data/chembl16_25K.pairs.sims2.pkl.gz','wb+'))
Load the lists
scoredLists = cPickle.load(gzip.open('../data/chembl16_25K.pairs.sims2.pkl.gz','rb'))
def directCompare(scoredLists,fp1,fp2,plotIt=True,silent=False):
l1 = scoredLists[fp1]
l2 = scoredLists[fp2]
rl1=[x[-1] for x in l1]
rl2=[x[-1] for x in l2]
vl1=[x[0] for x in l1]
vl2=[x[0] for x in l2]
if plotIt:
_=scatter(vl1,vl2,edgecolors='none')
maxvl1=max(vl1)
minvl1=min(vl1)
maxvl2=max(vl2)
minvl2=min(vl2)
_=plot((minvl1,maxvl1),(minvl2,maxvl2),color='k',linestyle='-')
xlabel(fp1)
ylabel(fp2)
tau,tau_p=stats.kendalltau(vl1,vl2)
spearman_rho,spearman_p=stats.spearmanr(vl1,vl2)
pearson_r,pearson_p = stats.pearsonr(vl1,vl2)
if not silent:
print fp1,fp2,tau,tau_p,spearman_rho,spearman_p,pearson_r,pearson_p
return tau,spearman_rho,pearson_r
_=directCompare(scoredLists,'Fraggle','RDKit5')
Fraggle RDKit5 0.510174399518 0.0 0.676266099876 0.0 0.734593163378 0.0
That's an interesting shape...
We'll get ready by loading the data into a Pandas data frame.
df = pandas.DataFrame(index=range(len(ms1)),columns=['mol1','mol2','Fraggle','RDKit5'])
df.mol1 = [x[1] for x in ms1]
df.mol2 = [x[1] for x in ms2]
df.Fraggle = [x[0] for x in scoredLists['Fraggle']]
df.RDKit5 = [x[0] for x in scoredLists['RDKit5']]
And now do the subset
subset = df[df.RDKit5<0.2][df.Fraggle>0.8]
subset.sort(columns=['Fraggle'],ascending=False,inplace=True)
len(subset)
62
Add the fragment that Fraggle is using to each row:
frags = []
for row in subset.itertuples():
m1 = row[1]
m2 = row[2]
sim,frag= FraggleSim.GetFraggleSimilarity(m1,m2)
frags.append(frag)
mfrags = [Chem.MolFromSmiles(x) for x in frags]
subset['Fragment']=frags
subset['FragMol']=mfrags
subset
mol1 | mol2 | Fraggle | RDKit5 | Fragment | FragMol | |
---|---|---|---|---|---|---|
2768 | 1.000000 | 0.198157 | [*]C(F)(F)Cl.[*]C(F)(Cl)C(F)(F)F | |||
2937 | 1.000000 | 0.128205 | [*]C[Se](=O)O | |||
7696 | 1.000000 | 0.157738 | [*]c1ncnc(N)c1[*] | |||
21156 | 1.000000 | 0.184080 | [*]CCC.[*]c1c2ccccc2nc2ccccc12 | |||
3347 | 1.000000 | 0.104478 | [*]CC(C)(C)CO.[*]C(C)(C)CO | |||
6534 | 1.000000 | 0.071942 | [*]CCNC.[*]CNC | |||
10494 | 0.969231 | 0.079365 | [*]CCCCCCCCCCCCCCCCC | |||
23207 | 0.964602 | 0.164706 | [*]SCCO.[*][N+](=O)[O-] | |||
6250 | 0.952096 | 0.172185 | [*]c1ccccc1[*] | |||
24245 | 0.950000 | 0.185687 | [*]c1cccc[n+]1[O-].[*]C(C)c1cc(C)ccc1C | |||
15887 | 0.950000 | 0.176136 | [*][C@@H]1CCCNC1 | |||
17667 | 0.949580 | 0.161392 | [*]c1ccccc1.[*]N1C(=O)CNC1=O | |||
17500 | 0.948718 | 0.156951 | [*]CCCCCCCCCCC.[*]CP(=O)(OC)OC | |||
19213 | 0.931034 | 0.190476 | [*]NC(=N)CN.[*]C(=O)O | |||
21961 | 0.929412 | 0.168790 | [*]CSC#N.[*]c1ccccc1[*] | |||
15634 | 0.927711 | 0.191693 | [*]c1ncnc2[nH]cnc21 | |||
17356 | 0.925926 | 0.120805 | [*]CCCCCC.[*]CC(N)=O | |||
19129 | 0.919355 | 0.174863 | [*]c1cncc(Cl)c1 | |||
13401 | 0.918750 | 0.184275 | [*]CC1CC1.[*]c1ccccc1Br | |||
22933 | 0.916667 | 0.183784 | [*]CC#C.[*]c1ncccn1 | |||
4404 | 0.907216 | 0.186667 | [*]CSC.[*]c1ccccc1 | |||
12294 | 0.894737 | 0.099010 | [*]c1ccccc1.[*]N(C)C | |||
16786 | 0.893617 | 0.112426 | [*]c1sc[n+](C)c1C.[*]c1sc[n+](C)c1C | |||
13760 | 0.887218 | 0.190283 | [*]COC(N)=O.[*][N+](=O)[O-] | |||
4473 | 0.885417 | 0.145985 | [*]c1ccccc1.[*]c1ccccc1 | |||
19112 | 0.883721 | 0.157598 | [*]c1cn2ccsc2n1.[*]n1nc(C)cc1C | |||
17148 | 0.882353 | 0.166667 | [*]CCCCCCCCCCCCC | |||
6334 | 0.882353 | 0.190678 | [*]c1ccccc1[*] | |||
16077 | 0.879518 | 0.123684 | [*]c1nc(C)nn1[*].[*]c1ccccc1 | |||
8779 | 0.878505 | 0.152685 | [*]c1nc2nnnc-2c(O)n1[*] | |||
2002 | 0.875000 | 0.154667 | [*]c1ccco1.[*]c1ncnn1[*] | |||
5529 | 0.875000 | 0.135714 | [*]C(N)=O.[*]C(CC)CCCC | |||
15573 | 0.859813 | 0.090196 | [*]C(CSCCCCCCCCCCCCCCCC)OC.[*][n+]1ccsc1 | |||
17492 | 0.858824 | 0.182573 | [*]c1ccc2c[nH]nc2c1 | |||
20831 | 0.858824 | 0.182573 | [*]c1ccc2c[nH]nc2c1 | |||
2570 | 0.853659 | 0.111842 | [*]C(=O)CCCCCCC.[*]C(=O)CCCCCCC | |||
23156 | 0.853659 | 0.130081 | [*]CCCCCCCCC.[*]OC(=O)C=C | |||
13103 | 0.853333 | 0.091892 | [*]CCCCCCCCCCC.[*]C(N)=O | |||
18140 | 0.851852 | 0.197101 | [*]C(=O)OC(C)(C)C.[*]C(=O)OC(C)(C)C | |||
24087 | 0.851351 | 0.180851 | [*]c1c(C)ncn1[*] | |||
6051 | 0.850000 | 0.182927 | [*]C(=O)OCC.[*]C(=O)OCC | |||
7595 | 0.839161 | 0.094955 | [*]OC=O.[*]C(C(=O)O)C(=O)O | |||
3185 | 0.838323 | 0.111111 | [*]N1CCOCC1.[*]S(C)(=O)=O | |||
15940 | 0.835821 | 0.127490 | [*]/C=C(\O)C(=O)O.[*]C(C)C | |||
20472 | 0.831858 | 0.160458 | [*]C(=O)Cc1ccsc1.[*]C(=O)NC1CCCCCC1 | |||
16457 | 0.829787 | 0.197581 | [*]C(=O)OCC.[*]C(=O)C(C)N1CCOCC1 | |||
1283 | 0.828571 | 0.158301 | [*]c1ccccc1.[*]C(C)C | |||
8629 | 0.827273 | 0.183333 | [*]CCC(N)=O.[*]NCc1ccccc1 | |||
10367 | 0.827273 | 0.161812 | [*]CCc1ccccc1.[*]C(CCS)C(=O)O | |||
9572 | 0.826446 | 0.164080 | [*]c1ccc(Cl)c(Cl)c1.[*]c1cc(Cl)c(Cl)cc1[*] | |||
13684 | 0.824074 | 0.130233 | [*]C1CN2CCC1C2 | |||
12972 | 0.823529 | 0.154762 | [*]c1c(Br)cnn1[*] | |||
22001 | 0.822785 | 0.136364 | [*]C(=O)O.[*]c1ncsc1[*] | |||
5087 | 0.822222 | 0.173913 | [*]CC(C)(C)C(=O)O | |||
19447 | 0.816327 | 0.184507 | [*]c1cc2ccccc2o1.[*]c1nnnn1C | |||
8002 | 0.810345 | 0.115672 | [*]CCCC.[*]C(O)(P(=O)(O)O)P(=O)(O)O | |||
18304 | 0.809091 | 0.153226 | [*]c1ccco1.[*]c1ccccn1 | |||
17612 | 0.809091 | 0.162791 | [*]c1ccco1.[*]c1ccncc1 | |||
225 | 0.806122 | 0.198330 | [*]c1ccccc1[*] | |||
6688 | 0.804598 | 0.138122 | [*]CCO.[*]c1ccccc1[*] | |||
16812 | 0.802083 | 0.188732 | [*]CCCCC.[*]c1nnc(N)s1 | |||
14552 | 0.801980 | 0.157407 | [*]NC(N)=S.[*]Oc1cccc2ccccc21 |
Here's a particularly nice example where a small change in the middle of the molecule (N->S) destroys what would otherwise be a fairly high RDKit similarity, but where Fraggle still produces a high score:
subset[subset.index==15634]
mol1 | mol2 | Fraggle | RDKit5 | Fragment | FragMol | |
---|---|---|---|---|---|---|
15634 | 0.927711 | 0.191693 | [*]c1ncnc2[nH]cnc21 |
Demonstrate the disproportionate influence of the central S by replacing it with an N and repeating the similarity calculations
Chem.MolToSmiles(subset.ix[15634]['mol2'],True)
'CCCSc1ncnc2[nH]ncc21'
tmol = Chem.MolFromSmiles('CCCNc1ncnc2[nH]ncc21')
fp1 = Chem.RDKFingerprint(subset.ix[15634]['mol1'],maxPath=5)
fp2 = Chem.RDKFingerprint(tmol,maxPath=5)
print 'RDKit5: ',DataStructs.TanimotoSimilarity(fp1,fp2)
print 'Fraggle: ',FraggleSim.GetFraggleSimilarity(subset.ix[15634]['mol1'],tmol)
RDKit5: 0.501992031873 Fraggle: (0.927710843373494, '[*]c1ncnc2[nH]cnc21')
The RDKit5 similarity is now well above the random threshold (0.29 and 95%), but there's no impact on Fraggle.
subset2 = df[df.RDKit5>.5][df.Fraggle<0.1]
subset2.sort(columns=['RDKit5'],ascending=False,inplace=True)
len(subset2)
38
frags = []
for row in subset2.itertuples():
m1 = row[1]
m2 = row[2]
sim,frag= FraggleSim.GetFraggleSimilarity(m1,m2)
frags.append(frag)
subset2['Fragment']=frags
subset2
mol1 | mol2 | Fraggle | RDKit5 | Fragment | |
---|---|---|---|---|---|
2958 | 0 | 1.000000 | None | ||
4568 | 0 | 1.000000 | None | ||
21906 | 0 | 1.000000 | None | ||
11718 | 0 | 1.000000 | None | ||
11745 | 0 | 1.000000 | None | ||
1581 | 0 | 0.987552 | None | ||
9838 | 0 | 0.987552 | None | ||
20969 | 0 | 0.987552 | None | ||
10615 | 0 | 0.986063 | None | ||
16125 | 0 | 0.881890 | None | ||
17498 | 0 | 0.875472 | None | ||
22214 | 0 | 0.849057 | None | ||
17812 | 0 | 0.816901 | None | ||
22514 | 0 | 0.798701 | None | ||
23507 | 0 | 0.720000 | None | ||
19071 | 0 | 0.714744 | None | ||
19067 | 0 | 0.713948 | None | ||
7808 | 0 | 0.711111 | None | ||
12405 | 0 | 0.692000 | None | ||
21204 | 0 | 0.687764 | None | ||
18018 | 0 | 0.657895 | None | ||
4167 | 0 | 0.654867 | None | ||
8750 | 0 | 0.650000 | None | ||
7008 | 0 | 0.630252 | None | ||
12990 | 0 | 0.630000 | None | ||
8355 | 0 | 0.623377 | None | ||
21753 | 0 | 0.602941 | None | ||
9773 | 0 | 0.598361 | None | ||
3180 | 0 | 0.589474 | None | ||
19333 | 0 | 0.576119 | None | ||
8083 | 0 | 0.575758 | None | ||
22375 | 0 | 0.559633 | None | ||
20954 | 0 | 0.555556 | None | ||
5359 | 0 | 0.548173 | None | ||
20305 | 0 | 0.542601 | None | ||
22193 | 0 | 0.542601 | None | ||
15618 | 0 | 0.536585 | None | ||
12669 | 0 | 0.531056 | None |
At first these seem somewhat surprising, but it's just due to the fact that the molecules don't generate any fragments. Here is an example:
subset2.ix[21906]['mol1']
FraggleSim.generate_fraggle_fragmentation(subset2.ix[21906]['mol1'])
set()
This is repeating the last bit of analysis from http://rdkit.blogspot.ch/2013/10/comparing-fingerprints-to-each-other.html
nToDo=200
apl = sorted(scoredLists['AP'],reverse=True)[:nToDo]
ttl = sorted(scoredLists['TT'],reverse=True)[:nToDo]
avl = sorted(scoredLists['Avalon-1024'],reverse=True)[:nToDo]
rdkl = sorted(scoredLists['RDKit5'],reverse=True)[:nToDo]
fragl = sorted(scoredLists['Fraggle'],reverse=True)[:nToDo]
idsToKeep=set()
idsToKeep.update([x[1] for x in apl])
idsToKeep.update([x[1] for x in ttl])
idsToKeep.update([x[1] for x in avl])
idsToKeep.update([x[1] for x in rdkl])
idsToKeep.update([x[1] for x in fragl])
print 'Overall number:',len(idsToKeep)
ids={}
ids['AP']=set([x[1] for x in apl])
ids['TT']=set([x[1] for x in ttl])
ids['Avalon-1024']=set([x[1] for x in avl])
ids['RDKit5']=set([x[1] for x in rdkl])
ids['Fraggle']=set([x[1] for x in fragl])
ks = sorted(ids.keys())
for i,k in enumerate(ks):
for j in range(i+1,len(ks)):
overlap=len(ids[k].intersection(ids[ks[j]]))
print ks[i],ks[j],overlap,'%.2f'%(float(overlap)/len(apl))
Overall number: 475 AP Avalon-1024 102 0.51 AP Fraggle 77 0.39 AP RDKit5 112 0.56 AP TT 137 0.69 Avalon-1024 Fraggle 68 0.34 Avalon-1024 RDKit5 125 0.62 Avalon-1024 TT 111 0.56 Fraggle RDKit5 82 0.41 Fraggle TT 70 0.35 RDKit5 TT 117 0.58
nToDo=100
apl = sorted(scoredLists['AP'],reverse=True)[:nToDo]
ttl = sorted(scoredLists['TT'],reverse=True)[:nToDo]
avl = sorted(scoredLists['Avalon-1024'],reverse=True)[:nToDo]
rdkl = sorted(scoredLists['RDKit5'],reverse=True)[:nToDo]
fragl = sorted(scoredLists['Fraggle'],reverse=True)[:nToDo]
idsToKeep=set()
idsToKeep.update([x[1] for x in apl])
idsToKeep.update([x[1] for x in ttl])
idsToKeep.update([x[1] for x in avl])
idsToKeep.update([x[1] for x in rdkl])
idsToKeep.update([x[1] for x in fragl])
print 'Overall number:',len(idsToKeep)
ids={}
ids['AP']=set([x[1] for x in apl])
ids['TT']=set([x[1] for x in ttl])
ids['Avalon-1024']=set([x[1] for x in avl])
ids['RDKit5']=set([x[1] for x in rdkl])
ids['Fraggle']=set([x[1] for x in fragl])
ks = sorted(ids.keys())
for i,k in enumerate(ks):
for j in range(i+1,len(ks)):
overlap=len(ids[k].intersection(ids[ks[j]]))
print ks[i],ks[j],overlap,'%.2f'%(float(overlap)/len(apl))
Overall number: 240 AP Avalon-1024 58 0.58 AP Fraggle 18 0.18 AP RDKit5 69 0.69 AP TT 86 0.86 Avalon-1024 Fraggle 16 0.16 Avalon-1024 RDKit5 56 0.56 Avalon-1024 TT 60 0.60 Fraggle RDKit5 24 0.24 Fraggle TT 21 0.21 RDKit5 TT 70 0.70
Fraggle is really pulling back different compounds.