Comparing Fraggle to other fingerprints¶

The Fraggle similarity algorithm from Jameed Hussain and Gavin Harper is available in the RDKit since the 2013_09 release.

The algorithm, which is described here: https://github.com/rdkit/UGM_2013/blob/master/Presentations/Hussain.Fraggle.pdf?raw=true , uses the similarity between fragments of the query molecule and the database molecule and is an interesting complement to standard fingerprint similiarity.

Here I will take a look at Fraggle using the same tools I applied to the other fingerprinting methods in these two posts:

http://rdkit.blogspot.ch/2013/10/fingerprint-thresholds.html

http://rdkit.blogspot.ch/2013/10/comparing-fingerprints-to-each-other.html

TL;DR Summary¶

The baseline similarity values for Fraggle are quite high:

Fingerprint	Metric	90% level	95% level	99% level
Fraggle		0.483	0.538	0.650

As expected from the definition, Fraggle similarity tends to be higher than RDKit5 similarity:

This is a nice example of a case where the RDKit5 fingerprint says the molecules are quite dissimilar, but Fraggle provides the expected high similarity score:

	mol1	mol2	Fraggle	RDKit5	Fragment	FragMol
15634			0.927711	0.191693	[*]c1ncnc2[nH]cnc21

Another interesting point about Fraggle is that it pulls back compounds that are quite complementary to the other methods we've looked at. To demonstrate, here is the percent overlap in the top 100 pairs found by Fraggle and a few other fingerprints:

Fingerprint 1	Fingerprint 2	Fraction in common (top 100)
Fraggle	AP	0.18
Fraggle	Avalon-1024	0.16
Fraggle	RDKit5	0.24
Fraggle	TT	0.21
AP	Avalon-1024	0.58
AP	RDKit5	0.69
AP	TT	0.86
Avalon-1024	RDKit5	0.56
Avalon-1024	TT	0.60
RDKit5	TT	0.70

Move on to actually do the work¶

In [3]:

from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit.Avalon import pyAvalonTools
from rdkit.Chem import Draw
from rdkit.Chem.Fraggle import FraggleSim
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
from rdkit import DataStructs
from collections import defaultdict
import cPickle,random,gzip,time
import scipy as sp
import pandas
from rdkit.Chem import PandasTools
PandasTools.RenderImagesInAllDataFrames()
from scipy import stats
from IPython.core.display import display,HTML,Javascript

print rdBase.rdkitVersion

2014.03.1pre

Start with finding the baseline similarity value¶

read in the molecule pairs and shuffle them so that we have random pairs¶

In [2]:

ind = [x.split() for x in gzip.open('../data/chembl16_25K.pairs.txt.gz')]
ms1 = []
ms2 = []
for i,row in enumerate(ind):
    m1 = Chem.MolFromSmiles(row[1])
    ms1.append((row[0],m1))
    m2 = Chem.MolFromSmiles(row[3])
    ms2.append((row[2],m2))
    

In [3]:

random.seed(23)
random.shuffle(ms2)

In [ ]:

t1=time.time()
sims=[]
for i,(m1,m2) in enumerate(zip(ms1,ms2)):
    sim,frag= FraggleSim.GetFraggleSimilarity(m1[-1],m2[-1])
    sims.append((sim,i))
    if not (i%200):
        print 'Done: %d in %.2f seconds'%(i,time.time()-t1)
t2=time.time()
print 'Finished in %.2f seconds'%(t2-t1)

In [5]:

cPickle.dump(sims,gzip.open('../data/chembl16_25K.fraggle_randompairs.sims.pkl.gz','wb+'))

Here's the analysis¶

In [7]:

sl = sorted(sims)
np = len(sl)
for bin in (.7,.8,.9,.95,.99):
     print bin,sl[int(bin*np)]
hist([x[0] for x in sims],bins=20)
xlabel("Fraggle")
    

0.7 (0.37727272727272726, 11580)
0.8 (0.4196078431372549, 17489)
0.9 (0.4826254826254826, 393)
0.95 (0.5377358490566038, 3077)
0.99 (0.65, 17818)

Out[7]:

<matplotlib.text.Text at 0x925f2250>

In [3]:

scoredLists = cPickle.load(gzip.open('../data/chembl16_25K.pairs.sims.pkl.gz','rb'))

In [8]:

t1=time.time()
rl=[]
for i,(m1,m2) in enumerate(zip(ms1,ms2)):
    sim,frag= FraggleSim.GetFraggleSimilarity(m1[-1],m2[-1])
    rl.append((sim,i))
    if not (i%200):
        print 'Done: %d in %.2f seconds'%(i,time.time()-t1)
t2=time.time()
print 'Finished in %.2f seconds'%(t2-t1)
scoredLists['Fraggle']=rl

Done: 0 in 0.10 seconds
Done: 200 in 37.79 seconds
Done: 400 in 83.12 seconds
Done: 600 in 133.13 seconds
Done: 800 in 174.72 seconds
Done: 1000 in 226.38 seconds
Done: 1200 in 274.27 seconds
Done: 1400 in 322.24 seconds
Done: 1600 in 366.22 seconds
Done: 1800 in 408.76 seconds
Done: 2000 in 460.27 seconds
Done: 2200 in 504.41 seconds
Done: 2400 in 543.93 seconds
Done: 2600 in 591.81 seconds
Done: 2800 in 635.61 seconds
Done: 3000 in 681.73 seconds
Done: 3200 in 728.26 seconds
Done: 3400 in 771.40 seconds
Done: 3600 in 813.29 seconds
Done: 3800 in 861.38 seconds
Done: 4000 in 906.49 seconds
Done: 4200 in 954.90 seconds
Done: 4400 in 997.52 seconds
Done: 4600 in 1041.82 seconds
Done: 4800 in 1088.03 seconds
Done: 5000 in 1134.05 seconds
Done: 5200 in 1170.79 seconds
Done: 5400 in 1211.28 seconds
Done: 5600 in 1257.09 seconds
Done: 5800 in 1301.07 seconds
Done: 6000 in 1343.60 seconds
Done: 6200 in 1385.13 seconds
Done: 6400 in 1425.09 seconds
Done: 6600 in 1471.67 seconds
Done: 6800 in 1513.88 seconds
Done: 7000 in 1560.01 seconds
Done: 7200 in 1603.64 seconds
Done: 7400 in 1647.56 seconds
Done: 7600 in 1692.30 seconds
Done: 7800 in 1737.26 seconds
Done: 8000 in 1781.66 seconds
Done: 8200 in 1828.17 seconds
Done: 8400 in 1871.50 seconds
Done: 8600 in 1915.69 seconds
Done: 8800 in 1956.71 seconds
Done: 9000 in 1997.96 seconds
Done: 9200 in 2040.47 seconds
Done: 9400 in 2085.69 seconds
Done: 9600 in 2133.86 seconds
Done: 9800 in 2185.82 seconds
Done: 10000 in 2234.24 seconds
Done: 10200 in 2284.10 seconds
Done: 10400 in 2333.33 seconds
Done: 10600 in 2375.41 seconds
Done: 10800 in 2418.13 seconds
Done: 11000 in 2470.55 seconds
Done: 11200 in 2512.55 seconds
Done: 11400 in 2553.32 seconds
Done: 11600 in 2598.75 seconds
Done: 11800 in 2646.64 seconds
Done: 12000 in 2692.88 seconds
Done: 12200 in 2741.21 seconds
Done: 12400 in 2783.86 seconds
Done: 12600 in 2828.30 seconds
Done: 12800 in 2872.25 seconds
Done: 13000 in 2918.02 seconds
Done: 13200 in 2959.99 seconds
Done: 13400 in 3007.89 seconds
Done: 13600 in 3050.15 seconds
Done: 13800 in 3099.61 seconds
Done: 14000 in 3145.79 seconds
Done: 14200 in 3190.81 seconds
Done: 14400 in 3234.20 seconds
Done: 14600 in 3275.04 seconds
Done: 14800 in 3314.82 seconds
Done: 15000 in 3358.80 seconds
Done: 15200 in 3400.57 seconds
Done: 15400 in 3441.54 seconds
Done: 15600 in 3494.32 seconds
Done: 15800 in 3533.18 seconds
Done: 16000 in 3578.51 seconds
Done: 16200 in 3623.28 seconds
Done: 16400 in 3664.12 seconds
Done: 16600 in 3711.36 seconds
Done: 16800 in 3751.84 seconds
Done: 17000 in 3797.13 seconds
Done: 17200 in 3844.04 seconds
Done: 17400 in 3881.47 seconds
Done: 17600 in 3928.48 seconds
Done: 17800 in 3971.64 seconds
Done: 18000 in 4016.54 seconds
Done: 18200 in 4060.79 seconds
Done: 18400 in 4106.77 seconds
Done: 18600 in 4149.58 seconds
Done: 18800 in 4190.75 seconds
Done: 19000 in 4237.42 seconds
Done: 19200 in 4279.87 seconds
Done: 19400 in 4328.97 seconds
Done: 19600 in 4373.51 seconds
Done: 19800 in 4415.70 seconds
Done: 20000 in 4458.43 seconds
Done: 20200 in 4505.40 seconds
Done: 20400 in 4549.35 seconds
Done: 20600 in 4591.15 seconds
Done: 20800 in 4632.82 seconds
Done: 21000 in 4675.24 seconds
Done: 21200 in 4722.26 seconds
Done: 21400 in 4763.14 seconds
Done: 21600 in 4804.33 seconds
Done: 21800 in 4850.55 seconds
Done: 22000 in 4893.26 seconds
Done: 22200 in 4935.05 seconds
Done: 22400 in 4980.35 seconds
Done: 22600 in 5021.81 seconds
Done: 22800 in 5063.18 seconds
Done: 23000 in 5103.84 seconds
Done: 23200 in 5146.18 seconds
Done: 23400 in 5187.49 seconds
Done: 23600 in 5232.10 seconds
Done: 23800 in 5275.02 seconds
Done: 24000 in 5318.88 seconds
Done: 24200 in 5360.90 seconds
Done: 24400 in 5404.27 seconds
Done: 24600 in 5443.92 seconds
Done: 24800 in 5488.28 seconds
Finished in 5535.28 seconds

In [9]:

cPickle.dump(scoredLists,gzip.open('../data/chembl16_25K.pairs.sims2.pkl.gz','wb+'))

Load the lists

In [4]:

scoredLists = cPickle.load(gzip.open('../data/chembl16_25K.pairs.sims2.pkl.gz','rb'))

In [11]:

def directCompare(scoredLists,fp1,fp2,plotIt=True,silent=False):
    l1 = scoredLists[fp1]
    l2 = scoredLists[fp2]
    rl1=[x[-1] for x in l1]
    rl2=[x[-1] for x in l2]
    vl1=[x[0] for x in l1]
    vl2=[x[0] for x in l2]
    if plotIt:
        _=scatter(vl1,vl2,edgecolors='none')
        maxvl1=max(vl1)
        minvl1=min(vl1)
        maxvl2=max(vl2)
        minvl2=min(vl2)
        _=plot((minvl1,maxvl1),(minvl2,maxvl2),color='k',linestyle='-')
        xlabel(fp1)
        ylabel(fp2)
    
    tau,tau_p=stats.kendalltau(vl1,vl2)
    spearman_rho,spearman_p=stats.spearmanr(vl1,vl2)
    pearson_r,pearson_p = stats.pearsonr(vl1,vl2)
    if not silent:
        print fp1,fp2,tau,tau_p,spearman_rho,spearman_p,pearson_r,pearson_p
    return tau,spearman_rho,pearson_r

The Fraggle algorithm makes use of the RDKit5 fingerprint, so let's look at the comparison to that.¶

In [12]:

_=directCompare(scoredLists,'Fraggle','RDKit5')

Fraggle RDKit5 0.510174399518 0.0 0.676266099876 0.0 0.734593163378 0.0

That's an interesting shape...

Let's a look at the points where the Fraggle similarity is high but the RDKit similarity is low.¶

We'll get ready by loading the data into a Pandas data frame.

In [31]:

df = pandas.DataFrame(index=range(len(ms1)),columns=['mol1','mol2','Fraggle','RDKit5'])
df.mol1 = [x[1] for x in ms1]
df.mol2 = [x[1] for x in ms2]
df.Fraggle = [x[0] for x in scoredLists['Fraggle']]
df.RDKit5 = [x[0] for x in scoredLists['RDKit5']]

And now do the subset

In [55]:

subset = df[df.RDKit5<0.2][df.Fraggle>0.8]
subset.sort(columns=['Fraggle'],ascending=False,inplace=True)
len(subset)

Out[55]:

Add the fragment that Fraggle is using to each row:

In [56]:

frags = []
for row in subset.itertuples():
    m1 = row[1]
    m2 = row[2]
    sim,frag= FraggleSim.GetFraggleSimilarity(m1,m2)
    frags.append(frag)   
mfrags = [Chem.MolFromSmiles(x) for x in frags]
subset['Fragment']=frags
subset['FragMol']=mfrags

In [54]:

subset

Out[54]:

	Fraggle	RDKit5	Fragment
2768	1.000000	0.198157	[]C(F)(F)Cl.[]C(F)(Cl)C(F)(F)F
2937	1.000000	0.128205	[*]C[Se](=O)O
7696	1.000000	0.157738	[]c1ncnc(N)c1[]
21156	1.000000	0.184080	[]CCC.[]c1c2ccccc2nc2ccccc12
3347	1.000000	0.104478	[]CC(C)(C)CO.[]C(C)(C)CO
6534	1.000000	0.071942	[]CCNC.[]CNC
10494	0.969231	0.079365	[*]CCCCCCCCCCCCCCCCC
23207	0.964602	0.164706	[]SCCO.[][N+](=O)[O-]
6250	0.952096	0.172185	[]c1ccccc1[]
24245	0.950000	0.185687	[]c1cccc[n+]1[O-].[]C(C)c1cc(C)ccc1C
15887	0.950000	0.176136	[*][C@@H]1CCCNC1
17667	0.949580	0.161392	[]c1ccccc1.[]N1C(=O)CNC1=O
17500	0.948718	0.156951	[]CCCCCCCCCCC.[]CP(=O)(OC)OC
19213	0.931034	0.190476	[]NC(=N)CN.[]C(=O)O
21961	0.929412	0.168790	[]CSC#N.[]c1ccccc1[*]
15634	0.927711	0.191693	[*]c1ncnc2[nH]cnc21
17356	0.925926	0.120805	[]CCCCCC.[]CC(N)=O
19129	0.919355	0.174863	[*]c1cncc(Cl)c1
13401	0.918750	0.184275	[]CC1CC1.[]c1ccccc1Br
22933	0.916667	0.183784	[]CC#C.[]c1ncccn1
4404	0.907216	0.186667	[]CSC.[]c1ccccc1
12294	0.894737	0.099010	[]c1ccccc1.[]N(C)C
16786	0.893617	0.112426	[]c1sc[n+](C)c1C.[]c1sc[n+](C)c1C
13760	0.887218	0.190283	[]COC(N)=O.[][N+](=O)[O-]
4473	0.885417	0.145985	[]c1ccccc1.[]c1ccccc1
19112	0.883721	0.157598	[]c1cn2ccsc2n1.[]n1nc(C)cc1C
17148	0.882353	0.166667	[*]CCCCCCCCCCCCC
6334	0.882353	0.190678	[]c1ccccc1[]
16077	0.879518	0.123684	[]c1nc(C)nn1[].[*]c1ccccc1
8779	0.878505	0.152685	[]c1nc2nnnc-2c(O)n1[]
2002	0.875000	0.154667	[]c1ccco1.[]c1ncnn1[*]
5529	0.875000	0.135714	[]C(N)=O.[]C(CC)CCCC
15573	0.859813	0.090196	[]C(CSCCCCCCCCCCCCCCCC)OC.[][n+]1ccsc1
17492	0.858824	0.182573	[*]c1ccc2c[nH]nc2c1
20831	0.858824	0.182573	[*]c1ccc2c[nH]nc2c1
2570	0.853659	0.111842	[]C(=O)CCCCCCC.[]C(=O)CCCCCCC
23156	0.853659	0.130081	[]CCCCCCCCC.[]OC(=O)C=C
13103	0.853333	0.091892	[]CCCCCCCCCCC.[]C(N)=O
18140	0.851852	0.197101	[]C(=O)OC(C)(C)C.[]C(=O)OC(C)(C)C
24087	0.851351	0.180851	[]c1c(C)ncn1[]
6051	0.850000	0.182927	[]C(=O)OCC.[]C(=O)OCC
7595	0.839161	0.094955	[]OC=O.[]C(C(=O)O)C(=O)O
3185	0.838323	0.111111	[]N1CCOCC1.[]S(C)(=O)=O
15940	0.835821	0.127490	[]/C=C(\O)C(=O)O.[]C(C)C
20472	0.831858	0.160458	[]C(=O)Cc1ccsc1.[]C(=O)NC1CCCCCC1
16457	0.829787	0.197581	[]C(=O)OCC.[]C(=O)C(C)N1CCOCC1
1283	0.828571	0.158301	[]c1ccccc1.[]C(C)C
8629	0.827273	0.183333	[]CCC(N)=O.[]NCc1ccccc1
10367	0.827273	0.161812	[]CCc1ccccc1.[]C(CCS)C(=O)O
9572	0.826446	0.164080	[]c1ccc(Cl)c(Cl)c1.[]c1cc(Cl)c(Cl)cc1[*]
13684	0.824074	0.130233	[*]C1CN2CCC1C2
12972	0.823529	0.154762	[]c1c(Br)cnn1[]
22001	0.822785	0.136364	[]C(=O)O.[]c1ncsc1[*]
5087	0.822222	0.173913	[*]CC(C)(C)C(=O)O
19447	0.816327	0.184507	[]c1cc2ccccc2o1.[]c1nnnn1C
8002	0.810345	0.115672	[]CCCC.[]C(O)(P(=O)(O)O)P(=O)(O)O
18304	0.809091	0.153226	[]c1ccco1.[]c1ccccn1
17612	0.809091	0.162791	[]c1ccco1.[]c1ccncc1
225	0.806122	0.198330	[]c1ccccc1[]
6688	0.804598	0.138122	[]CCO.[]c1ccccc1[*]
16812	0.802083	0.188732	[]CCCCC.[]c1nnc(N)s1
14552	0.801980	0.157407	[]NC(N)=S.[]Oc1cccc2ccccc21

Here's a particularly nice example where a small change in the middle of the molecule (N->S) destroys what would otherwise be a fairly high RDKit similarity, but where Fraggle still produces a high score:

In [69]:

subset[subset.index==15634]

Out[69]:

	mol1	mol2	Fraggle	RDKit5	Fragment	FragMol
15634			0.927711	0.191693	[*]c1ncnc2[nH]cnc21

Demonstrate the disproportionate influence of the central S by replacing it with an N and repeating the similarity calculations

In [70]:

Chem.MolToSmiles(subset.ix[15634]['mol2'],True)

Out[70]:

'CCCSc1ncnc2[nH]ncc21'

In [73]:

tmol = Chem.MolFromSmiles('CCCNc1ncnc2[nH]ncc21')
fp1 = Chem.RDKFingerprint(subset.ix[15634]['mol1'],maxPath=5)
fp2 = Chem.RDKFingerprint(tmol,maxPath=5)
print 'RDKit5: ',DataStructs.TanimotoSimilarity(fp1,fp2)
print 'Fraggle: ',FraggleSim.GetFraggleSimilarity(subset.ix[15634]['mol1'],tmol)

RDKit5:  0.501992031873
Fraggle:  (0.927710843373494, '[*]c1ncnc2[nH]cnc21')

The RDKit5 similarity is now well above the random threshold (0.29 and 95%), but there's no impact on Fraggle.

What about the cases where the Fraggle similarity is zero, but RDKit5 has a value?¶

In [59]:

subset2 = df[df.RDKit5>.5][df.Fraggle<0.1]
subset2.sort(columns=['RDKit5'],ascending=False,inplace=True)
len(subset2)

Out[59]:

In [60]:

frags = []
for row in subset2.itertuples():
    m1 = row[1]
    m2 = row[2]
    sim,frag= FraggleSim.GetFraggleSimilarity(m1,m2)
    frags.append(frag)   
subset2['Fragment']=frags
subset2

Out[60]:

	RDKit5	Fragment
2958	1.000000	None
4568	1.000000	None
21906	1.000000	None
11718	1.000000	None
11745	1.000000	None
1581	0.987552	None
9838	0.987552	None
20969	0.987552	None
10615	0.986063	None
16125	0.881890	None
17498	0.875472	None
22214	0.849057	None
17812	0.816901	None
22514	0.798701	None
23507	0.720000	None
19071	0.714744	None
19067	0.713948	None
7808	0.711111	None
12405	0.692000	None
21204	0.687764	None
18018	0.657895	None
4167	0.654867	None
8750	0.650000	None
7008	0.630252	None
12990	0.630000	None
8355	0.623377	None
21753	0.602941	None
9773	0.598361	None
3180	0.589474	None
19333	0.576119	None
8083	0.575758	None
22375	0.559633	None
20954	0.555556	None
5359	0.548173	None
20305	0.542601	None
22193	0.542601	None
15618	0.536585	None
12669	0.531056	None

At first these seem somewhat surprising, but it's just due to the fact that the molecules don't generate any fragments. Here is an example:

In [66]:

subset2.ix[21906]['mol1']

Out[66]:

In [67]:

FraggleSim.generate_fraggle_fragmentation(subset2.ix[21906]['mol1'])

Out[67]:

set()

What about how different the compounds are?¶

This is repeating the last bit of analysis from http://rdkit.blogspot.ch/2013/10/comparing-fingerprints-to-each-other.html

In [78]:

nToDo=200
apl = sorted(scoredLists['AP'],reverse=True)[:nToDo]
ttl = sorted(scoredLists['TT'],reverse=True)[:nToDo]
avl = sorted(scoredLists['Avalon-1024'],reverse=True)[:nToDo]
rdkl = sorted(scoredLists['RDKit5'],reverse=True)[:nToDo]
fragl = sorted(scoredLists['Fraggle'],reverse=True)[:nToDo]

idsToKeep=set()
idsToKeep.update([x[1] for x in apl])
idsToKeep.update([x[1] for x in ttl])
idsToKeep.update([x[1] for x in avl])
idsToKeep.update([x[1] for x in rdkl])
idsToKeep.update([x[1] for x in fragl])

print 'Overall number:',len(idsToKeep)
ids={}
ids['AP']=set([x[1] for x in apl])
ids['TT']=set([x[1] for x in ttl])
ids['Avalon-1024']=set([x[1] for x in avl])
ids['RDKit5']=set([x[1] for x in rdkl])
ids['Fraggle']=set([x[1] for x in fragl])


ks = sorted(ids.keys())
for i,k in enumerate(ks):
    for j in range(i+1,len(ks)):
        overlap=len(ids[k].intersection(ids[ks[j]]))
        print ks[i],ks[j],overlap,'%.2f'%(float(overlap)/len(apl))

Overall number: 475
AP Avalon-1024 102 0.51
AP Fraggle 77 0.39
AP RDKit5 112 0.56
AP TT 137 0.69
Avalon-1024 Fraggle 68 0.34
Avalon-1024 RDKit5 125 0.62
Avalon-1024 TT 111 0.56
Fraggle RDKit5 82 0.41
Fraggle TT 70 0.35
RDKit5 TT 117 0.58

In [79]:

nToDo=100
apl = sorted(scoredLists['AP'],reverse=True)[:nToDo]
ttl = sorted(scoredLists['TT'],reverse=True)[:nToDo]
avl = sorted(scoredLists['Avalon-1024'],reverse=True)[:nToDo]
rdkl = sorted(scoredLists['RDKit5'],reverse=True)[:nToDo]
fragl = sorted(scoredLists['Fraggle'],reverse=True)[:nToDo]

idsToKeep=set()
idsToKeep.update([x[1] for x in apl])
idsToKeep.update([x[1] for x in ttl])
idsToKeep.update([x[1] for x in avl])
idsToKeep.update([x[1] for x in rdkl])
idsToKeep.update([x[1] for x in fragl])

print 'Overall number:',len(idsToKeep)
ids={}
ids['AP']=set([x[1] for x in apl])
ids['TT']=set([x[1] for x in ttl])
ids['Avalon-1024']=set([x[1] for x in avl])
ids['RDKit5']=set([x[1] for x in rdkl])
ids['Fraggle']=set([x[1] for x in fragl])


ks = sorted(ids.keys())
for i,k in enumerate(ks):
    for j in range(i+1,len(ks)):
        overlap=len(ids[k].intersection(ids[ks[j]]))
        print ks[i],ks[j],overlap,'%.2f'%(float(overlap)/len(apl))

Overall number: 240
AP Avalon-1024 58 0.58
AP Fraggle 18 0.18
AP RDKit5 69 0.69
AP TT 86 0.86
Avalon-1024 Fraggle 16 0.16
Avalon-1024 RDKit5 56 0.56
Avalon-1024 TT 60 0.60
Fraggle RDKit5 24 0.24
Fraggle TT 21 0.21
RDKit5 TT 70 0.70

Fraggle is really pulling back different compounds.

In [ ]:

Comparing Fraggle to other fingerprints¶

TL;DR Summary¶

Move on to actually do the work¶

Start with finding the baseline similarity value¶

read in the molecule pairs and shuffle them so that we have random pairs¶

Here's the analysis¶

Now do the same thing for the related compound pairs.¶

The Fraggle algorithm makes use of the RDKit5 fingerprint, so let's look at the comparison to that.¶

Let's a look at the points where the Fraggle similarity is high but the RDKit similarity is low.¶

What about the cases where the Fraggle similarity is zero, but RDKit5 has a value?¶

What about how different the compounds are?¶