This is a followup to this post, the original notebook is here.

This time I will use Andrew Dalke's very nice, and super fast, chemfp to do the similarity comparisons. This allows much larger data sets to be run, so I'll look at documents from 2010-2012 and, in a second step, use a lower similarity threshold to define neighbors.

In [1]:

from rdkit import Chem,DataStructs
import time,random
from collections import defaultdict
import psycopg2
from rdkit.Chem import Draw,PandasTools,rdMolDescriptors
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
from __future__ import print_function
import requests
from xml.etree import ElementTree
import pandas as pd
%load_ext sql
print(rdBase.rdkitVersion)

2014.03.1pre

Preparation¶

Let's start by getting all molecules from papers from the years 2010-2012 from ChEMBL_16.

In [2]:

data = %sql postgresql://localhost/chembl_16 \
    select molregno,canonical_smiles smiles from \
      (select distinct molregno from \
        (select doc_id from docs where year>=2010 and year<=2012) t1 \
        join (select doc_id,molregno from activities) t2 \
        using (doc_id)) tbl \
      join compound_structures using (molregno) ;

234681 rows affected.

In [3]:

outf=file('../data/chembl16_2010-2012.smi','w+')
for i,(molregno,smi) in enumerate(data):
    outf.write('%s\t%d\n'%(smi,molregno))
outf=None

In [17]:

outf=file('../data/chembl16_2010-2012.mfp2.fps','w+')
outf.write("""#FPS1
#num_bits=2048
#software=RDKit/%s
"""%rdBase.rdkitVersion)

for i,(molregno,smi) in enumerate(data):
    mol = Chem.MolFromSmiles(str(smi))
    if not mol:
        continue
    fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol,2,nBits=2048)
    outf.write("%s\t%d\n"%(DataStructs.BitVectToFPSText(fp),molregno))
    if i and not i%(5000):
        print("Done: %s"%i)
outf=None

Done: 5000
Done: 10000
Done: 15000
Done: 20000
Done: 25000
Done: 30000
Done: 35000
Done: 40000
Done: 45000
Done: 50000
Done: 55000
Done: 60000
Done: 65000
Done: 70000
Done: 75000
Done: 80000
Done: 85000
Done: 90000
Done: 95000
Done: 100000
Done: 105000
Done: 110000
Done: 115000
Done: 120000
Done: 125000
Done: 130000
Done: 135000
Done: 140000
Done: 145000
Done: 150000
Done: 155000
Done: 160000
Done: 165000
Done: 170000
Done: 175000
Done: 180000
Done: 185000
Done: 190000
Done: 195000
Done: 200000
Done: 205000
Done: 210000
Done: 215000
Done: 220000
Done: 225000
Done: 230000

In [20]:

data=None

In [ ]:

t1=time.time()
rows = !simsearch --NxN -t 0.8 ../data/chembl16_2010-2012.mfp2.fps
t2=time.time()
print("That took %.2f seconds"%(t2-t1))

In [5]:

len(rows)

Out[5]:

To be clear on exactly how amazing chemfp is: I just found all the pairs in a 230Kx230K set of fingerprints in about two minutes on my three year old desktop linux box.

Limit the data to include only rows for molecules that have at least one neighbor:

In [6]:

rowsWithData=[x for x in rows if x[0] not in '#0']
len(rowsWithData)

Out[6]:

In [7]:

rowsWithData[0]

Out[7]:

'2\t23\t631370\t1.00000\t705001\t1.00000'

Grab doc_ids for each compound:

In [8]:

data = %sql \
    select molregno,doc_id from \
    (select doc_id from docs where year>=2010 and year<=2012) t1 \
    join (select doc_id,molregno from activities) t2 \
    using (doc_id) ;
molDocs=defaultdict(list)
for regno,doc_id in data:
    if doc_id not in molDocs[regno]:
        molDocs[regno].append(doc_id)
data=None   

1362384 rows affected.

another data structure to hold the compounds in each document:

In [9]:

docMols=defaultdict(list)
for regno,docs in molDocs.iteritems():
    for docId in docs:
        docMols[docId].append(regno)

And now figure out the document pairs based on the compound pairs:

In [10]:

docPairs=defaultdict(list)
molSims={}
for row in rowsWithData:
    row = row.split('\t')
    nNbrs = int(row[0])
    bRegno = int(row[1])
    for i in range(nNbrs):
        iRegno=int(row[2*i+2])
        if bRegno==iRegno: 
            continue
        sim=float(row[2*i+3])
        if bRegno<iRegno:
            tpl=(bRegno,iRegno)
        else:
            tpl=(iRegno,bRegno)
        molSims[tpl]=sim
        for dId1 in molDocs[bRegno]:
            for dId2 in molDocs[iRegno]:
                if dId1==dId2: 
                    continue
                if dId1<dId2:
                    dtpl = (dId1,dId2)
                else:
                    dtpl = (dId2,dId1)
                if tpl not in docPairs[dtpl]:
                    docPairs[dtpl].append(tpl)
        

Put the data into a DataFrame so that it's easier to look at:

In [11]:

keys=['docid1','docid2','nDoc1','nDoc2','nInCommon','pair_count']
rows=[]
for k,v in docPairs.iteritems():
    row=[]
    row.append(k[0])
    row.append(k[1])
    s0=set(docMols[k[0]])
    s1=set(docMols[k[1]])
    row.append(len(s0))
    row.append(len(s1))
    row.append(len(s0.intersection(s1)))
    row.append(len(v))
    rows.append(row)
df = pd.DataFrame(data=rows,columns=keys)

Add our computed summary properties:

In [12]:

df['compound_id_tani']=df.apply(lambda row: float(row['nInCommon'])/(row['nDoc1']+row['nDoc2']-row['nInCommon']),axis=1)
df['frac_high_sim']=df.apply(lambda row: float(row['pair_count'])/(row['nDoc1']*row['nDoc2']),axis=1)

In [13]:

df[(df.nDoc1>10)&(df.nDoc2>10)].sort('frac_high_sim',ascending=False).head(10)

Out[13]:

	docid1	docid2	nDoc1	nDoc2	nInCommon	pair_count	compound_id_tani	frac_high_sim
23277	50562	59446	14	33	0	190	0.000000	0.411255
1332	52860	66739	15	11	1	63	0.040000	0.381818
21840	57757	61248	17	20	4	121	0.121212	0.355882
58593	54004	65337	14	15	2	47	0.074074	0.223810
87061	51349	55319	15	36	15	114	0.416667	0.211111
19563	57623	60600	12	25	11	61	0.423077	0.203333
8744	65426	65439	15	22	1	63	0.027778	0.190909
24548	54604	57777	16	16	2	48	0.066667	0.187500
25254	53089	57738	27	21	2	106	0.043478	0.186949
21699	55412	58519	13	43	11	103	0.244444	0.184258

Take a look at the document titles from pubmed:

In [14]:

# turn off row count printing
%config SqlMagic.feedback = False

titlePairs=[]
for id_,row in df[(df.nDoc1>10)&(df.nDoc2>10)].sort('frac_high_sim',ascending=False).head(10).iterrows():
    did1=row['docid1']
    did2=row['docid2']
    pmids=%sql  select doc_id,pubmed_id,title from docs where doc_id in (:did1,:did2)
    tpl = pmids[0][1],pmids[1][1]
    if tpl[0] is None:
        print('no pmid found for item: %s'%(str(pmids[0])))
        continue
    if tpl[1] is None:
        print('no pmid found for item: %s'%(str(pmids[1])))
        continue
        
    txt=requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=%d,%d'%tpl).text
    et = ElementTree.fromstring(txt.encode('utf-8'))
    ts=[x.text for x in et.findall(".//*[@Name='Title']")]
    titlePairs.append(ts)
    print(str(tpl))
    print('   '+ts[0])
    print('   '+ts[1])

(20180534L, 21978682L)
   Discovery of a novel series of semisynthetic vancomycin derivatives effective against vancomycin-resistant bacteria.
   Synthesis and antibacterial activity of N4-mono alkyl derivatives of novel glycopeptide LYV07ww01.
no pmid found for item: (66739, None, None)
(21601450L, 22244938L)
   Novel, potent, and orally bioavailable phosphinic acid inhibitors of the hepatitis C virus NS3 protease.
   Discovery of GS-9256: a novel phosphinic acid derived inhibitor of the hepatitis C virus NS3/4A protease with potent clinical activity.
(20951594L, 22989907L)
   Tuning hydrophobicity of highly cationic tetradecameric Gramicidin S analogues using adamantane amino acids.
   'Inverted' analogs of the antibiotic gramicidin S with an improved biological profile.
(20462756L, 21159410L)
   Synthesis and antibacterial activity of novel 4''-O-arylalkylcarbamoyl and 4''-O-((arylalkylamino)-4-oxo-butyl)carbamoyl clarithromycin derivatives.
   Synthesis and antibacterial evaluation of novel clarithromycin derivatives with C-4″ elongated arylalkyl groups against macrolide-resistant strains.
no pmid found for item: (57623, None, None)
(22975302L, 22995619L)
   Serum stability of selected decapeptide agonists of KISS1R using pseudopeptides.
   Trypsin resistance of a decapeptide KISS1R agonist containing an Nω-methylarginine substitution.
(21211982L, 21497959L)
   Design, synthesis and evaluation of novel tacrine-multialkoxybenzene hybrids as dual inhibitors for cholinesterases and amyloid beta aggregation.
   Synthesis and evaluation of heterobivalent tacrine derivatives as potential multi-functional anti-Alzheimer agents.
(20801039L, 21570844L)
   9-Dihydroerythromycins as non-antibiotic motilin receptor agonists.
   The role of the 4''-hydroxyl on motilin agonist potency in the 9-dihydroerythromycin series.
(21256008L, 21658961L)
   Spirodiketopiperazine-based CCR5 antagonist: discovery of an antiretroviral drug candidate.
   Discovery of 4-[4-({(3R)-1-butyl-3-[(R)-cyclohexyl(hydroxy)methyl]-2,5-dioxo-1,4,9-triazaspiro[5.5]undec-9-yl}methyl)phenoxy]benzoic acid hydrochloride: a highly potent orally available CCR5 selective antagonist.

And then inspect the results:

In [15]:

for id_,row in df[(df.nDoc1>10)&(df.nDoc2>10)].sort('frac_high_sim',ascending=False).head(10).iterrows():
    did1=row['docid1']
    did2=row['docid2']
    pmids=%sql  select doc_id,pubmed_id,chembl_id,year from docs where doc_id in (:did1,:did2)
    print('https://www.ebi.ac.uk/chembl/doc/inspect/%s (%d )- https://www.ebi.ac.uk/chembl/doc/inspect/%s (%d)'%(pmids[0][2],pmids[0][3],pmids[1][2],pmids[1][3]))

https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1155426 (2010 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1909570 (2011)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1250472 (2010 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL2177087 (2012)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1777720 (2011 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1955809 (2012)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1287679 (2010 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL2146376 (2012)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1155521 (2010 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1667745 (2011)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1777703 (2010 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1933006 (2011)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL2150961 (2012 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL2151012 (2012)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1641506 (2011 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1782027 (2011)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1255554 (2010 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1777727 (2011)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1667838 (2011 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1811891 (2011)

Once again, many of these don't have ChEMBL matches identified. This is at least partially due to the fact that papers have to cite each other to show up (thanks to George for pointing that out), but another part is clearly going to be the low absolute overlap between papers.

The exception is CHEMBL1255554 - CHEMBL1777727, which is a pair in the current ChEMBL scheme.

look at some molecules¶

In [16]:

d1,d2=50562,59446
mPairs=docPairs[(d1,d2)][:]
random.seed(d1)
random.shuffle(mPairs)
ms = []
for i,j in mPairs[:3]:
    d = %sql select m from rdk.mols where molregno=:i
    ms.append(Chem.MolFromSmiles(d[0][0]))
    d = %sql select m from rdk.mols where molregno=:j
    ms.append(Chem.MolFromSmiles(d[0][0]))
Draw.MolsToGridImage(ms,molsPerRow=2,subImgSize=(300,300))

Out[16]:

The molecules do certainly seem similar to each other, and they definitely highlight some nice opportunities for improvement of the depiction code (positive attitude! positive attitude!).

In [17]:

d1,d2=52860, 66739
mPairs=docPairs[(d1,d2)][:]
random.seed(d1)
random.shuffle(mPairs)
ms = []
for i,j in mPairs[:3]:
    d = %sql select m from rdk.mols where molregno=:i
    ms.append(Chem.MolFromSmiles(d[0][0]))
    d = %sql select m from rdk.mols where molregno=:j
    ms.append(Chem.MolFromSmiles(d[0][0]))
Draw.MolsToGridImage(ms,molsPerRow=2,subImgSize=(400,200))

Out[17]:

Try a lower similarity threshold¶

In [18]:

t1=time.time()
rows = !simsearch --NxN -t 0.6 ../data/chembl16_2010-2012.mfp2.fps
t2=time.time()
print("That took %.2f seconds"%(t2-t1))

That took 220.68 seconds

It takes a bit longer with the lower threshold, but chemfp is still excellently fast.

In [19]:

len(rows)

Out[19]:

In [20]:

rowsWithData=[x for x in rows if x[0] not in '#0']
len(rowsWithData)

Out[20]:

In [21]:

data = %sql \
    select molregno,doc_id from \
    (select doc_id from docs where year>=2010 and year<=2012) t1 \
    join (select doc_id,molregno from activities) t2 \
    using (doc_id) ;
molDocs=defaultdict(list)
for regno,doc_id in data:
    if doc_id not in molDocs[regno]:
        molDocs[regno].append(doc_id)
data=None

docMols=defaultdict(list)
for regno,docs in molDocs.iteritems():
    for docId in docs:
        docMols[docId].append(regno)
        
docPairs=defaultdict(list)
molSims={}
for row in rowsWithData:
    row = row.split('\t')
    nNbrs = int(row[0])
    bRegno = int(row[1])
    for i in range(nNbrs):
        iRegno=int(row[2*i+2])
        if bRegno==iRegno: 
            continue
        sim=float(row[2*i+3])
        if bRegno<iRegno:
            tpl=(bRegno,iRegno)
        else:
            tpl=(iRegno,bRegno)
        molSims[tpl]=sim
        for dId1 in molDocs[bRegno]:
            for dId2 in molDocs[iRegno]:
                if dId1==dId2: 
                    continue
                if dId1<dId2:
                    dtpl = (dId1,dId2)
                else:
                    dtpl = (dId2,dId1)
                if tpl not in docPairs[dtpl]:
                    docPairs[dtpl].append(tpl)

keys=['docid1','docid2','nDoc1','nDoc2','nInCommon','pair_count']
rows=[]
for k,v in docPairs.iteritems():
    row=[]
    row.append(k[0])
    row.append(k[1])
    s0=set(docMols[k[0]])
    s1=set(docMols[k[1]])
    row.append(len(s0))
    row.append(len(s1))
    row.append(len(s0.intersection(s1)))
    row.append(len(v))
    rows.append(row)
df = pd.DataFrame(data=rows,columns=keys)

df['compound_id_tani']=df.apply(lambda row: float(row['nInCommon'])/(row['nDoc1']+row['nDoc2']-row['nInCommon']),axis=1)
df['frac_high_sim']=df.apply(lambda row: float(row['pair_count'])/(row['nDoc1']*row['nDoc2']),axis=1)

df[(df.nDoc1>10)&(df.nDoc2>10)].sort('frac_high_sim',ascending=False).head(10)

Out[21]:

	docid1	docid2	nDoc1	nDoc2	nInCommon	pair_count	compound_id_tani	frac_high_sim
191796	50562	59446	14	33	0	462	0.000000	1.000000
144402	59446	62617	33	19	0	627	0.000000	1.000000
208144	66523	66968	12	27	1	323	0.026316	0.996914
64413	50562	62617	14	19	1	265	0.031250	0.996241
154921	55780	57070	15	15	2	217	0.071429	0.964444
43395	55788	57575	99	37	1	3461	0.007407	0.944854
177126	60278	62777	54	36	2	1817	0.022727	0.934671
194340	61045	66790	26	29	0	700	0.000000	0.928382
233742	51545	53478	30	13	1	341	0.023810	0.874359
68093	65507	66530	21	13	0	236	0.000000	0.864469

In [22]:

titlePairs=[]
for id_,row in df[(df.nDoc1>10)&(df.nDoc2>10)].sort('frac_high_sim',ascending=False).head(10).iterrows():
    did1=row['docid1']
    did2=row['docid2']
    pmids=%sql  select doc_id,pubmed_id,title from docs where doc_id in (:did1,:did2)
    tpl = pmids[0][1],pmids[1][1]
    if tpl[0] is None:
        print('no pmid found for item: %s'%(str(pmids[0])))
        continue
    if tpl[1] is None:
        print('no pmid found for item: %s'%(str(pmids[1])))
        continue
        
    txt=requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=%d,%d'%tpl).text
    et = ElementTree.fromstring(txt.encode('utf-8'))
    ts=[x.text for x in et.findall(".//*[@Name='Title']")]
    titlePairs.append(ts)
    print(str(tpl))
    print('   '+ts[0])
    print('   '+ts[1])

(20180534L, 21978682L)
Discovery of a novel series of semisynthetic vancomycin derivatives effective against vancomycin-resistant bacteria.
Synthesis and antibacterial activity of N4-mono alkyl derivatives of novel glycopeptide LYV07ww01.
(21978682L, 22765891L)
Synthesis and antibacterial activity of N4-mono alkyl derivatives of novel glycopeptide LYV07ww01.
Synthesis and antibacterial activity against Clostridium difficile of novel demethylvancomycin derivatives.
(22115594L, 23107481L)
Ceric ammonium nitrate-promoted oxidative coupling reaction for the synthesis and evaluation of a series of anti-tumor amide anhydrovinblastine analogs.
Synthesis and structure-activity relationship studies of cytotoxic vinorelbine amide analogues.
(20180534L, 22765891L)
Discovery of a novel series of semisynthetic vancomycin derivatives effective against vancomycin-resistant bacteria.
Synthesis and antibacterial activity against Clostridium difficile of novel demethylvancomycin derivatives.
(21316958L, 21392990L)
Maintaining potent HTLV-I protease inhibition without the P3-cap moiety in small tetrapeptidic inhibitors.
Design and synthesis of several small-size HTLV-I protease inhibitors with different hydrophilicity profiles.
(21316974L, 21476508L)
An automated, polymer-assisted strategy for the preparation of urea and thiourea derivatives of 15-membered azalides as potential antimalarial chemotherapeutics.
Synthesis, structure-activity relationship, and antimalarial activity of ureas and thioureas of 15-membered azalides.
(21999529L, 22315981L)
Antibacterial optimization of 4-aminothiazolyl analogues of the natural product GE2270 A: identification of the cycloalkylcarboxylic acids.
Discovery of LFF571: an investigational agent for Clostridium difficile infection.
no pmid found for item: (61045, None, None)
(20471843L, 20880702L)
Synthesis and biological evaluation of a new series of berberine derivatives as dual inhibitors of acetylcholinesterase and butyrylcholinesterase.
Berberine derivatives, with substituted amino groups linked at the 9-position, as inhibitors of acetylcholinesterase/butyrylcholinesterase.
no pmid found for item: (65507, None, None)

In [24]:

for id_,row in df[(df.nDoc1>10)&(df.nDoc2>10)].sort('frac_high_sim',ascending=False).head(10).iterrows():
    did1=row['docid1']
    did2=row['docid2']
    pmids=%sql  select doc_id,pubmed_id,chembl_id,year from docs where doc_id in (:did1,:did2)
    print('https://www.ebi.ac.uk/chembl/doc/inspect/%s (%d )- https://www.ebi.ac.uk/chembl/doc/inspect/%s (%d)'%(pmids[0][2],pmids[0][3],pmids[1][2],pmids[1][3]))

https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1155426 (2010 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1909570 (2011)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1909570 (2011 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL2057152 (2012)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL2177090 (2012 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL2203286 (2012)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1155426 (2010 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL2057152 (2012)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1681720 (2011 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1759899 (2011)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1681728 (2011 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1772969 (2011)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1921735 (2011 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL2062386 (2012)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1949557 (2011 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL2203263 (2012)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1165944 (2010 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1268919 (2010)
https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL2151031 (2012 )- https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL2177006 (2012)

In [25]:

d1,d2=65507,66530
mPairs=docPairs[(d1,d2)][:]
random.seed(d1)
random.shuffle(mPairs)
ms = []
for i,j in mPairs[:3]:
    d = %sql select m from rdk.mols where molregno=:i
    ms.append(Chem.MolFromSmiles(d[0][0]))
    d = %sql select m from rdk.mols where molregno=:j
    ms.append(Chem.MolFromSmiles(d[0][0]))
Draw.MolsToGridImage(ms,molsPerRow=2,subImgSize=(300,200))

Out[25]:

In [26]:

d1,d2=61045,66790
mPairs=docPairs[(d1,d2)][:]
random.seed(d1)
random.shuffle(mPairs)
ms = []
for i,j in mPairs[:3]:
    d = %sql select m from rdk.mols where molregno=:i
    ms.append(Chem.MolFromSmiles(d[0][0]))
    d = %sql select m from rdk.mols where molregno=:j
    ms.append(Chem.MolFromSmiles(d[0][0]))
Draw.MolsToGridImage(ms,molsPerRow=2,subImgSize=(300,200))

Out[26]:

In [ ]:

More on finding related documents in ChEMBL.¶

Preparation¶

look at some molecules¶

Try a lower similarity threshold¶