Goal: construct a set of molecular pairs that can be used to compare similarity methods to each other.
This works with ChEMBL30.
I want to start with molecules that have some connection to each other, so I will pick pairs that have a baseline similarity: a Tanimoto similarity using count based Morgan0 fingerprints of at least 0.65. I also create a second set of somewhat more closely related molecules where the baseline similarity is 0.55 with a Morgan1 fingerprint. The thresholds were selected based on the analysis in this blog post
I'm going to use ChEMBL as my data source, so I'll start by adding a table with count-based morgan fingerprints. Here's the SQL for that, assuming that you've installed the RDKit extension and setup an RDKit schema as described in the docs
select molregno,morgan_fp(m,0) mfp0,morgan_fp(m,1) mfp1,morgan_fp(m,2) mfp2 into rdk.countfps from rdk.mols;
create index cfps_mfp0 on rdk.countfps using gist(mfp0);
create index cfps_mfp1 on rdk.countfps using gist(mfp1);
create index cfps_mfp2 on rdk.countfps using gist(mfp2);
Fingerprints that only contains molecules with <= 50 heavy atoms and a single fragment (we recognize this because there is no '.' in the SMILES):
select molregno,mfp0,mfp1 into table rdk.tfps_smaller from rdk.countfps join compound_properties using (molregno) join compound_structures using (molregno) where heavy_atoms<=50 and canonical_smiles not like '%.%';
create index sfps_mfp0_idx on rdk.tfps_smaller using gist(mfp0);
create index sfps_mfp1_idx on rdk.tfps_smaller using gist(mfp1);
And now I'll build the set of pairs using Python. This is definitely doable in SQL, but my SQL-fu isn't that strong.
Start by getting a set of 60K random small molecules:
from rdkit import Chem
from rdkit import rdBase
print(rdBase.rdkitVersion)
import time
print(time.asctime())
2022.09.1 Mon Jan 16 09:20:51 2023
import psycopg2
cn = psycopg2.connect(host='localhost',dbname='chembl_30')
curs = cn.cursor()
curs.execute("select chembl_id,m from rdk.mols join rdk.tfps_smaller using (molregno)"
" join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND')"
" order by random() limit 60000")
qs = curs.fetchall()
And now find one neighbor for 50K of those from the mfp0 table of smallish molecules:
cn.rollback()
curs.execute('set rdkit.tanimoto_threshold=0.65')
keep=[]
for i,row in enumerate(qs):
curs.execute("select chembl_id,m from rdk.mols join (select chembl_id,molregno from rdk.tfps_smaller "
"join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND') "
"where mfp0%%morgan_fp(%s,0) "
"and chembl_id!=%s limit 1) t2 using (molregno) "
"limit 1",(row[1],row[0]))
d = curs.fetchone()
if not d:
continue
keep.append((row[0],row[1],d[0],d[1]))
if len(keep)>=50000:
break
if not i%1000: print('Done: %d'%i)
Done: 0 Done: 1000 Done: 2000 Done: 4000 Done: 5000 Done: 6000 Done: 7000 Done: 8000 Done: 9000 Done: 10000 Done: 11000 Done: 12000 Done: 13000 Done: 14000 Done: 15000 Done: 16000 Done: 17000 Done: 18000 Done: 19000 Done: 20000 Done: 21000 Done: 22000 Done: 23000 Done: 24000 Done: 25000 Done: 26000 Done: 27000 Done: 28000 Done: 29000 Done: 30000 Done: 31000 Done: 32000 Done: 33000 Done: 34000 Done: 35000 Done: 36000 Done: 37000 Done: 38000 Done: 39000 Done: 40000 Done: 41000 Done: 42000 Done: 43000 Done: 44000 Done: 45000 Done: 46000 Done: 47000 Done: 48000 Done: 49000 Done: 50000
Finally, write those out to a file so that we can use them elsewhere:
import gzip
outf = gzip.open('../data/chembl30_50K.mfp0.pairs.txt.gz','wb+')
for cid1,smi1,cid2,smi2 in keep:
outf.write(f'{cid1}\t{smi1}\t{cid2}\t{smi2}\n'.encode('UTF-8'))
outf=None
Use a similarity threshold for the pairs using MFP1 bits.
cn.rollback()
curs.execute('set rdkit.tanimoto_threshold=0.55')
keep=[]
for i,row in enumerate(qs):
curs.execute("select chembl_id,m from rdk.mols join (select chembl_id,molregno from rdk.tfps_smaller "
"join chembl_id_lookup on (molregno=entity_id and entity_type='COMPOUND') "
"where mfp1%%morgan_fp(%s,1) "
"and chembl_id!=%s limit 1) t2 using (molregno) "
"limit 1",(row[1],row[0]))
d = curs.fetchone()
if not d:
continue
keep.append((row[0],row[1],d[0],d[1]))
if len(keep)>=50000:
break
if not i%1000: print('Done: %d'%i)
Done: 0 Done: 1000 Done: 2000 Done: 4000 Done: 5000 Done: 6000 Done: 7000 Done: 8000 Done: 9000 Done: 10000 Done: 11000 Done: 12000 Done: 13000 Done: 14000 Done: 15000 Done: 16000 Done: 17000 Done: 18000 Done: 19000 Done: 20000 Done: 21000 Done: 22000 Done: 23000 Done: 24000 Done: 25000 Done: 26000 Done: 27000 Done: 28000 Done: 29000 Done: 30000 Done: 31000 Done: 32000 Done: 33000 Done: 34000 Done: 35000 Done: 36000 Done: 37000 Done: 38000 Done: 39000 Done: 40000 Done: 41000 Done: 42000 Done: 43000 Done: 44000 Done: 45000 Done: 46000 Done: 47000 Done: 48000 Done: 49000 Done: 50000
import gzip
outf = gzip.open('../data/chembl30_50K.mfp1.pairs.txt.gz','wb+')
for cid1,smi1,cid2,smi2 in keep:
outf.write(f'{cid1}\t{smi1}\t{cid2}\t{smi2}\n'.encode('UTF-8'))
outf=None