MMPA on ChEMBL hERG data using Pandas¶

The idea here is to try out using Pandas to visualize and work with the output of Jameed Hussain's MMPA code in the IPython notebook. The code is available in the RDKit Contrib dir. Jameed gave a couple tutorials on use of the tools at the 2013 UGM. The notebooks from his tutorials are here and here.

I'll use a ChEMBL hERG dataset. This was somewhat inspired/informed by Paul's work here: https://github.com/pzc/herg_chembl_jcim

In [1]:

from rdkit import Chem,DataStructs
import time,random
from collections import defaultdict
import psycopg2
from rdkit.Chem import Draw,PandasTools,rdMolDescriptors
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
from __future__ import print_function
import requests
from xml.etree import ElementTree
import pandas as pd
%load_ext sql
print(rdBase.rdkitVersion)

2014.09.1pre

Preparation¶

Start by finding the hERG data in ChEMBL

In [3]:

%sql postgresql://localhost/chembl_19 \
    select * from chembl_id_lookup where chembl_id = 'CHEMBL240';

1 rows affected.

Out[3]:

chembl_id	entity_type	entity_id	status
CHEMBL240	TARGET	165	ACTIVE

In [10]:

%sql select count(*) from activities join assays using (assay_id) where tid=165;

1 rows affected.

Out[10]:

count
14397

Look at all the activity units available.

In [11]:

%sql select distinct(standard_type) from activities join assays using (assay_id) where tid=165;

27 rows affected.

Out[11]:

standard_type
Ratio
Fold change
Ratio IC50
EC25
Imax
EC50
Activity
IC25
IC60
EC10
IP
IC50
QT interval
Time
Ratio Ki
Log IC50
ED50
pIC50
Inflection point
Inhibition
Ki
IC20
V1/2
Potency
INH
pKi
log IC50

In [12]:

%sql select count(*) from activities join assays using (assay_id) where tid=165 and standard_type='Ki';

1 rows affected.

Out[12]:

count
2327

Create the data set¶

Pull all hERG assay Ki values where the value is not qualified and the SMILES doesn't include a dot (the MMPA code doesn't get along with dot-separated SMILES).

Reproducibility note: though the queries here are shown against chembl_19, I did the original data export from chembl_18, so the pairs shown may differ somewhat from what you'd get.

In [4]:

data = %sql select canonical_smiles,molregno,activity_id,standard_value,standard_units from activities \
  join assays using (assay_id) \
  join compound_structures using (molregno) \
  where tid=165 and standard_type='Ki' and standard_value is not null and standard_relation='=' \
    and canonical_smiles not like '%.%';

1099 rows affected.

Convert to a Pandas DataFrame and write it to a text file

In [5]:

df = data.DataFrame()

In [42]:

df.to_csv('../data/herg_data.txt',sep=" ",index=False)

Build the matched pairs¶

Call the MMPA fragmentation program.

This can take a few minutes.

In [43]:

!python $RDBASE/Contrib/mmpa/rfrag.py < ../data/herg_data.txt > ../data/herg_fragmented.txt

[03:43:47] SMILES Parse Error: syntax error for input: canonical_smiles
Can't generate mol for: canonical_smiles

Now generate the MMPs that differ by less than 10% of the molecule. Generate symmetrically so that we can detect tforms in both directions.

In [153]:

!python $RDBASE/Contrib/mmpa/indexing.py -s -r 0.1 < ../data/herg_fragmented.txt > ../data/mmps_default.txt

Read those into a Pandas data frame and look at some of the data.

In [6]:

mmps = pd.read_csv('../data/mmps_default.txt',header=None,names=('smiles1','smiles2','molregno1','molregno2','tform','core'))

In [7]:

mmps[mmps.molregno1==290813]

Out[7]:

	smiles1	smiles2	molregno1	molregno2	tform	core
0	CCn1nc(Cc2ccc(OC3CCC3)cc2)cc1C4CCN(C[C@H]5CN(C[C@@H]5c6cccc(F)c6)[C@@H](C(=O)O)C(C)(C)C)CC4	CC[C@@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)[C@H](C1)c6cccc(F)c6)C(=O)O	290813	290814	[:1]C(C)(C)C>>[:1][C@H](C)CC	[*:1][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
2	CCn1nc(Cc2ccc(OC3CCC3)cc2)cc1C4CCN(C[C@H]5CN(C[C@@H]5c6cccc(F)c6)[C@@H](C(=O)O)C(C)(C)C)CC4	CC[C@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)[C@H](C1)c6cccc(F)c6)C(=O)O	290813	290815	[:1]C(C)(C)C>>[:1][C@@H](C)CC	[*:1][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
92	CCn1nc(Cc2ccc(OC3CCC3)cc2)cc1C4CCN(C[C@H]5CN(C[C@@H]5c6cccc(F)c6)[C@@H](C(=O)O)C(C)(C)C)CC4	CC[C@@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)[C@H](C1)c6cccc(F)c6)C(=O)O	290813	290814	[:1]C([:2])([:3])C>>[:1]CC([:2])[:3]	[:3]C.[:1]C.[*:2][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
94	CCn1nc(Cc2ccc(OC3CCC3)cc2)cc1C4CCN(C[C@H]5CN(C[C@@H]5c6cccc(F)c6)[C@@H](C(=O)O)C(C)(C)C)CC4	CC[C@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)[C@H](C1)c6cccc(F)c6)C(=O)O	290813	290815	[:1]C([:2])([:3])C>>[:1]CC([:2])[:3]	[:3]C.[:1]C.[*:2][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
1368	CCn1nc(Cc2ccc(OC3CCC3)cc2)cc1C4CCN(C[C@H]5CN(C[C@@H]5c6cccc(F)c6)[C@@H](C(=O)O)C(C)(C)C)CC4	CC[C@@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)[C@H](C1)c6cccc(F)c6)C(=O)O	290813	290814	[:1]C([:2])(C)C>>[:1]C([:2])CC	[:1]C.[:2][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
1370	CCn1nc(Cc2ccc(OC3CCC3)cc2)cc1C4CCN(C[C@H]5CN(C[C@@H]5c6cccc(F)c6)[C@@H](C(=O)O)C(C)(C)C)CC4	CC[C@@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)[C@H](C1)c6cccc(F)c6)C(=O)O	290813	290814	[:1]C([:2])(C)C>>[:1]C[C@H]([:2])C	[:1]C.[:2][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
1372	CCn1nc(Cc2ccc(OC3CCC3)cc2)cc1C4CCN(C[C@H]5CN(C[C@@H]5c6cccc(F)c6)[C@@H](C(=O)O)C(C)(C)C)CC4	CC[C@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)[C@H](C1)c6cccc(F)c6)C(=O)O	290813	290815	[:1]C([:2])(C)C>>[:1]C[C@@H]([:2])C	[:1]C.[:2][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
1374	CCn1nc(Cc2ccc(OC3CCC3)cc2)cc1C4CCN(C[C@H]5CN(C[C@@H]5c6cccc(F)c6)[C@@H](C(=O)O)C(C)(C)C)CC4	CC[C@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)[C@H](C1)c6cccc(F)c6)C(=O)O	290813	290815	[:1]C([:2])(C)C>>[:1]C([:2])CC	[:1]C.[:2][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
1470	CCn1nc(Cc2ccc(OC3CCC3)cc2)cc1C4CCN(C[C@H]5CN(C[C@@H]5c6cccc(F)c6)[C@@H](C(=O)O)C(C)(C)C)CC4	CCn1nc(Cc2ccc(OC(C)C)cc2)cc1C3CCN(C[C@H]4CN(C[C@@H]4c5cccc(F)c5)[C@@H](C(=O)O)C(C)(C)C)CC3	290813	290921	[:1]C1CCC1>>[:1]C(C)C	[*:1]Oc1ccc(Cc2cc(C3CCN(C[C@H]4CN([C@@H](C(=O)O)C(C)(C)C)C[C@@H]4c4cccc(F)c4)CC3)n(CC)n2)cc1
1472	CCn1nc(Cc2ccc(OC3CCC3)cc2)cc1C4CCN(C[C@H]5CN(C[C@@H]5c6cccc(F)c6)[C@@H](C(=O)O)C(C)(C)C)CC4	CCn1nc(Cc2ccc(OC(F)(F)F)cc2)cc1C3CCN(C[C@H]4CN(C[C@@H]4c5cccc(F)c5)[C@@H](C(=O)O)C(C)(C)C)CC3	290813	292218	[:1]C1CCC1>>[:1]C(F)(F)F	[*:1]Oc1ccc(Cc2cc(C3CCN(C[C@H]4CN([C@@H](C(=O)O)C(C)(C)C)C[C@@H]4c4cccc(F)c4)CC3)n(CC)n2)cc1
1504	CCn1nc(Cc2ccc(OC3CCC3)cc2)cc1C4CCN(C[C@H]5CN(C[C@@H]5c6cccc(F)c6)[C@@H](C(=O)O)C(C)(C)C)CC4	CC[C@@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)[C@H](C1)c6cccc(F)c6)C(=O)O	290813	290814	[:1]C([:2])C([:3])(C)C>>[:1]C([:2])[C@@H]([:3])CC	[:3]C.[:1]C(=O)O.[*:2]N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
1506	CCn1nc(Cc2ccc(OC3CCC3)cc2)cc1C4CCN(C[C@H]5CN(C[C@@H]5c6cccc(F)c6)[C@@H](C(=O)O)C(C)(C)C)CC4	CC[C@@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)[C@H](C1)c6cccc(F)c6)C(=O)O	290813	290814	[:1]C([:2])C([:3])(C)C>>[:3]C[C@@H](C)C([:1])[:2]	[:3]C.[:1]C(=O)O.[*:2]N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
1508	CCn1nc(Cc2ccc(OC3CCC3)cc2)cc1C4CCN(C[C@H]5CN(C[C@@H]5c6cccc(F)c6)[C@@H](C(=O)O)C(C)(C)C)CC4	CC[C@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)[C@H](C1)c6cccc(F)c6)C(=O)O	290813	290815	[:1]C([:2])C([:3])(C)C>>[:1]C([:2])[C@H]([:3])CC	[:3]C.[:1]C(=O)O.[*:2]N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
1510	CCn1nc(Cc2ccc(OC3CCC3)cc2)cc1C4CCN(C[C@H]5CN(C[C@@H]5c6cccc(F)c6)[C@@H](C(=O)O)C(C)(C)C)CC4	CC[C@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)[C@H](C1)c6cccc(F)c6)C(=O)O	290813	290815	[:1]C([:2])C([:3])(C)C>>[:3]C[C@H](C)C([:1])[:2]	[:3]C.[:1]C(=O)O.[*:2]N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1

There are dupes... drop them:

In [8]:

mmps=mmps.drop_duplicates(subset=("molregno1","molregno2"))

Add a couple molecule columns and remove the SMILES columns:

In [9]:

PandasTools.AddMoleculeColumnToFrame(mmps,'smiles1','mol1')
PandasTools.AddMoleculeColumnToFrame(mmps,'smiles2','mol2')
mmps = mmps[['mol1','mol2','molregno1','molregno2','tform','core']]
mmps.head()

Out[9]:

	molregno1	molregno2	tform	core
0	290813	290814	[:1]C(C)(C)C>>[:1][C@H](C)CC	[*:1][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
1	290814	290813	[:1][C@H](C)CC>>[:1]C(C)(C)C	[*:1][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
2	290813	290815	[:1]C(C)(C)C>>[:1][C@@H](C)CC	[*:1][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
3	290815	290813	[:1][C@@H](C)CC>>[:1]C(C)(C)C	[*:1][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
4	290814	290815	[:1][C@H](C)CC>>[:1][C@@H](C)CC	[*:1][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1

Now join back on the original data so that we have activities in the table again. Pandas makes this easy.

In [10]:

t1=df[['molregno','standard_value']]
mmpdds = mmps.merge(t1,left_on='molregno1',right_on='molregno',suffixes=("_1","_2")).\
   merge(t1,left_on='molregno2',right_on='molregno',suffixes=("_1","_2"))
mmpdds.head()

Out[10]:

	molregno1	molregno2	tform	core	molregno_1	standard_value_1	molregno_2	standard_value_2
0	290813	290814	[:1]C(C)(C)C>>[:1][C@H](C)CC	[*:1][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1	290813	3500	290814	3400
1	290815	290814	[:1][C@@H](C)CC>>[:1][C@H](C)CC	[*:1][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1	290815	5700	290814	3400
2	290879	290814	[:1]C(C)(C)C>>[:1]C1CCC1	[*:1]Oc1ccc(Cc2cc(C3CCN(C[C@H]4CN([C@@H](C(=O)O)[C@H](C)CC)C[C@@H]4c4cccc(F)c4)CC3)n(CC)n2)cc1	290879	5600	290814	3400
3	290813	290815	[:1]C(C)(C)C>>[:1][C@@H](C)CC	[*:1][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1	290813	3500	290815	5700
4	290814	290815	[:1][C@H](C)CC>>[:1][C@@H](C)CC	[*:1][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1	290814	3400	290815	5700

Calculate pKi values and the difference between them:

In [11]:

import math
mmpdds['pKi_1']=mmpdds.apply(lambda row:-1*math.log10(float(row['standard_value_1'])*1e-9),axis=1)
mmpdds['pKi_2']=mmpdds.apply(lambda row:-1*math.log10(float(row['standard_value_2'])*1e-9),axis=1)
mmpdds['delta']=mmpdds['pKi_2']-mmpdds['pKi_1']

And, remove some extra columns:

In [12]:

mmpdds=mmpdds[['mol1','mol2','molregno1','molregno2','pKi_1','pKi_2','delta','tform','core']]
mmpdds.head()

Out[12]:

	molregno1	molregno2	pKi_1	pKi_2	delta	tform	core
0	290813	290814	5.455932	5.468521	0.012589	[:1]C(C)(C)C>>[:1][C@H](C)CC	[*:1][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
1	290815	290814	5.244125	5.468521	0.224396	[:1][C@@H](C)CC>>[:1][C@H](C)CC	[*:1][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
2	290879	290814	5.251812	5.468521	0.216709	[:1]C(C)(C)C>>[:1]C1CCC1	[*:1]Oc1ccc(Cc2cc(C3CCN(C[C@H]4CN([C@@H](C(=O)O)[C@H](C)CC)C[C@@H]4c4cccc(F)c4)CC3)n(CC)n2)cc1
3	290813	290815	5.455932	5.244125	-0.211807	[:1]C(C)(C)C>>[:1][C@@H](C)CC	[*:1][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1
4	290814	290815	5.468521	5.244125	-0.224396	[:1][C@H](C)CC>>[:1][C@@H](C)CC	[*:1][C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc4ccc(OC5CCC5)cc4)nn3CC)CC2)[C@@H](c2cccc(F)c2)C1

Analysis¶

Let's start by grouping related transforms together and seeing how often they occur:

In [13]:

gs=mmpdds.groupby('tform')

In [14]:

vs = [(len(y),x) for x,y in gs]
vs.sort(reverse=True)
vs[:5]

Out[14]:

[(34, '[*:1]F>>[*:1]Cl'),
 (34, '[*:1]Cl>>[*:1]F'),
 (15, '[*:1]C[*:2]>>[*:1]CC[*:2]'),
 (15, '[*:1]CC[*:2]>>[*:1]C[*:2]'),
 (14, '[*:1]F>>[*:1]C#N')]

Look at the summary stats for one of the frequent transformations:

In [15]:

gs['delta'].describe()['[*:1]F>>[*:1]Cl']

Out[15]:

count    34.000000
mean      0.419827
std       0.383555
min      -0.421005
25%       0.154902
50%       0.455932
75%       0.675167
max       1.149398
dtype: float64

Gather those summary stats for the tforms that occur at least 5 times and convert them into another data frame

In [102]:

rows=[]
for c,k in vs:
    if c>=5:
        descr=gs['delta'].describe()[k]
        rows.append((k,descr['count'],descr['mean'],descr['std']))

In [103]:

ndf = pd.DataFrame(rows,columns=('tform','count_val','mean_val','std_val'))

In [104]:

ndf.head()

Out[104]:

	tform	count_val	mean_val	std_val
0	[:1]F>>[:1]Cl	34	0.419827	0.383555
1	[:1]Cl>>[:1]F	34	-0.419827	0.383555
2	[:1]C[:2]>>[:1]CC[:2]	15	-0.115020	0.222562
3	[:1]CC[:2]>>[:1]C[:2]	15	0.115020	0.222562
4	[:1]F>>[:1]C#N	14	0.155265	0.089983

Add the two bits of the transformation as molecules so that we can visualize them

In [105]:

ndf['react']=ndf.apply(lambda row:row['tform'].split('>>')[0],axis=1)
ndf['prod']=ndf.apply(lambda row:row['tform'].split('>>')[1],axis=1)

In [106]:

PandasTools.AddMoleculeColumnToFrame(ndf,'react','reactmol')
PandasTools.AddMoleculeColumnToFrame(ndf,'prod','prodmol')
ndf.head()

Out[106]:

	tform	count_val	mean_val	std_val	react	prod
0	[:1]F>>[:1]Cl	34	0.419827	0.383555	[*:1]F	[*:1]Cl
1	[:1]Cl>>[:1]F	34	-0.419827	0.383555	[*:1]Cl	[*:1]F
2	[:1]C[:2]>>[:1]CC[:2]	15	-0.115020	0.222562	[:1]C[:2]	[:1]CC[:2]
3	[:1]CC[:2]>>[:1]C[:2]	15	0.115020	0.222562	[:1]CC[:2]	[:1]C[:2]
4	[:1]F>>[:1]C#N	14	0.155265	0.089983	[*:1]F	[*:1]C#N

And now let's see all the transforms that, on average, reduce hERG binding by at least 0.3 log units.

In [107]:

ndf[ndf.mean_val<-.3]

Out[107]:

	tform	count_val	mean_val	std_val	react	prod
1	[:1]Cl>>[:1]F	34	-0.419827	0.383555	[*:1]Cl	[*:1]F
13	[:1]Cl>>[:1]OC	9	-0.342221	0.694424	[*:1]Cl	[*:1]OC
16	[:1]Cl>>[:1]C#N	7	-0.437406	0.431559	[*:1]Cl	[*:1]C#N
21	[:1]C1CC1>>[:1]C	6	-0.655480	0.580657	[*:1]C1CC1	[*:1]C
25	[:1]F>>[:1]OC	5	-0.462994	0.192751	[*:1]F	[*:1]OC

Not much signal in there when you take the standard deviation into account, but to continue showing what's possible with Pandas, we can at least look at the pairs for the last one:

In [19]:

mmpdds[mmpdds['tform']=='[*:1]F>>[*:1]OC'][['mol1','mol2','pKi_1','pKi_2','delta']]

Out[19]:

	pKi_1	pKi_2	delta
85	6.370590	5.835350	-0.535241
86	7.744727	6.991400	-0.753328
303	5.862329	5.576918	-0.285411
349	7.244125	6.801343	-0.442782
487	6.033389	5.735182	-0.298207

Those are all a consistent structural modification to the same core.

Look at bigger transformations¶

Try increasing the cutoff when running the pair-generation algorithm to see if we get more/larger tforms.

In [11]:

!python $RDBASE/Contrib/mmpa/indexing.py -s -r 0.25 < ../data/herg_fragmented.txt > ../data/mmps_larger.txt

In [67]:

mmps = pd.read_csv('../data/mmps_larger.txt',header=None,names=('smiles1','smiles2','molregno1','molregno2','tform','core'))
mmps=mmps.drop_duplicates(subset=("molregno1","molregno2"))
PandasTools.AddMoleculeColumnToFrame(mmps,'smiles1','mol1')
PandasTools.AddMoleculeColumnToFrame(mmps,'smiles2','mol2')
mmps = mmps[['mol1','mol2','molregno1','molregno2','tform','core']]
t1=df[['molregno','standard_value']]
mmpdds = mmps.merge(t1,left_on='molregno1',right_on='molregno',suffixes=("_1","_2")).\
   merge(t1,left_on='molregno2',right_on='molregno',suffixes=("_1","_2"))
   
import math
mmpdds['pKi_1']=mmpdds.apply(lambda row:-1*math.log10(float(row['standard_value_1'])*1e-9),axis=1)
mmpdds['pKi_2']=mmpdds.apply(lambda row:-1*math.log10(float(row['standard_value_2'])*1e-9),axis=1)
mmpdds['delta']=mmpdds['pKi_2']-mmpdds['pKi_1']
mmpdds=mmpdds[['mol1','mol2','molregno1','molregno2','pKi_1','pKi_2','delta','tform','core']]
mmpdds.head()

Out[67]:

	molregno1	molregno2	pKi_1	pKi_2	delta	tform	core
0	1333317	1333318	5.823909	5.958607	0.134699	[:1]Cc1ccccc1[:2]>>[:1]Cc1cccc([:2])c1	[:2]F.[:1]n1ccc2c1ncnc2OC1CCN(Cc2cscn2)CC1
1	1333327	1333318	6.221849	5.958607	-0.263241	[:1]C(F)(F)F>>[:1]F	[*:1]c1cccc(Cn2ccc3c2ncnc3OC2CCN(Cc3cscn3)CC2)c1
2	1333330	1333318	6.022276	5.958607	-0.063669	[:1]c1cccc(Cl)c1[:2]>>[:1]c1cccc([:2])c1	[:2]F.[:1]Cn1ccc2c1ncnc2OC1CCN(Cc2cscn2)CC1
3	1333321	1333318	5.920819	5.958607	0.037789	[:1]c1cccc(Cl)c1>>[:1]c1cccc(F)c1	[*:1]Cn1ccc2c1ncnc2OC1CCN(Cc2cscn2)CC1
4	1333329	1333318	5.638272	5.958607	0.320335	[:1]c1cccc(F)c1[:2]>>[:1]c1cccc([:2])c1	[:2]F.[:1]Cn1ccc2c1ncnc2OC1CCN(Cc2cscn2)CC1

In [68]:

gs=mmpdds.groupby('tform')
vs = [(len(y),x) for x,y in gs]
vs.sort(reverse=True)
vs[:5]

Out[68]:

[(30, '[*:1]F>>[*:1]Cl'),
 (30, '[*:1]Cl>>[*:1]F'),
 (14, '[*:1]F>>[*:1]C#N'),
 (14, '[*:1]C#N>>[*:1]F'),
 (13, '[*:1]c1cccc([*:2])c1>>[*:1]c1ccc([*:2])cc1')]

In [69]:

rows=[]
for c,k in vs:
    if c>=5:
        descr=gs['delta'].describe()[k]
        rows.append((k,descr['count'],descr['mean'],descr['std']))
ndf = pd.DataFrame(rows,columns=('tform','count_val','mean_val','std_val'))
ndf['react']=ndf.apply(lambda row:row['tform'].split('>>')[0],axis=1)
ndf['prod']=ndf.apply(lambda row:row['tform'].split('>>')[1],axis=1)
PandasTools.AddMoleculeColumnToFrame(ndf,'react','reactmol')
PandasTools.AddMoleculeColumnToFrame(ndf,'prod','prodmol')

In [70]:

ndf[ndf.mean_val<-.3].sort(columns='mean_val')

Out[70]:

	tform	count_val	mean_val	std_val	react	prod
17	[:1]Cc1ccccc1>>[:1]C	8	-0.847827	0.827364	[*:1]Cc1ccccc1	[*:1]C
31	[:1]c1ccc(Cl)cc1>>[:1]c1ccccc1F	6	-0.684998	0.122674	[*:1]c1ccc(Cl)cc1	[*:1]c1ccccc1F
49	[:1]C1CC1>>[:1]C	5	-0.647260	0.443866	[*:1]C1CC1	[*:1]C
11	[:1]C#N>>[:1]C(N)=O	10	-0.594131	0.339552	[*:1]C#N	[*:1]C(N)=O
1	[:1]Cl>>[:1]F	30	-0.469615	0.368912	[*:1]Cl	[*:1]F
24	[:1]Cl>>[:1]C#N	7	-0.437406	0.431559	[*:1]Cl	[*:1]C#N
15	[:1]C(F)(F)F>>[:1]C	9	-0.376203	0.524325	[*:1]C(F)(F)F	[*:1]C
34	[:1]F>>[:1]OC	6	-0.363874	0.297777	[*:1]F	[*:1]OC
36	[:1]Cl>>[:1]OC	6	-0.345202	0.855605	[*:1]Cl	[*:1]OC
44	[:1]c1cccc(C)c1>>[:1]c1ccccc1	5	-0.310806	0.185063	[*:1]c1cccc(C)c1	[*:1]c1ccccc1
5	[:1]c1ccc([:2])cc1>>[:1]c1cccc([:2])c1	13	-0.301016	0.518341	[:1]c1ccc([:2])cc1	[:1]c1cccc([:2])c1

In [95]:

tform='[*:1]c1ccc(Cl)cc1>>[*:1]c1ccccc1F'
mmpdds[mmpdds['tform']==tform][['molregno1','molregno2','mol1','mol2','pKi_1','pKi_2','delta']]

Out[95]:

	molregno1	molregno2	pKi_1	pKi_2	delta
2184	408189	408192	6.688246	6.149354	-0.538892
2711	408196	408198	6.686133	5.886057	-0.800076
2712	408196	408198	6.659556	5.886057	-0.773499
2713	408196	408198	6.669586	5.886057	-0.783530
2714	408196	408198	6.419075	5.886057	-0.533018
2715	408196	408198	6.567031	5.886057	-0.680974

Again, a nice set of modifications to a consistent core.

Note that there are really only two pairs here, this arises due to repeated measurements in the paper (we'll see those below).

An aside¶

This is a brief exploration to look at additional data that's available from ChEMBL. This isn't the ideal example for it, but hopefully it will still be a useful start.

We would assume that the last set of examples all came from the same paper, but we can confirm that.

Start by getting the unique ChEMBL compound numbers:

In [101]:

regnos = list(mmpdds[mmpdds['tform']==tform]['molregno1'])
regnos += list(mmpdds[mmpdds['tform']==tform]['molregno2'])
regnos=tuple(set(regnos))

Now get the documents that have Ki values for those compounds:

In [97]:

%sql select distinct(activities.doc_id) from activities join assays using (assay_id) \
  where tid=165 and standard_type='Ki' and molregno in :regnos;

1 rows affected.

Out[97]:

doc_id
37427

Query our local ChEMBL instance to get info about the document:

In [53]:

docid = _[0]['doc_id']
%sql select * from docs where doc_id=:docid;

1 rows affected.

Out[53]:

doc_id	journal	year	volume	issue	first_page	last_page	pubmed_id	doi	chembl_id	title	doc_type	authors	abstract
37427	Bioorg. Med. Chem. Lett.	2007	17	6	1675	1678	17257843	None	CHEMBL1139118	None	PUBLICATION	None	None

ChEMBL doesn't have the article title, but we can get that easily enough from pubmed:

In [54]:

pmid = _[0]['pubmed_id']
txt=requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=%d'%pmid).text
et = ElementTree.fromstring(txt.encode('utf-8'))
et.findall(".//*[@Name='Title']")[0].text

Out[54]:

'A novel, non-substrate-based series of glycine type 1 transporter inhibitors derived from high-throughput screening.'

Pull the other assays from the paper:

In [98]:

%sql select * from assays where assay_id in (select distinct(assay_id) from activities where doc_id = :docid);

6 rows affected.

Out[98]:

assay_id	doc_id	description	assay_type	assay_test_type	assay_category	assay_organism	assay_tax_id	assay_strain	assay_tissue	assay_cell_type	assay_subcellular_fraction	tid	relationship_type	confidence_score	curated_by	src_id	src_assay_id	chembl_id	cell_id	bao_format
454229	37427	Metabolic stability in human liver microsomes assessed as half life	A	In vitro	None	Homo sapiens	9606	None	Liver	None	Microsomes	102164	S	2	Autocuration	1	None	CHEMBL903412	None	BAO_0000251
454226	37427	Displacement of [3H]5-hydroxytrytamine from human 5HT1B receptor expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	106	D	9	Intermediate	1	None	CHEMBL903407	722	BAO_0000219
454227	37427	Displacement of [3H]dofetilide from human ERG channel expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	165	D	9	Intermediate	1	None	CHEMBL903410	722	BAO_0000219
454228	37427	Inhibition of human recombinant CYP2D6 at 1.5 uM	A	None	None	Homo sapiens	9606	None	None	None	None	11365	D	9	Intermediate	1	None	CHEMBL903411	None	BAO_0000357
454224	37427	Displacement of [3H]NPTS from human GlyT1C expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	11682	D	9	Intermediate	1	None	CHEMBL903408	722	BAO_0000219
454225	37427	Inhibition of [3H]glycine uptake at human GlyT2 expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	11596	D	9	Intermediate	1	None	CHEMBL903409	722	BAO_0000219

And look at the values for our compounds:

In [102]:

assayData=%sql select * from activities join assays using (assay_id) \
    where activities.doc_id=:docid \
        and molregno in :regnos \
        and standard_value is not null \
        and assay_id!=454227;
assayData

21 rows affected.

Out[102]:

assay_id	activity_id	doc_id	record_id	molregno	standard_relation	published_value	published_units	standard_value	standard_units	standard_flag	standard_type	activity_comment	published_type	data_validity_comment	potential_duplicate	published_relation	pchembl_value	bao_endpoint	uo_units	qudt_units	doc_id_1	description	assay_type	assay_test_type	assay_category	assay_organism	assay_tax_id	assay_strain	assay_tissue	assay_cell_type	assay_subcellular_fraction	tid	relationship_type	confidence_score	curated_by	src_id	src_assay_id	chembl_id	cell_id	bao_format
454224	2020479	37427	671221	408196	=	26	nM	26	nM	1	IC50	None	IC50	None	None	=	7.59	BAO_0000190	UO_0000065	http://www.openphacts.org/units/Nanomolar	37427	Displacement of [3H]NPTS from human GlyT1C expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	11682	D	9	Intermediate	1	None	CHEMBL903408	722	BAO_0000219
454224	2020480	37427	671220	408196	=	31	nM	31	nM	1	IC50	None	IC50	None	None	=	7.51	BAO_0000190	UO_0000065	http://www.openphacts.org/units/Nanomolar	37427	Displacement of [3H]NPTS from human GlyT1C expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	11682	D	9	Intermediate	1	None	CHEMBL903408	722	BAO_0000219
454224	2020481	37427	671219	408196	=	24	nM	24	nM	1	IC50	None	IC50	None	None	=	7.62	BAO_0000190	UO_0000065	http://www.openphacts.org/units/Nanomolar	37427	Displacement of [3H]NPTS from human GlyT1C expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	11682	D	9	Intermediate	1	None	CHEMBL903408	722	BAO_0000219
454224	2020482	37427	671218	408196	=	38	nM	38	nM	1	IC50	None	IC50	None	None	=	7.42	BAO_0000190	UO_0000065	http://www.openphacts.org/units/Nanomolar	37427	Displacement of [3H]NPTS from human GlyT1C expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	11682	D	9	Intermediate	1	None	CHEMBL903408	722	BAO_0000219
454224	2020485	37427	671215	408198	=	10.2	nM	10.2	nM	1	Ki	None	Ki	None	None	=	7.99	BAO_0000192	UO_0000065	http://www.openphacts.org/units/Nanomolar	37427	Displacement of [3H]NPTS from human GlyT1C expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	11682	D	9	Intermediate	1	None	CHEMBL903408	722	BAO_0000219
454224	2020486	37427	671214	408196	=	15.9	nM	15.9	nM	1	Ki	None	Ki	None	None	=	7.80	BAO_0000192	UO_0000065	http://www.openphacts.org/units/Nanomolar	37427	Displacement of [3H]NPTS from human GlyT1C expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	11682	D	9	Intermediate	1	None	CHEMBL903408	722	BAO_0000219
454224	2020488	37427	671212	408192	=	61.4	nM	61.4	nM	1	Ki	None	Ki	None	None	=	7.21	BAO_0000192	UO_0000065	http://www.openphacts.org/units/Nanomolar	37427	Displacement of [3H]NPTS from human GlyT1C expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	11682	D	9	Intermediate	1	None	CHEMBL903408	722	BAO_0000219
454224	2020490	37427	671210	408189	=	26.8	nM	26.8	nM	1	Ki	None	Ki	None	None	=	7.57	BAO_0000192	UO_0000065	http://www.openphacts.org/units/Nanomolar	37427	Displacement of [3H]NPTS from human GlyT1C expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	11682	D	9	Intermediate	1	None	CHEMBL903408	722	BAO_0000219
454225	2020495	37427	671215	408198	=	385	nM	385	nM	1	IC50	None	IC50	None	None	=	6.41	BAO_0000190	UO_0000065	http://www.openphacts.org/units/Nanomolar	37427	Inhibition of [3H]glycine uptake at human GlyT2 expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	11596	D	9	Intermediate	1	None	CHEMBL903409	722	BAO_0000219
454225	2020496	37427	671214	408196	=	319	nM	319	nM	1	IC50	None	IC50	None	None	=	6.50	BAO_0000190	UO_0000065	http://www.openphacts.org/units/Nanomolar	37427	Inhibition of [3H]glycine uptake at human GlyT2 expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	11596	D	9	Intermediate	1	None	CHEMBL903409	722	BAO_0000219
454225	2020498	37427	671212	408192	=	31.5	nM	31.5	nM	1	IC50	None	IC50	None	None	=	7.50	BAO_0000190	UO_0000065	http://www.openphacts.org/units/Nanomolar	37427	Inhibition of [3H]glycine uptake at human GlyT2 expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	11596	D	9	Intermediate	1	None	CHEMBL903409	722	BAO_0000219
454226	2020506	37427	671214	408196	>	1000	nM	1000	nM	1	Ki	None	Ki	None	None	>	None	BAO_0000192	UO_0000065	http://www.openphacts.org/units/Nanomolar	37427	Displacement of [3H]5-hydroxytrytamine from human 5HT1B receptor expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	106	D	9	Intermediate	1	None	CHEMBL903407	722	BAO_0000219
454226	2020510	37427	671210	408189	=	1490	nM	1490	nM	1	Ki	None	Ki	None	None	=	5.83	BAO_0000192	UO_0000065	http://www.openphacts.org/units/Nanomolar	37427	Displacement of [3H]5-hydroxytrytamine from human 5HT1B receptor expressed in HEK293 cells	B	None	None	Homo sapiens	9606	None	None	HEK293	None	106	D	9	Intermediate	1	None	CHEMBL903407	722	BAO_0000219
454228	2020529	37427	671215	408198	=	43	%	43	%	1	Inhibition	None	Inhibition	None	None	=	None	BAO_0000201	UO_0000187	http://qudt.org/vocab/unit#Percent	37427	Inhibition of human recombinant CYP2D6 at 1.5 uM	A	None	None	Homo sapiens	9606	None	None	None	None	11365	D	9	Intermediate	1	None	CHEMBL903411	None	BAO_0000357
454228	2020530	37427	671214	408196	=	96	%	96	%	1	Inhibition	None	Inhibition	None	None	=	None	BAO_0000201	UO_0000187	http://qudt.org/vocab/unit#Percent	37427	Inhibition of human recombinant CYP2D6 at 1.5 uM	A	None	None	Homo sapiens	9606	None	None	None	None	11365	D	9	Intermediate	1	None	CHEMBL903411	None	BAO_0000357
454228	2020532	37427	671212	408192	=	85	%	85	%	1	Inhibition	None	Inhibition	None	None	=	None	BAO_0000201	UO_0000187	http://qudt.org/vocab/unit#Percent	37427	Inhibition of human recombinant CYP2D6 at 1.5 uM	A	None	None	Homo sapiens	9606	None	None	None	None	11365	D	9	Intermediate	1	None	CHEMBL903411	None	BAO_0000357
454228	2020534	37427	671210	408189	=	47	%	47	%	1	Inhibition	None	Inhibition	None	None	=	None	BAO_0000201	UO_0000187	http://qudt.org/vocab/unit#Percent	37427	Inhibition of human recombinant CYP2D6 at 1.5 uM	A	None	None	Homo sapiens	9606	None	None	None	None	11365	D	9	Intermediate	1	None	CHEMBL903411	None	BAO_0000357
454229	2020539	37427	671215	408198	=	17	min	0.283	hr	1	T1/2	None	t1/2	None	None	=	None	BAO_0002115	UO_0000032	http://qudt.org/vocab/unit#Hour	37427	Metabolic stability in human liver microsomes assessed as half life	A	In vitro	None	Homo sapiens	9606	None	Liver	None	Microsomes	102164	S	2	Autocuration	1	None	CHEMBL903412	None	BAO_0000251
454229	2020540	37427	671214	408196	=	22	min	0.367	hr	1	T1/2	None	t1/2	None	None	=	None	BAO_0002115	UO_0000032	http://qudt.org/vocab/unit#Hour	37427	Metabolic stability in human liver microsomes assessed as half life	A	In vitro	None	Homo sapiens	9606	None	Liver	None	Microsomes	102164	S	2	Autocuration	1	None	CHEMBL903412	None	BAO_0000251
454229	2020542	37427	671212	408192	=	14	min	0.233	hr	1	T1/2	None	t1/2	None	None	=	None	BAO_0002115	UO_0000032	http://qudt.org/vocab/unit#Hour	37427	Metabolic stability in human liver microsomes assessed as half life	A	In vitro	None	Homo sapiens	9606	None	Liver	None	Microsomes	102164	S	2	Autocuration	1	None	CHEMBL903412	None	BAO_0000251
454229	2020544	37427	671210	408189	=	9	min	0.15	hr	1	T1/2	None	t1/2	None	None	=	None	BAO_0002115	UO_0000032	http://qudt.org/vocab/unit#Hour	37427	Metabolic stability in human liver microsomes assessed as half life	A	In vitro	None	Homo sapiens	9606	None	Liver	None	Microsomes	102164	S	2	Autocuration	1	None	CHEMBL903412	None	BAO_0000251

In [ ]: