Multi-dimensional scaling with myChEMBL, RDKit and Pandas¶

myChEMBL team, ChEMBL group, EMBL-EBI.¶

This is a slightly modified version of this article by Greg Landrum , which in turn is based on this notebook by George Papadatos.

Scikit-learn makes it quite easy to apply multi-dimensional scaling (MDS) to either reduce the dimensionality of a dataset or to embed distance data into cartesian space. This enables one of the favorite activities of the cheminformatician: producing plots of where compounds land in an 2D space.

In [1]:

from chembl_webresource_client import *
import pandas as pd
from rdkit.Chem import PandasTools
from rdkit.Chem import AllChem as Chem
from rdkit.Chem import DataStructs
from sklearn import manifold
import warnings
warnings.filterwarnings('ignore')

height has been deprecated.

In [2]:

rcParams['figure.figsize'] = 8,8

Let's configure our client to use local version of web services. Just delete the cell below, if you want to use the official version. In that case, you will need internet access.

In [3]:

from chembl_webresource_client.settings import Settings
Settings.Instance().WEBSERVICE_DOMAIN = 'localhost'
Settings.Instance().WEBSERVICE_PROTOCOL = 'http'

#If your VM is slow, you can consider increasing timeout from default 3 seconds, just uncomment line below:
Settings.Instance().TIMEOUT = 120

Grab the data¶

I'm going to start with a larger dataset so that I can explore the impact of dataset size on the results. The analysis of George's results are below.

I'll use the Dopamine D3 receptor as the target for this exercise. It's one of the targets we used in both the benchmarking and model fusion papers.

In [4]:

targets = TargetResource()
target='CHEMBL234'
bio = targets.bioactivities(target)
data = pd.DataFrame(bio)
data.head()

Out[4]:

	activity_comment	assay_chemblid	assay_description	assay_type	bioactivity_type	ingredient_cmpd_chemblid	name_in_reference	operator	organism	parent_cmpd_chemblid	reference	target_chemblid	target_confidence	target_name	units	value
0	Unspecified	CHEMBL666973	Binding affinity was measured at cloned mammalian dopamine D3 receptor expressed in CHO-K1 cells (using [3H]- spiperone)	B	Ki	CHEMBL278751	2 (+)-UH232	=	Homo sapiens	CHEMBL278751	Bioorg. Med. Chem. Lett., (1994) 4:5:689	CHEMBL234	8	Dopamine D3 receptor	nM	4.2
1	Unspecified	CHEMBL666973	Binding affinity was measured at cloned mammalian dopamine D3 receptor expressed in CHO-K1 cells (using [3H]- spiperone)	B	Ki	CHEMBL161811	9	=	Homo sapiens	CHEMBL161811	Bioorg. Med. Chem. Lett., (1994) 4:5:689	CHEMBL234	8	Dopamine D3 receptor	nM	40
2	Unspecified	CHEMBL666973	Binding affinity was measured at cloned mammalian dopamine D3 receptor expressed in CHO-K1 cells (using [3H]- spiperone)	B	Ki	CHEMBL349843	5	=	Homo sapiens	CHEMBL349843	Bioorg. Med. Chem. Lett., (1994) 4:5:689	CHEMBL234	8	Dopamine D3 receptor	nM	171
3	Unspecified	CHEMBL666973	Binding affinity was measured at cloned mammalian dopamine D3 receptor expressed in CHO-K1 cells (using [3H]- spiperone)	B	Ki	CHEMBL27441	1 (+)-AJ76	=	Homo sapiens	CHEMBL27441	Bioorg. Med. Chem. Lett., (1994) 4:5:689	CHEMBL234	8	Dopamine D3 receptor	nM	26
4	Unspecified	CHEMBL666973	Binding affinity was measured at cloned mammalian dopamine D3 receptor expressed in CHO-K1 cells (using [3H]- spiperone)	B	Ki	CHEMBL161507	6	=	Homo sapiens	CHEMBL161507	Bioorg. Med. Chem. Lett., (1994) 4:5:689	CHEMBL234	8	Dopamine D3 receptor	nM	253

In [5]:

data.shape

Out[5]:

(6812, 16)

Remove duplicates¶

In [6]:

data = data.drop_duplicates(['parent_cmpd_chemblid'])
data.shape

Out[6]:

(4447, 16)

Remove assays with too many or too few compounds¶

In [7]:

assays = data[['assay_chemblid','target_chemblid']].groupby('assay_chemblid').count()
print assays.shape
assays.head(5)

(488, 1)

Out[7]:

	target_chemblid
assay_chemblid
CHEMBL1030601	12
CHEMBL1030747	1
CHEMBL1031083	1
CHEMBL1031097	3
CHEMBL1031401	1

In [8]:

assays.target_chemblid.hist(bins=20)

Out[8]:

<matplotlib.axes.AxesSubplot at 0x7f9820192e10>

In [9]:

goodassays = assays.ix[(assays.target_chemblid >= 10)&(assays.target_chemblid <= 30)]
print goodassays.shape
goodassays.target_chemblid.hist(bins=20)

(114, 1)

Out[9]:

<matplotlib.axes.AxesSubplot at 0x7f982016bf90>

In [10]:

data2 = data.ix[data.assay_chemblid.isin(list(goodassays.index))]
data2.shape

Out[10]:

(1918, 16)

Now get the SMILES¶

In [11]:

compounds = CompoundResource()
cs = compounds.get(list(data2['parent_cmpd_chemblid']))
smiles = pd.Series(map(lambda x: x['smiles'], cs), index=data2.index)
data2['SMILES'] = smiles

In [12]:

PandasTools.AddMoleculeColumnToFrame(data2, smilesCol = 'SMILES',includeFingerprints=True)

Let's start by looking at assays that have between 10 and 12 compounds:

In [13]:

assays = data2[['assay_chemblid','target_chemblid']].groupby('assay_chemblid').count()
print assays.shape
goodassays = assays.ix[(assays.target_chemblid >= 10)&(assays.target_chemblid <= 12)]
subset = data2.ix[data.assay_chemblid.isin(list(goodassays.index))]
print subset.shape

(114, 1)
(262, 18)

In [14]:

mols = subset[['parent_cmpd_chemblid','name_in_reference','SMILES', 'ROMol', 'assay_chemblid']]
mols.shape

Out[14]:

(262, 5)

In [15]:

mols.head(5)

Out[15]:

	parent_cmpd_chemblid	name_in_reference	SMILES	assay_chemblid
23	CHEMBL87260	5	Clc1ccc(cc1)N2CCN(CCCOc3ccc(cc3)c4nc5ccccc5[nH]4)CC2	CHEMBL666891
24	CHEMBL87478	9	CCCSc1ccccc1N2CCN(CCCOc3ccc(cc3)c4nc5ccccc5[nH]4)CC2	CHEMBL666891
25	CHEMBL85441	3	C(COc1ccc(cc1)c2nc3ccccc3[nH]2)CN4CCN(CC4)c5ccccc5	CHEMBL666891
26	CHEMBL85700	4	CN1CCN(CCCOc2ccc(cc2)c3nc4ccccc4[nH]3)CC1	CHEMBL666891
27	CHEMBL314983	11	C(COc1ccc(cc1)c2nc3ccccc3[nH]2)CN4CCN(CC4)c5ccccn5	CHEMBL666891

Construct fingerprints:

In [16]:

fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in mols['ROMol']]

Now the distance matrix:

In [17]:

dist_mat = []
for i,fp in enumerate(fps):
    dist_mat.append(DataStructs.BulkTanimotoSimilarity(fps[i],fps,returnDistance=1))
dist_mat=numpy.array(dist_mat)

And now do the MDS into 2 dimensions:

In [18]:

mds = manifold.MDS(n_components=2, dissimilarity="precomputed", random_state=3, n_jobs = 4, verbose=1,max_iter=1000)
results = mds.fit(dist_mat)
coords = results.embedding_
print 'Final stress:',mds.stress_

Final stress: 3052.83827897
breaking at iteration 276 with stress 3227.03049273
breaking at iteration 248 with stress 3166.89856976
breaking at iteration 258 with stress 3114.92766418
breaking at iteration 297 with stress 3052.83827897

And plot the points:

In [19]:

scatter([x for x,y in coords], [y for x,y in coords],edgecolors='none')

Out[19]:

<matplotlib.collections.PathCollection at 0x7f981fb0d6d0>

Nice picture. How accurate is it?

The strain values above lead me to believe that there are probably some real problems here.

The overall strain value isn't particularly easy to interpret, so we'll check up on things by comparing the embedded distances with the Tanimoto distances we're trying to reproduce.

In [20]:

import random
def distCompare(dmat,coords,nPicks=5000,seed=0xf00d):
    """ picks a random set of pairs of points to compare distances """
    nPts=len(coords)
    random.seed(seed)
    res=[]
    keep=set()
    if nPicks>0:
        while len(res)<nPicks:
            idx1 = random.randint(0,nPts-1)
            idx2 = random.randint(0,nPts-1)
            if idx1==idx2: 
                continue
            if idx1>idx2: 
                idx1,idx2=idx2,idx1
            if (idx1,idx2) in keep: 
                continue
            keep.add((idx1,idx2))
            p1 = coords[idx1]
            p2 = coords[idx2]
            v = p1-p2
            d = sqrt(v.dot(v))
            res.append((dmat[idx1][idx2],d))
    else:
        for idx1 in range(nPts):
            for idx2 in range(idx1+1,nPts):
                p1 = coords[idx1]
                p2 = coords[idx2]
                v = p1-p2
                d = sqrt(v.dot(v))
                res.append((dmat[idx1][idx2],d))
    return res

In [21]:

d = distCompare(dist_mat,coords)

In [22]:

scatter([x for x,y in d],[y for x,y in d],edgecolors='none')

Out[22]:

<matplotlib.collections.PathCollection at 0x7f981f9bded0>

Yikes... that's no good. This really isn't so surprising given how much stress was left.

Let's try increasing the dimensionality to see how much improvement we can get:

In [23]:

mds2 = manifold.MDS(n_components=20, dissimilarity="precomputed", random_state=3, n_jobs = 4, verbose=1,max_iter=1000)
results = mds2.fit(dist_mat)
coords = results.embedding_
print 'Final stress:',mds2.stress_

Final stress: 57.0050716266
breaking at iteration 203 with stress 57.0050716266
breaking at iteration 199 with stress 59.5816001681
breaking at iteration 206 with stress 58.1255498141
breaking at iteration 207 with stress 57.0929415605

In [60]:

d = distCompare(dist_mat,coords)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')

Out[60]:

<matplotlib.collections.PathCollection at 0x7f402994d150>

Not so bad... of course, we can't visualize 20 dimensions.

What about if we try doing dimensionality reduction on the 20D results instead of using the distance matrix?

In [61]:

mds3 = manifold.MDS(n_components=2, random_state=3, n_jobs = 4, verbose=1,max_iter=1000)
results3 = mds3.fit(coords)
coords3 = results3.embedding_
print 'Final stress:',mds3.stress_

Final stress: 2943.91225488
breaking at iteration 239 with stress 3011.93802972
breaking at iteration 263 with stress 3033.56503189
breaking at iteration 265 with stress 3176.13501461
breaking at iteration 406 with stress 2943.91225488

The strain is still really high. This isn't that surprising given that we already tried the same thing from the distance matrix.

Ok, so MDS doesn't work. Now that we have actual coordinates, we can use some of the other scikit-learn manifold learning approaches for embedding the points.

Instead of jumping all the way to a 2D system, where we're really unlikely to get decent results, start by dropping down to 10D.

First MDS in 10D:

In [62]:

mds3 = manifold.MDS(n_components=10, random_state=3, n_jobs = 4, verbose=1,max_iter=1000)
results3 = mds3.fit(coords)
coords3 = results3.embedding_
print 'Final stress:',results3.stress_
ds = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')

Final stress: 164.997967472
breaking at iteration 187 with stress 167.970824753
breaking at iteration 195 with stress 167.426219544
breaking at iteration 206 with stress 168.6442109
breaking at iteration 208 with stress 164.997967472

Out[62]:

<matplotlib.collections.PathCollection at 0x7f402afc6e50>

Now try locally linear embedding:

In [63]:

lle = manifold.LocallyLinearEmbedding(n_components=10,random_state=3,max_iter=1000)
results3=lle.fit(coords)
coords3=results3.embedding_
print results3.reconstruction_error_
d = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')

7.87063669948e-15

Out[63]:

<matplotlib.collections.PathCollection at 0x7f402a453190>

Ick! What about Isomap?

In [64]:

embed = manifold.Isomap(n_components=10,max_iter=1000)
results3=embed.fit(coords)
coords3=results3.embedding_
d = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')

Out[64]:

<matplotlib.collections.PathCollection at 0x7f4028a24690>

Also pretty grim. Does spectral embedding work?

In [65]:

embed = manifold.SpectralEmbedding(n_components=10)
results3=embed.fit(coords)
coords3=results3.embedding_
d = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')

Out[65]:

<matplotlib.collections.PathCollection at 0x7f4024241790>

Hmm, there's at least some signal there. How does spectral embedding do with 2D?

In [66]:

embed = manifold.SpectralEmbedding(n_components=2)
results3=embed.fit(coords)
coords3=results3.embedding_
d = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')

Out[66]:

<matplotlib.collections.PathCollection at 0x7f4024146f10>

Not so hot...

Look at stress vs dimension¶

We're able to do a reasonable job of embedding into 20 dimensions with MDS. 10 even looks somewhat ok. Let's look at behavior of the residual stress as a function of dimension:

In [67]:

vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
    mds3 = manifold.MDS(n_components=nComps, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
    results3 = mds3.fit(dist_mat)
    print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
    vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')

Final stress at dim=140: 10.2150083984 stress per element: 0.0389885816734
Final stress at dim=120: 10.6831774998 stress per element: 0.0407754866405
Final stress at dim=100: 11.0373652537 stress per element: 0.0421273482967
Final stress at dim=80: 11.9702291564 stress per element: 0.0456878975435
Final stress at dim=60: 14.0144255401 stress per element: 0.0534901738172
Final stress at dim=40: 19.3925818175 stress per element: 0.074017487853
Final stress at dim=20: 57.0050716266 stress per element: 0.217576609262
Final stress at dim=10: 216.325632623 stress per element: 0.825670353524
Final stress at dim=5: 703.559542424 stress per element: 2.68534176498
Final stress at dim=2: 3052.83827897 stress per element: 11.6520544999

Yeah, the stress is going up exponentially with dimension. These data really don't want to be in a 2D space.

Try non-metric embedding

In [68]:

vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
    mds3 = manifold.MDS(metric=False,n_components=nComps,random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
    results3 = mds3.fit(dist_mat)
    print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
    vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')

Final stress at dim=140: 15.2927309221 stress per element: 0.0583692019926
Final stress at dim=120: 17.9694200702 stress per element: 0.0685855727871
Final stress at dim=100: 21.235102664 stress per element: 0.0810500101679
Final stress at dim=80: 26.4784693091 stress per element: 0.101062859958
Final stress at dim=60: 34.8885053235 stress per element: 0.133162234059
Final stress at dim=40: 52.3293595454 stress per element: 0.199730379944
Final stress at dim=20: 111.248026961 stress per element: 0.424610789927
Final stress at dim=10: 243.54472514 stress per element: 0.92956001962
Final stress at dim=5: 550.187026998 stress per element: 2.09995048472
Final stress at dim=2: 1548.48953014 stress per element: 5.91026538222

helps somewhat at lower dimension, but not dramatically. Plus it's a lot slower

A larger data set¶

Maybe it's possible to do better with a larger data set? This seems unlikely, but we can at least try:

In [71]:

assays = data2[['assay_chemblid','target_chemblid']].groupby('assay_chemblid').count()
print assays.shape
goodassays = assays.ix[(assays.target_chemblid >= 10)&(assays.target_chemblid <= 15)]
subset = data2.ix[data.assay_chemblid.isin(list(goodassays.index))]
print subset.shape

(114, 1)
(722, 18)

In [72]:

mols = subset[['parent_cmpd_chemblid','name_in_reference','SMILES', 'ROMol', 'assay_chemblid']]
print mols.shape
fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in mols['ROMol']]
dist_mat = []
for i,fp in enumerate(fps):
    dist_mat.append(DataStructs.BulkTanimotoSimilarity(fps[i],fps,returnDistance=1))
dist_mat=numpy.array(dist_mat)    
print dist_mat.shape 

(722, 5)
(722, 722)

In [73]:

vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
    mds3 = manifold.MDS(n_components=nComps, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
    results3 = mds3.fit(dist_mat)
    print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
    vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')

Final stress at dim=140: 70.1736432284 stress per element: 0.0971934116736
Final stress at dim=120: 74.5872943448 stress per element: 0.103306501863
Final stress at dim=100: 82.5548881411 stress per element: 0.114341950334
Final stress at dim=80: 97.553472 stress per element: 0.135115612188
Final stress at dim=60: 128.027969207 stress per element: 0.177324057074
Final stress at dim=40: 222.715384813 stress per element: 0.308470062068
Final stress at dim=20: 730.926915054 stress per element: 1.01236414827
Final stress at dim=10: 2299.36788686 stress per element: 3.18472006491
Final stress at dim=5: 6707.58303311 stress per element: 9.29028120929
Final stress at dim=2: 25389.4023638 stress per element: 35.1653772352

Nope, that definitely made things worse.

and larger still¶

In [75]:

assays = data2[['assay_chemblid','target_chemblid']].groupby('assay_chemblid').count()
print assays.shape
goodassays = assays.ix[(assays.target_chemblid >= 10)&(assays.target_chemblid <= 18)]
subset = data2.ix[data.assay_chemblid.isin(list(goodassays.index))]
print subset.shape

(114, 1)
(1039, 18)

In [76]:

mols = subset[['parent_cmpd_chemblid','name_in_reference','SMILES', 'ROMol', 'assay_chemblid']]
print mols.shape
fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in mols['ROMol']]
dist_mat = []
for i,fp in enumerate(fps):
    dist_mat.append(DataStructs.BulkTanimotoSimilarity(fps[i],fps,returnDistance=1))
dist_mat=numpy.array(dist_mat)    
print dist_mat.shape

(1039, 5)
(1039, 1039)

In [77]:

vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
    mds3 = manifold.MDS(n_components=nComps, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
    results3 = mds3.fit(dist_mat)
    print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
    vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')

Final stress at dim=140: 141.09512988 stress per element: 0.135798970048
Final stress at dim=120: 151.919790513 stress per element: 0.146217315219
Final stress at dim=100: 168.906821392 stress per element: 0.162566719337
Final stress at dim=80: 200.40090571 stress per element: 0.192878638797
Final stress at dim=60: 272.095915211 stress per element: 0.261882497796
Final stress at dim=40: 490.563286058 stress per element: 0.472149457226
Final stress at dim=20: 1628.60787554 stress per element: 1.56747629984
Final stress at dim=10: 5037.04251247 stress per element: 4.84797161932
Final stress at dim=5: 14191.8670891 stress per element: 13.6591598548
Final stress at dim=2: 53210.9813686 stress per element: 51.2136490555

Continuing to get worse.

Try less data¶

In [78]:

assays = data2[['assay_chemblid','target_chemblid']].groupby('assay_chemblid').count()
print assays.shape
goodassays = assays.ix[assays.target_chemblid == 15]
subset = data2.ix[data.assay_chemblid.isin(list(goodassays.index))]
print subset.shape

(114, 1)
(180, 18)

In [79]:

mols = subset[['parent_cmpd_chemblid','name_in_reference','SMILES', 'ROMol', 'assay_chemblid']]
print mols.shape
fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in mols['ROMol']]
dist_mat = []
for i,fp in enumerate(fps):
    dist_mat.append(DataStructs.BulkTanimotoSimilarity(fps[i],fps,returnDistance=1))
dist_mat=numpy.array(dist_mat)    
print dist_mat.shape

(180, 5)
(180, 180)

In [80]:

vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
    mds3 = manifold.MDS(n_components=nComps, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
    results3 = mds3.fit(dist_mat)
    print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
    vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')

Final stress at dim=140: 4.93318539149 stress per element: 0.0274065855083
Final stress at dim=120: 4.92380830335 stress per element: 0.0273544905742
Final stress at dim=100: 5.15458013048 stress per element: 0.0286365562804
Final stress at dim=80: 5.31751942123 stress per element: 0.0295417745624
Final stress at dim=60: 5.74061848474 stress per element: 0.0318923249152
Final stress at dim=40: 6.83546630575 stress per element: 0.0379748128097
Final stress at dim=20: 12.9022649033 stress per element: 0.0716792494629
Final stress at dim=10: 52.2825877625 stress per element: 0.290458820903
Final stress at dim=5: 217.475840469 stress per element: 1.20819911372
Final stress at dim=2: 1145.28539545 stress per element: 6.36269664139

That's a lot less stress per point than before, but it's still pretty high. Look at the distances to be sure:

In [81]:

mds3 = manifold.MDS(n_components=2, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
coords3=results3.embedding_
d = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')

Out[81]:

<matplotlib.collections.PathCollection at 0x7f4007e32510>

In [81]: