This is a slightly modified version of this article by Greg Landrum , which in turn is based on this notebook by George Papadatos.
Scikit-learn makes it quite easy to apply multi-dimensional scaling (MDS) to either reduce the dimensionality of a dataset or to embed distance data into cartesian space. This enables one of the favorite activities of the cheminformatician: producing plots of where compounds land in an 2D space.
from chembl_webresource_client import *
import pandas as pd
from rdkit.Chem import PandasTools
from rdkit.Chem import AllChem as Chem
from rdkit.Chem import DataStructs
from sklearn import manifold
import warnings
warnings.filterwarnings('ignore')
height has been deprecated.
rcParams['figure.figsize'] = 8,8
Let's configure our client to use local version of web services. Just delete the cell below, if you want to use the official version. In that case, you will need internet access.
from chembl_webresource_client.settings import Settings
Settings.Instance().WEBSERVICE_DOMAIN = 'localhost'
Settings.Instance().WEBSERVICE_PROTOCOL = 'http'
#If your VM is slow, you can consider increasing timeout from default 3 seconds, just uncomment line below:
Settings.Instance().TIMEOUT = 120
I'm going to start with a larger dataset so that I can explore the impact of dataset size on the results. The analysis of George's results are below.
I'll use the Dopamine D3 receptor as the target for this exercise. It's one of the targets we used in both the benchmarking and model fusion papers.
targets = TargetResource()
target='CHEMBL234'
bio = targets.bioactivities(target)
data = pd.DataFrame(bio)
data.head()
activity_comment | assay_chemblid | assay_description | assay_type | bioactivity_type | ingredient_cmpd_chemblid | name_in_reference | operator | organism | parent_cmpd_chemblid | reference | target_chemblid | target_confidence | target_name | units | value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Unspecified | CHEMBL666973 | Binding affinity was measured at cloned mammalian dopamine D3 receptor expressed in CHO-K1 cells (using [3H]- spiperone) | B | Ki | CHEMBL278751 | 2 (+)-UH232 | = | Homo sapiens | CHEMBL278751 | Bioorg. Med. Chem. Lett., (1994) 4:5:689 | CHEMBL234 | 8 | Dopamine D3 receptor | nM | 4.2 |
1 | Unspecified | CHEMBL666973 | Binding affinity was measured at cloned mammalian dopamine D3 receptor expressed in CHO-K1 cells (using [3H]- spiperone) | B | Ki | CHEMBL161811 | 9 | = | Homo sapiens | CHEMBL161811 | Bioorg. Med. Chem. Lett., (1994) 4:5:689 | CHEMBL234 | 8 | Dopamine D3 receptor | nM | 40 |
2 | Unspecified | CHEMBL666973 | Binding affinity was measured at cloned mammalian dopamine D3 receptor expressed in CHO-K1 cells (using [3H]- spiperone) | B | Ki | CHEMBL349843 | 5 | = | Homo sapiens | CHEMBL349843 | Bioorg. Med. Chem. Lett., (1994) 4:5:689 | CHEMBL234 | 8 | Dopamine D3 receptor | nM | 171 |
3 | Unspecified | CHEMBL666973 | Binding affinity was measured at cloned mammalian dopamine D3 receptor expressed in CHO-K1 cells (using [3H]- spiperone) | B | Ki | CHEMBL27441 | 1 (+)-AJ76 | = | Homo sapiens | CHEMBL27441 | Bioorg. Med. Chem. Lett., (1994) 4:5:689 | CHEMBL234 | 8 | Dopamine D3 receptor | nM | 26 |
4 | Unspecified | CHEMBL666973 | Binding affinity was measured at cloned mammalian dopamine D3 receptor expressed in CHO-K1 cells (using [3H]- spiperone) | B | Ki | CHEMBL161507 | 6 | = | Homo sapiens | CHEMBL161507 | Bioorg. Med. Chem. Lett., (1994) 4:5:689 | CHEMBL234 | 8 | Dopamine D3 receptor | nM | 253 |
data.shape
(6812, 16)
data = data.drop_duplicates(['parent_cmpd_chemblid'])
data.shape
(4447, 16)
assays = data[['assay_chemblid','target_chemblid']].groupby('assay_chemblid').count()
print assays.shape
assays.head(5)
(488, 1)
target_chemblid | |
---|---|
assay_chemblid | |
CHEMBL1030601 | 12 |
CHEMBL1030747 | 1 |
CHEMBL1031083 | 1 |
CHEMBL1031097 | 3 |
CHEMBL1031401 | 1 |
assays.target_chemblid.hist(bins=20)
<matplotlib.axes.AxesSubplot at 0x7f9820192e10>
goodassays = assays.ix[(assays.target_chemblid >= 10)&(assays.target_chemblid <= 30)]
print goodassays.shape
goodassays.target_chemblid.hist(bins=20)
(114, 1)
<matplotlib.axes.AxesSubplot at 0x7f982016bf90>
data2 = data.ix[data.assay_chemblid.isin(list(goodassays.index))]
data2.shape
(1918, 16)
compounds = CompoundResource()
cs = compounds.get(list(data2['parent_cmpd_chemblid']))
smiles = pd.Series(map(lambda x: x['smiles'], cs), index=data2.index)
data2['SMILES'] = smiles
PandasTools.AddMoleculeColumnToFrame(data2, smilesCol = 'SMILES',includeFingerprints=True)
Let's start by looking at assays that have between 10 and 12 compounds:
assays = data2[['assay_chemblid','target_chemblid']].groupby('assay_chemblid').count()
print assays.shape
goodassays = assays.ix[(assays.target_chemblid >= 10)&(assays.target_chemblid <= 12)]
subset = data2.ix[data.assay_chemblid.isin(list(goodassays.index))]
print subset.shape
(114, 1) (262, 18)
mols = subset[['parent_cmpd_chemblid','name_in_reference','SMILES', 'ROMol', 'assay_chemblid']]
mols.shape
(262, 5)
mols.head(5)
parent_cmpd_chemblid | name_in_reference | SMILES | ROMol | assay_chemblid | |
---|---|---|---|---|---|
23 | CHEMBL87260 | 5 | Clc1ccc(cc1)N2CCN(CCCOc3ccc(cc3)c4nc5ccccc5[nH]4)CC2 | CHEMBL666891 | |
24 | CHEMBL87478 | 9 | CCCSc1ccccc1N2CCN(CCCOc3ccc(cc3)c4nc5ccccc5[nH]4)CC2 | CHEMBL666891 | |
25 | CHEMBL85441 | 3 | C(COc1ccc(cc1)c2nc3ccccc3[nH]2)CN4CCN(CC4)c5ccccc5 | CHEMBL666891 | |
26 | CHEMBL85700 | 4 | CN1CCN(CCCOc2ccc(cc2)c3nc4ccccc4[nH]3)CC1 | CHEMBL666891 | |
27 | CHEMBL314983 | 11 | C(COc1ccc(cc1)c2nc3ccccc3[nH]2)CN4CCN(CC4)c5ccccn5 | CHEMBL666891 |
Construct fingerprints:
fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in mols['ROMol']]
Now the distance matrix:
dist_mat = []
for i,fp in enumerate(fps):
dist_mat.append(DataStructs.BulkTanimotoSimilarity(fps[i],fps,returnDistance=1))
dist_mat=numpy.array(dist_mat)
And now do the MDS into 2 dimensions:
mds = manifold.MDS(n_components=2, dissimilarity="precomputed", random_state=3, n_jobs = 4, verbose=1,max_iter=1000)
results = mds.fit(dist_mat)
coords = results.embedding_
print 'Final stress:',mds.stress_
Final stress: 3052.83827897 breaking at iteration 276 with stress 3227.03049273 breaking at iteration 248 with stress 3166.89856976 breaking at iteration 258 with stress 3114.92766418 breaking at iteration 297 with stress 3052.83827897
And plot the points:
scatter([x for x,y in coords], [y for x,y in coords],edgecolors='none')
<matplotlib.collections.PathCollection at 0x7f981fb0d6d0>
Nice picture. How accurate is it?
The strain values above lead me to believe that there are probably some real problems here.
The overall strain value isn't particularly easy to interpret, so we'll check up on things by comparing the embedded distances with the Tanimoto distances we're trying to reproduce.
import random
def distCompare(dmat,coords,nPicks=5000,seed=0xf00d):
""" picks a random set of pairs of points to compare distances """
nPts=len(coords)
random.seed(seed)
res=[]
keep=set()
if nPicks>0:
while len(res)<nPicks:
idx1 = random.randint(0,nPts-1)
idx2 = random.randint(0,nPts-1)
if idx1==idx2:
continue
if idx1>idx2:
idx1,idx2=idx2,idx1
if (idx1,idx2) in keep:
continue
keep.add((idx1,idx2))
p1 = coords[idx1]
p2 = coords[idx2]
v = p1-p2
d = sqrt(v.dot(v))
res.append((dmat[idx1][idx2],d))
else:
for idx1 in range(nPts):
for idx2 in range(idx1+1,nPts):
p1 = coords[idx1]
p2 = coords[idx2]
v = p1-p2
d = sqrt(v.dot(v))
res.append((dmat[idx1][idx2],d))
return res
d = distCompare(dist_mat,coords)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
<matplotlib.collections.PathCollection at 0x7f981f9bded0>
Yikes... that's no good. This really isn't so surprising given how much stress was left.
Let's try increasing the dimensionality to see how much improvement we can get:
mds2 = manifold.MDS(n_components=20, dissimilarity="precomputed", random_state=3, n_jobs = 4, verbose=1,max_iter=1000)
results = mds2.fit(dist_mat)
coords = results.embedding_
print 'Final stress:',mds2.stress_
Final stress: 57.0050716266 breaking at iteration 203 with stress 57.0050716266 breaking at iteration 199 with stress 59.5816001681 breaking at iteration 206 with stress 58.1255498141 breaking at iteration 207 with stress 57.0929415605
d = distCompare(dist_mat,coords)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
<matplotlib.collections.PathCollection at 0x7f402994d150>
Not so bad... of course, we can't visualize 20 dimensions.
What about if we try doing dimensionality reduction on the 20D results instead of using the distance matrix?
mds3 = manifold.MDS(n_components=2, random_state=3, n_jobs = 4, verbose=1,max_iter=1000)
results3 = mds3.fit(coords)
coords3 = results3.embedding_
print 'Final stress:',mds3.stress_
Final stress: 2943.91225488 breaking at iteration 239 with stress 3011.93802972 breaking at iteration 263 with stress 3033.56503189 breaking at iteration 265 with stress 3176.13501461 breaking at iteration 406 with stress 2943.91225488
The strain is still really high. This isn't that surprising given that we already tried the same thing from the distance matrix.
Ok, so MDS doesn't work. Now that we have actual coordinates, we can use some of the other scikit-learn manifold learning approaches for embedding the points.
Instead of jumping all the way to a 2D system, where we're really unlikely to get decent results, start by dropping down to 10D.
First MDS in 10D:
mds3 = manifold.MDS(n_components=10, random_state=3, n_jobs = 4, verbose=1,max_iter=1000)
results3 = mds3.fit(coords)
coords3 = results3.embedding_
print 'Final stress:',results3.stress_
ds = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
Final stress: 164.997967472 breaking at iteration 187 with stress 167.970824753 breaking at iteration 195 with stress 167.426219544 breaking at iteration 206 with stress 168.6442109 breaking at iteration 208 with stress 164.997967472
<matplotlib.collections.PathCollection at 0x7f402afc6e50>
Now try locally linear embedding:
lle = manifold.LocallyLinearEmbedding(n_components=10,random_state=3,max_iter=1000)
results3=lle.fit(coords)
coords3=results3.embedding_
print results3.reconstruction_error_
d = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
7.87063669948e-15
<matplotlib.collections.PathCollection at 0x7f402a453190>
Ick! What about Isomap?
embed = manifold.Isomap(n_components=10,max_iter=1000)
results3=embed.fit(coords)
coords3=results3.embedding_
d = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
<matplotlib.collections.PathCollection at 0x7f4028a24690>
Also pretty grim. Does spectral embedding work?
embed = manifold.SpectralEmbedding(n_components=10)
results3=embed.fit(coords)
coords3=results3.embedding_
d = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
<matplotlib.collections.PathCollection at 0x7f4024241790>
Hmm, there's at least some signal there. How does spectral embedding do with 2D?
embed = manifold.SpectralEmbedding(n_components=2)
results3=embed.fit(coords)
coords3=results3.embedding_
d = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
<matplotlib.collections.PathCollection at 0x7f4024146f10>
Not so hot...
We're able to do a reasonable job of embedding into 20 dimensions with MDS. 10 even looks somewhat ok. Let's look at behavior of the residual stress as a function of dimension:
vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
mds3 = manifold.MDS(n_components=nComps, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')
Final stress at dim=140: 10.2150083984 stress per element: 0.0389885816734 Final stress at dim=120: 10.6831774998 stress per element: 0.0407754866405 Final stress at dim=100: 11.0373652537 stress per element: 0.0421273482967 Final stress at dim=80: 11.9702291564 stress per element: 0.0456878975435 Final stress at dim=60: 14.0144255401 stress per element: 0.0534901738172 Final stress at dim=40: 19.3925818175 stress per element: 0.074017487853 Final stress at dim=20: 57.0050716266 stress per element: 0.217576609262 Final stress at dim=10: 216.325632623 stress per element: 0.825670353524 Final stress at dim=5: 703.559542424 stress per element: 2.68534176498 Final stress at dim=2: 3052.83827897 stress per element: 11.6520544999
Yeah, the stress is going up exponentially with dimension. These data really don't want to be in a 2D space.
Try non-metric embedding
vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
mds3 = manifold.MDS(metric=False,n_components=nComps,random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')
Final stress at dim=140: 15.2927309221 stress per element: 0.0583692019926 Final stress at dim=120: 17.9694200702 stress per element: 0.0685855727871 Final stress at dim=100: 21.235102664 stress per element: 0.0810500101679 Final stress at dim=80: 26.4784693091 stress per element: 0.101062859958 Final stress at dim=60: 34.8885053235 stress per element: 0.133162234059 Final stress at dim=40: 52.3293595454 stress per element: 0.199730379944 Final stress at dim=20: 111.248026961 stress per element: 0.424610789927 Final stress at dim=10: 243.54472514 stress per element: 0.92956001962 Final stress at dim=5: 550.187026998 stress per element: 2.09995048472 Final stress at dim=2: 1548.48953014 stress per element: 5.91026538222
helps somewhat at lower dimension, but not dramatically. Plus it's a lot slower
Maybe it's possible to do better with a larger data set? This seems unlikely, but we can at least try:
assays = data2[['assay_chemblid','target_chemblid']].groupby('assay_chemblid').count()
print assays.shape
goodassays = assays.ix[(assays.target_chemblid >= 10)&(assays.target_chemblid <= 15)]
subset = data2.ix[data.assay_chemblid.isin(list(goodassays.index))]
print subset.shape
(114, 1) (722, 18)
mols = subset[['parent_cmpd_chemblid','name_in_reference','SMILES', 'ROMol', 'assay_chemblid']]
print mols.shape
fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in mols['ROMol']]
dist_mat = []
for i,fp in enumerate(fps):
dist_mat.append(DataStructs.BulkTanimotoSimilarity(fps[i],fps,returnDistance=1))
dist_mat=numpy.array(dist_mat)
print dist_mat.shape
(722, 5) (722, 722)
vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
mds3 = manifold.MDS(n_components=nComps, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')
Final stress at dim=140: 70.1736432284 stress per element: 0.0971934116736 Final stress at dim=120: 74.5872943448 stress per element: 0.103306501863 Final stress at dim=100: 82.5548881411 stress per element: 0.114341950334 Final stress at dim=80: 97.553472 stress per element: 0.135115612188 Final stress at dim=60: 128.027969207 stress per element: 0.177324057074 Final stress at dim=40: 222.715384813 stress per element: 0.308470062068 Final stress at dim=20: 730.926915054 stress per element: 1.01236414827 Final stress at dim=10: 2299.36788686 stress per element: 3.18472006491 Final stress at dim=5: 6707.58303311 stress per element: 9.29028120929 Final stress at dim=2: 25389.4023638 stress per element: 35.1653772352
Nope, that definitely made things worse.
assays = data2[['assay_chemblid','target_chemblid']].groupby('assay_chemblid').count()
print assays.shape
goodassays = assays.ix[(assays.target_chemblid >= 10)&(assays.target_chemblid <= 18)]
subset = data2.ix[data.assay_chemblid.isin(list(goodassays.index))]
print subset.shape
(114, 1) (1039, 18)
mols = subset[['parent_cmpd_chemblid','name_in_reference','SMILES', 'ROMol', 'assay_chemblid']]
print mols.shape
fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in mols['ROMol']]
dist_mat = []
for i,fp in enumerate(fps):
dist_mat.append(DataStructs.BulkTanimotoSimilarity(fps[i],fps,returnDistance=1))
dist_mat=numpy.array(dist_mat)
print dist_mat.shape
(1039, 5) (1039, 1039)
vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
mds3 = manifold.MDS(n_components=nComps, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')
Final stress at dim=140: 141.09512988 stress per element: 0.135798970048 Final stress at dim=120: 151.919790513 stress per element: 0.146217315219 Final stress at dim=100: 168.906821392 stress per element: 0.162566719337 Final stress at dim=80: 200.40090571 stress per element: 0.192878638797 Final stress at dim=60: 272.095915211 stress per element: 0.261882497796 Final stress at dim=40: 490.563286058 stress per element: 0.472149457226 Final stress at dim=20: 1628.60787554 stress per element: 1.56747629984 Final stress at dim=10: 5037.04251247 stress per element: 4.84797161932 Final stress at dim=5: 14191.8670891 stress per element: 13.6591598548 Final stress at dim=2: 53210.9813686 stress per element: 51.2136490555
Continuing to get worse.
assays = data2[['assay_chemblid','target_chemblid']].groupby('assay_chemblid').count()
print assays.shape
goodassays = assays.ix[assays.target_chemblid == 15]
subset = data2.ix[data.assay_chemblid.isin(list(goodassays.index))]
print subset.shape
(114, 1) (180, 18)
mols = subset[['parent_cmpd_chemblid','name_in_reference','SMILES', 'ROMol', 'assay_chemblid']]
print mols.shape
fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in mols['ROMol']]
dist_mat = []
for i,fp in enumerate(fps):
dist_mat.append(DataStructs.BulkTanimotoSimilarity(fps[i],fps,returnDistance=1))
dist_mat=numpy.array(dist_mat)
print dist_mat.shape
(180, 5) (180, 180)
vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
mds3 = manifold.MDS(n_components=nComps, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')
Final stress at dim=140: 4.93318539149 stress per element: 0.0274065855083 Final stress at dim=120: 4.92380830335 stress per element: 0.0273544905742 Final stress at dim=100: 5.15458013048 stress per element: 0.0286365562804 Final stress at dim=80: 5.31751942123 stress per element: 0.0295417745624 Final stress at dim=60: 5.74061848474 stress per element: 0.0318923249152 Final stress at dim=40: 6.83546630575 stress per element: 0.0379748128097 Final stress at dim=20: 12.9022649033 stress per element: 0.0716792494629 Final stress at dim=10: 52.2825877625 stress per element: 0.290458820903 Final stress at dim=5: 217.475840469 stress per element: 1.20819911372 Final stress at dim=2: 1145.28539545 stress per element: 6.36269664139
That's a lot less stress per point than before, but it's still pretty high. Look at the distances to be sure:
mds3 = manifold.MDS(n_components=2, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
coords3=results3.embedding_
d = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
<matplotlib.collections.PathCollection at 0x7f4007e32510>