Scaffold analysis of ChEMBL data with pandas and RDKit¶

Dr. Samo Turk¶

BioMed X Innovation Center, Heidelberg¶

Python

[Python](http://www.python.org/) very popular programming language especially in science.

pandas

[Pandas](http://pandas.pydata.org/) is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. No need for R!

RDKit

[RDKit](http://www.rdkit.org/) is an open source chemistry toolkit.

IPython

[IPython](http://ipython.org/) interactive Python shell. Has web-based interactive computational environment IPython Notebook.

chembl_webresource_client

[chembl_webresource_client](https://github.com/chembl/chembl_webresource_client) Python client for accessing ChEMBL webservices.

More notebooks: https://github.com/Team-SKI/snippets ¶

In [1]:

import pandas as pd # Import pandas
from chembl_webresource_client import *

chembl_webresource_client¶

Targets¶

Target object handler for API access and check connection to the db

In [2]:

targets = TargetResource()

In [3]:

print targets.status()

True

ABL1 kinase¶

uniprotID: P00519

In [4]:

targets.get(uniprot=['P00519'])

Out[4]:

[{u'bioactivityCount': 9984,
  u'chemblId': u'CHEMBL1862',
  u'compoundCount': 2997,
  u'description': u'Tyrosine-protein kinase ABL',
  u'geneNames': u'Unspecified',
  u'organism': u'Homo sapiens',
  u'preferredName': u'Tyrosine-protein kinase ABL',
  u'proteinAccession': u'P00519',
  u'synonyms': u'Proto-oncogene c-Abl,JTK7,ABL1,2.7.10.2,Abelson murine leukemia viral oncogene homolog 1,Abelson tyrosine-protein kinase 1,ABL ,Tyrosine-protein kinase ABL1,p150',
  u'targetType': u'SINGLE PROTEIN'}]

Load targets in data frame

In [5]:

targetsDF = pd.DataFrame.from_dict(targets.get(uniprot=['P00519']))

In [6]:

targetsDF

Out[6]:

	bioactivityCount	chemblId	compoundCount	description	geneNames	organism	preferredName	proteinAccession	synonyms	targetType
0	9984	CHEMBL1862	2997	Tyrosine-protein kinase ABL	Unspecified	Homo sapiens	Tyrosine-protein kinase ABL	P00519	Proto-oncogene c-Abl,JTK7,ABL1,2.7.10.2,Abelso...	SINGLE PROTEIN

Bioactivities¶

Download all bioactivities for targets

In [7]:

bioactsDF = pd.DataFrame.from_dict(targets.bioactivities('CHEMBL1862'))

In [8]:

bioactsDF.head(1)

Out[8]:

	activity_comment	assay_chemblid	assay_description	assay_type	bioactivity_type	ingredient_cmpd_chemblid	name_in_reference	operator	organism	parent_cmpd_chemblid	reference	target_chemblid	target_confidence	target_name	units	value
0	Unspecified	CHEMBL1063789	Binding constant for ABL1(E255K) kinase domain	B	Kd	CHEMBL440084	SB-431542	>	Homo sapiens	CHEMBL440084	Nat. Biotechnol., (2008) 26:1:127	CHEMBL1862	9	Tyrosine-protein kinase ABL	nM	10000

Filter bioactivities (safe defaults)

In [9]:

bioactsDF = bioactsDF[(bioactsDF['bioactivity_type'] == 'IC50') &  # keep ony IC50
                      (bioactsDF['operator'] == '=') & # only exact measurements
                      (bioactsDF['assay_type'] == 'B') & # only binding data
                      (bioactsDF['target_confidence'] == 9)] # only high target confidence

In [10]:

len(bioactsDF), len(bioactsDF['ingredient_cmpd_chemblid'].unique())

Out[10]:

(282, 251)

Compouds¶

Compound object handler for API access and check connection to the db

In [11]:

compounds = CompoundResource()

In [12]:

compounds.status()

Out[12]:

True

Load the compounds in the dataframe

In [13]:

cpdsDF = pd.DataFrame.from_dict(compounds.get(list(bioactsDF['ingredient_cmpd_chemblid'].unique())))

In [14]:

cpdsDF.head()

Out[14]:

	acdAcidicPka	acdBasicPka	acdLogd	acdLogp	alogp	chemblId	knownDrug	medChemFriendly	molecularFormula	molecularWeight	numRo5Violations	passesRuleOfThree	preferredCompoundName	rotatableBonds	smiles	species	stdInChiKey	synonyms
0	NaN	8.97	3.71	5.27	3.82	CHEMBL388978	No	Yes	C28H26N4O3	466.53	0	No	STAUROSPORINE	2	CN[C@@H]1C[C@H]2O[C@@](C)([C@@H]1OC)n3c4ccccc4...	BASE	HKSZLNNOFSGOKW-FYTWVXJKSA-N	SID26755744,SID26755745
1	NaN	6.88	4.02	4.13	3.76	CHEMBL1290072	No	Yes	C19H16N2O2	304.34	0	No	NaN	4	COc1ccc(CNc2ccnc3oc4ccccc4c23)cc1	NEUTRAL	MNMQBQOLMHVBJT-UHFFFAOYSA-N	NaN
2	12.27	3.32	4.18	4.18	5.10	CHEMBL565609	No	Yes	C28H28Cl2N6O4	583.47	2	No	NaN	9	CN1C(=O)C(=Cc2cnc(Nc3ccc(NC(=O)CCNC(=O)OC(C)(C...	NEUTRAL	APGJWCRRERWKDQ-UHFFFAOYSA-N	NaN
3	12.80	7.86	5.67	6.62	5.26	CHEMBL1171085	No	Yes	C30H25F3N6O	542.55	2	No	NaN	8	CN(C)Cc1nccn1c2cc(NC(=O)c3ccc(C)c(c3)C#Cc4cnc5...	NEUTRAL	QZYXEXGJVTZDIZ-UHFFFAOYSA-N	NaN
4	8.13	7.54	4.88	5.35	4.75	CHEMBL1171657	No	Yes	C29H27F3N6O	532.56	1	No	NaN	7	CN1CCN(Cc2ccc(NC(=O)c3ccc(C)c(c3)C#Cc4cnc5[nH]...	NEUTRAL	XGDBPKHGUYPWPJ-UHFFFAOYSA-N	NaN

In [15]:

bioactsDF['value'] = bioactsDF['value'].astype(float) # making sure everything is float

Chemoinformatics¶

In [16]:

import rdkit.Chem as Chem
from rdkit.Chem import Draw
from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole # Enables RDKit IPython integration

In [17]:

molnames = 'chemblId'
smiles = 'smiles'

Add molecule column

In [18]:

PandasTools.AddMoleculeColumnToFrame(cpdsDF, smilesCol='smiles')

In [19]:

cpdsDF = cpdsDF[[molnames, smiles, 'ROMol', 'knownDrug', 'preferredCompoundName']]

Add bioactivities and ligand efficiency¶

Ligand efficiency: LE = 1.4 x pIC50/HAC

In [20]:

import numpy as np
def getBioacts(cpd, target):
    value = bioactsDF[(bioactsDF['ingredient_cmpd_chemblid'] == cpd)& # All rows of compound AND
                     (bioactsDF['target_chemblid'] == target)]['value'].mean() # target. Get mean of values
    return np.log10(value*10**-9)*-1 # returns pIC50

def getLE(mol, pIC50): 
    return 1.4*pIC50/mol.GetNumHeavyAtoms()        

In [21]:

cpdsDF['pIC50'] = cpdsDF.apply(lambda x: getBioacts(x[molnames], 'CHEMBL1862'), axis=1)

In [22]:

cpdsDF['LE'] = cpdsDF.apply(lambda x: getLE(x['ROMol'], x['pIC50']), axis=1)

In [23]:

cpdsDF.head(1)

Out[23]:

	chemblId	smiles	ROMol	knownDrug	preferredCompoundName	pIC50	LE
0	CHEMBL388978	CN[C@@H]1C[C@H]2O[C@@](C)([C@@H]1OC)n3c4ccccc4c5c6CNC(=O)c6c7c8ccccc8n2c7c35		No	STAUROSPORINE	6.677781	0.267111

In [24]:

len(cpdsDF)

Out[24]:

## Murcko scaffold decomposition

Bemis, G. W.; Murcko, M. A. “The Properties of Known Drugs. 1. Molecular Frameworks.” J. Med. Chem. 39:2887-93 (1996).

Decomposition of molecules to scaffolds or generic frameworks

Functionality present in RDKit. Added it to PandasTools

In [25]:

from rdkit.Chem.Scaffolds import MurckoScaffold

In [26]:

mol  = Chem.MolFromSmiles(cpdsDF.ix[2]['smiles'])

In [27]:

scaffold = MurckoScaffold.GetScaffoldForMol(mol)
generic = MurckoScaffold.MakeScaffoldGeneric(MurckoScaffold.GetScaffoldForMol(mol))

In [28]:

Draw.MolsToGridImage([mol, scaffold, generic])

Out[28]:

## AddMurckoToFrame(pandasFrame, MurcoCol=, Generic=False) Add column with SMILES of scaffolds (or generic frameworks)

In [29]:

PandasTools.AddMurckoToFrame(cpdsDF)

##### AlignToScaffold(dataframe, molCol=, scaffoldCol=) Aligns compounds to scaffolds

In [30]:

PandasTools.AlignToScaffold(cpdsDF, molCol='ROMol', scaffoldCol='Murcko_SMILES')

In [31]:

cpdsDF.head(1)

Out[31]:

	chemblId	smiles	ROMol	knownDrug	preferredCompoundName	pIC50	LE	Murcko_SMILES
0	CHEMBL388978	CN[C@@H]1C[C@H]2O[C@@](C)([C@@H]1OC)n3c4ccccc4c5c6CNC(=O)c6c7c8ccccc8n2c7c35		No	STAUROSPORINE	6.677781	0.267111	O=C1NCc2c1c1c3ccccc3n3c1c1c2c2ccccc2n1C1CCCC3O1

Now we can use pandas groupby() functionality and group by scaffolds and create a new frame with scaffolds sorted by number of members

In [32]:

sortedScaffolds = cpdsDF.groupby(['Murcko_SMILES']).count().sort(smiles, ascending=False)

In [33]:

sortedScaffolds = sortedScaffolds[[smiles]] # Keep only smiles column
sortedScaffolds = sortedScaffolds.rename(columns={smiles:'count'}) # rename smiles column to count
sortedScaffolds['Murcko_SMILES'] = sortedScaffolds.index # actual SMILES are only in index column, move it
sortedScaffolds.index = range(len(sortedScaffolds)+1)[1:]
sortedScaffolds.head()

Out[33]:

	count	Murcko_SMILES
1	17	O=c1[nH]c2nc(Nc3ccccc3)ncc2cc1-c1ccccc1
2	11	O=C(Nc1cccc(Nc2nccc(-c3cccnc3)n2)c1)c1ccccc1
3	10	O=C(Nc1ccc(CN2CCNCC2)cc1)c1cccc(C#Cc2cnc[nH]2)c1
4	9	O=C(COc1ccc2c(=O)cc(-c3ccccc3)oc2c1)Nc1ccccn1
5	8	O=C(Nc1ccc(CN2CCNCC2)cc1)c1cccc(C#Cc2cnc3cccnn23)c1

In [34]:

len(sortedScaffolds)

Out[34]:

Add RDKit's ROMol column to scaffolds dataframe so we can visualize it

In [35]:

PandasTools.AddMoleculeColumnToFrame(sortedScaffolds, smilesCol='Murcko_SMILES')

#### FrameToGridImage(pandasFrame, legendsCol=, molsPerRow=) Visualize molecules in data frame as a grid image

In [36]:

PandasTools.FrameToGridImage(sortedScaffolds.dropna().head(8), legendsCol='count', 
                             molsPerRow=4) # dropna() drops compounds without scaffold

Out[36]:

Get compounds that belong to certain scaffold

In [37]:

PandasTools.FrameToGridImage(cpdsDF[cpdsDF['Murcko_SMILES'] == 'O=c1[nH]c2nc(Nc3ccccc3)ncc2cc1-c1ccccc1'].head(4),
                            legendsCol=molnames, molsPerRow=4)

Out[37]:

In [39]:

cpdsDF[cpdsDF['Murcko_SMILES'] == 'O=c1[nH]c2nc(Nc3ccccc3)ncc2cc1-c1ccccc1'
       ]['pIC50'].max() # min(), mean(), sum(), anything you can do with pandas Series

Out[39]:

8.6382721639824069

Exploratory plots¶

In [40]:

%matplotlib inline

In [41]:

import pylab

pIC50s of compounds with certain scaffolds¶

In [42]:

data = [cpdsDF[cpdsDF['Murcko_SMILES'] == x]['pIC50'] for x in sortedScaffolds['Murcko_SMILES'].head(8)]
data.append(cpdsDF[cpdsDF['knownDrug'] == 'Yes']['pIC50'])

In [43]:

pylab.boxplot(data)
pylab.show()

Ligand efficiencies of compounds with certain scaffolds¶

In [44]:

data = [cpdsDF[cpdsDF['Murcko_SMILES'] == x]['LE'] for x in sortedScaffolds['Murcko_SMILES'].head(8)]
data.append(cpdsDF[cpdsDF['knownDrug'] == 'Yes']['LE'])

In [45]:

pylab.boxplot(data)
pylab.show()

Check our GitHub https://github.com/Team-SKI/snippets ¶

Check our web page http://www.bio.mx.¶

Thank you!¶

This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

In [ ]: