Python
[Python](http://www.python.org/) very popular programming language especially in science.
pandas
[Pandas](http://pandas.pydata.org/) is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. No need for R!
RDKit
[RDKit](http://www.rdkit.org/) is an open source chemistry toolkit.
IPython
[IPython](http://ipython.org/) interactive Python shell. Has web-based interactive computational environment IPython Notebook.
chembl_webresource_client
[chembl_webresource_client](https://github.com/chembl/chembl_webresource_client) Python client for accessing ChEMBL webservices.
import pandas as pd # Import pandas
from chembl_webresource_client import *
Target object handler for API access and check connection to the db
targets = TargetResource()
print targets.status()
True
uniprotID: P00519
targets.get(uniprot=['P00519'])
[{u'bioactivityCount': 9984, u'chemblId': u'CHEMBL1862', u'compoundCount': 2997, u'description': u'Tyrosine-protein kinase ABL', u'geneNames': u'Unspecified', u'organism': u'Homo sapiens', u'preferredName': u'Tyrosine-protein kinase ABL', u'proteinAccession': u'P00519', u'synonyms': u'Proto-oncogene c-Abl,JTK7,ABL1,2.7.10.2,Abelson murine leukemia viral oncogene homolog 1,Abelson tyrosine-protein kinase 1,ABL ,Tyrosine-protein kinase ABL1,p150', u'targetType': u'SINGLE PROTEIN'}]
Load targets in data frame
targetsDF = pd.DataFrame.from_dict(targets.get(uniprot=['P00519']))
targetsDF
bioactivityCount | chemblId | compoundCount | description | geneNames | organism | preferredName | proteinAccession | synonyms | targetType | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 9984 | CHEMBL1862 | 2997 | Tyrosine-protein kinase ABL | Unspecified | Homo sapiens | Tyrosine-protein kinase ABL | P00519 | Proto-oncogene c-Abl,JTK7,ABL1,2.7.10.2,Abelso... | SINGLE PROTEIN |
Download all bioactivities for targets
bioactsDF = pd.DataFrame.from_dict(targets.bioactivities('CHEMBL1862'))
bioactsDF.head(1)
activity_comment | assay_chemblid | assay_description | assay_type | bioactivity_type | ingredient_cmpd_chemblid | name_in_reference | operator | organism | parent_cmpd_chemblid | reference | target_chemblid | target_confidence | target_name | units | value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Unspecified | CHEMBL1063789 | Binding constant for ABL1(E255K) kinase domain | B | Kd | CHEMBL440084 | SB-431542 | > | Homo sapiens | CHEMBL440084 | Nat. Biotechnol., (2008) 26:1:127 | CHEMBL1862 | 9 | Tyrosine-protein kinase ABL | nM | 10000 |
Filter bioactivities (safe defaults)
bioactsDF = bioactsDF[(bioactsDF['bioactivity_type'] == 'IC50') & # keep ony IC50
(bioactsDF['operator'] == '=') & # only exact measurements
(bioactsDF['assay_type'] == 'B') & # only binding data
(bioactsDF['target_confidence'] == 9)] # only high target confidence
len(bioactsDF), len(bioactsDF['ingredient_cmpd_chemblid'].unique())
(282, 251)
Compound object handler for API access and check connection to the db
compounds = CompoundResource()
compounds.status()
True
Load the compounds in the dataframe
cpdsDF = pd.DataFrame.from_dict(compounds.get(list(bioactsDF['ingredient_cmpd_chemblid'].unique())))
cpdsDF.head()
acdAcidicPka | acdBasicPka | acdLogd | acdLogp | alogp | chemblId | knownDrug | medChemFriendly | molecularFormula | molecularWeight | numRo5Violations | passesRuleOfThree | preferredCompoundName | rotatableBonds | smiles | species | stdInChiKey | synonyms | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | 8.97 | 3.71 | 5.27 | 3.82 | CHEMBL388978 | No | Yes | C28H26N4O3 | 466.53 | 0 | No | STAUROSPORINE | 2 | CN[C@@H]1C[C@H]2O[C@@](C)([C@@H]1OC)n3c4ccccc4... | BASE | HKSZLNNOFSGOKW-FYTWVXJKSA-N | SID26755744,SID26755745 |
1 | NaN | 6.88 | 4.02 | 4.13 | 3.76 | CHEMBL1290072 | No | Yes | C19H16N2O2 | 304.34 | 0 | No | NaN | 4 | COc1ccc(CNc2ccnc3oc4ccccc4c23)cc1 | NEUTRAL | MNMQBQOLMHVBJT-UHFFFAOYSA-N | NaN |
2 | 12.27 | 3.32 | 4.18 | 4.18 | 5.10 | CHEMBL565609 | No | Yes | C28H28Cl2N6O4 | 583.47 | 2 | No | NaN | 9 | CN1C(=O)C(=Cc2cnc(Nc3ccc(NC(=O)CCNC(=O)OC(C)(C... | NEUTRAL | APGJWCRRERWKDQ-UHFFFAOYSA-N | NaN |
3 | 12.80 | 7.86 | 5.67 | 6.62 | 5.26 | CHEMBL1171085 | No | Yes | C30H25F3N6O | 542.55 | 2 | No | NaN | 8 | CN(C)Cc1nccn1c2cc(NC(=O)c3ccc(C)c(c3)C#Cc4cnc5... | NEUTRAL | QZYXEXGJVTZDIZ-UHFFFAOYSA-N | NaN |
4 | 8.13 | 7.54 | 4.88 | 5.35 | 4.75 | CHEMBL1171657 | No | Yes | C29H27F3N6O | 532.56 | 1 | No | NaN | 7 | CN1CCN(Cc2ccc(NC(=O)c3ccc(C)c(c3)C#Cc4cnc5[nH]... | NEUTRAL | XGDBPKHGUYPWPJ-UHFFFAOYSA-N | NaN |
bioactsDF['value'] = bioactsDF['value'].astype(float) # making sure everything is float
import rdkit.Chem as Chem
from rdkit.Chem import Draw
from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole # Enables RDKit IPython integration
molnames = 'chemblId'
smiles = 'smiles'
Add molecule column
PandasTools.AddMoleculeColumnToFrame(cpdsDF, smilesCol='smiles')
cpdsDF = cpdsDF[[molnames, smiles, 'ROMol', 'knownDrug', 'preferredCompoundName']]
Ligand efficiency: LE = 1.4 x pIC50/HAC
import numpy as np
def getBioacts(cpd, target):
value = bioactsDF[(bioactsDF['ingredient_cmpd_chemblid'] == cpd)& # All rows of compound AND
(bioactsDF['target_chemblid'] == target)]['value'].mean() # target. Get mean of values
return np.log10(value*10**-9)*-1 # returns pIC50
def getLE(mol, pIC50):
return 1.4*pIC50/mol.GetNumHeavyAtoms()
cpdsDF['pIC50'] = cpdsDF.apply(lambda x: getBioacts(x[molnames], 'CHEMBL1862'), axis=1)
cpdsDF['LE'] = cpdsDF.apply(lambda x: getLE(x['ROMol'], x['pIC50']), axis=1)
cpdsDF.head(1)
chemblId | smiles | ROMol | knownDrug | preferredCompoundName | pIC50 | LE | |
---|---|---|---|---|---|---|---|
0 | CHEMBL388978 | CN[C@@H]1C[C@H]2O[C@@](C)([C@@H]1OC)n3c4ccccc4c5c6CNC(=O)c6c7c8ccccc8n2c7c35 | No | STAUROSPORINE | 6.677781 | 0.267111 |
len(cpdsDF)
251
Bemis, G. W.; Murcko, M. A. “The Properties of Known Drugs. 1. Molecular Frameworks.” J. Med. Chem. 39:2887-93 (1996).
Decomposition of molecules to scaffolds or generic frameworks
Functionality present in RDKit. Added it to PandasTools
from rdkit.Chem.Scaffolds import MurckoScaffold
mol = Chem.MolFromSmiles(cpdsDF.ix[2]['smiles'])
scaffold = MurckoScaffold.GetScaffoldForMol(mol)
generic = MurckoScaffold.MakeScaffoldGeneric(MurckoScaffold.GetScaffoldForMol(mol))
Draw.MolsToGridImage([mol, scaffold, generic])
PandasTools.AddMurckoToFrame(cpdsDF)
PandasTools.AlignToScaffold(cpdsDF, molCol='ROMol', scaffoldCol='Murcko_SMILES')
cpdsDF.head(1)
chemblId | smiles | ROMol | knownDrug | preferredCompoundName | pIC50 | LE | Murcko_SMILES | |
---|---|---|---|---|---|---|---|---|
0 | CHEMBL388978 | CN[C@@H]1C[C@H]2O[C@@](C)([C@@H]1OC)n3c4ccccc4c5c6CNC(=O)c6c7c8ccccc8n2c7c35 | No | STAUROSPORINE | 6.677781 | 0.267111 | O=C1NCc2c1c1c3ccccc3n3c1c1c2c2ccccc2n1C1CCCC3O1 |
Now we can use pandas groupby() functionality and group by scaffolds and create a new frame with scaffolds sorted by number of members
sortedScaffolds = cpdsDF.groupby(['Murcko_SMILES']).count().sort(smiles, ascending=False)
sortedScaffolds = sortedScaffolds[[smiles]] # Keep only smiles column
sortedScaffolds = sortedScaffolds.rename(columns={smiles:'count'}) # rename smiles column to count
sortedScaffolds['Murcko_SMILES'] = sortedScaffolds.index # actual SMILES are only in index column, move it
sortedScaffolds.index = range(len(sortedScaffolds)+1)[1:]
sortedScaffolds.head()
count | Murcko_SMILES | |
---|---|---|
1 | 17 | O=c1[nH]c2nc(Nc3ccccc3)ncc2cc1-c1ccccc1 |
2 | 11 | O=C(Nc1cccc(Nc2nccc(-c3cccnc3)n2)c1)c1ccccc1 |
3 | 10 | O=C(Nc1ccc(CN2CCNCC2)cc1)c1cccc(C#Cc2cnc[nH]2)c1 |
4 | 9 | O=C(COc1ccc2c(=O)cc(-c3ccccc3)oc2c1)Nc1ccccn1 |
5 | 8 | O=C(Nc1ccc(CN2CCNCC2)cc1)c1cccc(C#Cc2cnc3cccnn23)c1 |
len(sortedScaffolds)
140
Add RDKit's ROMol column to scaffolds dataframe so we can visualize it
PandasTools.AddMoleculeColumnToFrame(sortedScaffolds, smilesCol='Murcko_SMILES')
PandasTools.FrameToGridImage(sortedScaffolds.dropna().head(8), legendsCol='count',
molsPerRow=4) # dropna() drops compounds without scaffold
Get compounds that belong to certain scaffold
PandasTools.FrameToGridImage(cpdsDF[cpdsDF['Murcko_SMILES'] == 'O=c1[nH]c2nc(Nc3ccccc3)ncc2cc1-c1ccccc1'].head(4),
legendsCol=molnames, molsPerRow=4)
cpdsDF[cpdsDF['Murcko_SMILES'] == 'O=c1[nH]c2nc(Nc3ccccc3)ncc2cc1-c1ccccc1'
]['pIC50'].max() # min(), mean(), sum(), anything you can do with pandas Series
8.6382721639824069
%matplotlib inline
import pylab
data = [cpdsDF[cpdsDF['Murcko_SMILES'] == x]['pIC50'] for x in sortedScaffolds['Murcko_SMILES'].head(8)]
data.append(cpdsDF[cpdsDF['knownDrug'] == 'Yes']['pIC50'])
pylab.boxplot(data)
pylab.show()
data = [cpdsDF[cpdsDF['Murcko_SMILES'] == x]['LE'] for x in sortedScaffolds['Murcko_SMILES'].head(8)]
data.append(cpdsDF[cpdsDF['knownDrug'] == 'Yes']['LE'])
pylab.boxplot(data)
pylab.show()
Copyright (C) 2013, 2014 by Samo Turk, BioMed X GmbH
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.