This is a relatively short one because I just wanted to point people to an older (thus buried) dataset from some former colleagues that I've found really useful in the past but that I think a lot of folks aren't aware of. The dataset is in the supplementary material to this paper: https://pubs.acs.org/doi/abs/10.1021/jm020472j "Informative Library Design as an Efficient Strategy to Identify and Optimize Leads: Application to Cyclin-Dependent Kinase 2 Antagonists" by Erin Bradley et al. The paper itself is worth reading, but the buried treasure is the Excel file in the supplementary material, which contains SMILES and measured data for >17K compounds. There's also a very useful PDF which explains the columns in that file.
What's very cool is that the compounds are a mix of things from a small general-purpose screening library and compounds purchased or synthesized for a med chem project. I'm not aware of any other public
Let's look at what's there.
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdFMCS
import pandas as pd
import rdkit
print(rdkit.__version__)
/other_linux/home/glandrum/anaconda3/envs/rdkit_blog/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject return f(*args, **kwds) /other_linux/home/glandrum/anaconda3/envs/rdkit_blog/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject return f(*args, **kwds)
2020.03.1dev1
Start by reading in the file and adding a molecule column:
df = pd.read_excel('../data/jm020472j_s2.xls')
df.head()
Smiles | mol_name | cdk2_ic50 | cdk2_inhib | scaffold | sourcepool | cdk_act_bin_1 | |
---|---|---|---|---|---|---|---|
0 | C#CC(/C=C/C)OC(=O)c(ccc1)c(c1)C(=O)O | mol_1 | None | 23.3 | Scaffold_00 | divscreen | 0 |
1 | C#CC(C)(C)N(CC1C)CC\C1=N\OC(=O)c2cc([N+]([O-])... | mol_2 | None | 20.5 | Scaffold_00 | divscreen | 0 |
2 | C#CC(C)(C)N(CN1)CNC1=S | mol_3 | None | 20.7 | Scaffold_00 | divscreen | 0 |
3 | C#CC(C)(C)N(CN1)C\N=C/1SC | mol_4 | None | 1.9 | Scaffold_00 | divscreen | 0 |
4 | C#CC(C)(C)NC(=O)CN(c(c1)cccc1)[S](=O)(=O)c2ccc... | mol_5 | None | 19.8 | Scaffold_00 | divscreen | 0 |
PandasTools.AddMoleculeColumnToFrame(df)
df = df.drop('Smiles',axis=1)
df.head()
mol_name | cdk2_ic50 | cdk2_inhib | scaffold | sourcepool | cdk_act_bin_1 | ROMol | |
---|---|---|---|---|---|---|---|
0 | mol_1 | None | 23.3 | Scaffold_00 | divscreen | 0 | |
1 | mol_2 | None | 20.5 | Scaffold_00 | divscreen | 0 | |
2 | mol_3 | None | 20.7 | Scaffold_00 | divscreen | 0 | |
3 | mol_4 | None | 1.9 | Scaffold_00 | divscreen | 0 | |
4 | mol_5 | None | 19.8 | Scaffold_00 | divscreen | 0 |
The "sourcepool" column relates to which kind of compound it is:
divscreen
is from the screening librarysimscreen
are vendor compounds picked by chemists using similarity screeningsynscreen
are compounds made and screened during the course of the projectdf.groupby('sourcepool')['mol_name'].count()
sourcepool divscreen 13359 simscreen 951 synscreen 3240 Name: mol_name, dtype: int64
Another interesting column is the scaffold
, which contains human-assigned scaffolds for the compounds
df[df.sourcepool=='divscreen'].groupby('scaffold')['mol_name'].count()
scaffold Scaffold_00 12280 Scaffold_01 32 Scaffold_02 111 Scaffold_03 57 Scaffold_04 461 Scaffold_05 37 Scaffold_06 16 Scaffold_07 15 Scaffold_08 103 Scaffold_09 88 Scaffold_11 7 Scaffold_12 109 Scaffold_18 2 Scaffold_20 41 Name: mol_name, dtype: int64
df[df.sourcepool=='simscreen'].groupby('scaffold')['mol_name'].count()
scaffold Scaffold_00 460 Scaffold_01 90 Scaffold_03 2 Scaffold_04 91 Scaffold_05 18 Scaffold_08 216 Scaffold_10 37 Scaffold_11 8 Scaffold_12 29 Name: mol_name, dtype: int64
synscreen = df[df.sourcepool=='synscreen']
synscreen.groupby('scaffold')['mol_name'].count()
scaffold Scaffold_01 204 Scaffold_02 106 Scaffold_04 78 Scaffold_05 265 Scaffold_06 273 Scaffold_07 76 Scaffold_08 13 Scaffold_09 344 Scaffold_10 502 Scaffold_12 61 Scaffold_13 17 Scaffold_14 12 Scaffold_15 11 Scaffold_16 269 Scaffold_17 157 Scaffold_18 38 Scaffold_19 256 Scaffold_21 558 Name: mol_name, dtype: int64
Let's look at some compounds:
IPythonConsole.drawOptions.bondLineWidth=1
PandasTools.FrameToGridImage(synscreen[synscreen.scaffold=='Scaffold_18'],molsPerRow=4,maxMols=20)
We don't have the actual scaffold definitions, but we can use the standard MCS trick to guess:
scaff = 'Scaffold_18'
params = rdFMCS.MCSParameters()
params.BondCompareParameters.CompleteRingsOnly=True
params.Threshold=0.8
mcs = rdFMCS.FindMCS(list(synscreen[synscreen.scaffold==scaff].ROMol),params)
print(mcs.smartsString)
tmp = synscreen[synscreen.scaffold==scaff]
tmp[tmp.ROMol>=Chem.MolFromSmarts(mcs.smartsString)]
[#6]-&!@[#8]-&!@[#6](=&!@[#8])-&!@[#7]-&!@[#6]
mol_name | cdk2_ic50 | cdk2_inhib | scaffold | sourcepool | cdk_act_bin_1 | ROMol | |
---|---|---|---|---|---|---|---|
946 | mol_947 | None | 11.1111 | Scaffold_18 | synscreen | 0 | |
1528 | mol_1529 | None | 12.9218 | Scaffold_18 | synscreen | 0 | |
1529 | mol_1530 | None | 6.47714 | Scaffold_18 | synscreen | 0 | |
4923 | mol_4924 | None | 0 | Scaffold_18 | synscreen | 0 | |
4924 | mol_4925 | None | 0.905505 | Scaffold_18 | synscreen | 0 | |
4927 | mol_4928 | None | 23.1168 | Scaffold_18 | synscreen | 0 | |
5464 | mol_5465 | None | 9.49246 | Scaffold_18 | synscreen | 0 | |
6775 | mol_6776 | None | 0 | Scaffold_18 | synscreen | 0 | |
6779 | mol_6780 | None | 5.73388 | Scaffold_18 | synscreen | 0 | |
6783 | mol_6784 | None | 0 | Scaffold_18 | synscreen | 0 | |
6784 | mol_6785 | None | 0 | Scaffold_18 | synscreen | 0 | |
6785 | mol_6786 | None | 10.5825 | Scaffold_18 | synscreen | 0 | |
6786 | mol_6787 | None | 27.6304 | Scaffold_18 | synscreen | 0 | |
8085 | mol_8086 | None | None | Scaffold_18 | synscreen | 50 | |
8168 | mol_8169 | None | 0 | Scaffold_18 | synscreen | 0 | |
8169 | mol_8170 | None | 0 | Scaffold_18 | synscreen | 0 | |
8170 | mol_8171 | None | 0 | Scaffold_18 | synscreen | 0 | |
8200 | mol_8201 | None | 0 | Scaffold_18 | synscreen | 0 | |
8201 | mol_8202 | None | 0 | Scaffold_18 | synscreen | 0 | |
9035 | mol_9036 | None | 2.12431 | Scaffold_18 | synscreen | 0 | |
9529 | mol_9530 | None | 12.4351 | Scaffold_18 | synscreen | 0 | |
9533 | mol_9534 | None | 2.78529 | Scaffold_18 | synscreen | 0 | |
9576 | mol_9577 | None | 0 | Scaffold_18 | synscreen | 0 | |
9910 | mol_9911 | None | 0 | Scaffold_18 | synscreen | 0 | |
9911 | mol_9912 | None | 0 | Scaffold_18 | synscreen | 0 | |
9912 | mol_9913 | None | 17.9575 | Scaffold_18 | synscreen | 0 | |
10516 | mol_10517 | None | 10.9348 | Scaffold_18 | synscreen | 0 | |
10859 | mol_10860 | None | 0 | Scaffold_18 | synscreen | 0 | |
10920 | mol_10921 | None | 12.4249 | Scaffold_18 | synscreen | 0 | |
11463 | mol_11464 | None | 0 | Scaffold_18 | synscreen | 0 | |
11474 | mol_11475 | None | 0.630415 | Scaffold_18 | synscreen | 0 | |
11481 | mol_11482 | None | 10.5566 | Scaffold_18 | synscreen | 0 | |
11702 | mol_11703 | None | 4.19753 | Scaffold_18 | synscreen | 0 | |
14808 | mol_14809 | None | 0 | Scaffold_18 | synscreen | 0 | |
16613 | mol_16614 | None | 3.23995 | Scaffold_18 | synscreen | 0 | |
17099 | mol_17100 | None | 4.93252 | Scaffold_18 | synscreen | 0 | |
17218 | mol_17219 | None | 10.2165 | Scaffold_18 | synscreen | 0 | |
17221 | mol_17222 | None | 0 | Scaffold_18 | synscreen | 0 |
scaff = 'Scaffold_12'
params = rdFMCS.MCSParameters()
params.BondCompareParameters.CompleteRingsOnly=True
params.Threshold=0.8
mcs = rdFMCS.FindMCS(list(synscreen[synscreen.scaffold==scaff].ROMol),params)
print(mcs.smartsString)
tmp = synscreen[synscreen.scaffold==scaff]
tmp[tmp.ROMol>=Chem.MolFromSmarts(mcs.smartsString)]
[#6]-&!@[#7]1:&@[#6](=&!@[#8]):&@[#6]:&@[#6]:&@[#7]:&@[#6]:&@1=&!@[#8]
mol_name | cdk2_ic50 | cdk2_inhib | scaffold | sourcepool | cdk_act_bin_1 | ROMol | |
---|---|---|---|---|---|---|---|
1202 | mol_1203 | None | 19.7348 | Scaffold_12 | synscreen | 0 | |
3236 | mol_3237 | None | 32.6813 | Scaffold_12 | synscreen | 50 | |
3318 | mol_3319 | None | 18.6061 | Scaffold_12 | synscreen | 0 | |
3769 | mol_3770 | None | 16.5904 | Scaffold_12 | synscreen | 0 | |
3787 | mol_3788 | None | 5.70459 | Scaffold_12 | synscreen | 0 | |
4215 | mol_4216 | None | 10.2919 | Scaffold_12 | synscreen | 0 | |
4440 | mol_4441 | None | 22.2017 | Scaffold_12 | synscreen | 0 | |
5303 | mol_5304 | None | 7.12303 | Scaffold_12 | synscreen | 0 | |
5304 | mol_5305 | None | 6.22931 | Scaffold_12 | synscreen | 0 | |
5417 | mol_5418 | None | 12.3034 | Scaffold_12 | synscreen | 0 | |
5502 | mol_5503 | None | 15.1068 | Scaffold_12 | synscreen | 0 | |
5503 | mol_5504 | None | 5.26633 | Scaffold_12 | synscreen | 0 | |
7039 | mol_7040 | None | 5.29642 | Scaffold_12 | synscreen | 0 | |
7040 | mol_7041 | None | 10.9841 | Scaffold_12 | synscreen | 0 | |
7115 | mol_7116 | None | 8.00481 | Scaffold_12 | synscreen | 0 | |
8139 | mol_8140 | None | 30.0956 | Scaffold_12 | synscreen | 50 | |
8445 | mol_8446 | None | 2.88896 | Scaffold_12 | synscreen | 0 | |
8782 | mol_8783 | None | 17.0829 | Scaffold_12 | synscreen | 0 | |
8976 | mol_8977 | None | 23.0816 | Scaffold_12 | synscreen | 0 | |
8977 | mol_8978 | None | 13.8128 | Scaffold_12 | synscreen | 0 | |
9068 | mol_9069 | None | 22.7875 | Scaffold_12 | synscreen | 0 | |
9069 | mol_9070 | None | 0 | Scaffold_12 | synscreen | 0 | |
9070 | mol_9071 | None | 20.2282 | Scaffold_12 | synscreen | 0 | |
9234 | mol_9235 | None | 18.2365 | Scaffold_12 | synscreen | 0 | |
9235 | mol_9236 | None | 26.3953 | Scaffold_12 | synscreen | 0 | |
9491 | mol_9492 | None | 8.03491 | Scaffold_12 | synscreen | 0 | |
9952 | mol_9953 | None | 0 | Scaffold_12 | synscreen | 0 | |
10056 | mol_10057 | None | 16.9904 | Scaffold_12 | synscreen | 0 | |
10729 | mol_10730 | None | 11.2248 | Scaffold_12 | synscreen | 0 | |
11005 | mol_11006 | None | 20.7216 | Scaffold_12 | synscreen | 0 | |
... | ... | ... | ... | ... | ... | ... | ... |
12412 | mol_12413 | None | 25.9019 | Scaffold_12 | synscreen | 0 | |
12475 | mol_12476 | None | 18.3472 | Scaffold_12 | synscreen | 0 | |
12964 | mol_12965 | None | 5.44689 | Scaffold_12 | synscreen | 0 | |
13128 | mol_13129 | None | 22.695 | Scaffold_12 | synscreen | 0 | |
13152 | mol_13153 | None | 27.1045 | Scaffold_12 | synscreen | 0 | |
13153 | mol_13154 | None | 20.5982 | Scaffold_12 | synscreen | 0 | |
13156 | mol_13157 | None | 14.3244 | Scaffold_12 | synscreen | 0 | |
13157 | mol_13158 | None | 0 | Scaffold_12 | synscreen | 0 | |
13158 | mol_13159 | None | 19.247 | Scaffold_12 | synscreen | 0 | |
13173 | mol_13174 | None | 14.6778 | Scaffold_12 | synscreen | 0 | |
13251 | mol_13252 | None | 23.2192 | Scaffold_12 | synscreen | 0 | |
13463 | mol_13464 | None | 27.5979 | Scaffold_12 | synscreen | 0 | |
13464 | mol_13465 | None | 16.4046 | Scaffold_12 | synscreen | 0 | |
13476 | mol_13477 | None | 11.0141 | Scaffold_12 | synscreen | 0 | |
13709 | mol_13710 | None | 4.65618 | Scaffold_12 | synscreen | 0 | |
13906 | mol_13907 | None | 17.9958 | Scaffold_12 | synscreen | 0 | |
14008 | mol_14009 | None | 20.1625 | Scaffold_12 | synscreen | 0 | |
14010 | mol_14011 | None | 7.49323 | Scaffold_12 | synscreen | 0 | |
14022 | mol_14023 | None | 16.5814 | Scaffold_12 | synscreen | 0 | |
14092 | mol_14093 | None | 22.5717 | Scaffold_12 | synscreen | 0 | |
14155 | mol_14156 | None | 13.1207 | Scaffold_12 | synscreen | 0 | |
14156 | mol_14157 | None | 11.9771 | Scaffold_12 | synscreen | 0 | |
14565 | mol_14566 | None | 11.4655 | Scaffold_12 | synscreen | 0 | |
14602 | mol_14603 | None | 5.56726 | Scaffold_12 | synscreen | 0 | |
14827 | mol_14828 | None | 10.4123 | Scaffold_12 | synscreen | 0 | |
16927 | mol_16928 | None | 3.55101 | Scaffold_12 | synscreen | 0 | |
17019 | mol_17020 | None | 24.0746 | Scaffold_12 | synscreen | 0 | |
17116 | mol_17117 | None | 0 | Scaffold_12 | synscreen | 0 | |
17479 | mol_17480 | None | 10.7273 | Scaffold_12 | synscreen | 0 | |
17531 | mol_17532 | None | 32.9697 | Scaffold_12 | synscreen | 50 |
61 rows × 7 columns
mcs.
from collections import defaultdict
params = rdFMCS.MCSParameters()
params.BondCompareParameters.CompleteRingsOnly=True
params.Threshold=0.8
scaffs = defaultdict(list)
for scaff in synscreen.scaffold.unique():
subset = synscreen[synscreen.scaffold==scaff]
subset['ROMol'] = [Chem.Mol(x) for x in list(subset.ROMol)]
# if len(subset)>100:
# continue
print(f'Doing {scaff} with {len(subset)} mols')
mcs = rdFMCS.FindMCS(list(subset.ROMol),params)
print(mcs.smartsString)
matches = subset[subset.ROMol>=Chem.MolFromSmarts(mcs.smartsString)]
mol = matches.iloc[0].ROMol
scaffs['scaffold'].append(scaff)
scaffs['ROMol'].append(mol)
scaffs['smarts'].append(mcs.smartsString)
scaffs['timed_out'].append(mcs.canceled)
scaffs = pd.DataFrame(scaffs)
/other_linux/home/glandrum/anaconda3/envs/rdkit_blog/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Doing Scaffold_05 with 265 mols [#6]-&!@[#7]-&!@[#6]1:&@[#6]:&@[#6]:&@[#7]:&@[#6](:&@[#7]:&@1)-&!@[#7]-&!@[#6] Doing Scaffold_10 with 502 mols [#6]12:&@[#6]:&@[#6]:&@[#7]:&@[#7]:&@1:&@[#6](:&@[#6]:&@[#6](:&@[#7]:&@2)-&!@[#6])=&!@[#8] Doing Scaffold_09 with 344 mols [#6]-&!@[#6]1=&@[#7]-&@[#7](-&@[#6](-&@[#6]-&@1)-&!@[#6])-&!@[#6] Doing Scaffold_16 with 269 mols [#6]-&!@[#8]-&!@[#6](=&!@[#8])-&!@[#6]1(-&!@[#6])-&@[#7]-&@[#6](-&@[#6]2-&@[#6]-&@1-&@[#6](-&@[#7]-&@[#6]-&@2=&!@[#8])=&!@[#8])-&!@[#6] Doing Scaffold_07 with 76 mols [#6]-&!@[#6]1=&@[#7]-&@[#7]-&@[#6](-&@[#6]-&@1)=&!@[#8] Doing Scaffold_02 with 106 mols [#6]1:&@[#6]:&@[#6]:&@[#6]:&@[#6](:&@[#6]:&@1)-&!@[#6]1:&@[#6](:&@[#16]:&@[#6](:&@[#7]:&@1)-&!@[#7])-&!@[#6]1:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@1 Doing Scaffold_06 with 273 mols [#6]-&!@[#6](-&!@[#6]1:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@1-&!@[#8])-&!@[#7]-&!@[#6] Doing Scaffold_01 with 204 mols [#6]-&!@[#6]-&!@[#7]1:&@[#6]:&@[#7]:&@[#6]:&@[#6]:&@1-&!@[#6] Doing Scaffold_13 with 17 mols [#6]-&!@[#7]-&!@[#6](=&!@[#8])-&!@[#6]1:&@[#6](-&!@[#8]):&@[#6](=&!@[#8]):&@[#6]:&@[#6](:&@[#8]:&@1)-&!@[#6](=&!@[#8])-&!@[#7]-&!@[#6] Doing Scaffold_19 with 256 mols [#6]-&!@[#6]-&!@[#7]-&!@[#6]1:&@[#7]:&@[#6]2:&@[#6](:&@[#16]:&@1):&@[#7]:&@[#6]:&@[#7]:&@[#6]:&@2-&!@[#7]-&!@[#6]-&!@[#6]1:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@1 Doing Scaffold_18 with 38 mols [#6]-&!@[#8]-&!@[#6](=&!@[#8])-&!@[#7]-&!@[#6] Doing Scaffold_04 with 78 mols [#6]-&!@[#7]-&!@[#6](=&!@[#8])-&!@[#7]-&!@[#6]-&!@[#6] Doing Scaffold_17 with 157 mols [#6]-&!@[#7]1:&@[#6]:&@[#6](:&@[#6]:&@[#7]:&@1)-&!@[#6](=&!@[#8])-&!@[#6]1:&@[#6](:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@1)-&!@[#8] Doing Scaffold_12 with 61 mols [#6]-&!@[#7]1:&@[#6](=&!@[#8]):&@[#6]:&@[#6]:&@[#7]:&@[#6]:&@1=&!@[#8] Doing Scaffold_21 with 558 mols [#6]-&!@[#6]-&!@[#7]-&!@[#6]1:&@[#6](-&!@[#6]):&@[#7]:&@[#6]:&@[#7]:&@1 Doing Scaffold_15 with 11 mols [#6]1:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@1 Doing Scaffold_14 with 12 mols [#6]1:&@[#6]:&@[#6](-&!@[#9]):&@[#6]:&@[#6]:&@[#6]:&@1-&!@[#6]1:&@[#6]2:&@[#6](:&@[#7]:&@[#6]:&@1-&!@[#6]1:&@[#6]:&@[#6]:&@[#7]:&@[#6](:&@[#6]:&@1)-&!@[#7]-&!@[#6]):&@[#7]:&@[#6]:&@[#6]:&@[#7]:&@2 Doing Scaffold_08 with 13 mols [#6]1:&@[#7]:&@[#7]:&@[#6](:&@[#6]:&@1)-&!@[#6]
counts = [len(synscreen[synscreen.scaffold == scaff]) for scaff in synscreen.scaffold.unique()]
#scaffs['count'] = [x for x in counts if x<=100]
scaffs['count'] = counts
scaffs
scaffold | ROMol | smarts | timed_out | count | |
---|---|---|---|---|---|
0 | Scaffold_05 | [#6]-&!@[#7]-&!@[#6]1:&@[#6]:&@[#6]:&@[#7]:&@[... | False | 265 | |
1 | Scaffold_10 | [#6]12:&@[#6]:&@[#6]:&@[#7]:&@[#7]:&@1:&@[#6](... | False | 502 | |
2 | Scaffold_09 | [#6]-&!@[#6]1=&@[#7]-&@[#7](-&@[#6](-&@[#6]-&@... | False | 344 | |
3 | Scaffold_16 | [#6]-&!@[#8]-&!@[#6](=&!@[#8])-&!@[#6]1(-&!@[#... | False | 269 | |
4 | Scaffold_07 | [#6]-&!@[#6]1=&@[#7]-&@[#7]-&@[#6](-&@[#6]-&@1... | False | 76 | |
5 | Scaffold_02 | [#6]1:&@[#6]:&@[#6]:&@[#6]:&@[#6](:&@[#6]:&@1)... | False | 106 | |
6 | Scaffold_06 | [#6]-&!@[#6](-&!@[#6]1:&@[#6]:&@[#6]:&@[#6]:&@... | False | 273 | |
7 | Scaffold_01 | [#6]-&!@[#6]-&!@[#7]1:&@[#6]:&@[#7]:&@[#6]:&@[... | False | 204 | |
8 | Scaffold_13 | [#6]-&!@[#7]-&!@[#6](=&!@[#8])-&!@[#6]1:&@[#6]... | False | 17 | |
9 | Scaffold_19 | [#6]-&!@[#6]-&!@[#7]-&!@[#6]1:&@[#7]:&@[#6]2:&... | False | 256 | |
10 | Scaffold_18 | [#6]-&!@[#8]-&!@[#6](=&!@[#8])-&!@[#7]-&!@[#6] | False | 38 | |
11 | Scaffold_04 | [#6]-&!@[#7]-&!@[#6](=&!@[#8])-&!@[#7]-&!@[#6]... | False | 78 | |
12 | Scaffold_17 | [#6]-&!@[#7]1:&@[#6]:&@[#6](:&@[#6]:&@[#7]:&@1... | False | 157 | |
13 | Scaffold_12 | [#6]-&!@[#7]1:&@[#6](=&!@[#8]):&@[#6]:&@[#6]:&... | False | 61 | |
14 | Scaffold_21 | [#6]-&!@[#6]-&!@[#7]-&!@[#6]1:&@[#6](-&!@[#6])... | False | 558 | |
15 | Scaffold_15 | [#6]1:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@1 | False | 11 | |
16 | Scaffold_14 | [#6]1:&@[#6]:&@[#6](-&!@[#9]):&@[#6]:&@[#6]:&@... | False | 12 | |
17 | Scaffold_08 | [#6]1:&@[#7]:&@[#7]:&@[#6](:&@[#6]:&@1)-&!@[#6] | False | 13 |
Finally, let's look at the measured data that's present.
Every compound has measured % inhibition values and an assignment to "active", "inactive", and "gray" bins in the cdk_act_bin_1 column (the assignment scheme for this is in that PDF).
df.groupby('cdk_act_bin_1')['mol_name'].count()
cdk_act_bin_1 0 15276 50 1906 100 368 Name: mol_name, dtype: int64
df.groupby(['sourcepool','cdk_act_bin_1'])['mol_name'].count()
sourcepool cdk_act_bin_1 divscreen 0 11972 50 1180 100 207 simscreen 0 824 50 80 100 47 synscreen 0 2480 50 646 100 114 Name: mol_name, dtype: int64
There are also a smaller number of measured IC50 values:
df[df.cdk2_ic50 != 'None'].groupby('sourcepool')['mol_name'].count()
sourcepool divscreen 51 simscreen 26 synscreen 34 Name: mol_name, dtype: int64
I'll be using this dataset in future blog posts, but I think