Notebook

If you are involved in building predictive models for biological end points in pharma research, you've certainly encountered the problem that our datasets tend to be very highly unbalanced - we usually have many, many more inactive molecules than active ones. The imbalance in the data can present challenges when evaluating model performance and "confuse" some of the machine-learning algorithms we use.

Let's start with evaluating models. Although at first glance a classifier (we'll focus on classifiers here) that shows 99% accuracy on hold-out data sounds pretty good, it's not going to be practically useful if the data set is 99% inactive and the model is just always predicting "inactive". There are a number of balanced accuracy metrics available to solve this problem; for this post we will use Cohen's kappa, which is readily available in scikit-learn. As an example of what a difference it makes, the normal accuracy for this confusion matrix:

[[400   0]
 [ 19   1]]

is 95.4% accurate but kappa, which can adopt values from 0 (bad) to 1 (good) is only 0.091. Paul Czodrowski has written an nice article discussing this in more detail that's well worth reading: https://link.springer.com/article/10.1007/s10822-014-9759-6. Paul has also done a presentation summarizing this.

There are multiple strategies available for improving the performance of predictive models built on unbalanced data sets. Most of these focus on making the training set more balanced by either oversampling the active examples or undersampling the inactive examples. There's a broad literature on this and you can find plenty of other information online and if I end up having time I'll do a post looking at these approaches later. Here we're going to explore a different approach that I first learned about from Dean Abbott at a workshop he held at a KNIME User Group Meeting several years ago: adjusting the threshold a model uses to make its predictions. I've talked about this in a few venues but this is the first time that I've actually been somewhat systematic and written stuff down. There is also a body of literature available about the approach, which I will call adjusting the decision threshold, but since I've never formally written this up, I haven't yet done a real literature review. It looks like this article from Provost and Fawcett is one of the first publications: https://link.springer.com/article/10.1023/A:1007601015854

Most classification algorithms can provide predicted probabilities for each class; the actual class predicted for any given new instance is the one which has the highest predicted probability. For a two-class problem this means that the class with a predicted probability of more than 0.5 is the final prediction. The key idea I learned from Dean (and that you can find in the literature) is that you often get much better balanced accuracy (as measured by things like Cohen's kappa) on unbalanced datasets by shifting the decision boundary and allowing the model to predict the minority class even when the predicted probability of that class is not the highest. To make that concrete: if I'm building a model to distinguish actives from inactives based on a training set where there are far more inactives than actives, I may benefit by assigning "active" to data points where the predicted probability of being active is smaller, often considerably smaller, than 0.5. For example, if I set the decision threshold for the model that produced the confusion matrix above to 0.1, I get the following confusion matrix:

[[384  16]
 [  6  14]]

This model is clearly producing far more false positives (16 instead of none) but finds far more of the true positives and has a kappa of 0.533. That's not great, but it is significantly better than the original 0.091.

It's possible to imagine many different approaches for picking the "best" decision threshold for a model as well as multiple definitions of what "best" means at all. The lower the threshold is set, the more true actives that will be retrieved, together with more false positives. Since I wanted to consider a large number of models here, I chose an automated approach: I try a number of different decision thresholds and calculate kappa using the out-of-bag prediction probabilities for the training set; the threshold that produces the best kappa value is picked as the best.

To demonstrate why this method works and when it doesn't work particularly well, here are two more examples taken from the ChEMBL datasets presented below; both are PubChem qHTS confirmatory assays. In the first example, assay ID CHEMBL1614421,the balancing approach works well: we start with a kappa of 0.033 on the holdout data, but by using the OOB predictions to pick a threshold of 0.200 we arrive at a significantly better kappa of 0.451.

assay_id CHEMBL1614421, description: PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation, Thioflavin T Binding. (Class of assay: confirmatory) [Related pubchem assays: 596 ]
--------- Default -----------
ratio: 0.129 kappa: 0.033, AUC: 0.860, OOB_AUC: 0.850
[[8681    4]
 [1101   22]]
--------- Balanced -----------
thresh: 0.200 kappa: 0.451
[[8261  424]
 [ 595  528]]

For assay CHEMBL1614249, on the other hand, kappa starts at 0.000 (no compounds from the holdout set are classified as active) and the best the balancing procedure (again based on OOB predictions) can do is a kappa of 0.102 when the threshold is set to 0.050.

assay_id CHEMBL1614249, description: PUBCHEM_BIOASSAY: qHTS Assay for Identification of Novel General Anesthetics. In this assay, a GABAergic mimetic model system, apoferritin and a profluorescent 1-aminoanthracene ligand (1-AMA), was used to construct a competitive binding assay for identification of novel general anesthetics (Class of assay: confirmatory) [Related pubchem assays: 2385 (Probe Development Summary for Identification of Novel General Anesthetics), 2323 (Validation apoferritin assay run on SigmaAldrich LOPAC1280 collection)]
--------- Default -----------
ratio: 0.006 kappa: 0.000, AUC: 0.725, OOB_AUC: 0.741
[[8356    0]
 [  51    0]]
--------- Balanced -----------
thresh: 0.050 kappa: 0.102
[[8302   54]
 [  45    6]]

What's the difference here? Why does the balancing procedure produce reasonable results for one of the datasets and poor quality results for the other? A quite obvious difference between the two models can be seen in the ROC curves generated for the OOB predictions on the training sets. Note that these curves, which are generated solely using the predicted probabilities of being active, are not sensitive to the decision threshold.

The ROC curve for assay CHEMBL1614249 shows significantly better early enrichment than that for CHEMBL1614421 and an overall better AUC. The nice-looking ROC curve indicates that the model for CHEMBL1614249 is actually pretty good at distinguishing actives from inactives, adjusting the decision threshold allows us to take advantage of this.

In order to try and determine how general the method is, I've applied the same balancing procedure to a collection of diverse datasets pulled from ChEMBL:

13 Ki datasets for Serotonin receptors. download
The 80 "Dataset 1" datasets from our benchmarking set. download
8 large PubChem qHTS confirmatory assays. download
44 DrugMatrix assays. download

Here's a summary graphic plotting the balanced kappa values (obtained by shifting the decision threshold) against the original kappa values for all the data sets. The points are colored by the out-of-bag AUROC. You can see that the balancing procedure results in better kappa values, often dramatically better values, in almost all cases and that the datasets with low AUROC values tend to show less improvement.

The full details, along with the code to run the experiment, are in the jupyter notebook associated with this post, view it here. Since a couple of the datasets are a bit large, I haven't checked them all into github with this jupyter notebook, there are links above to download the data. I'm also happy to provide the KNIME workflows that I used to construct the datasets.

I plan a couple more blog posts in this series. One will show how to adjust the decision threshold when making predictions within KNIME, one will look at using machine learning methods other than random forests, and I'd also like to include a comparison to the results obtained using standard resampling methods.

I think this approach should be a standard part of our toolbox when building predictive models for unbalanced datasets. What do you think?

In [1]:

from rdkit import Chem, DataStructs
from rdkit.Chem import Draw
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem.Draw import IPythonConsole
from collections import defaultdict
import pandas as pd
import numpy as np
import gzip
import pickle
import matplotlib.pyplot as plt
%pylab inline

Populating the interactive namespace from numpy and matplotlib

Here's the function to build and evaluate a random forest for an assay. This won't build the best possible model for the assay - we aren't doing any parameter optimization and are always using the same fingerprints - but that's beside the point for this analysis

In [2]:

import sklearn
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

def run_it2(acts,inacts):
    # generate fingerprints for the actives and inactives and convert them to numpy arrays so that
    # we can learn from them. This uses the generalized fingerprinter added in the RDKit 2018.09 release
    act_data = []
    for i,fp in enumerate(rdFingerprintGenerator.GetFPs([Chem.MolFromSmiles(x) for x in acts['canonical_smiles']],fpType=rdFingerprintGenerator.MorganFP)):
        bv = np.zeros((len(fp),),np.int16)
        DataStructs.ConvertToNumpyArray(fp,bv)
        act_data.append((acts['compound_chembl_id'].iloc[i],bv,1))
    inact_data = []
    for i,fp in enumerate(rdFingerprintGenerator.GetFPs([Chem.MolFromSmiles(x) for x in inacts['canonical_smiles']],fpType=rdFingerprintGenerator.MorganFP)):
        bv = np.zeros((len(fp),),np.int16)
        DataStructs.ConvertToNumpyArray(fp,bv)
        inact_data.append((inacts['compound_chembl_id'].iloc[i],bv,0))

    # do a random, stratified 80/20 training/holdout split:
    random.seed(0xf00d)
    nAct = len(act_data)
    actis = list(range(nAct))
    nInact = len(inact_data)
    inactis = list(range(nInact))
    random.shuffle(actis)
    random.shuffle(inactis)

    train_pct = 0.80
    train = [act_data[x] for x in actis[:int(train_pct*nAct)]] + [inact_data[x] for x in inactis[:int(train_pct*nInact)]]
    test = [act_data[x] for x in actis[int(train_pct*nAct):]] + [inact_data[x] for x in inactis[int(train_pct*nInact):]]

    random.shuffle(train)
    random.shuffle(test)

    # build the random forest:
    cls = RandomForestClassifier(n_estimators=500,max_depth=15,min_samples_leaf=2,n_jobs=4,oob_score=True)
    cls.fit([y for x,y,z in train],[z for x,y,z in train])
    oob_auroc = metrics.roc_auc_score([z for x,y,z in train],[x[1] for x in cls.oob_decision_function_])

    # generate and show some evaluation stats for the model on the holdout data:
    scores = cls.predict([y for x,y,z in test])
    probs = cls.predict_proba([y for x,y,z in test])[:,1]
    auroc = metrics.roc_auc_score([z for x,y,z in test], probs)
    print('ratio: %.3f kappa: %.3f, AUC: %.3f, OOB_AUC: %.3f'%(nAct/nInact,metrics.cohen_kappa_score([z for x,y,z in test],scores),auroc,oob_auroc))
    print(metrics.confusion_matrix([z for x,y,z in test],scores))
    print(metrics.classification_report([z for x,y,z in test],scores))
    
    # to allow further analysis we return a tuple with:
    #  - predicted probabilities of being active (class 1) for the holdout set
    #  - actual values for the holdout set
    #  - (actual values for the training set, OOB predicted probabilities for the training set)
    return probs,[z for x,y,z in test],([z for x,y,z in train],cls.oob_decision_function_)

def run_it_(assay_id,acts,inacts,thresholds,archive):
    print("--------- Default -----------")
    ps,tgt,oob_tpl = run_it2(acts,inacts)
    oob_tgt,oob_probs = oob_tpl
    oob_probs = [x[1] for x in oob_probs]
    auroc = metrics.roc_auc_score(oob_tgt, oob_probs)
    scores = [1 if x>=0.5 else 0 for x in ps]
    kappa = metrics.cohen_kappa_score(tgt,scores)
    confusion = metrics.confusion_matrix(tgt,scores)
    archive[assay_id].append((0.5,kappa,confusion,auroc))
    print("--------- Balanced -----------")
    if thresholds is None:
        # pick the best threshold using the OOB kappas
        threshs = np.arange(0.05,0.55,0.05)
        tscores = []
        for thresh in threshs:
            scores = [1 if x>=thresh else 0 for x in oob_probs]
            kappa = metrics.cohen_kappa_score(oob_tgt,scores)
            tscores.append((np.round(kappa,3),thresh))
        tscores.sort(reverse=True)
        thresh = tscores[0][-1]
        thresholds = [thresh]
    for thresh in thresholds:
        scores = [1 if x>=thresh else 0 for x in ps]
        kappa = metrics.cohen_kappa_score(tgt,scores)
        confusion = metrics.confusion_matrix(tgt,scores)
        archive[assay_id].append((thresh,kappa,confusion,None))
        print('thresh: %.3f kappa: %.3f'%(thresh,kappa))
        print(confusion)        

The serotonin Ki datasets¶

This is a set of data about binding to serotonin receptors exported from ChEMBL.

In [3]:

with gzip.open('../data/chembl_5HT.csv.gz') as inf:
    serotonin_d = pd.read_csv(inf)
serotonin_d.head()

Out[3]:

	standard_relation	standard_value	standard_type	pchembl_value	assay_chembl_id	target_chembl_id	canonical_smiles	compound_chembl_id
0	>	10000.0	Ki	NaN	CHEMBL615807	CHEMBL214	CCN(C)C1CCCc2ccncc12.OC(=O)\C=C/C(=O)O.OC(=O)\...	CHEMBL1794855
1	=	168.0	Ki	6.78	CHEMBL615460	CHEMBL214	CCCN(CCC)[C@@H]1CCc2c(OC)cccc2[C@@H]1C	CHEMBL278751
2	=	181.0	Ki	6.74	CHEMBL615809	CHEMBL214	C(N1CCN(CC1)c2ncccn2)c3c[nH]c(n3)c4ccccc4	CHEMBL103772
3	=	3.9	Ki	8.41	CHEMBL615460	CHEMBL214	CCCN1CC[C@@H]2[C@H]1CCc3ccc4ccoc4c23	CHEMBL328107
4	=	1.6	Ki	8.80	CHEMBL615756	CHEMBL214	COc1cccc2OC[C@H](CN3[C@@H]4CC[C@H]3C[C@@](O)(C...	CHEMBL148860

Get some info about those assays from the ChEMBL REST services:

In [4]:

import requests,json
assay_lookup = {}
tpls = sorted([(len(v),k) for k,v in serotonin_d.groupby('target_chembl_id').groups.items()],reverse=True)
for v,k in tpls:
    resp=requests.get(f'https://www.ebi.ac.uk/chembl/api/data/target/{k}?format=json')
    info = json.loads(resp.text)
    assay_lookup[k] = info['organism'],info['pref_name']
    print(k,v,info['organism'],info['pref_name'])

CHEMBL3371 3180 Homo sapiens Serotonin 6 (5-HT6) receptor
CHEMBL224 3068 Homo sapiens Serotonin 2a (5-HT2a) receptor
CHEMBL214 2672 Homo sapiens Serotonin 1a (5-HT1a) receptor
CHEMBL3155 1636 Homo sapiens Serotonin 7 (5-HT7) receptor
CHEMBL225 1580 Homo sapiens Serotonin 2c (5-HT2c) receptor
CHEMBL1833 922 Homo sapiens Serotonin 2b (5-HT2b) receptor
CHEMBL1898 542 Homo sapiens Serotonin 1b (5-HT1b) receptor
CHEMBL1983 525 Homo sapiens Serotonin 1d (5-HT1d) receptor
CHEMBL1899 491 Homo sapiens Serotonin 3a (5-HT3a) receptor
CHEMBL1875 369 Homo sapiens Serotonin 4 (5-HT4) receptor
CHEMBL3426 227 Homo sapiens Serotonin 5a (5-HT5a) receptor
CHEMBL2182 86 Homo sapiens Serotonin 1e (5-HT1e) receptor
CHEMBL1805 36 Homo sapiens Serotonin 1f (5-HT1f) receptor

In [5]:

with open('../data/serotonin_data.pkl','wb+') as outf:
    pickle.dump((serotonin_d,assay_lookup),outf)

In [6]:

with open('../data/serotonin_data.pkl','rb') as inf:
    serotonin_d,assay_lookup = pickle.load(inf)

Look at the distributions of activities for some of the assays

In [7]:

for v,k in tpls:
    if v>900:
        serotonin_d.loc[serotonin_d['target_chembl_id']==k].hist('pchembl_value');
        

In [8]:

def run_serotonin_assay(assay_id,d,assay_lookup,thresholds=None,archive=None):
    if archive is None:
        archive=defaultdict(list)
    # these need to be converted into a classification problem.
    # start with a pretty strict cutoff for active/inactive:
    assay = d.loc[d['target_chembl_id']==assay_id]
    acts = assay.loc[assay['pchembl_value']>9.0]
    if len(acts)>=50:
        inacts = assay.loc[assay['pchembl_value']<8.5]
    else:
        # but relax that if we don't end up with a reasonable number of actives:
        acts = assay.loc[assay['pchembl_value']>8.0]
        inacts = assay.loc[assay['pchembl_value']<7.5]
    print('*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*')
    print(f'assay_id {assay_id}, organism: {assay_lookup[assay_id][0]}, target: {assay_lookup[assay_id][1]}')
    run_it_(assay_id,acts,inacts,thresholds,archive)
    print('*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*\n\n')

In [9]:

serotonin_archive = defaultdict(list)

for v,k in tpls:
    if v>900:
        run_serotonin_assay(k,serotonin_d,assay_lookup,archive=serotonin_archive)

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL3371, organism: Homo sapiens, target: Serotonin 6 (5-HT6) receptor
--------- Default -----------
ratio: 0.189 kappa: 0.654, AUC: 0.968, OOB_AUC: 0.949
[[418  10]
 [ 32  49]]
              precision    recall  f1-score   support

           0       0.93      0.98      0.95       428
           1       0.83      0.60      0.70        81

   micro avg       0.92      0.92      0.92       509
   macro avg       0.88      0.79      0.83       509
weighted avg       0.91      0.92      0.91       509

--------- Balanced -----------
thresh: 0.350 kappa: 0.803
[[414  14]
 [ 13  68]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL224, organism: Homo sapiens, target: Serotonin 2a (5-HT2a) receptor
--------- Default -----------
ratio: 0.085 kappa: 0.714, AUC: 0.988, OOB_AUC: 0.958
[[494   2]
 [ 17  26]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.98       496
           1       0.93      0.60      0.73        43

   micro avg       0.96      0.96      0.96       539
   macro avg       0.95      0.80      0.86       539
weighted avg       0.96      0.96      0.96       539

--------- Balanced -----------
thresh: 0.400 kappa: 0.692
[[492   4]
 [ 17  26]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL214, organism: Homo sapiens, target: Serotonin 1a (5-HT1a) receptor
--------- Default -----------
ratio: 0.108 kappa: 0.471, AUC: 0.951, OOB_AUC: 0.945
[[393   2]
 [ 28  15]]
              precision    recall  f1-score   support

           0       0.93      0.99      0.96       395
           1       0.88      0.35      0.50        43

   micro avg       0.93      0.93      0.93       438
   macro avg       0.91      0.67      0.73       438
weighted avg       0.93      0.93      0.92       438

--------- Balanced -----------
thresh: 0.300 kappa: 0.672
[[381  14]
 [ 12  31]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL3155, organism: Homo sapiens, target: Serotonin 7 (5-HT7) receptor
--------- Default -----------
ratio: 0.053 kappa: 0.408, AUC: 0.953, OOB_AUC: 0.918
[[270   0]
 [ 11   4]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       270
           1       1.00      0.27      0.42        15

   micro avg       0.96      0.96      0.96       285
   macro avg       0.98      0.63      0.70       285
weighted avg       0.96      0.96      0.95       285

--------- Balanced -----------
thresh: 0.350 kappa: 0.384
[[269   1]
 [ 11   4]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL225, organism: Homo sapiens, target: Serotonin 2c (5-HT2c) receptor
--------- Default -----------
ratio: 0.288 kappa: 0.564, AUC: 0.914, OOB_AUC: 0.912
[[182   4]
 [ 27  27]]
              precision    recall  f1-score   support

           0       0.87      0.98      0.92       186
           1       0.87      0.50      0.64        54

   micro avg       0.87      0.87      0.87       240
   macro avg       0.87      0.74      0.78       240
weighted avg       0.87      0.87      0.86       240

--------- Balanced -----------
thresh: 0.250 kappa: 0.671
[[161  25]
 [  6  48]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1833, organism: Homo sapiens, target: Serotonin 2b (5-HT2b) receptor
--------- Default -----------
ratio: 0.168 kappa: 0.545, AUC: 0.965, OOB_AUC: 0.893
[[132   1]
 [ 13  10]]
              precision    recall  f1-score   support

           0       0.91      0.99      0.95       133
           1       0.91      0.43      0.59        23

   micro avg       0.91      0.91      0.91       156
   macro avg       0.91      0.71      0.77       156
weighted avg       0.91      0.91      0.90       156

--------- Balanced -----------
thresh: 0.250 kappa: 0.692
[[130   3]
 [  8  15]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

The "Dataset 1" datasets¶

Now we'll work with the "Dataset 1" datasets from the benchmarking set. These are collections of diverse actives from various ChEMBL targets together with assumed inactives pulled from an older version of ZINC.

In [10]:

import glob
datasets = glob.glob('../../Code/benchmarking_platform/compounds/ChEMBL/cmp_list*_actives.dat.gz')
print(len(datasets))
random.seed(0xf00d)
random.shuffle(datasets)
keep = datasets[:20]
keep[:5]

Out[10]:

['../../Code/benchmarking_platform/compounds/ChEMBL/cmp_list_ChEMBL_10198_actives.dat.gz',
 '../../Code/benchmarking_platform/compounds/ChEMBL/cmp_list_ChEMBL_10980_actives.dat.gz',
 '../../Code/benchmarking_platform/compounds/ChEMBL/cmp_list_ChEMBL_11279_actives.dat.gz',
 '../../Code/benchmarking_platform/compounds/ChEMBL/cmp_list_ChEMBL_87_actives.dat.gz',
 '../../Code/benchmarking_platform/compounds/ChEMBL/cmp_list_ChEMBL_12252_actives.dat.gz']

In [11]:

import re
with gzip.open('../../Code/benchmarking_platform/compounds/ChEMBL/cmp_list_ChEMBL_zinc_decoys.dat.gz') as inf:
    inactive_df = pd.read_csv(inf,sep='\t')

chembl_active_sets = {}
for fn in datasets:
    nm = re.search(r'cmp_list_(ChEMBL_[0-9]*)_actives',fn).groups()[0]
    with gzip.open(fn) as inf:
        chembl_active_sets[nm] = pd.read_csv(inf,sep='\t')

In [12]:

def run_dataset1_assay(assay_id,chembl_active_sets,inactive_df,factor=20,thresholds=None,archive=None):
    if archive is None:
        archive=defaultdict(list)
    assay = chembl_active_sets[assay_id]
    acts = assay.rename(index=str,columns={'SMILES':'canonical_smiles','# _Name':'compound_chembl_id'})
    inacts = inactive_df.sample(n=factor*len(acts),random_state=hash(assay_id)%0xf00d).rename(index=str,columns={'SMILES':'canonical_smiles','# _Name':'compound_chembl_id'})

    print('*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*')
    print(f'assay_id {assay_id}')

    run_it_(assay_id,acts,inacts,thresholds,archive)    
    print('*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*\n\n')

In [13]:

ds1_archive = defaultdict(list)
for k in chembl_active_sets:
    run_dataset1_assay(k,chembl_active_sets,inactive_df,archive=ds1_archive)

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_10198
--------- Default -----------
ratio: 0.050 kappa: 0.700, AUC: 1.000, OOB_AUC: 0.981
[[400   0]
 [  9  11]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       400
           1       1.00      0.55      0.71        20

   micro avg       0.98      0.98      0.98       420
   macro avg       0.99      0.78      0.85       420
weighted avg       0.98      0.98      0.98       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.945
[[400   0]
 [  2  18]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_10980
--------- Default -----------
ratio: 0.050 kappa: 0.000, AUC: 0.994, OOB_AUC: 0.966
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.802
[[398   2]
 [  5  15]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_11279
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.741, AUC: 0.965, OOB_AUC: 0.977
[[400   0]
 [  8  12]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       400
           1       1.00      0.60      0.75        20

   micro avg       0.98      0.98      0.98       420
   macro avg       0.99      0.80      0.87       420
weighted avg       0.98      0.98      0.98       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.973
[[400   0]
 [  1  19]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_87
--------- Default -----------
ratio: 0.050 kappa: 0.000, AUC: 0.908, OOB_AUC: 0.910
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.307
[[399   1]
 [ 16   4]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_12252
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.252, AUC: 0.983, OOB_AUC: 0.978
[[400   0]
 [ 17   3]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.15      0.26        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.57      0.62       420
weighted avg       0.96      0.96      0.94       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.802
[[398   2]
 [  5  15]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_10280
--------- Default -----------
ratio: 0.050 kappa: 0.323, AUC: 0.986, OOB_AUC: 0.976
[[400   0]
 [ 16   4]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.20      0.33        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.60      0.66       420
weighted avg       0.96      0.96      0.95       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.843
[[397   3]
 [  3  17]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_17045
--------- Default -----------
ratio: 0.050 kappa: 0.000, AUC: 0.913, OOB_AUC: 0.886
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.100 kappa: 0.473
[[387  13]
 [  9  11]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_10188
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.000, AUC: 0.945, OOB_AUC: 0.926
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.746
[[397   3]
 [  6  14]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_10193
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.388, AUC: 0.972, OOB_AUC: 0.979
[[400   0]
 [ 15   5]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.25      0.40        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.62      0.69       420
weighted avg       0.97      0.96      0.95       420

--------- Balanced -----------
thresh: 0.200 kappa: 0.780
[[397   3]
 [  5  15]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_11365
--------- Default -----------
ratio: 0.050 kappa: 0.000, AUC: 0.945, OOB_AUC: 0.898
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.100 kappa: 0.505
[[379  21]
 [  5  15]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_93
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.252, AUC: 0.986, OOB_AUC: 0.977
[[400   0]
 [ 17   3]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.15      0.26        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.57      0.62       420
weighted avg       0.96      0.96      0.94       420

--------- Balanced -----------
thresh: 0.100 kappa: 0.659
[[385  15]
 [  2  18]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_10773
--------- Default -----------
ratio: 0.050 kappa: 0.741, AUC: 1.000, OOB_AUC: 0.998
[[400   0]
 [  8  12]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       400
           1       1.00      0.60      0.75        20

   micro avg       0.98      0.98      0.98       420
   macro avg       0.99      0.80      0.87       420
weighted avg       0.98      0.98      0.98       420

--------- Balanced -----------
thresh: 0.150 kappa: 1.000
[[400   0]
 [  0  20]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_10417
--------- Default -----------
ratio: 0.050 kappa: 0.915, AUC: 1.000, OOB_AUC: 1.000
[[400   0]
 [  3  17]]
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       400
           1       1.00      0.85      0.92        20

   micro avg       0.99      0.99      0.99       420
   macro avg       1.00      0.93      0.96       420
weighted avg       0.99      0.99      0.99       420

--------- Balanced -----------
thresh: 0.150 kappa: 1.000
[[400   0]
 [  0  20]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_36
--------- Default -----------
ratio: 0.050 kappa: 0.388, AUC: 0.993, OOB_AUC: 0.994
[[400   0]
 [ 15   5]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.25      0.40        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.62      0.69       420
weighted avg       0.97      0.96      0.95       420

--------- Balanced -----------
thresh: 0.200 kappa: 0.802
[[398   2]
 [  5  15]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_134
--------- Default -----------
ratio: 0.050 kappa: 0.388, AUC: 1.000, OOB_AUC: 0.994
[[400   0]
 [ 15   5]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.25      0.40        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.62      0.69       420
weighted avg       0.97      0.96      0.95       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.950
[[398   2]
 [  0  20]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_20014
--------- Default -----------
ratio: 0.050 kappa: 0.559, AUC: 0.993, OOB_AUC: 0.994
[[400   0]
 [ 12   8]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       400
           1       1.00      0.40      0.57        20

   micro avg       0.97      0.97      0.97       420
   macro avg       0.99      0.70      0.78       420
weighted avg       0.97      0.97      0.97       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.895
[[398   2]
 [  2  18]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_114
--------- Default -----------
ratio: 0.050 kappa: 0.091, AUC: 0.957, OOB_AUC: 0.959
[[400   0]
 [ 19   1]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       1.00      0.05      0.10        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.98      0.53      0.54       420
weighted avg       0.96      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.489
[[391   9]
 [ 10  10]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_12209
--------- Default -----------
ratio: 0.050 kappa: 0.323, AUC: 0.969, OOB_AUC: 0.976
[[400   0]
 [ 16   4]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.20      0.33        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.60      0.66       420
weighted avg       0.96      0.96      0.95       420

--------- Balanced -----------
thresh: 0.250 kappa: 0.755
[[399   1]
 [  7  13]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_25
--------- Default -----------
ratio: 0.050 kappa: 0.323, AUC: 0.984, OOB_AUC: 0.974
[[400   0]
 [ 16   4]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.20      0.33        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.60      0.66       420
weighted avg       0.96      0.96      0.95       420

--------- Balanced -----------
thresh: 0.200 kappa: 0.755
[[399   1]
 [  7  13]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_11359
--------- Default -----------
ratio: 0.050 kappa: 0.700, AUC: 0.991, OOB_AUC: 0.983
[[400   0]
 [  9  11]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       400
           1       1.00      0.55      0.71        20

   micro avg       0.98      0.98      0.98       420
   macro avg       0.99      0.78      0.85       420
weighted avg       0.98      0.98      0.98       420

--------- Balanced -----------
thresh: 0.350 kappa: 0.851
[[400   0]
 [  5  15]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_51
--------- Default -----------
ratio: 0.050 kappa: 0.175, AUC: 0.937, OOB_AUC: 0.980
[[400   0]
 [ 18   2]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.10      0.18        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.55      0.58       420
weighted avg       0.96      0.96      0.94       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.704
[[395   5]
 [  6  14]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_10927
--------- Default -----------
ratio: 0.050 kappa: 0.816, AUC: 0.999, OOB_AUC: 0.999
[[400   0]
 [  6  14]]
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       400
           1       1.00      0.70      0.82        20

   micro avg       0.99      0.99      0.99       420
   macro avg       0.99      0.85      0.91       420
weighted avg       0.99      0.99      0.98       420

--------- Balanced -----------
thresh: 0.250 kappa: 0.973
[[400   0]
 [  1  19]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_10475
--------- Default -----------
ratio: 0.050 kappa: 0.884, AUC: 1.000, OOB_AUC: 1.000
[[400   0]
 [  4  16]]
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       400
           1       1.00      0.80      0.89        20

   micro avg       0.99      0.99      0.99       420
   macro avg       1.00      0.90      0.94       420
weighted avg       0.99      0.99      0.99       420

--------- Balanced -----------
thresh: 0.200 kappa: 1.000
[[400   0]
 [  0  20]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_11631
--------- Default -----------
ratio: 0.050 kappa: 0.700, AUC: 0.995, OOB_AUC: 0.980
[[400   0]
 [  9  11]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       400
           1       1.00      0.55      0.71        20

   micro avg       0.98      0.98      0.98       420
   macro avg       0.99      0.78      0.85       420
weighted avg       0.98      0.98      0.98       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.828
[[395   5]
 [  2  18]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_10434
--------- Default -----------
ratio: 0.050 kappa: 0.091, AUC: 0.989, OOB_AUC: 0.962
[[400   0]
 [ 19   1]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       1.00      0.05      0.10        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.98      0.53      0.54       420
weighted avg       0.96      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.835
[[398   2]
 [  4  16]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_10378
--------- Default -----------
ratio: 0.050 kappa: 0.252, AUC: 0.988, OOB_AUC: 0.976
[[400   0]
 [ 17   3]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.15      0.26        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.57      0.62       420
weighted avg       0.96      0.96      0.94       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.802
[[398   2]
 [  5  15]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_116
--------- Default -----------
ratio: 0.050 kappa: 0.700, AUC: 0.998, OOB_AUC: 0.998
[[400   0]
 [  9  11]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       400
           1       1.00      0.55      0.71        20

   micro avg       0.98      0.98      0.98       420
   macro avg       0.99      0.78      0.85       420
weighted avg       0.98      0.98      0.98       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.948
[[399   1]
 [  1  19]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_61
--------- Default -----------
ratio: 0.050 kappa: 0.091, AUC: 0.958, OOB_AUC: 0.960
[[400   0]
 [ 19   1]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       1.00      0.05      0.10        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.98      0.53      0.54       420
weighted avg       0.96      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.467
[[398   2]
 [ 13   7]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_12261
--------- Default -----------
ratio: 0.050 kappa: 0.388, AUC: 0.986, OOB_AUC: 0.974
[[400   0]
 [ 15   5]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.25      0.40        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.62      0.69       420
weighted avg       0.97      0.96      0.95       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.884
[[400   0]
 [  4  16]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_11085
--------- Default -----------
ratio: 0.050 kappa: 0.388, AUC: 0.991, OOB_AUC: 0.992
[[400   0]
 [ 15   5]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.25      0.40        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.62      0.69       420
weighted avg       0.97      0.96      0.95       420

--------- Balanced -----------
thresh: 0.200 kappa: 0.973
[[400   0]
 [  1  19]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_20174
--------- Default -----------
ratio: 0.050 kappa: 0.559, AUC: 0.999, OOB_AUC: 1.000
[[400   0]
 [ 12   8]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       400
           1       1.00      0.40      0.57        20

   micro avg       0.97      0.97      0.97       420
   macro avg       0.99      0.70      0.78       420
weighted avg       0.97      0.97      0.97       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.915
[[400   0]
 [  3  17]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_19905
--------- Default -----------
ratio: 0.050 kappa: 0.000, AUC: 0.986, OOB_AUC: 0.973
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.835
[[398   2]
 [  4  16]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_11575
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.609, AUC: 0.949, OOB_AUC: 0.971
[[400   0]
 [ 11   9]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       400
           1       1.00      0.45      0.62        20

   micro avg       0.97      0.97      0.97       420
   macro avg       0.99      0.72      0.80       420
weighted avg       0.97      0.97      0.97       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.872
[[397   3]
 [  2  18]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_11336
--------- Default -----------
ratio: 0.050 kappa: 0.741, AUC: 0.999, OOB_AUC: 0.999
[[400   0]
 [  8  12]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       400
           1       1.00      0.60      0.75        20

   micro avg       0.98      0.98      0.98       420
   macro avg       0.99      0.80      0.87       420
weighted avg       0.98      0.98      0.98       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.900
[[397   3]
 [  1  19]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_52
--------- Default -----------
ratio: 0.050 kappa: 0.000, AUC: 0.960, OOB_AUC: 0.946
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.611
[[398   2]
 [ 10  10]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_11536
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.816, AUC: 1.000, OOB_AUC: 0.994
[[400   0]
 [  6  14]]
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       400
           1       1.00      0.70      0.82        20

   micro avg       0.99      0.99      0.99       420
   macro avg       0.99      0.85      0.91       420
weighted avg       0.99      0.99      0.98       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.974
[[399   1]
 [  0  20]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_108
--------- Default -----------
ratio: 0.050 kappa: 0.000, AUC: 0.961, OOB_AUC: 0.938
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.564
[[391   9]
 [  8  12]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_107
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.000, AUC: 0.949, OOB_AUC: 0.950
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.633
[[399   1]
 [ 10  10]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_130
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.091, AUC: 0.934, OOB_AUC: 0.966
[[400   0]
 [ 19   1]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       1.00      0.05      0.10        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.98      0.53      0.54       420
weighted avg       0.96      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.694
[[398   2]
 [  8  12]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_100126
--------- Default -----------
ratio: 0.050 kappa: 0.780, AUC: 0.976, OOB_AUC: 0.981
[[400   0]
 [  7  13]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       400
           1       1.00      0.65      0.79        20

   micro avg       0.98      0.98      0.98       420
   macro avg       0.99      0.82      0.89       420
weighted avg       0.98      0.98      0.98       420

--------- Balanced -----------
thresh: 0.200 kappa: 0.835
[[398   2]
 [  4  16]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_11534
--------- Default -----------
ratio: 0.050 kappa: 0.091, AUC: 0.999, OOB_AUC: 0.987
[[400   0]
 [ 19   1]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       1.00      0.05      0.10        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.98      0.53      0.54       420
weighted avg       0.96      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.895
[[398   2]
 [  2  18]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_15
--------- Default -----------
ratio: 0.050 kappa: 0.323, AUC: 0.976, OOB_AUC: 0.959
[[400   0]
 [ 16   4]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.20      0.33        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.60      0.66       420
weighted avg       0.96      0.96      0.95       420

--------- Balanced -----------
thresh: 0.200 kappa: 0.724
[[396   4]
 [  6  14]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_11265
--------- Default -----------
ratio: 0.050 kappa: 0.973, AUC: 1.000, OOB_AUC: 0.989
[[400   0]
 [  1  19]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       400
           1       1.00      0.95      0.97        20

   micro avg       1.00      1.00      1.00       420
   macro avg       1.00      0.97      0.99       420
weighted avg       1.00      1.00      1.00       420

--------- Balanced -----------
thresh: 0.200 kappa: 0.948
[[399   1]
 [  1  19]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_12911
--------- Default -----------
ratio: 0.050 kappa: 0.000, AUC: 0.946, OOB_AUC: 0.884
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.100 kappa: 0.563
[[384  16]
 [  5  15]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_276
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.816, AUC: 0.999, OOB_AUC: 0.987
[[400   0]
 [  6  14]]
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       400
           1       1.00      0.70      0.82        20

   micro avg       0.99      0.99      0.99       420
   macro avg       0.99      0.85      0.91       420
weighted avg       0.99      0.99      0.98       420

--------- Balanced -----------
thresh: 0.200 kappa: 0.973
[[400   0]
 [  1  19]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_12679
--------- Default -----------
ratio: 0.050 kappa: 0.945, AUC: 1.000, OOB_AUC: 1.000
[[400   0]
 [  2  18]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       400
           1       1.00      0.90      0.95        20

   micro avg       1.00      1.00      1.00       420
   macro avg       1.00      0.95      0.97       420
weighted avg       1.00      1.00      1.00       420

--------- Balanced -----------
thresh: 0.300 kappa: 0.973
[[400   0]
 [  1  19]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_11489
--------- Default -----------
ratio: 0.050 kappa: 0.000, AUC: 0.852, OOB_AUC: 0.912
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.417
[[395   5]
 [ 13   7]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_12670
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.091, AUC: 0.990, OOB_AUC: 0.980
[[400   0]
 [ 19   1]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       1.00      0.05      0.10        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.98      0.53      0.54       420
weighted avg       0.96      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.859
[[399   1]
 [  4  16]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_13001
--------- Default -----------
ratio: 0.050 kappa: 0.091, AUC: 0.913, OOB_AUC: 0.962
[[400   0]
 [ 19   1]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       1.00      0.05      0.10        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.98      0.53      0.54       420
weighted avg       0.96      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.611
[[398   2]
 [ 10  10]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_10579
--------- Default -----------
ratio: 0.050 kappa: 0.851, AUC: 1.000, OOB_AUC: 1.000
[[400   0]
 [  5  15]]
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       400
           1       1.00      0.75      0.86        20

   micro avg       0.99      0.99      0.99       420
   macro avg       0.99      0.88      0.93       420
weighted avg       0.99      0.99      0.99       420

--------- Balanced -----------
thresh: 0.150 kappa: 1.000
[[400   0]
 [  0  20]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_11682
--------- Default -----------
ratio: 0.050 kappa: 0.559, AUC: 0.996, OOB_AUC: 0.997
[[400   0]
 [ 12   8]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       400
           1       1.00      0.40      0.57        20

   micro avg       0.97      0.97      0.97       420
   macro avg       0.99      0.70      0.78       420
weighted avg       0.97      0.97      0.97       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.973
[[400   0]
 [  1  19]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_8
--------- Default -----------
ratio: 0.050 kappa: 0.323, AUC: 0.996, OOB_AUC: 0.979
[[400   0]
 [ 16   4]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.20      0.33        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.60      0.66       420
weighted avg       0.96      0.96      0.95       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.812
[[397   3]
 [  4  16]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_43
--------- Default -----------
ratio: 0.050 kappa: 0.175, AUC: 0.875, OOB_AUC: 0.868
[[400   0]
 [ 18   2]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.10      0.18        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.55      0.58       420
weighted avg       0.96      0.96      0.94       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.518
[[398   2]
 [ 12   8]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_90
--------- Default -----------
ratio: 0.050 kappa: 0.252, AUC: 0.986, OOB_AUC: 0.978
[[400   0]
 [ 17   3]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.15      0.26        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.57      0.62       420
weighted avg       0.96      0.96      0.94       420

--------- Balanced -----------
thresh: 0.200 kappa: 0.710
[[397   3]
 [  7  13]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_126
--------- Default -----------
ratio: 0.050 kappa: 0.091, AUC: 0.964, OOB_AUC: 0.966
[[400   0]
 [ 19   1]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       1.00      0.05      0.10        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.98      0.53      0.54       420
weighted avg       0.96      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.672
[[397   3]
 [  8  12]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_219
--------- Default -----------
ratio: 0.050 kappa: 0.000, AUC: 0.956, OOB_AUC: 0.980
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.591
[[397   3]
 [ 10  10]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_104
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.000, AUC: 0.996, OOB_AUC: 0.962
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.816
[[400   0]
 [  6  14]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_121
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.000, AUC: 0.969, OOB_AUC: 0.965
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.527
[[391   9]
 [  9  11]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_11488
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.915, AUC: 1.000, OOB_AUC: 0.991
[[400   0]
 [  3  17]]
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       400
           1       1.00      0.85      0.92        20

   micro avg       0.99      0.99      0.99       420
   macro avg       1.00      0.93      0.96       420
weighted avg       0.99      0.99      0.99       420

--------- Balanced -----------
thresh: 0.250 kappa: 0.973
[[400   0]
 [  1  19]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_18061
--------- Default -----------
ratio: 0.050 kappa: 0.449, AUC: 0.972, OOB_AUC: 0.989
[[400   0]
 [ 14   6]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.98       400
           1       1.00      0.30      0.46        20

   micro avg       0.97      0.97      0.97       420
   macro avg       0.98      0.65      0.72       420
weighted avg       0.97      0.97      0.96       420

--------- Balanced -----------
thresh: 0.200 kappa: 0.973
[[400   0]
 [  1  19]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_28
--------- Default -----------
ratio: 0.050 kappa: 0.816, AUC: 1.000, OOB_AUC: 0.998
[[400   0]
 [  6  14]]
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       400
           1       1.00      0.70      0.82        20

   micro avg       0.99      0.99      0.99       420
   macro avg       0.99      0.85      0.91       420
weighted avg       0.99      0.99      0.98       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.974
[[399   1]
 [  0  20]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_12968
--------- Default -----------
ratio: 0.050 kappa: 1.000, AUC: 1.000, OOB_AUC: 1.000
[[400   0]
 [  0  20]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       400
           1       1.00      1.00      1.00        20

   micro avg       1.00      1.00      1.00       420
   macro avg       1.00      1.00      1.00       420
weighted avg       1.00      1.00      1.00       420

--------- Balanced -----------
thresh: 0.300 kappa: 1.000
[[400   0]
 [  0  20]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_12952
--------- Default -----------
ratio: 0.050 kappa: 0.175, AUC: 0.977, OOB_AUC: 0.985
[[400   0]
 [ 18   2]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.10      0.18        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.55      0.58       420
weighted avg       0.96      0.96      0.94       420

--------- Balanced -----------
thresh: 0.200 kappa: 0.758
[[396   4]
 [  5  15]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_10498
--------- Default -----------
ratio: 0.050 kappa: 0.323, AUC: 0.987, OOB_AUC: 0.996
[[400   0]
 [ 16   4]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.20      0.33        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.60      0.66       420
weighted avg       0.96      0.96      0.95       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.835
[[398   2]
 [  4  16]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_72
--------- Default -----------
ratio: 0.050 kappa: 0.323, AUC: 0.926, OOB_AUC: 0.969
[[400   0]
 [ 16   4]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.20      0.33        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.60      0.66       420
weighted avg       0.96      0.96      0.95       420

--------- Balanced -----------
thresh: 0.200 kappa: 0.791
[[399   1]
 [  6  14]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_237
--------- Default -----------
ratio: 0.050 kappa: 0.851, AUC: 0.988, OOB_AUC: 0.994
[[400   0]
 [  5  15]]
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       400
           1       1.00      0.75      0.86        20

   micro avg       0.99      0.99      0.99       420
   macro avg       0.99      0.88      0.93       420
weighted avg       0.99      0.99      0.99       420

--------- Balanced -----------
thresh: 0.300 kappa: 0.895
[[398   2]
 [  2  18]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_259
--------- Default -----------
ratio: 0.050 kappa: 0.000, AUC: 0.968, OOB_AUC: 0.914
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.611
[[398   2]
 [ 10  10]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_11140
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.252, AUC: 0.981, OOB_AUC: 0.985
[[400   0]
 [ 17   3]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.15      0.26        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.57      0.62       420
weighted avg       0.96      0.96      0.94       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.758
[[396   4]
 [  5  15]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_11442
--------- Default -----------
ratio: 0.050 kappa: 0.915, AUC: 1.000, OOB_AUC: 0.999
[[400   0]
 [  3  17]]
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       400
           1       1.00      0.85      0.92        20

   micro avg       0.99      0.99      0.99       420
   macro avg       1.00      0.93      0.96       420
weighted avg       0.99      0.99      0.99       420

--------- Balanced -----------
thresh: 0.200 kappa: 1.000
[[400   0]
 [  0  20]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_12840
--------- Default -----------
ratio: 0.050 kappa: 0.609, AUC: 0.997, OOB_AUC: 0.992
[[400   0]
 [ 11   9]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       400
           1       1.00      0.45      0.62        20

   micro avg       0.97      0.97      0.97       420
   macro avg       0.99      0.72      0.80       420
weighted avg       0.97      0.97      0.97       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.851
[[400   0]
 [  5  15]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_10752
--------- Default -----------
ratio: 0.050 kappa: 0.449, AUC: 0.995, OOB_AUC: 0.986
[[400   0]
 [ 14   6]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.98       400
           1       1.00      0.30      0.46        20

   micro avg       0.97      0.97      0.97       420
   macro avg       0.98      0.65      0.72       420
weighted avg       0.97      0.97      0.96       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.973
[[400   0]
 [  1  19]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_100579
--------- Default -----------
ratio: 0.050 kappa: 0.506, AUC: 1.000, OOB_AUC: 0.998
[[400   0]
 [ 13   7]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.98       400
           1       1.00      0.35      0.52        20

   micro avg       0.97      0.97      0.97       420
   macro avg       0.98      0.68      0.75       420
weighted avg       0.97      0.97      0.96       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.900
[[397   3]
 [  1  19]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_234
--------- Default -----------
ratio: 0.050 kappa: 0.559, AUC: 0.993, OOB_AUC: 0.983
[[400   0]
 [ 12   8]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       400
           1       1.00      0.40      0.57        20

   micro avg       0.97      0.97      0.97       420
   macro avg       0.99      0.70      0.78       420
weighted avg       0.97      0.97      0.97       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.915
[[400   0]
 [  3  17]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_105
--------- Default -----------
ratio: 0.050 kappa: 0.449, AUC: 1.000, OOB_AUC: 0.992
[[400   0]
 [ 14   6]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.98       400
           1       1.00      0.30      0.46        20

   micro avg       0.97      0.97      0.97       420
   macro avg       0.98      0.65      0.72       420
weighted avg       0.97      0.97      0.96       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.973
[[400   0]
 [  1  19]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_11225
--------- Default -----------
ratio: 0.050 kappa: 0.609, AUC: 1.000, OOB_AUC: 0.988
[[400   0]
 [ 11   9]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       400
           1       1.00      0.45      0.62        20

   micro avg       0.97      0.97      0.97       420
   macro avg       0.99      0.72      0.80       420
weighted avg       0.97      0.97      0.97       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.919
[[399   1]
 [  2  18]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_65
--------- Default -----------
ratio: 0.050 kappa: 0.252, AUC: 0.995, OOB_AUC: 0.981
[[400   0]
 [ 17   3]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.15      0.26        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.57      0.62       420
weighted avg       0.96      0.96      0.94       420

--------- Balanced -----------
thresh: 0.200 kappa: 0.755
[[399   1]
 [  7  13]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_165
--------- Default -----------
ratio: 0.050 kappa: 0.000, AUC: 0.910, OOB_AUC: 0.933
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.100 kappa: 0.370
[[377  23]
 [  9  11]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_100166
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.656, AUC: 1.000, OOB_AUC: 0.996
[[400   0]
 [ 10  10]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       400
           1       1.00      0.50      0.67        20

   micro avg       0.98      0.98      0.98       420
   macro avg       0.99      0.75      0.83       420
weighted avg       0.98      0.98      0.97       420

--------- Balanced -----------
thresh: 0.200 kappa: 0.915
[[400   0]
 [  3  17]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_10260
--------- Default -----------
ratio: 0.050 kappa: 0.252, AUC: 0.991, OOB_AUC: 0.969
[[400   0]
 [ 17   3]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       400
           1       1.00      0.15      0.26        20

   micro avg       0.96      0.96      0.96       420
   macro avg       0.98      0.57      0.62       420
weighted avg       0.96      0.96      0.94       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.821
[[396   4]
 [  3  17]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id ChEMBL_100
--------- Default -----------
ratio: 0.050 kappa: 0.000, AUC: 0.960, OOB_AUC: 0.934
[[400   0]
 [ 20   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       400
           1       0.00      0.00      0.00        20

   micro avg       0.95      0.95      0.95       420
   macro avg       0.48      0.50      0.49       420
weighted avg       0.91      0.95      0.93       420

--------- Balanced -----------
thresh: 0.150 kappa: 0.669
[[395   5]
 [  7  13]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

Look at the distribution of AUROC values here:

In [14]:

plt.hist([x[0][-1] for x in ds1_archive.values()]);

In [15]:

#with open('../data/ds1_archive.pkl','wb+') as outf:
#    pickle.dump(ds1_archive,outf)

PubChem HTS Validation assays found in ChEMBL¶

In [16]:

with gzip.open('../data/ChEMBL_PubChem_HTS.csv.gz') as inf:
    pubchem_d = pd.read_csv(inf)
pubchem_d.head()

Out[16]:

	canonical_smiles	compound_chembl_id	assay_chembl_id	standard_relation	Mean(standard_value)	activity_comment
0	Br.Br.C(c1ccncc1)c2cnc[nH]2	CHEMBL1316355	CHEMBL1614421	=	44668.4	Inconclusive
1	Br.Br.NCCSC(=N)N	CHEMBL1256182	CHEMBL1614249	=	31622.8	Not Active
2	Br.Br.NCCSC(=N)N	CHEMBL1256182	CHEMBL1614364	=	446.7	Not Active
3	Br.Br.NCCSC(=N)N	CHEMBL1256182	CHEMBL1614421	=	17782.8	Inconclusive
4	Br.Br.NCCSC(=N)N	CHEMBL1256182	CHEMBL1794375	NaN	35481.3	active

In [17]:

import requests,json
pubchem_assay_lookup = {}
tpls = sorted([(len(v),k) for k,v in pubchem_d.groupby('assay_chembl_id').groups.items()],reverse=True)
for v,k in tpls:
    resp=requests.get(f'https://www.ebi.ac.uk/chembl/api/data/assay/{k}?format=json')
    info = json.loads(resp.text)
    pubchem_assay_lookup[k] = info['description']
    print(k,v,info['description'])

CHEMBL1794375 83178 PUBCHEM_BIOASSAY: qHTS for inhibitors of binding or entry into cells for Marburg Virus. (Class of assay: confirmatory) [Related pubchem assays (depositor defined):AID463114, AID540249, AID540278]
CHEMBL1614421 49035 PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation, Thioflavin T Binding. (Class of assay: confirmatory) [Related pubchem assays: 596 ]
CHEMBL1614249 42030 PUBCHEM_BIOASSAY: qHTS Assay for Identification of Novel General Anesthetics. In this assay, a GABAergic mimetic model system, apoferritin and a profluorescent 1-aminoanthracene ligand (1-AMA), was used to construct a competitive binding assay for identification of novel general anesthetics (Class of assay: confirmatory) [Related pubchem assays: 2385 (Probe Development Summary for Identification of Novel General Anesthetics), 2323 (Validation apoferritin assay run on SigmaAldrich LOPAC1280 collection)]
CHEMBL1614166 34201 PubChem BioAssay. qHTS Assay for Inhibitors of MBNL1-poly(CUG) RNA binding.   (Class of assay: confirmatory) 
CHEMBL1614364 11842 PUBCHEM_BIOASSAY: qHTS Assay for Inhibitors of Tyrosyl-DNA Phosphodiesterase (TDP1). (Class of assay: confirmatory) 
CHEMBL1613933 7149 PUBCHEM_BIOASSAY: qHTS Assay for Compounds Blocking the Interaction Between CBF-beta and RUNX1 for the Treatment of Acute Myeloid Leukemia. (Class of assay: confirmatory) [Related pubchem assays (depositor defined):AID1484, AID504370, AID504374, AID504375]
CHEMBL3214913 4394 PubChem BioAssay. qHTS for Inhibitors of binding or entry into cells for Marburg Virus.   (Class of assay: confirmatory) 
CHEMBL3215169 3248 PubChem BioAssay. qHTS for Inhibitors of binding or entry into cells for Lassa Virus.   (Class of assay: confirmatory)

In [18]:

with open('../data/pubchem_data.pkl','wb+') as outf:
    pickle.dump((pubchem_d,pubchem_assay_lookup),outf)

In [19]:

with open('../data/pubchem_data.pkl','rb') as inf:
    pubchem_d,pubchem_assay_lookup = pickle.load(inf)

In [20]:

def run_pubchem_assay(assay_id,d,assay_lookup,thresholds=None,archive=None):
    if archive is None:
        archive=defaultdict(list)
    assay = d.loc[d['assay_chembl_id']==assay_id]
    acts = pd.concat((assay.loc[assay['activity_comment'] == 'Active'], 
                      assay.loc[assay['activity_comment'] == 'active']))
    inacts = pd.concat((assay.loc[assay['activity_comment'] == 'inactive'],
                        assay.loc[assay['activity_comment'] == 'inconclusive'], 
                        assay.loc[assay['activity_comment'] == 'Inconclusive'], 
                        assay.loc[assay['activity_comment'] == 'Not Active']))
    print('*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*')
    print(f'assay_id {assay_id}, description: {assay_lookup[assay_id]}')
    run_it_(assay_id,acts,inacts,thresholds,archive)       
    print('*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*\n\n')

In [21]:

pubchem_archive = defaultdict(list)
for k in pubchem_assay_lookup:
    run_pubchem_assay(k,pubchem_d,pubchem_assay_lookup,archive=pubchem_archive)

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1794375, description: PUBCHEM_BIOASSAY: qHTS for inhibitors of binding or entry into cells for Marburg Virus. (Class of assay: confirmatory) [Related pubchem assays (depositor defined):AID463114, AID540249, AID540278]
--------- Default -----------
ratio: 0.057 kappa: 0.002, AUC: 0.704, OOB_AUC: 0.694
[[15737     1]
 [  897     1]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.97     15738
           1       0.50      0.00      0.00       898

   micro avg       0.95      0.95      0.95     16636
   macro avg       0.72      0.50      0.49     16636
weighted avg       0.92      0.95      0.92     16636

--------- Balanced -----------
thresh: 0.100 kappa: 0.129
[[15542   196]
 [  808    90]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1614421, description: PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation, Thioflavin T Binding. (Class of assay: confirmatory) [Related pubchem assays: 596 ]
--------- Default -----------
ratio: 0.129 kappa: 0.030, AUC: 0.861, OOB_AUC: 0.848
[[8680    5]
 [1103   20]]
              precision    recall  f1-score   support

           0       0.89      1.00      0.94      8685
           1       0.80      0.02      0.03      1123

   micro avg       0.89      0.89      0.89      9808
   macro avg       0.84      0.51      0.49      9808
weighted avg       0.88      0.89      0.84      9808

--------- Balanced -----------
thresh: 0.200 kappa: 0.443
[[8255  430]
 [ 603  520]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1614249, description: PUBCHEM_BIOASSAY: qHTS Assay for Identification of Novel General Anesthetics. In this assay, a GABAergic mimetic model system, apoferritin and a profluorescent 1-aminoanthracene ligand (1-AMA), was used to construct a competitive binding assay for identification of novel general anesthetics (Class of assay: confirmatory) [Related pubchem assays: 2385 (Probe Development Summary for Identification of Novel General Anesthetics), 2323 (Validation apoferritin assay run on SigmaAldrich LOPAC1280 collection)]
--------- Default -----------
ratio: 0.006 kappa: 0.000, AUC: 0.702, OOB_AUC: 0.728
[[8356    0]
 [  51    0]]
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      8356
           1       0.00      0.00      0.00        51

   micro avg       0.99      0.99      0.99      8407
   macro avg       0.50      0.50      0.50      8407
weighted avg       0.99      0.99      0.99      8407

--------- Balanced -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

thresh: 0.050 kappa: 0.099
[[8299   57]
 [  45    6]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1614166, description: PubChem BioAssay. qHTS Assay for Inhibitors of MBNL1-poly(CUG) RNA binding.   (Class of assay: confirmatory) 
--------- Default -----------
ratio: 0.003 kappa: 0.181, AUC: 0.822, OOB_AUC: 0.781
[[6821    0]
 [  18    2]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6821
           1       1.00      0.10      0.18        20

   micro avg       1.00      1.00      1.00      6841
   macro avg       1.00      0.55      0.59      6841
weighted avg       1.00      1.00      1.00      6841

--------- Balanced -----------
thresh: 0.100 kappa: 0.456
[[6814    7]
 [  12    8]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1614364, description: PUBCHEM_BIOASSAY: qHTS Assay for Inhibitors of Tyrosyl-DNA Phosphodiesterase (TDP1). (Class of assay: confirmatory) 
--------- Default -----------
ratio: 0.085 kappa: 0.009, AUC: 0.748, OOB_AUC: 0.763
[[2183    1]
 [ 184    1]]
              precision    recall  f1-score   support

           0       0.92      1.00      0.96      2184
           1       0.50      0.01      0.01       185

   micro avg       0.92      0.92      0.92      2369
   macro avg       0.71      0.50      0.49      2369
weighted avg       0.89      0.92      0.89      2369

--------- Balanced -----------
thresh: 0.150 kappa: 0.232
[[2100   84]
 [ 141   44]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1613933, description: PUBCHEM_BIOASSAY: qHTS Assay for Compounds Blocking the Interaction Between CBF-beta and RUNX1 for the Treatment of Acute Myeloid Leukemia. (Class of assay: confirmatory) [Related pubchem assays (depositor defined):AID1484, AID504370, AID504374, AID504375]
--------- Default -----------
ratio: 0.000 kappa: 0.000, AUC: 1.000, OOB_AUC: 1.000
[[1430    0]
 [   1    0]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1430
           1       0.00      0.00      0.00         1

   micro avg       1.00      1.00      1.00      1431
   macro avg       0.50      0.50      0.50      1431
weighted avg       1.00      1.00      1.00      1431

--------- Balanced -----------
thresh: 0.050 kappa: 1.000
[[1430    0]
 [   0    1]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL3214913, description: PubChem BioAssay. qHTS for Inhibitors of binding or entry into cells for Marburg Virus.   (Class of assay: confirmatory) 
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.257 kappa: 0.106, AUC: 0.812, OOB_AUC: 0.799
[[695   5]
 [166  14]]
              precision    recall  f1-score   support

           0       0.81      0.99      0.89       700
           1       0.74      0.08      0.14       180

   micro avg       0.81      0.81      0.81       880
   macro avg       0.77      0.54      0.52       880
weighted avg       0.79      0.81      0.74       880

--------- Balanced -----------
thresh: 0.250 kappa: 0.423
[[618  82]
 [ 83  97]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL3215169, description: PubChem BioAssay. qHTS for Inhibitors of binding or entry into cells for Lassa Virus.   (Class of assay: confirmatory) 
--------- Default -----------
ratio: 0.215 kappa: 0.230, AUC: 0.745, OOB_AUC: 0.769
[[531   4]
 [ 96  19]]
              precision    recall  f1-score   support

           0       0.85      0.99      0.91       535
           1       0.83      0.17      0.28       115

   micro avg       0.85      0.85      0.85       650
   macro avg       0.84      0.58      0.59       650
weighted avg       0.84      0.85      0.80       650

--------- Balanced -----------
thresh: 0.250 kappa: 0.347
[[491  44]
 [ 69  46]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

An assay that didn't work particularly well¶

In [22]:

aid = 'CHEMBL1614249'
assay = pubchem_d.loc[pubchem_d['assay_chembl_id']==aid]
acts = pd.concat((assay.loc[assay['activity_comment'] == 'Active'], 
                  assay.loc[assay['activity_comment'] == 'active']))
inacts = pd.concat((assay.loc[assay['activity_comment'] == 'inactive'],
                    assay.loc[assay['activity_comment'] == 'inconclusive'], 
                    assay.loc[assay['activity_comment'] == 'Inconclusive'], 
                    assay.loc[assay['activity_comment'] == 'Not Active']))
print(len(acts),len(inacts))
ps,tgt,oob_tpl = run_it2(acts,inacts)
oob_tgt,oob_probs = oob_tpl
oob_probs = [x[1] for x in oob_probs]
auroc = metrics.roc_auc_score(oob_tgt, oob_probs)
scores = [1 if x>=0.5 else 0 for x in ps]
kappa = metrics.cohen_kappa_score(tgt,scores)
confusion = metrics.confusion_matrix(tgt,scores)
fpr,tpr,_ = metrics.roc_curve(oob_tgt,oob_probs);
plt.figure()
lw = 2
plt.plot(fpr, tpr, lw=lw)
plt.plot([0, 1], [0, 1], color='k', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title("%s: AUC=%.3f"%(aid,auroc))
plt.show()

251 41779
ratio: 0.006 kappa: 0.000, AUC: 0.733, OOB_AUC: 0.730
[[8356    0]
 [  51    0]]
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      8356
           1       0.00      0.00      0.00        51

   micro avg       0.99      0.99      0.99      8407
   macro avg       0.50      0.50      0.50      8407
weighted avg       0.99      0.99      0.99      8407

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

An assay that did work well¶

In [23]:

aid = 'CHEMBL1614421'
assay = pubchem_d.loc[pubchem_d['assay_chembl_id']==aid]
acts = pd.concat((assay.loc[assay['activity_comment'] == 'Active'], 
                  assay.loc[assay['activity_comment'] == 'active']))
inacts = pd.concat((assay.loc[assay['activity_comment'] == 'inactive'],
                    assay.loc[assay['activity_comment'] == 'inconclusive'], 
                    assay.loc[assay['activity_comment'] == 'Inconclusive'], 
                    assay.loc[assay['activity_comment'] == 'Not Active']))
print(len(acts),len(inacts))
ps,tgt,oob_tpl = run_it2(acts,inacts)
oob_tgt,oob_probs = oob_tpl
oob_probs = [x[1] for x in oob_probs]
auroc = metrics.roc_auc_score(oob_tgt, oob_probs)
scores = [1 if x>=0.5 else 0 for x in ps]
kappa = metrics.cohen_kappa_score(tgt,scores)
confusion = metrics.confusion_matrix(tgt,scores)
fpr,tpr,_ = metrics.roc_curve(oob_tgt,oob_probs);
plt.figure()
lw = 2
plt.plot(fpr, tpr, lw=lw)
plt.plot([0, 1], [0, 1], color='k', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title("%s: AUC=%.3f"%(aid,auroc))
plt.show()

5614 43421
ratio: 0.129 kappa: 0.032, AUC: 0.859, OOB_AUC: 0.850
[[8680    5]
 [1102   21]]
              precision    recall  f1-score   support

           0       0.89      1.00      0.94      8685
           1       0.81      0.02      0.04      1123

   micro avg       0.89      0.89      0.89      9808
   macro avg       0.85      0.51      0.49      9808
weighted avg       0.88      0.89      0.84      9808

ChEMBL DrugMatrix assays¶

In [24]:

with gzip.open('../data/ChEMBL_DrugMatrix.csv.gz') as inf:
    drugmatrix_d = pd.read_csv(inf)
drugmatrix_d.head()

Out[24]:

	canonical_smiles	compound_chembl_id	assay_chembl_id	standard_relation	Mean(standard_value)	activity_comment
0	Br.CN1[C@@H]2CC[C@H]1C[C@H](C2)OC(=O)C(O)c3ccccc3	CHEMBL1319362	CHEMBL1909097	NaN	NaN	Not Active (inhibition < 50% @ 10 uM and thus ...
1	Br.CN1[C@@H]2CC[C@H]1C[C@H](C2)OC(=O)C(O)c3ccccc3	CHEMBL1319362	CHEMBL1909212	NaN	NaN	Not Active (inhibition < 50% @ 10 uM and thus ...
2	Br.CN1[C@@H]2CC[C@H]1C[C@H](C2)OC(=O)C(O)c3ccccc3	CHEMBL1319362	CHEMBL1909213	NaN	NaN	Not Active (inhibition < 50% @ 10 uM and thus ...
3	Br.CN1[C@@H]2CC[C@H]1C[C@H](C2)OC(=O)C(O)c3ccccc3	CHEMBL1319362	CHEMBL1909214	NaN	NaN	Not Active (inhibition < 50% @ 10 uM and thus ...
4	Br.CN1[C@@H]2CC[C@H]1C[C@H](C2)OC(=O)C(O)c3ccccc3	CHEMBL1319362	CHEMBL1909215	NaN	NaN	Not Active (inhibition < 50% @ 10 uM and thus ...

In [25]:

import requests,json
drugmatrix_assay_lookup = {}
tpls = sorted([(len(v),k) for k,v in drugmatrix_d.groupby('assay_chembl_id').groups.items()],reverse=True)
for v,k in tpls:
    # we will only keep the subset of assays that have enough actives,
    assay = drugmatrix_d.loc[drugmatrix_d['assay_chembl_id']==k]
    # skip the assay if it doesn't have at least 40 actives
    if(len([1 for y in assay['activity_comment'] if y.find('Active')==0])<40):
        #print(f'Skipping assay: {k}')
        continue
    resp=requests.get(f'https://www.ebi.ac.uk/chembl/api/data/assay/{k}?format=json')
    info = json.loads(resp.text)
    drugmatrix_assay_lookup[k] = info['description']
    print(k,v,info['description'])
#     drugmatrix_assay_lookup[k] = k
#     print(k,v)

CHEMBL1909215 842 DRUGMATRIX: Adenosine A3 radioligand binding (ligand: AB-MECA)
CHEMBL1909211 842 DRUGMATRIX: Serotonin (5-Hydroxytryptamine) 5-HT2A radioligand binding (ligand: [3H] Ketanserin)
CHEMBL1909210 842 DRUGMATRIX: Serotonin (5-Hydroxytryptamine) 5-HT1B radioligand binding (ligand: [125I] Cyanopindolol)
CHEMBL1909209 842 DRUGMATRIX: Serotonin (5-Hydroxytryptamine) 5-HT1A radioligand binding (ligand: [3H] 8-OH-DPAT)
CHEMBL1909204 842 DRUGMATRIX: Protein Tyrosine Kinase, Fyn enzyme inhibition (substrate: Poly(Glu:Tyr))
CHEMBL1909203 842 DRUGMATRIX: Protein Tyrosine Kinase, EGF Receptor enzyme inhibition (substrate: Poly(Glu:Tyr))
CHEMBL1909191 842 DRUGMATRIX: Progesterone radioligand binding (ligand: [3H] R-5020)
CHEMBL1909174 842 DRUGMATRIX: Muscarinic M5 radioligand binding (ligand: [3H] N-Methylscopolamine)
CHEMBL1909173 842 DRUGMATRIX: Muscarinic M4 radioligand binding (ligand: [3H] N-Methylscopolamine)
CHEMBL1909172 842 DRUGMATRIX: Muscarinic M3 radioligand binding (ligand: [3H] N-Methylscopolamine)
CHEMBL1909171 842 DRUGMATRIX: Muscarinic M2 radioligand binding (ligand: [3H] N-Methylscopolamine)
CHEMBL1909170 842 DRUGMATRIX: Muscarinic M1 radioligand binding (ligand: [3H] N-Methylscopolamine)
CHEMBL1909165 842 DRUGMATRIX: Lipoxygenase 15-LO enzyme inhibition (substrate: Linoleic acid)
CHEMBL1909159 842 DRUGMATRIX: Imidazoline I2, Central radioligand binding (ligand: [3H] Idazoxan)
CHEMBL1909157 842 DRUGMATRIX: Histamine H2 radioligand binding (ligand: [125I] Aminopotentidine)
CHEMBL1909156 842 DRUGMATRIX: Histamine H1, Central radioligand binding (ligand: [3H] Pyrilamine)
CHEMBL1909150 842 DRUGMATRIX: Glucocorticoid radioligand binding (ligand: [3H] Dexamethasone)
CHEMBL1909143 842 DRUGMATRIX: Dopamine Transporter radioligand binding (ligand: [125I] RTI-55)
CHEMBL1909141 842 DRUGMATRIX: Dopamine D3 radioligand binding (ligand: [3H] Spiperone)
CHEMBL1909140 842 DRUGMATRIX: Dopamine D2L radioligand binding (ligand: [3H] Spiperone)
CHEMBL1909139 842 DRUGMATRIX: Dopamine D1 radioligand binding (ligand: [3H] SCH-23390)
CHEMBL1909136 842 DRUGMATRIX: CYP450, 2D6 enzyme inhibition (substrate: 3-Cyano-7-ethoxycoumarin)
CHEMBL1909135 842 DRUGMATRIX: CYP450, 2C9 enzyme inhibition (substrate: 3-Cyano-7-ethoxycoumarin)
CHEMBL1909134 842 DRUGMATRIX: CYP450, 2C19 enzyme inhibition (substrate: 3-Cyano-7-ethoxycoumarin)
CHEMBL1909132 842 DRUGMATRIX: CYP450, 1A2 enzyme inhibition (substrate: 3-Cyano-7-ethoxycoumarin)
CHEMBL1909130 842 DRUGMATRIX: Cyclooxygenase COX-1 enzyme inhibition (substrate: Arachidonic acid)
CHEMBL1909121 842 DRUGMATRIX: Calcium Channel Type L, Phenylalkylamine radioligand binding (ligand: [3H] (-)-Desmethoxyverapamil (D-888))
CHEMBL1909115 842 DRUGMATRIX: Androgen (Testosterone) AR radioligand binding (ligand: [3H] Mibolerone)
CHEMBL1909114 842 DRUGMATRIX: Tachykinin NK2 radioligand binding (ligand: [3H] SR-48968)
CHEMBL1909112 842 DRUGMATRIX: Sodium Channel, Site 2 radioligand binding (ligand: [3H] Batrachotoxin)
CHEMBL1909111 842 DRUGMATRIX: Sigma2 radioligand binding (ligand: [3H] Ifenprodil)
CHEMBL1909110 842 DRUGMATRIX: Sigma1 radioligand binding (ligand: [3H] Haloperidol)
CHEMBL1909109 842 DRUGMATRIX: Transporter, Serotonin (5-Hydroxytryptamine) (SERT) radioligand binding (ligand: [3H] Paroxetine)
CHEMBL1909108 842 DRUGMATRIX: Serotonin (5-Hydroxytryptamine) 5-HT6 radioligand binding (ligand: [3H] Lysergic acid diethylamide)
CHEMBL1909105 842 DRUGMATRIX: Serotonin (5-Hydroxytryptamine) 5-HT2C radioligand binding (ligand: [3H] Mesulergine)
CHEMBL1909104 842 DRUGMATRIX: Serotonin (5-Hydroxytryptamine) 5-HT2B radioligand binding (ligand: [3H] Lysergic acid diethylamide)
CHEMBL1909102 842 DRUGMATRIX: Calcium Channel Type L, Benzothiazepine radioligand binding (ligand: [3H] Diltiazem)
CHEMBL1909094 842 DRUGMATRIX: Norepinephrine Transporter radioligand binding (ligand: [125I] RTI-55)
CHEMBL1909090 842 DRUGMATRIX: Adrenergic Alpha-2C radioligand binding (ligand: [3H] MK-912)
CHEMBL1909089 842 DRUGMATRIX: Alpha-2B adrenergic receptor radioligand binding (ligand: Rauwolscine)
CHEMBL1909088 842 DRUGMATRIX: Alpha-2A adrenergic receptor radioligand binding (ligand: MK-912)
CHEMBL1909087 842 DRUGMATRIX: Alpha-1D adrenergic receptor radioligand binding (ligand: prazosin)
CHEMBL1909086 842 DRUGMATRIX: Alpha-1B adrenergic receptor radioligand binding (ligand: prazosin)
CHEMBL1909085 842 DRUGMATRIX: Alpha-1A adrenergic receptor radioligand binding (ligand: prazosin)

In [26]:

with open('../data/drugmatrix_data.pkl','wb+') as outf:
    pickle.dump((drugmatrix_d,drugmatrix_assay_lookup),outf)

In [27]:

with open('../data/drugmatrix_data.pkl','rb') as inf:
    drugmatrix_d,drugmatrix_assay_lookup = pickle.load(inf)

In [28]:

from collections import defaultdict
def run_drugmatrix_assay(assay_id,d,assay_lookup,thresholds=None,archive=None):
    if archive is None:
        archive=defaultdict(list)
    assay = d.loc[d['assay_chembl_id']==assay_id]
    inact_indices = [x for x,y in enumerate(assay['activity_comment']) if y.find('Not Active')==0]
    act_indices = [x for x,y in enumerate(assay['activity_comment']) if y.find('Active')==0]
    acts = assay.iloc[act_indices]
    inacts = assay.iloc[inact_indices]
    print('*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*')

    if(len(act_indices)<40):
        print(f'>>>>> SKIPPING {assay_id} DUE TO INSUFFICIENT ACTIVES <<<<<< ')
        print('*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*\n\n')
        return

    print(f'assay_id {assay_id}, description: {assay_lookup[assay_id]}')
    run_it_(assay_id,acts,inacts,thresholds,archive)       

    print('*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*\n\n')

In [29]:

drugmatrix_archive = defaultdict(list)
for k in drugmatrix_assay_lookup:
    run_drugmatrix_assay(k,drugmatrix_d,drugmatrix_assay_lookup,archive=drugmatrix_archive)

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909215, description: DRUGMATRIX: Adenosine A3 radioligand binding (ligand: AB-MECA)
--------- Default -----------
ratio: 0.073 kappa: 0.000, AUC: 0.839, OOB_AUC: 0.830
[[157   0]
 [ 12   0]]
              precision    recall  f1-score   support

           0       0.93      1.00      0.96       157
           1       0.00      0.00      0.00        12

   micro avg       0.93      0.93      0.93       169
   macro avg       0.46      0.50      0.48       169
weighted avg       0.86      0.93      0.89       169

--------- Balanced -----------
thresh: 0.150 kappa: 0.417
[[152   5]
 [  7   5]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909211, description: DRUGMATRIX: Serotonin (5-Hydroxytryptamine) 5-HT2A radioligand binding (ligand: [3H] Ketanserin)
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.127 kappa: 0.366, AUC: 0.924, OOB_AUC: 0.935
[[149   1]
 [ 14   5]]
              precision    recall  f1-score   support

           0       0.91      0.99      0.95       150
           1       0.83      0.26      0.40        19

   micro avg       0.91      0.91      0.91       169
   macro avg       0.87      0.63      0.68       169
weighted avg       0.91      0.91      0.89       169

--------- Balanced -----------
thresh: 0.250 kappa: 0.751
[[147   3]
 [  5  14]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909210, description: DRUGMATRIX: Serotonin (5-Hydroxytryptamine) 5-HT1B radioligand binding (ligand: [125I] Cyanopindolol)
--------- Default -----------
ratio: 0.050 kappa: 0.000, AUC: 0.777, OOB_AUC: 0.903
[[161   0]
 [  8   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       161
           1       0.00      0.00      0.00         8

   micro avg       0.95      0.95      0.95       169
   macro avg       0.48      0.50      0.49       169
weighted avg       0.91      0.95      0.93       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.282
[[158   3]
 [  6   2]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909209, description: DRUGMATRIX: Serotonin (5-Hydroxytryptamine) 5-HT1A radioligand binding (ligand: [3H] 8-OH-DPAT)
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.089 kappa: 0.308, AUC: 0.914, OOB_AUC: 0.921
[[154   1]
 [ 11   3]]
              precision    recall  f1-score   support

           0       0.93      0.99      0.96       155
           1       0.75      0.21      0.33        14

   micro avg       0.93      0.93      0.93       169
   macro avg       0.84      0.60      0.65       169
weighted avg       0.92      0.93      0.91       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.611
[[150   5]
 [  5   9]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909204, description: DRUGMATRIX: Protein Tyrosine Kinase, Fyn enzyme inhibition (substrate: Poly(Glu:Tyr))
--------- Default -----------
ratio: 0.066 kappa: 0.000, AUC: 0.783, OOB_AUC: 0.891
[[158   0]
 [ 11   0]]
              precision    recall  f1-score   support

           0       0.93      1.00      0.97       158
           1       0.00      0.00      0.00        11

   micro avg       0.93      0.93      0.93       169
   macro avg       0.47      0.50      0.48       169
weighted avg       0.87      0.93      0.90       169

--------- Balanced -----------
thresh: 0.100 kappa: 0.385
[[144  14]
 [  4   7]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909203, description: DRUGMATRIX: Protein Tyrosine Kinase, EGF Receptor enzyme inhibition (substrate: Poly(Glu:Tyr))
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.000, AUC: 0.917, OOB_AUC: 0.865
[[161   0]
 [  8   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       161
           1       0.00      0.00      0.00         8

   micro avg       0.95      0.95      0.95       169
   macro avg       0.48      0.50      0.49       169
weighted avg       0.91      0.95      0.93       169

--------- Balanced -----------
thresh: 0.100 kappa: 0.288
[[149  12]
 [  4   4]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909191, description: DRUGMATRIX: Progesterone radioligand binding (ligand: [3H] R-5020)
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.060 kappa: 0.556, AUC: 0.975, OOB_AUC: 0.936
[[159   0]
 [  6   4]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       159
           1       1.00      0.40      0.57        10

   micro avg       0.96      0.96      0.96       169
   macro avg       0.98      0.70      0.78       169
weighted avg       0.97      0.96      0.96       169

--------- Balanced -----------
thresh: 0.250 kappa: 0.883
[[159   0]
 [  2   8]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909174, description: DRUGMATRIX: Muscarinic M5 radioligand binding (ligand: [3H] N-Methylscopolamine)
--------- Default -----------
ratio: 0.084 kappa: 0.228, AUC: 0.946, OOB_AUC: 0.918
[[155   1]
 [ 11   2]]
              precision    recall  f1-score   support

           0       0.93      0.99      0.96       156
           1       0.67      0.15      0.25        13

   micro avg       0.93      0.93      0.93       169
   macro avg       0.80      0.57      0.61       169
weighted avg       0.91      0.93      0.91       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.487
[[152   4]
 [  7   6]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909173, description: DRUGMATRIX: Muscarinic M4 radioligand binding (ligand: [3H] N-Methylscopolamine)
--------- Default -----------
ratio: 0.082 kappa: -0.011, AUC: 0.915, OOB_AUC: 0.918
[[155   1]
 [ 13   0]]
              precision    recall  f1-score   support

           0       0.92      0.99      0.96       156
           1       0.00      0.00      0.00        13

   micro avg       0.92      0.92      0.92       169
   macro avg       0.46      0.50      0.48       169
weighted avg       0.85      0.92      0.88       169

--------- Balanced -----------
thresh: 0.250 kappa: 0.304
[[154   2]
 [ 10   3]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909172, description: DRUGMATRIX: Muscarinic M3 radioligand binding (ligand: [3H] N-Methylscopolamine)
--------- Default -----------
ratio: 0.082 kappa: 0.000, AUC: 0.916, OOB_AUC: 0.906
[[156   0]
 [ 13   0]]
              precision    recall  f1-score   support

           0       0.92      1.00      0.96       156
           1       0.00      0.00      0.00        13

   micro avg       0.92      0.92      0.92       169
   macro avg       0.46      0.50      0.48       169
weighted avg       0.85      0.92      0.89       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.301
[[150   6]
 [  9   4]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909171, description: DRUGMATRIX: Muscarinic M2 radioligand binding (ligand: [3H] N-Methylscopolamine)
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.078 kappa: 0.115, AUC: 0.929, OOB_AUC: 0.896
[[156   1]
 [ 12   1]]
              precision    recall  f1-score   support

           0       0.93      0.99      0.96       157
           1       0.50      0.08      0.13        13

   micro avg       0.92      0.92      0.92       170
   macro avg       0.71      0.54      0.55       170
weighted avg       0.90      0.92      0.90       170

--------- Balanced -----------
thresh: 0.200 kappa: 0.500
[[151   6]
 [  6   7]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909170, description: DRUGMATRIX: Muscarinic M1 radioligand binding (ligand: [3H] N-Methylscopolamine)
--------- Default -----------
ratio: 0.086 kappa: 0.106, AUC: 0.947, OOB_AUC: 0.903
[[154   1]
 [ 13   1]]
              precision    recall  f1-score   support

           0       0.92      0.99      0.96       155
           1       0.50      0.07      0.12        14

   micro avg       0.92      0.92      0.92       169
   macro avg       0.71      0.53      0.54       169
weighted avg       0.89      0.92      0.89       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.434
[[147   8]
 [  7   7]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909165, description: DRUGMATRIX: Lipoxygenase 15-LO enzyme inhibition (substrate: Linoleic acid)
--------- Default -----------
ratio: 0.054 kappa: 0.000, AUC: 0.693, OOB_AUC: 0.808
[[160   0]
 [  9   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.97       160
           1       0.00      0.00      0.00         9

   micro avg       0.95      0.95      0.95       169
   macro avg       0.47      0.50      0.49       169
weighted avg       0.90      0.95      0.92       169

--------- Balanced -----------
thresh: 0.150 kappa: 0.296
[[154   6]
 [  6   3]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909159, description: DRUGMATRIX: Imidazoline I2, Central radioligand binding (ligand: [3H] Idazoxan)
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.078 kappa: 0.133, AUC: 0.852, OOB_AUC: 0.903
[[157   0]
 [ 12   1]]
              precision    recall  f1-score   support

           0       0.93      1.00      0.96       157
           1       1.00      0.08      0.14        13

   micro avg       0.93      0.93      0.93       170
   macro avg       0.96      0.54      0.55       170
weighted avg       0.93      0.93      0.90       170

--------- Balanced -----------
thresh: 0.200 kappa: 0.611
[[153   4]
 [  5   8]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909157, description: DRUGMATRIX: Histamine H2 radioligand binding (ligand: [125I] Aminopotentidine)
--------- Default -----------
ratio: 0.054 kappa: 0.000, AUC: 0.905, OOB_AUC: 0.824
[[160   0]
 [  9   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.97       160
           1       0.00      0.00      0.00         9

   micro avg       0.95      0.95      0.95       169
   macro avg       0.47      0.50      0.49       169
weighted avg       0.90      0.95      0.92       169

--------- Balanced -----------
thresh: 0.150 kappa: 0.443
[[156   4]
 [  5   4]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909156, description: DRUGMATRIX: Histamine H1, Central radioligand binding (ligand: [3H] Pyrilamine)
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.062 kappa: 0.288, AUC: 0.963, OOB_AUC: 0.939
[[158   1]
 [  8   2]]
              precision    recall  f1-score   support

           0       0.95      0.99      0.97       159
           1       0.67      0.20      0.31        10

   micro avg       0.95      0.95      0.95       169
   macro avg       0.81      0.60      0.64       169
weighted avg       0.93      0.95      0.93       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.787
[[157   2]
 [  2   8]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909150, description: DRUGMATRIX: Glucocorticoid radioligand binding (ligand: [3H] Dexamethasone)
--------- Default -----------
ratio: 0.059 kappa: 0.814, AUC: 0.939, OOB_AUC: 0.953
[[159   0]
 [  3   7]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       159
           1       1.00      0.70      0.82        10

   micro avg       0.98      0.98      0.98       169
   macro avg       0.99      0.85      0.91       169
weighted avg       0.98      0.98      0.98       169

--------- Balanced -----------
thresh: 0.400 kappa: 0.814
[[159   0]
 [  3   7]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909143, description: DRUGMATRIX: Dopamine Transporter radioligand binding (ligand: [125I] RTI-55)
--------- Default -----------
ratio: 0.109 kappa: 0.000, AUC: 0.819, OOB_AUC: 0.871
[[152   0]
 [ 17   0]]
              precision    recall  f1-score   support

           0       0.90      1.00      0.95       152
           1       0.00      0.00      0.00        17

   micro avg       0.90      0.90      0.90       169
   macro avg       0.45      0.50      0.47       169
weighted avg       0.81      0.90      0.85       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.393
[[139  13]
 [  8   9]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909141, description: DRUGMATRIX: Dopamine D3 radioligand binding (ligand: [3H] Spiperone)
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.120 kappa: 0.095, AUC: 0.946, OOB_AUC: 0.896
[[151   0]
 [ 17   1]]
              precision    recall  f1-score   support

           0       0.90      1.00      0.95       151
           1       1.00      0.06      0.11        18

   micro avg       0.90      0.90      0.90       169
   macro avg       0.95      0.53      0.53       169
weighted avg       0.91      0.90      0.86       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.713
[[147   4]
 [  5  13]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909140, description: DRUGMATRIX: Dopamine D2L radioligand binding (ligand: [3H] Spiperone)
--------- Default -----------
ratio: 0.075 kappa: 0.000, AUC: 0.924, OOB_AUC: 0.933
[[157   0]
 [ 12   0]]
              precision    recall  f1-score   support

           0       0.93      1.00      0.96       157
           1       0.00      0.00      0.00        12

   micro avg       0.93      0.93      0.93       169
   macro avg       0.46      0.50      0.48       169
weighted avg       0.86      0.93      0.89       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.641
[[153   4]
 [  4   8]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909139, description: DRUGMATRIX: Dopamine D1 radioligand binding (ligand: [3H] SCH-23390)
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.081 kappa: -0.011, AUC: 0.917, OOB_AUC: 0.876
[[155   1]
 [ 13   0]]
              precision    recall  f1-score   support

           0       0.92      0.99      0.96       156
           1       0.00      0.00      0.00        13

   micro avg       0.92      0.92      0.92       169
   macro avg       0.46      0.50      0.48       169
weighted avg       0.85      0.92      0.88       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.394
[[151   5]
 [  8   5]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909136, description: DRUGMATRIX: CYP450, 2D6 enzyme inhibition (substrate: 3-Cyano-7-ethoxycoumarin)
--------- Default -----------
ratio: 0.095 kappa: 0.000, AUC: 0.809, OOB_AUC: 0.809
[[154   0]
 [ 15   0]]
              precision    recall  f1-score   support

           0       0.91      1.00      0.95       154
           1       0.00      0.00      0.00        15

   micro avg       0.91      0.91      0.91       169
   macro avg       0.46      0.50      0.48       169
weighted avg       0.83      0.91      0.87       169

--------- Balanced -----------
thresh: 0.150 kappa: 0.247
[[134  20]
 [  8   7]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909135, description: DRUGMATRIX: CYP450, 2C9 enzyme inhibition (substrate: 3-Cyano-7-ethoxycoumarin)
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.063 kappa: 0.320, AUC: 0.819, OOB_AUC: 0.778
[[159   0]
 [  8   2]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       159
           1       1.00      0.20      0.33        10

   micro avg       0.95      0.95      0.95       169
   macro avg       0.98      0.60      0.65       169
weighted avg       0.95      0.95      0.94       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.414
[[155   4]
 [  6   4]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909134, description: DRUGMATRIX: CYP450, 2C19 enzyme inhibition (substrate: 3-Cyano-7-ethoxycoumarin)
--------- Default -----------
ratio: 0.060 kappa: 0.000, AUC: 0.762, OOB_AUC: 0.718
[[159   0]
 [ 10   0]]
              precision    recall  f1-score   support

           0       0.94      1.00      0.97       159
           1       0.00      0.00      0.00        10

   micro avg       0.94      0.94      0.94       169
   macro avg       0.47      0.50      0.48       169
weighted avg       0.89      0.94      0.91       169

--------- Balanced -----------
thresh: 0.300 kappa: 0.409
[[158   1]
 [  7   3]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909132, description: DRUGMATRIX: CYP450, 1A2 enzyme inhibition (substrate: 3-Cyano-7-ethoxycoumarin)
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.050 kappa: 0.000, AUC: 0.793, OOB_AUC: 0.676
[[161   0]
 [  8   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       161
           1       0.00      0.00      0.00         8

   micro avg       0.95      0.95      0.95       169
   macro avg       0.48      0.50      0.49       169
weighted avg       0.91      0.95      0.93       169

--------- Balanced -----------
thresh: 0.050 kappa: 0.117
[[103  58]
 [  1   7]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909130, description: DRUGMATRIX: Cyclooxygenase COX-1 enzyme inhibition (substrate: Arachidonic acid)
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.056 kappa: -0.011, AUC: 0.894, OOB_AUC: 0.833
[[159   1]
 [  9   0]]
              precision    recall  f1-score   support

           0       0.95      0.99      0.97       160
           1       0.00      0.00      0.00         9

   micro avg       0.94      0.94      0.94       169
   macro avg       0.47      0.50      0.48       169
weighted avg       0.90      0.94      0.92       169

--------- Balanced -----------
thresh: 0.150 kappa: 0.406
[[158   2]
 [  6   3]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909121, description: DRUGMATRIX: Calcium Channel Type L, Phenylalkylamine radioligand binding (ligand: [3H] (-)-Desmethoxyverapamil (D-888))
--------- Default -----------
ratio: 0.105 kappa: 0.108, AUC: 0.870, OOB_AUC: 0.912
[[153   0]
 [ 15   1]]
              precision    recall  f1-score   support

           0       0.91      1.00      0.95       153
           1       1.00      0.06      0.12        16

   micro avg       0.91      0.91      0.91       169
   macro avg       0.96      0.53      0.54       169
weighted avg       0.92      0.91      0.87       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.488
[[147   6]
 [  8   8]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909115, description: DRUGMATRIX: Androgen (Testosterone) AR radioligand binding (ligand: [3H] Mibolerone)
--------- Default -----------
ratio: 0.067 kappa: 0.158, AUC: 0.739, OOB_AUC: 0.851
[[158   0]
 [ 10   1]]
              precision    recall  f1-score   support

           0       0.94      1.00      0.97       158
           1       1.00      0.09      0.17        11

   micro avg       0.94      0.94      0.94       169
   macro avg       0.97      0.55      0.57       169
weighted avg       0.94      0.94      0.92       169

--------- Balanced -----------
thresh: 0.300 kappa: 0.240
[[156   2]
 [  9   2]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909114, description: DRUGMATRIX: Tachykinin NK2 radioligand binding (ligand: [3H] SR-48968)
--------- Default -----------
ratio: 0.051 kappa: 0.000, AUC: 0.850, OOB_AUC: 0.906
[[161   0]
 [  9   0]]
              precision    recall  f1-score   support

           0       0.95      1.00      0.97       161
           1       0.00      0.00      0.00         9

   micro avg       0.95      0.95      0.95       170
   macro avg       0.47      0.50      0.49       170
weighted avg       0.90      0.95      0.92       170

--------- Balanced -----------
thresh: 0.200 kappa: 0.345
[[157   4]
 [  6   3]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909112, description: DRUGMATRIX: Sodium Channel, Site 2 radioligand binding (ligand: [3H] Batrachotoxin)
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.138 kappa: 0.192, AUC: 0.940, OOB_AUC: 0.868
[[146   2]
 [ 18   3]]
              precision    recall  f1-score   support

           0       0.89      0.99      0.94       148
           1       0.60      0.14      0.23        21

   micro avg       0.88      0.88      0.88       169
   macro avg       0.75      0.56      0.58       169
weighted avg       0.85      0.88      0.85       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.619
[[132  16]
 [  2  19]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909111, description: DRUGMATRIX: Sigma2 radioligand binding (ligand: [3H] Ifenprodil)
--------- Default -----------
ratio: 0.069 kappa: 0.000, AUC: 0.909, OOB_AUC: 0.919
[[158   0]
 [ 11   0]]
              precision    recall  f1-score   support

           0       0.93      1.00      0.97       158
           1       0.00      0.00      0.00        11

   micro avg       0.93      0.93      0.93       169
   macro avg       0.47      0.50      0.48       169
weighted avg       0.87      0.93      0.90       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.543
[[154   4]
 [  5   6]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909110, description: DRUGMATRIX: Sigma1 radioligand binding (ligand: [3H] Haloperidol)
--------- Default -----------

/other_linux/home/glandrum/anaconda3/envs/py37_dash/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

ratio: 0.085 kappa: -0.011, AUC: 0.886, OOB_AUC: 0.899
[[155   1]
 [ 14   0]]
              precision    recall  f1-score   support

           0       0.92      0.99      0.95       156
           1       0.00      0.00      0.00        14

   micro avg       0.91      0.91      0.91       170
   macro avg       0.46      0.50      0.48       170
weighted avg       0.84      0.91      0.88       170

--------- Balanced -----------
thresh: 0.150 kappa: 0.284
[[142  14]
 [  8   6]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909109, description: DRUGMATRIX: Transporter, Serotonin (5-Hydroxytryptamine) (SERT) radioligand binding (ligand: [3H] Paroxetine)
--------- Default -----------
ratio: 0.121 kappa: 0.156, AUC: 0.924, OOB_AUC: 0.938
[[150   1]
 [ 17   2]]
              precision    recall  f1-score   support

           0       0.90      0.99      0.94       151
           1       0.67      0.11      0.18        19

   micro avg       0.89      0.89      0.89       170
   macro avg       0.78      0.55      0.56       170
weighted avg       0.87      0.89      0.86       170

--------- Balanced -----------
thresh: 0.250 kappa: 0.400
[[146   5]
 [ 12   7]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909108, description: DRUGMATRIX: Serotonin (5-Hydroxytryptamine) 5-HT6 radioligand binding (ligand: [3H] Lysergic acid diethylamide)
--------- Default -----------
ratio: 0.084 kappa: 0.133, AUC: 0.893, OOB_AUC: 0.866
[[156   0]
 [ 12   1]]
              precision    recall  f1-score   support

           0       0.93      1.00      0.96       156
           1       1.00      0.08      0.14        13

   micro avg       0.93      0.93      0.93       169
   macro avg       0.96      0.54      0.55       169
weighted avg       0.93      0.93      0.90       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.500
[[150   6]
 [  6   7]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909105, description: DRUGMATRIX: Serotonin (5-Hydroxytryptamine) 5-HT2C radioligand binding (ligand: [3H] Mesulergine)
--------- Default -----------
ratio: 0.139 kappa: 0.226, AUC: 0.875, OOB_AUC: 0.902
[[148   0]
 [ 18   3]]
              precision    recall  f1-score   support

           0       0.89      1.00      0.94       148
           1       1.00      0.14      0.25        21

   micro avg       0.89      0.89      0.89       169
   macro avg       0.95      0.57      0.60       169
weighted avg       0.91      0.89      0.86       169

--------- Balanced -----------
thresh: 0.250 kappa: 0.504
[[143   5]
 [ 11  10]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909104, description: DRUGMATRIX: Serotonin (5-Hydroxytryptamine) 5-HT2B radioligand binding (ligand: [3H] Lysergic acid diethylamide)
--------- Default -----------
ratio: 0.169 kappa: 0.189, AUC: 0.943, OOB_AUC: 0.889
[[144   0]
 [ 22   3]]
              precision    recall  f1-score   support

           0       0.87      1.00      0.93       144
           1       1.00      0.12      0.21        25

   micro avg       0.87      0.87      0.87       169
   macro avg       0.93      0.56      0.57       169
weighted avg       0.89      0.87      0.82       169

--------- Balanced -----------
thresh: 0.300 kappa: 0.606
[[142   2]
 [ 12  13]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909102, description: DRUGMATRIX: Calcium Channel Type L, Benzothiazepine radioligand binding (ligand: [3H] Diltiazem)
--------- Default -----------
ratio: 0.101 kappa: -0.011, AUC: 0.869, OOB_AUC: 0.915
[[152   1]
 [ 16   0]]
              precision    recall  f1-score   support

           0       0.90      0.99      0.95       153
           1       0.00      0.00      0.00        16

   micro avg       0.90      0.90      0.90       169
   macro avg       0.45      0.50      0.47       169
weighted avg       0.82      0.90      0.86       169

--------- Balanced -----------
thresh: 0.150 kappa: 0.461
[[138  15]
 [  5  11]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909094, description: DRUGMATRIX: Norepinephrine Transporter radioligand binding (ligand: [125I] RTI-55)
--------- Default -----------
ratio: 0.147 kappa: 0.119, AUC: 0.934, OOB_AUC: 0.861
[[145   2]
 [ 20   2]]
              precision    recall  f1-score   support

           0       0.88      0.99      0.93       147
           1       0.50      0.09      0.15        22

   micro avg       0.87      0.87      0.87       169
   macro avg       0.69      0.54      0.54       169
weighted avg       0.83      0.87      0.83       169

--------- Balanced -----------
thresh: 0.250 kappa: 0.490
[[140   7]
 [ 11  11]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909090, description: DRUGMATRIX: Adrenergic Alpha-2C radioligand binding (ligand: [3H] MK-912)
--------- Default -----------
ratio: 0.105 kappa: 0.078, AUC: 0.861, OOB_AUC: 0.904
[[151   2]
 [ 15   1]]
              precision    recall  f1-score   support

           0       0.91      0.99      0.95       153
           1       0.33      0.06      0.11        16

   micro avg       0.90      0.90      0.90       169
   macro avg       0.62      0.52      0.53       169
weighted avg       0.86      0.90      0.87       169

--------- Balanced -----------
thresh: 0.150 kappa: 0.392
[[133  20]
 [  5  11]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909089, description: DRUGMATRIX: Alpha-2B adrenergic receptor radioligand binding (ligand: Rauwolscine)
--------- Default -----------
ratio: 0.141 kappa: 0.391, AUC: 0.873, OOB_AUC: 0.895
[[147   1]
 [ 15   6]]
              precision    recall  f1-score   support

           0       0.91      0.99      0.95       148
           1       0.86      0.29      0.43        21

   micro avg       0.91      0.91      0.91       169
   macro avg       0.88      0.64      0.69       169
weighted avg       0.90      0.91      0.88       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.497
[[136  12]
 [  8  13]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909088, description: DRUGMATRIX: Alpha-2A adrenergic receptor radioligand binding (ligand: MK-912)
--------- Default -----------
ratio: 0.150 kappa: 0.183, AUC: 0.881, OOB_AUC: 0.917
[[145   2]
 [ 19   3]]
              precision    recall  f1-score   support

           0       0.88      0.99      0.93       147
           1       0.60      0.14      0.22        22

   micro avg       0.88      0.88      0.88       169
   macro avg       0.74      0.56      0.58       169
weighted avg       0.85      0.88      0.84       169

--------- Balanced -----------
thresh: 0.250 kappa: 0.531
[[136  11]
 [  8  14]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909087, description: DRUGMATRIX: Alpha-1D adrenergic receptor radioligand binding (ligand: prazosin)
--------- Default -----------
ratio: 0.096 kappa: 0.180, AUC: 0.878, OOB_AUC: 0.940
[[152   2]
 [ 13   2]]
              precision    recall  f1-score   support

           0       0.92      0.99      0.95       154
           1       0.50      0.13      0.21        15

   micro avg       0.91      0.91      0.91       169
   macro avg       0.71      0.56      0.58       169
weighted avg       0.88      0.91      0.89       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.442
[[142  12]
 [  6   9]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909086, description: DRUGMATRIX: Alpha-1B adrenergic receptor radioligand binding (ligand: prazosin)
--------- Default -----------
ratio: 0.096 kappa: 0.099, AUC: 0.906, OOB_AUC: 0.926
[[153   1]
 [ 14   1]]
              precision    recall  f1-score   support

           0       0.92      0.99      0.95       154
           1       0.50      0.07      0.12        15

   micro avg       0.91      0.91      0.91       169
   macro avg       0.71      0.53      0.54       169
weighted avg       0.88      0.91      0.88       169

--------- Balanced -----------
thresh: 0.200 kappa: 0.510
[[148   6]
 [  7   8]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
assay_id CHEMBL1909085, description: DRUGMATRIX: Alpha-1A adrenergic receptor radioligand binding (ligand: prazosin)
--------- Default -----------
ratio: 0.094 kappa: 0.099, AUC: 0.893, OOB_AUC: 0.921
[[153   1]
 [ 14   1]]
              precision    recall  f1-score   support

           0       0.92      0.99      0.95       154
           1       0.50      0.07      0.12        15

   micro avg       0.91      0.91      0.91       169
   macro avg       0.71      0.53      0.54       169
weighted avg       0.88      0.91      0.88       169

--------- Balanced -----------
thresh: 0.250 kappa: 0.398
[[151   3]
 [ 10   5]]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

Again, look at the distribution of AUROC values:

In [30]:

plt.hist([x[0][-1] for x in drugmatrix_archive.values()]);

Bringing it all together¶

Plot the balanced kappa value versus the original value for all of the datasets:

In [33]:

pickle.dump((drugmatrix_archive,pubchem_archive,ds1_archive,serotonin_archive),
            open('../data/unbalanced_data_results.pkl','wb+'))

In [31]:

allInitVs = [x[0] for x in drugmatrix_archive.values()] + \
    [x[0] for x in pubchem_archive.values()] + \
    [x[0] for x in ds1_archive.values()] + \
    [x[0] for x in serotonin_archive.values()]
allFinalVs = [x[1] for x in drugmatrix_archive.values()] + \
    [x[1] for x in pubchem_archive.values()] + \
    [x[1] for x in ds1_archive.values()] + \
    [x[1] for x in serotonin_archive.values()]
  
initVs = [x[1] for x in allInitVs if x[-1]>0.9]
finalVs = [y[1] for x,y in zip(allInitVs,allFinalVs) if x[-1]>0.9]
plt.scatter(initVs,finalVs,c='#27647bff',linewidth=0,label='AUROC>0.9');
initVs = [x[1] for x in allInitVs if x[-1]<=0.9 and x[-1]>0.8]
finalVs = [y[1] for x,y in zip(allInitVs,allFinalVs) if x[-1]<=0.9 and x[-1]>0.8]
plt.scatter(initVs,finalVs,c='#849fadff',linewidth=0,label='0.8<AUROC<=0.9');
initVs = [x[1] for x in allInitVs if x[-1]<=0.8]
finalVs = [y[1] for x,y in zip(allInitVs,allFinalVs) if x[-1]<=0.8 ]
plt.scatter(initVs,finalVs,c='#ca3542ff',linewidth=0,label='AUROC<=0.8');
# # initVs = [x[0][1] for x in drugmatrix_archive.values() if x[0][-1]<=0.8]
# # finalVs = [x[1][1] for x in drugmatrix_archive.values() if x[0][-1]<=0.8]
# # plt.scatter(initVs,finalVs,linewidth=0);
plt.plot([0,1],[0,1],'k-')
plt.xlabel('kappa (unbalanced)')
plt.legend()
plt.ylabel('kappa (balanced)');

In [ ]: