Notebook

myChEMBL ADMESARfari webservice tutorial¶

myChEMBL team, ChEMBL Group, EMBL-EBI.¶

This notebook is intended to illustrate the use of the ADMESARfari webservice API from Python. Since the webservices for ADMESARfari are written using Cornice (https://github.com/mozilla-services/cornice) we have an exposed SPORE (https://github.com/SPORE/specifications) endpoint. This allows us to use a Python library, such as Respire (https://github.com/spiral-project/respire) to parse the JSON description of the methods available, which is provided by the SPORE endpoint (https://www.ebi.ac.uk/chembl/admesarfari/rest/spore) and to automatically generate callable methods from Python without handcoding the necessary boilerplate code.

We will cover

Using Respire to create an API client
Making basic GET requests to the service
Using an input compound to run a prediction
Using an input FASTA sequence to run a prediction
Presentation of results from both.

In [3]:

# Let do our imports first.
import respire,urllib,re
from IPython.display import HTML,JSON

In [6]:

# We just need to monkey-patch the URL join method in this instance, 
# since it truncates the URL due to the way ADMESARfari is hosted

def urljoin_patched(base,path):
    return base+path

respire.client.urljoin = urljoin_patched
    
# Create our client and associated methods
api_client = respire.client_from_url('http://wwwdev.ebi.ac.uk/chembl/admesarfari/rest/spore')    

In [7]:

# What methods do we have available?
# Iterate over the parsed endpoint, pulling out applicable methods, the paths and the descriptions.
# We'll add some HTML elements to the output.
tc=[]
ts = '<table><tr>'
te = '</tr></table>'

for method in api_client.description.methods:
        methodname = method
        method = api_client.description.methods[methodname]

        if method['method']!='HEAD':
            tc.append("<tr><th>"+methodname+"</th></tr>")
            tc.append('<tr><td>'+method['path']+'</td><td>'+method['description']+'</td></tr>')


h = HTML(ts+"".join(tc)+te)
h

Out[7]:

get_textsearch
/rest/:TEXT/search	Return a set of target ids where the search term appears
get_blast
/rest/:FASTA/blast	BLAST the input sequence(s) against the ADME SARfari set of target sequences. This requires: * A URL encoded FASTA sequence (May extended to other formats via a keyword)
get_celltypes
/rest/celltypes	Retrieve the list of cell types
get_targetsequence
/rest/targetsequence/:TARGET_ID	Return the Target Sequence and Variation information for a particular Target ID
post_postsimsubsdf
/rest/simsubsdf/:VALUE	Return a set of molecules via either a similarity or substructure search Requires: * The POST body to contain CTAB or SMILES * A similarity cut-off value (100 will perform a sub-structure search) This returns a gzipped SDF file.
get_orthologuematrix
/rest/orthologuematrix/:TAX_IDS	Retrieves the orthologue mapping matrix for a specific set of Taxonomy IDs. Requires: Comma seperated list of Taxonomy IDs
post_modelpredictor2
/rest/modelpredictor2	Run the input CTAB through the ADME SARfari SciKit/RDKit Bayesian model. This requires a URL encoded CTAB
get_target
/rest/target/:TARGET_ID	Return the Target information for a particular Target ID
get_targetalignment
/rest/targetalignment/:TARGET_ID/:TAX_IDS	Return the alignment information for a particular Target ID
get_bioactivity
/rest/:MOLREGNO/bioactivity	Retrieve the bioactivity and assay data for a particular molregno. Requires: Molregno (Int) If the Molregno == -1 then it will bring back all records Returns: Datatables format JSON.
post_postblast
/rest/blast	BLAST the input sequence(s) against the ADME SARfari set of protein databases This requires: * A FASTA sequence as the POST body
get_targetcompounds
/rest/:TARGET_ID/targetcompounds	Retrieve the compound SMILES associated with an ADME SARfari target (via activity) Requires: ADME SARfari Target ID
get_expressionmatrix
/rest/expressionmatrix/:TISSUE_IDS	Retrieves the tissue target expression level matrix
get_taxids
/rest/taxids	Return the list of taxonomy IDs used.
get_alignmentdendrogram
/rest/alignmentdendrogram/:TARGET_ID/:TAX_IDS	Return the dendrogram tree information for a particular Target ID (from the relevant orthologues) Requires a target ID and a comma seperated list of tax IDs
get_targetinvivomatrix
/rest/:TARGET_ID/targetinvivomatrix	Retrieves the invivo matrix for a particular target Requires: ADME SARfari internal target id This will return an object with xcats,ycats and data elements (primarily used with the Highcharts Heatmap plugin.)
get_tissues
/rest/tissues	Retrieve the list of tissues
get_targetbioactivity
/rest/:TARGET_ID/targetbioactivity	Retrieve the bioactivity and assay data for a particular target Requires: ADME SARfari Target ID
get_modelpredictor2
/rest/:CTAB/modelpredictor2	Run the input CTAB through the ADME SARfari SciKit/RDKit Bayesian model. This requires a URL encoded CTAB

In [8]:

# Let's set a few lookup dictionaries

# Taxonmy ID look up
# Get the taxids

taxids = api_client.get_taxids()['results']
t = {}

# Create taxonomy look-up

for taxid in taxids:
    t[taxid['taxid']]=taxid['name']

taxids = t

# Get tissues
tissues = api_client.get_tissues()['results'][0]
alltissues = str(",".join(tissues.keys()))
cells = api_client.get_celltypes()['results'][0]

In [9]:

# Get Human expression levels (Could take a while!)
expressionlevels = api_client.get_expressionmatrix(TISSUE_IDS=alltissues)['expression_matrix']
print "Levels found:",expressionlevels.__len__()

Levels found: 459

In [10]:

# Let's use an input compound and predict it's ADME profile
# We'll use Gleevec (CHEMBL941) as our input

gleevec_ctab = """
 SciTegic12111210002D

41  0  0  0  0            999 V2000
9208   -3.0042    0.0000 C   0  0
5250   -2.6417    0.0000 N   0  0
2167   -3.0417    0.0000 C   0  0
6875   -3.0167    0.0000 C   0  0
3000   -2.6542    0.0000 N   0  0
8292   -2.6792    0.0000 N   0  0
1417   -2.9833    0.0000 C   0  0
1292   -2.0000    0.0000 N   0  0
0667   -2.6667    0.0000 C   0  0
   -1.1083   -2.6917    0.0000 N   0  0
9250   -3.7167    0.0000 N   0  0
6000   -2.6917    0.0000 C   0  0
4542   -3.0292    0.0000 C   0  0
7542   -2.6125    0.0000 C   0  0
6917   -3.7292    0.0000 C   0  0
2250   -3.7500    0.0000 O   0  0
3458   -1.5375    0.0000 N   0  0
7417   -1.6417    0.0000 C   0  0
6000   -1.9792    0.0000 C   0  0
9875   -3.0500    0.0000 C   0  0
0667   -4.0917    0.0000 C   0  0
1167   -2.7167    0.0000 C   0  0
   -0.4708   -1.6417    0.0000 C   0  0
   -0.5000   -3.0542    0.0000 C   0  0
   -1.1000   -1.9792    0.0000 C   0  0
1583   -3.6958    0.0000 C   0  0
5458   -4.0625    0.0000 C   0  0
3667   -1.9917    0.0000 C   0  0
4542   -3.7417    0.0000 C   0  0
9667   -1.6292    0.0000 C   0  0
3750   -2.7042    0.0000 C   0  0
7417   -1.9083    0.0000 C   0  0
   -1.7375   -3.0292    0.0000 C   0  0
3750   -2.9625    0.0000 C   0  0
9750   -1.8833    0.0000 C   0  0
3042   -4.0875    0.0000 C   0  0
9917   -2.5958    0.0000 C   0  0
1  1  0
6  1  0
5  1  0
1  1  0
13  1  0
2  2  0
18  1  0
4  1  0
25  1  0
1  2  0
3  1  0
9  2  0
7  1  0
4  2  0
3  2  0
32  2  0
28  1  0
12  2  0
12  1  0
15  1  0
8  1  0
8  1  0
22  1  0
23  1  0
7  1  0
11  1  0
31  1  0
21  2  0
19  1  0
20  2  0
14  1  0
10  1  0
14  2  0
37  2  0
15  1  0
34  1  0
26  2  0
13  1  0
35  1  0
30  2  0
10  1  0
M  END
"""

predictions = api_client.post_modelpredictor2(data=urllib.quote(gleevec_ctab))['results']

# How many ADME targets were predicted?
print predictions.__len__()

In [11]:

# Let's view the predictions
tc=[]
ts = '<table><tr>'
te = '</tr></table>'

for prediction in predictions:    
        tc.append("<tr><th>"+prediction['PROTEIN_ACCESSION']+"</th><th>"+prediction['full_name']+"</th></tr>")
        if prediction['function'] != None:
            pfunc = prediction['function']
        else:
            pfunc = 'Unknown'
            
        tc.append('<tr><td>'+taxids[prediction['taxid']]+'</td><td>'+pfunc+'</td></tr>')
        
h = HTML(ts+"".join(tc)+te)
h

Out[11]:

CHEMBL3356	Cytochrome P450 1A2
Human	Cytochromes P450 are a group of heme-thiolate monooxygenases. In liver microsomes, this enzyme is involved in an NADPH-dependent electron transport pathway. It oxidizes a variety of structurally unrelated compounds, including steroids, fatty acids, and xenobiotics. Most active in catalyzing 2-hydroxylation. Caffeine is metabolized primarily by cytochrome CYP1A2 in the liver through an initial N3-demethylation. Also acts in the metabolism of aflatoxin B1 and acetaminophen. Participates in the bioactivation of carcinogenic aromatic and heterocyclic amines. Catalizes the N-hydroxylation of heterocyclic amines and the O-deethylation of phenacetin.
CHEMBL5393	ATP-binding cassette sub-family G member 2
Human	Xenobiotic transporter that may play an important role in the exclusion of xenobiotics from the brain. May be involved in brain-to-blood efflux. Appears to play a major role in the multidrug resistance phenotype of several cancer cell lines. When overexpressed, the transfected cells become resistant to mitoxantrone, daunorubicin and doxorubicin, display diminished intracellular accumulation of daunorubicin, and manifest an ATP-dependent increase in the efflux of rhodamine 123.
CHEMBL340	Cytochrome P450 3A4
Human	Cytochromes P450 are a group of heme-thiolate monooxygenases. In liver microsomes, this enzyme is involved in an NADPH-dependent electron transport pathway. It performs a variety of oxidation reactions (e.g. caffeine 8-oxidation, omeprazole sulphoxidation, midazolam 1''''-hydroxylation and midazolam 4-hydroxylation) of structurally unrelated compounds, including steroids, fatty acids, and xenobiotics. Acts as a 1,8-cineole 2-exo-monooxygenase. The enzyme also hydroxylates etoposide.
CHEMBL3397	Cytochrome P450 2C9
Human	Cytochromes P450 are a group of heme-thiolate monooxygenases. In liver microsomes, this enzyme is involved in an NADPH-dependent electron transport pathway. It oxidizes a variety of structurally unrelated compounds, including steroids, fatty acids, and xenobiotics. This enzyme contributes to the wide pharmacokinetics variability of the metabolism of drugs such as S-warfarin, diclofenac, phenytoin, tolbutamide and losartan.
CHEMBL3622	Cytochrome P450 2C19
Human	Responsible for the metabolism of a number of therapeutic agents such as the anticonvulsant drug S-mephenytoin, omeprazole, proguanil, certain barbiturates, diazepam, propranolol, citalopram and imipramine.
CHEMBL3577	Retinal dehydrogenase 1
Human	Binds free retinal and cellular retinol-binding protein-bound retinal. Can convert/oxidize retinaldehyde to retinoic acid (By similarity).
CHEMBL289	Cytochrome P450 2D6
Human	Responsible for the metabolism of many drugs and environmental chemicals that it oxidizes. It is involved in the metabolism of drugs such as antiarrhythmics, adrenoceptor antagonists, and tricyclic antidepressants.
CHEMBL6035	Thioredoxin reductase 1, cytoplasmic
Rat	Unknown

In [12]:

# Now lets look at expression levels of these targets
# Select only HIGH expression levels
tc=[]
ts = '<table><tr>'
te = '</tr></table>'

for prediction in predictions:    
        tc.append("<tr><th>"+prediction['PROTEIN_ACCESSION']+"</th><th>"+prediction['full_name']+"</th></tr>")
        # This dumps out expression levels for all tissues and cell types!
        targetexpression=[]

        for humexp in expressionlevels:

            for tissue in tissues:

                for cell in cells:

                    percell = humexp[str(tissue)]

                    if str(cell) in percell:

                        if percell[str(cell)]['target_id']==prediction['target_id']:

                            expstring = "Tissue =",percell[str(cell)]['tissue'],", Cell =",cells[str(cell)]," Level =",percell[str(cell)]['exp_level']," Type =",percell[str(cell)]['expression_type']," Reliability =",percell[str(cell)]['reliability']

                            level = percell[str(cell)]['exp_level']
                            if re.match('High|Strong',level):
                                 targetexpression.append(expstring)

        for exp in targetexpression:
            tc.append('<tr><td>'+" ".join(exp)+'</td></tr>')

h = HTML(ts+"".join(tc)+te)
h

Out[12]:

CHEMBL3356	Cytochrome P450 1A2
Tissue = liver , Cell = hepatocytes Level = High Type = APE Reliability = Medium
CHEMBL5393	ATP-binding cassette sub-family G member 2
Tissue = vulva/anal+skin , Cell = epidermal cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = tonsil , Cell = squamous epithelial cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = lung , Cell = macrophages Level = Strong Type = Staining Reliability = Uncertain
Tissue = nasopharynx , Cell = respiratory epithelial cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = vagina , Cell = squamous epithelial cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = uterus,+post-menopause , Cell = glandular cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = stomach,+upper , Cell = glandular cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = testis , Cell = cells in seminiferus ducts Level = Strong Type = Staining Reliability = Uncertain
Tissue = adrenal+gland , Cell = glandular cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = bone+marrow , Cell = hematopoietic cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = bronchus , Cell = respiratory epithelial cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = colon , Cell = glandular cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = cervix,+uterine , Cell = glandular cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = spleen , Cell = cells in red pulp Level = Strong Type = Staining Reliability = Uncertain
Tissue = spleen , Cell = cells in white pulp Level = Strong Type = Staining Reliability = Uncertain
Tissue = epididymis , Cell = glandular cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = esophagus , Cell = squamous epithelial cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = heart+muscle , Cell = myocytes Level = Strong Type = Staining Reliability = Uncertain
Tissue = gallbladder , Cell = glandular cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = seminal+vesicle , Cell = glandular cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = small+intestine , Cell = glandular cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = skin , Cell = keratinocytes Level = Strong Type = Staining Reliability = Uncertain
Tissue = skin , Cell = Langerhans Level = Strong Type = Staining Reliability = Uncertain
Tissue = skin , Cell = fibroblasts Level = Strong Type = Staining Reliability = Uncertain
Tissue = skin , Cell = melanocytes Level = Strong Type = Staining Reliability = Uncertain
Tissue = skeletal+muscle , Cell = myocytes Level = Strong Type = Staining Reliability = Uncertain
CHEMBL340	Cytochrome P450 3A4
Tissue = duodenum , Cell = glandular cells Level = Strong Type = Staining Reliability = Supportive
Tissue = liver , Cell = hepatocytes Level = Strong Type = Staining Reliability = Supportive
Tissue = small+intestine , Cell = glandular cells Level = Strong Type = Staining Reliability = Supportive
CHEMBL3397	Cytochrome P450 2C9
Tissue = liver , Cell = hepatocytes Level = High Type = APE Reliability = High
CHEMBL3622	Cytochrome P450 2C19
Tissue = liver , Cell = hepatocytes Level = Strong Type = Staining Reliability = Supportive
CHEMBL3577	Retinal dehydrogenase 1
CHEMBL289	Cytochrome P450 2D6
Tissue = cerebellum , Cell = cells in granular layer Level = Strong Type = Staining Reliability = Uncertain
Tissue = duodenum , Cell = glandular cells Level = Strong Type = Staining Reliability = Uncertain
Tissue = liver , Cell = hepatocytes Level = Strong Type = Staining Reliability = Uncertain
Tissue = small+intestine , Cell = glandular cells Level = Strong Type = Staining Reliability = Uncertain
CHEMBL6035	Thioredoxin reductase 1, cytoplasmic

In [13]:

# Now lets see how many activity points we have per target
# Let's view the predictions
tc=[]
ts = '<table><tr>'
te = '</tr></table>'

for prediction in predictions:
        try:
            activity = api_client.get_targetbioactivity(TARGET_ID=str(prediction['target_id']))['results']
            print activity.__len__()," activity points"
        except:
            print "Error retrieving data points!"
        #tc.append("<tr><th>"+prediction['PROTEIN_ACCESSION']+"</th><th>"+prediction['full_name']+"</th></tr>")
        #tc.append('<tr><td>Activity points</td><td>'+str(activity.__len__())+'</td></tr>')
        
h = HTML(ts+"".join(tc)+te)
h

66105  activity points
10257  activity points
Error retrieving data points!
59288  activity points
79250  activity points
Error retrieving data points!
Error retrieving data points!
Error retrieving data points!

Out[13]:

In [15]:

# How many compounds do we have per target?
tc=[]
ts = '<table><tr>'
te = '</tr></table>'

for prediction in predictions:
        try:
            targetcompounds = api_client.get_targetcompounds(TARGET_ID=str(prediction['target_id']))['results']
            count = targetcompounds.__len__()," compounds"
        except:
            count = 0
            print "Error retrieving data points!"
        tc.append("<tr><th>"+prediction['PROTEIN_ACCESSION']+"</th><th>"+prediction['full_name']+"</th></tr>")
        tc.append('<tr><td>Activity points</td><td>'+str(count)+'</td></tr>')
        
h = HTML(ts+"".join(tc)+te)
h

Out[15]:

CHEMBL3356	Cytochrome P450 1A2
Activity points	(11791, ' compounds')
CHEMBL5393	ATP-binding cassette sub-family G member 2
Activity points	(484, ' compounds')
CHEMBL340	Cytochrome P450 3A4
Activity points	(15913, ' compounds')
CHEMBL3397	Cytochrome P450 2C9
Activity points	(11124, ' compounds')
CHEMBL3622	Cytochrome P450 2C19
Activity points	(11707, ' compounds')
CHEMBL3577	Retinal dehydrogenase 1
Activity points	(75307, ' compounds')
CHEMBL289	Cytochrome P450 2D6
Activity points	(9717, ' compounds')
CHEMBL6035	Thioredoxin reductase 1, cytoplasmic
Activity points	(39319, ' compounds')

In [ ]: