In [ ]:

#functions to help visualise output

from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem

def depict(input):
    if(">>" in input):
        rxn = AllChem.ReactionFromSmarts(input)       
        return Draw.ReactionToImage(rxn)
    else:
        temp = Chem.MolFromSmiles(input)
        return temp
    
def showFile(in_file):
    fo = open(in_file, "r")
    
    mols=[]
    ids=[]
    n=0
    for line in fo:
        if(n < 10):
            smi,id = line.rstrip().split()
            mols.append( Chem.MolFromSmiles(smi) )
            ids.append(id)
        n=n+1
    
    return Draw.MolsToGridImage(mols,molsPerRow=5,legends=ids)

def showMMPs(in_file):
    fo = open(in_file, "r")
    
    mols=[]
    trans=[]
    n=0
    for line in fo:
        if(n < 9):
            m1,m2,id1,id2,t,c = line.rstrip().split(",")
            smi = "%s.%s" % (m1,m2)
            mols.append( Chem.MolFromSmiles(smi) )
            trans.append(t)
        n=n+1
    
    return Draw.MolsToGridImage(mols,molsPerRow=3,subImgSize= ( 300 , 300 ) ,legends=trans)

def showLine(in_string):
    f = in_string.split(",")
    
    rxn =f[-2].split(">>")      
        
    mols=[]
    ids=[]
    
    mols.append( Chem.MolFromSmiles(f[-6]) )
    mols.append( Chem.MolFromSmiles(f[-5]) )
    mols.append( Chem.MolFromSmiles(rxn[0]) )
    mols.append( Chem.MolFromSmiles(rxn[1]) )
    mols.append( Chem.MolFromSmiles(f[-1]) )
    ids.append(f[-3])
    ids.append(f[-4])
    ids.append("LHS")  
    ids.append("RHS")  
    ids.append("CONTEXT")  
    
    return Draw.MolsToGridImage(mols,molsPerRow=6,legends=ids)

To find all the MMPs in your set¶

The program to generate the MMPs from a set is divided into two parts; fragmentation and indexing.

Before running the programs, make sure your input set of SMILES:

Does not contain mixtures (salts etc.)
Does not contain * atoms
Has been canonicalised using RDKit.

If your smiles set doesn't satisfy the conditions above the programs are likely to fail or in the case of canonicalisation result in not identifying MMPs involving H atom substitution.

The file sample.smi has been canonicalised using RDKit.

1) Fragmentation command:¶

In [ ]:

cd t1_files/

In [ ]:

ls

*Note:* You need a ! to run a Linux command in this ipython notebook. It's not needed on the command line.

In [ ]:

!wc -l sample.smi

In [ ]:

!head sample.smi

In [ ]:

showFile('sample.smi')

Example fragmentation command:

python $RDBASE/Contrib/mmpa/rfrag.py <SMILES_FILE >FRAGMENT_OUTPUT

Program help (use the command: rfrag.py -h)

Program that fragments a user input set of smiles.
The program enumerates every single,double and triple acyclic single bond cuts in a molecule.

USAGE: ./rfrag.py <file_of_smiles
Format of smiles file: SMILES ID (space separated)
Output: whole mol smiles,ID,core,context

Lets run an example:

In [ ]:

!python $RDBASE/Contrib/mmpa/rfrag.py <sample.smi >out_fragmented.txt

Lets have a look at the output:

In [ ]:

!head out_fragmented.txt

2) Index command:¶

Example indexing command:

python $RDBASE/Contrib/mmpa/indexing.py <FRAGMENT_OUTPUT >MMP_OUTPUT.CSV

Format of output: SMILES_OF_LEFT_MMP,SMILES_OF_RIGHT_MMP,ID_OF_LEFT_MMP,ID_OF_RIGHT_MMP,SMIRKS_OF_TRANSFORMATION,SMILES_OF_CONTEXT

Program help (use the command: indexing.py -h)

Usage: indexing.py [options]

Program to generate MMPs

Options:
  -h, --help            show this help message and exit
  -s, --symmetric       Output symmetrically equivalent MMPs, i.e output both
                        cmpd1,cmpd2, SMIRKS:A>>B and cmpd2,cmpd1, SMIRKS:B>>A
  -m MAXSIZE, --maxsize=MAXSIZE
                        Maximum size of change (in heavy atoms) allowed in
                        matched molecular pairs identified. DEFAULT=10.
                        Note: This option overrides the ratio option if both
                        are specified.
  -r RATIO, --ratio=RATIO
                        Maximum ratio of change allowed in matched molecular
                        pairs identified. The ratio is: size of change /
                        size of cmpd (in terms of heavy atoms). DEFAULT=0.3.
                        Note: If this option is used with the maxsize option,
                        the maxsize option will be used.

Lets some an examples:

Default settings:

In [ ]:

!python $RDBASE/Contrib/mmpa/indexing.py <out_fragmented.txt >out_mmps_default.csv

In [ ]:

!head out_mmps_default.csv

In [ ]:

showLine("Cc1cccn2cc(-c3cccc(S(=O)(=O)N4CCCCC4)c3)nc12,Cc1cccn2cc(-c3ccc(S(=O)(=O)N4CCCCC4)cc3)nc12,2963575,1156028,[*:1]c1cccc([*:2])c1>>[*:1]c1ccc([*:2])cc1,[*:1]c1cn2cccc(C)c2n1.[*:2]S(=O)(=O)N1CCCCC1")

In [ ]:

showMMPs("out_mmps_default.csv")

Output MMPs where maximum size of change is 3 heavy atoms:

In [ ]:

!python $RDBASE/Contrib/mmpa/indexing.py -m 3 <out_fragmented.txt

In [ ]:

showLine("Nc1ccc(-c2cc3ccccc3oc2=O)cc1,O=c1oc2ccccc2cc1-c1ccc(O)cc1,310860,4055843,[*:1]N>>[*:1]O,[*:1]c1ccc(-c2cc3ccccc3oc2=O)cc1")

Output MMPs where no more that 10% of the compound has changed (and using the symmetric option):

In [ ]:

!python $RDBASE/Contrib/mmpa/indexing.py -r 0.1 <out_fragmented.txt

In [ ]:

showLine("Nc1ccc(-c2cc3ccccc3oc2=O)cc1,O=c1oc2ccccc2cc1-c1ccc(O)cc1,310860,4055843,[*:1]N>>[*:1]O,[*:1]c1ccc(-c2cc3ccccc3oc2=O)cc1")

In [ ]:

!python $RDBASE/Contrib/mmpa/indexing.py -r 0.1 -s <out_fragmented.txt >out_mmps_sym.csv

In [ ]:

!head out_mmps_sym.csv
showMMPs("out_mmps_sym.csv")

SMIRKS canonicalisation¶

Take a look at the following SMIRKS:

In [ ]:

depict("[*:2]c1ccc([*:1])o1>>[*:1]c1ccc([*:2])cc1")

In [ ]:

depict("[*:1]c1ccc([*:2])o1>>[*:2]c1ccc([*:1])cc1")

Notice the SMIRKS strings are different but the change they describe are the same (as the positions on the furan and benzene rings are symmetrically equivalent).

The algorithm used to canonicalise SMIRKS is as follows:

Canonicalise the LHS.
For the LHS the 1st asterisk (attachment point) in the SMILES will have label 1, 2nd asterisk will have label 2 and so on
For the RHS, if you have a choice (ie. two attachment points are symmetrically equivalent), always put the label with lower numerical value on the earlier attachment point in the canonicalised SMILES

The MMP identification script uses a SMIRKS canonicalisation routine so the same change always has the same output SMIRKS. To canonicalise a SMIRKS (generated elsewhere) so it is in the same format as MMP identification scripts use command:

cansmirk.py <SMIRKS_FILE >SMIRKS_OUTPUT_FILE

Format of SMIRKS_FILE (space or comma separated): SMIRKS ID

Format of output: CANONICALISED_SMIRKS ID

Note: The script will NOT deal with SMARTS characters, so the SMIRKS must contain valid SMILES for left and right hand sides.

Let's try an example:

In [ ]:

!head sample_smirks.txt

In [ ]:

!python $RDBASE/Contrib/mmpa/cansmirk.py <sample_smirks.txt

Applying SMIRKS to input compounds¶

If you want to apply a SMIRKS/transform generated by the programs above to a compound, use the mol_transform.py program with the following command:

python $RDBASE/Contrib/mmpa/mol_transform.py -f TRANSFORM_FILE <SMILES_FILE >OUTPUT_FILE

If you want to use a set SMIRKS generated elsewhere, please make sure they have been canonicalised using the cansmirk.py command.

Program help: Usage: mol_transform.py [options]

Program to apply transformations to a set of input molecules

Options:
  -h, --help            show this help message and exit
  -f TRANSFORM_FILE, --file=TRANSFORM_FILE
                    The file containing the transforms to apply to your
                    input SMILES

Example command: mol_transform.py -f TRANSFORM_FILE <SMILES_FILE
Format of smiles file: SMILES ID <space or comma separated>
Format of transform file: transform <one per line>
Output: SMILES,ID,Transfrom,Modified_SMILES

Let's go through an example:

In [ ]:

!head sample_smirks_mol_trans.txt

In [ ]:

depict("[*:1]C(=O)O>>[*:1]C(N)=O")

In [ ]:

!head sample_smiles_mol_trans.smi

In [ ]:

depict("O=C(O)c1ccc(NC(=O)C2COc3ccccc3O2)cc1")

Let's run the command:

In [ ]:

!python $RDBASE/Contrib/mmpa/mol_transform.py -f sample_smirks_mol_trans.txt <sample_smiles_mol_trans.smi

In [ ]:

depict("NC(=O)c1ccc(NC(=O)C2COc3ccccc3O2)cc1")