In [1]:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw


# Building Molecules

## Overall process

### From SMILES

• Process SMILES
• Remove explicit Hs from graph
• Sanitize
• Perceive stereochemistry: remove redundant or incorrect stereo information. Assign CIP codes.

### From CTAB

• Process CTAB
• Remove explicit Hs from graph
• Sanitize
• Assign cis/trans markers to double bonds based on geometry
• Perceive stereochemistry: remove redundant or incorrect stereo information. Assign CIP codes.

## Sanitization

Chem.SanitizeMol() calls the following in sequence

• MolOps::cleanUp() : clean up a small-set of substructures that are frequently "incorrectly" drawn. Example: N(=O)=O -> [N+](=O)[O-]
• mol.updatePropertyCache() : calculate explicit and implicit valence of all atoms. FAILS when atoms have illegal valence.
• MolOps::symmetrizeSSSR() : generates a symmetrized set of smallest rings. FAILS in very rare cases.
• MolOps::Kekulize() : generates a Kekule structure for the molecule. FAILS if a Kekule form cannot be found or non-ring bonds are marked as aromatic.
• MolOps::assignRadicals() : determines the number of radicals at each atom
• MolOps::setAromaticity() : finds aromatic ring systems, marks their atoms as aromatic, marks their bonds as aromatic, and converts their bonds to BondType::AROMATIC
• MolOps::setConjugation() : identifies conjugated bond systems
• MolOps::setHybridization() : assigns hybridization to atoms
• MolOps::cleanupChirality() : removes atom-stereochemistry markings from non-sp3 centers.
• MolOps::adjustHs() : assigns explicit Hs to aromatic heteroatoms that need them. e.g. solves the C1=CNC=C1 -> c1c[nH]cc1 problem

## How long does it all take?

As of this writing (21 Sept 2012), building 100K drug-like molecules from SMILES takes ~14 seconds on my linux box.

Here's the percentage of total run time broken down by step:

| Step                            | Time | Percent | Notes                      |
| toMol()                         |  1.8 |    13.1 | SMILES parser              |
| MolOps::removeHs()              |  7.7 |    55.7 |                            |
| MolOps::sanitizeMol()           |  6.8 |    48.8 | Part of the removeHs tally |
| MolOps::assignStereochemistry() |  3.7 |    26.9 |                            |


And here's a detailed view of the sanitization steps:

| Step                       | Time | Percent | Notes            |
| MolOps::cleanUp()          |  0.1 |     1.2 |                  |
| mol.updatePropertyCache()  |  0.4 |     2.8 |                  |
| MolOps::symmetrizeSSSR()   |  2.1 |    15.3 | 2.08 in findSSSR |
| MolOps::Kekulize()         |  1.0 |     6.9 |                  |
| MolOps::assignRadicals()   |  0.1 |     0.6 |                  |
| MolOps::setAromaticity()   |  1.2 |     8.7 |                  |
| MolOps::setConjugation()   |  0.8 |     5.5 |                  |
| MolOps::setHybridization() |  0.3 |     2.0 |                  |
| MolOps::cleanupChirality() |  0.1 |     0.5 |                  |
| MolOps::adjustHs()         |  0.4 |     3.1 |                  |


## Can it go faster?

### Use pickles

Build molecules once, store pickles, build from them.

This is nice because the pickle contains pretty much everything the RDKit knows about the molecule. It's also quite fast, but it's not inter-operable with other systems. It also takes up a fair amount of disk space.

For the 100K drug-like molecules, creating the molecules from the pickle file takes about 1.3 seconds (compare with the 14 seconds for building the molecule from SMILES). The pickle file itself took 0.9 seconds to create and is 28.3MB (the SMILES file is 3.3MB).

### Use RDKit SMILES

Build molecules once, write canonical SMILES, build from that using limited sanitization.

The idea is to trust the input SMILES to only contain correct molecules with a reasonable aromaticity assignment and no bogus stereochemistry info.

For the 100K drug-like molecules, parsing the SMILES and calling mol.updatePropertyCache() so that atoms have valences assigned takes 2.1 seconds.

These input molecules have sufficient information to allow many standard operations

Calling MolOps::fastFindRings() to determine whether or not atoms and bonds are in rings takes 0.4 seconds. MolOps::fastFindRings() does a depth-first traversal of the molecule graph to find rings. This is a fast way to find cycles, but unfortunately it doesn't necessarily find the smallest cycles. In order to get information about about ring sizes, etc. a call to MolOps::symmetrizeSSSR() is required. This takes 2.1 seconds for the 100K molecules.

## An example of doing partial sanitization

Start by building a molecule from SMILES and assigning valences:

In [2]:
m = Chem.MolFromSmiles('N#Cc1ccccc1N1C[C@H](C(=O)[O-])CC1=O',sanitize=False)
m.UpdatePropertyCache()
m

Out[2]:

Notice that the depiction around the triple bond is wrong. This is because no hybridization information is present.

That's easy to fix:

In [3]:
Chem.SetHybridization(m)
m

Out[3]:
In [4]:
tmp=m.GetSubstructMatch(Chem.MolFromSmarts('c1ccccc1'))
m

Out[4]:

Doing a substructure search that looks for ring information fails though:

In [5]:
#m.HasSubstructMatch(Chem.MolFromSmarts('[r]'))


But this can be cleared up using FastFindRings():

In [6]:
Chem.FastFindRings(m)
m.HasSubstructMatch(Chem.MolFromSmarts('[r]'))

Out[6]:
True

In [7]:
[list(x) for x in m.GetSubstructMatches(Chem.MolFromSmarts('[a;r]'))]

Out[7]:
[[2], [3], [4], [5], [6], [7]]

In [8]:
[list(x) for x in m.GetSubstructMatches(Chem.MolFromSmarts('[A;r]'))]

Out[8]:
[[8], [9], [10], [14], [15]]

In [9]:
m

Out[9]:

Here's another example of the problems that a lack of complete ring information can cause.

In [10]:
sm = Chem.MolFromSmiles('O=C(N[C@H]1CCCc2ccccc21)c1n[nH]c2ccccc21')
sm

Out[10]:

Now build the same thing with minimal sanitization.

In [11]:
m = Chem.MolFromSmiles('O=C(N[C@H]1CCCc2ccccc21)c1n[nH]c2ccccc21',sanitize=False)
m.UpdatePropertyCache()
Chem.SetHybridization(m)
Chem.FastFindRings(m)
m

Out[11]:

That happened because the depiction algorithm uses ring-size information, which is not correct after calling FastFindRings()

Ring membership is fine:

In [12]:
[list(x) for x in m.GetSubstructMatches(Chem.MolFromSmarts('[A;r]'))]

Out[12]:
[[3], [4], [5], [6]]

In [13]:
m

Out[13]:

But the ring sizes are not:

In [14]:
[list(x) for x in m.GetSubstructMatches(Chem.MolFromSmarts('[r5]'))]

Out[14]:
[]


ring counts are also wrong:

In [15]:
[list(x) for x in m.GetSubstructMatches(Chem.MolFromSmarts('[R2]'))]

Out[15]:
[[7], [8], [9], [10], [11], [12], [16], [17], [18], [19], [20], [21]]

In [16]:
m

Out[16]:

Here's what the query should have generated:

In [17]:
[list(x) for x in sm.GetSubstructMatches(Chem.MolFromSmarts('[R2]'))]

Out[17]:
[[7], [12], [16], [21]]

In [18]:
sm

Out[18]:

What rings does it find?

In [19]:
m = Chem.MolFromSmiles('O=C(N[C@H]1CCCc2ccccc21)c1n[nH]c2ccccc21',sanitize=False)
m.UpdatePropertyCache()
Chem.FastFindRings(m)
ri = m.GetRingInfo()

In [20]:
ri.AtomRings()

Out[20]:
((12, 11, 10, 9, 8, 7, 6, 5, 4, 3),
(12, 11, 10, 9, 8, 7),
(21, 20, 19, 18, 17, 16, 15, 14, 13),
(21, 20, 19, 18, 17, 16))

In [21]:
Draw.MolToImage(m,highlightAtoms=ri.AtomRings()[0])

Out[21]:

There's an extra bond highlighted in the above drawing (the fusing bond). This is a limitation of the way atom highlighting is handled in drawings: all bonds between highlighted atoms are highlighted.

We can avoid the problem by highlighting bonds instead:

In [22]:
Draw.MolToImage(m,highlightBonds=ri.BondRings()[0])

Out[22]:
In [23]:
Draw.MolToImage(m,highlightBonds=ri.BondRings()[1])

Out[23]:

# Why symmetrize the rings?

This is an old problem...

Use a classic example: cubane only has five smallest rings.

In [24]:
mb="""
Mrv0541 09231207052D

8 12  0  0  0  0            999 V2000
0.4714    0.2946    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
0.4420   -1.0607    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
1.7679    0.3241    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
1.7973   -1.0902    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
-0.4479   -0.4184    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
1.1432   -0.4184    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
1.1137   -1.9800    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
-0.4479   -1.9800    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
1  2  1  0  0  0  0
1  3  1  0  0  0  0
3  4  1  0  0  0  0
4  2  1  0  0  0  0
1  5  1  0  0  0  0
5  6  1  0  0  0  0
6  3  1  0  0  0  0
6  7  1  0  0  0  0
7  4  1  0  0  0  0
5  8  1  0  0  0  0
8  7  1  0  0  0  0
8  2  1  0  0  0  0
M  END
"""
m = Chem.MolFromMolBlock(mb)
m

Out[24]:
In [25]:
m2 = Chem.MolFromMolBlock(mb,sanitize=False)
m2.UpdatePropertyCache()
Chem.GetSSSR(m2)
m2.GetRingInfo().NumRings()

Out[25]:
5

In [26]:
Draw.MolToImage(m2,highlightBonds=m2.GetRingInfo().BondRings()[0])

Out[26]:
In [27]:
Draw.MolToImage(m2,highlightBonds=m2.GetRingInfo().BondRings()[1])

Out[27]:
In [28]:
Draw.MolToImage(m2,highlightBonds=m2.GetRingInfo().BondRings()[2])

Out[28]:
In [29]:
Draw.MolToImage(m2,highlightBonds=m2.GetRingInfo().BondRings()[3])

Out[29]:
In [30]:
Draw.MolToImage(m2,highlightBonds=m2.GetRingInfo().BondRings()[4])

Out[30]:

The front face is being missed.

Isn't this all nitpicking though? Does it actually matter?

In [31]:
m2.GetSubstructMatches(Chem.MolFromSmarts('[R3]'))

Out[31]:
((0,), (1,), (2,), (3,))

In [32]:
m2

Out[32]:

The symmetrization algorithm (run by default) solves this problem:

In [33]:
Chem.GetSymmSSSR(m2)
m2.GetRingInfo().NumRings()

Out[33]:
6

In [34]:
m2.GetSubstructMatches(Chem.MolFromSmarts('[R3]'))

Out[34]:
((0,), (1,), (2,), (3,), (4,), (5,), (6,), (7,))


# Stereochemistry Handling

### Stereochemistry information that doesn't make sense is removed:

In [35]:
Chem.CanonSmiles('C[C@H](F)C') # <- two identical atoms

Out[35]:
'CC(C)F'

In [36]:
Chem.CanonSmiles('C/C=C') # <- only substituted on one side

Out[36]:
'C=CC'


### CIP codes are assigned

In [37]:
m = Chem.MolFromSmiles('C[C@](F)(Cl)Br')
m

Out[37]:
In [38]:
m.GetAtomWithIdx(1).GetProp('_CIPCode')

Out[38]:
'S'


### As are Z and E labels for double bonds:

In [39]:
m = Chem.MolFromSmiles('C/C=C/C')
m

Out[39]:
In [40]:
m.GetBondWithIdx(1).GetStereo()

Out[40]:
rdkit.Chem.rdchem.BondStereo.STEREOE

In [41]:
m = Chem.MolFromSmiles(r'C/C=C\C')
m

Out[41]:
In [42]:
m.GetBondWithIdx(1).GetStereo()

Out[42]:
rdkit.Chem.rdchem.BondStereo.STEREOZ


### Dependent stereochemistry is handled properly:

In [43]:
m = Chem.MolFromSmiles('C[C@H]1CCC[C@@H](C)[C@H]1C')
Draw.MolToImage(m,includeAtomNumbers=True)

Out[43]:
In [44]:
m.GetAtomWithIdx(1).GetProp('_CIPCode'),m.GetAtomWithIdx(5).GetProp('_CIPCode'),m.GetAtomWithIdx(7).GetProp('_CIPCode')

Out[44]:
('S', 'R', 'S')

In [45]:
m = Chem.MolFromSmiles('C[C@H]1CCC[C@H](C)[C@H]1C')
Draw.MolToImage(m,includeAtomNumbers=True)

Out[45]:
In [46]:
m = Chem.MolFromSmiles(r'C[C@H](F)/C([C@H](F)C)=C\Cl')
m

Out[46]:
In [47]:
m.GetAtomWithIdx(1).GetProp('_CIPCode'),m.GetAtomWithIdx(4).GetProp('_CIPCode')

Out[47]:
('S', 'R')

In [48]:
Chem.MolToSmiles(m,True)

Out[48]:
'C[C@@H](F)/C(=C/Cl)[C@H](C)F'

In [49]:
m = Chem.MolFromSmiles(r'C[C@H](F)/C([C@@H](F)C)=C\Cl')
m

Out[49]:
In [50]:
m.GetAtomWithIdx(1).GetProp('_CIPCode'),m.GetAtomWithIdx(4).GetProp('_CIPCode')

Out[50]:
('S', 'S')

In [51]:
Chem.MolToSmiles(m,True)

Out[51]:
'C[C@H](F)C(=CCl)[C@H](C)F'


### Stereochemistry at P and S

In [52]:
Chem.CanonSmiles('F[P@@](=O)(OC(C)C)C') #<- Sarin

Out[52]:
'CC(C)O[P@](C)(=O)F'

In [53]:
m = Chem.MolFromSmiles('F[P@@](=O)(OC(C)C)C')
m

Out[53]:
In [54]:
m.GetAtomWithIdx(1).GetProp('_CIPCode')

Out[54]:
'S'

In [65]:
Chem.CanonSmiles('COc1ccc2nc([nH]c2c1)[S@@](=O)Cc1ncc(C)c(OC)c1C')  # <- Esomeprazole

Out[65]:
'COc1ccc2nc([S@@](=O)Cc3ncc(C)c(OC)c3C)[nH]c2c1'

In [70]:
m = Chem.MolFromSmiles('COc1ccc2nc([nH]c2c1)[S@@](=O)Cc1ncc(C)c(OC)c1C')
m

Out[70]:
In [67]:
m.GetAtomWithIdx(11).GetProp('_CIPCode')

Out[67]:
'S'

In [60]: