When the RDKit has problems processing a molecule, it outputs information to the error console about what those problems were. Here's an example:
In [23]: m = Chem.MolFromSmiles('CO(C)C')
[06:18:04] Explicit valence for atom # 1 O, 3, is greater than permitted
It's sometimes useful to have programmatic access to this information for later use in reporting.
It also would be great if these types of messages were visible in the jupyter notebook.
Brian Kelley recently added functionality to the RDKit to enable both of these things. For anyone interested, the two pull requests for those changes are: #736 and #739.
This is a short note on how to take advantage of that.
A couple of things to note:
Let's start by showing the standard state of affairs:
from rdkit import Chem
from rdkit import rdBase
print(rdBase.rdkitVersion)
2016.03.1.dev1
m = Chem.MolFromSmiles('CO(C)C')
m
At this point there is an error message in the console I launched Jupyter from, but it sure would be nice if it were visible here.
We can enable that by just importing the usual RDKit Jupyter integration code:
from rdkit.Chem.Draw import IPythonConsole
m = Chem.MolFromSmiles('CO(C)C')
RDKit ERROR: [06:46:02] Explicit valence for atom # 1 O, 3, is greater than permitted
Chem.MolFromSmiles('c1cc1')
RDKit ERROR: [06:46:11] Can't kekulize mol RDKit ERROR:
Chem.MolFromSmiles('c1')
RDKit ERROR: [06:46:18] SMILES Parse Error: unclosed ring for input: 'c1'
Chem.MolFromSmiles('Ch')
RDKit ERROR: [06:46:41] SMILES Parse Error: syntax error for input: 'Ch'
So far so good. What if I want to have access to the error messages as strings in Python?
from io import StringIO
import sys
Chem.WrapLogs()
sio = sys.stderr = StringIO()
Chem.MolFromSmiles('Ch')
print("error message:",sio.getvalue())
error message: RDKit ERROR: [06:49:14] SMILES Parse Error: syntax error for input: 'Ch'
I can use this to write a bit of code that processes all of the molecules in an SDF and captures the errors:
def readmols(suppl):
ok=[]
failures=[]
sio = sys.stderr = StringIO()
for i,m in enumerate(suppl):
if m is None:
failures.append((i,sio.getvalue()))
sio = sys.stderr = StringIO() # reset the error logger
else:
ok.append((i,m))
return ok,failures
import gzip,os
from rdkit import RDConfig
inf = gzip.open(os.path.join(RDConfig.RDDataDir,'PubChem','Compound_000200001_000225000.sdf.gz'))
suppl = Chem.ForwardSDMolSupplier(inf)
ok,failures = readmols(suppl)
for i,fail in failures:
print(i,fail)
2035 RDKit ERROR: [07:31:28] Explicit valence for atom # 0 Br, 5, is greater than permitted RDKit ERROR: [07:31:28] ERROR: Could not sanitize molecule ending on line 404864 11460 RDKit ERROR: [07:31:28] ERROR: Explicit valence for atom # 0 Br, 5, is greater than permitted RDKit ERROR: [07:31:32] Explicit valence for atom # 2 Te, 4, is greater than permitted RDKit ERROR: [07:31:32] ERROR: Could not sanitize molecule ending on line 2344967 17016 RDKit ERROR: [07:31:32] ERROR: Explicit valence for atom # 2 Te, 4, is greater than permitted RDKit ERROR: [07:31:34] Explicit valence for atom # 1 Br, 5, is greater than permitted RDKit ERROR: [07:31:34] ERROR: Could not sanitize molecule ending on line 3489884