This is just a short one, mainly to have the information online to use as a reference.
The RDKit has a number of different ways of serializing and deserializing molecules (converting them to and from strings). This post looks at how long it takes to do that with the three serialization approaches I normally recommend:
I looked at just serializing the molecular structure, including atomic coordinates, and serializing a few propertis together with the molecule.
Note that this isn't exactly comparing apples to apples: the binary and JSON formats both capture more or less the full perceived state of the molecule (aromaticity, ring systems, etc), while the CXSMILES variant doesn't include the ring information.
Here are the timing results for 50K molecules from ChEMBL. I ran this on my normal desktop machine, a 3-year old Dell desktop with a 3.1GHz Intel Core9 CPU. I'm running the conda-forge RDKit build.
Times are in seconds
Writing
Method | Mol | Mol + coordinates | Mol + properties |
---|---|---|---|
CXSMILES | 3.9 | 6.4 | N/A |
Pickle | 1.31 | 1.48 | 1.44 |
JSON | 2.4 | 3.1 | 4.1 |
Reading
Method | Mol | Mol + coordinates | Mol + properties |
---|---|---|---|
CXSMILES | 1.56 | 3.78 | N/A |
Pickle | 0.9 | 1.14 | 1.16 |
JSON | 1.66 | 2.08 | 2.14 |
The binary format is, of course, the fastest. The JSON format is slower than that, but it's still faster than using CXSMILES when serializing coordinates and roughly equivalent to CXSMILES when just storing the molecule (but, as noted above, the JSON contains more info than the CXSMILES does).
from rdkit import Chem
import time
import gzip
import rdkit
print(rdkit.__version__)
2022.09.1
Read in a set of 50K ChEMBL molecules that we'll use for the testing.
mols = []
suppl = Chem.SmilesMolSupplier('../data/new_chembl_document_activity_set.smi')
while len(mols)<50000:
try:
m = next(suppl)
except StopIteration:
break;
mols.append(m)
len(mols)
50000
This class makes timing our runs easier:
class timer(object):
def __init__(self, *args, **kwargs):
pass
def __enter__(self):
self.t1=time.time()
def __exit__(self, type, value, traceback):
delta = time.time()-self.t1
print(f'That took {delta:.2f} seconds')
jsons[0]
'{"commonchem":{"version":10},"defaults":{"atom":{"z":6,"impHs":0,"chg":0,"nRad":0,"isotope":0,"stereo":"unspecified"},"bond":{"bo":1,"stereo":"unspecified"}},"molecules":[{"name":"436078","atoms":[{"z":8,"chg":-1},{"z":7,"chg":1},{"z":8},{},{"impHs":1},{"impHs":1},{},{"z":7},{},{"z":8},{"impHs":1,"stereo":"ccw"},{"impHs":1},{"impHs":2},{"impHs":2},{"impHs":1},{"z":8},{"impHs":1,"stereo":"ccw"},{},{"z":8},{},{"impHs":1},{"impHs":1},{"impHs":1},{"impHs":1},{}],"bonds":[{"atoms":[0,1]},{"bo":2,"atoms":[1,2]},{"atoms":[1,3]},{"bo":2,"atoms":[3,4]},{"atoms":[4,5]},{"bo":2,"atoms":[5,6]},{"atoms":[6,7]},{"atoms":[7,8]},{"bo":2,"atoms":[8,9]},{"atoms":[8,10]},{"atoms":[10,11]},{"atoms":[11,12]},{"atoms":[12,13]},{"atoms":[13,14]},{"atoms":[14,15]},{"atoms":[14,16]},{"atoms":[16,17]},{"bo":2,"atoms":[17,18]},{"atoms":[6,19]},{"bo":2,"atoms":[19,20]},{"atoms":[20,21]},{"bo":2,"atoms":[21,22]},{"atoms":[22,23]},{"bo":2,"atoms":[23,24]},{"atoms":[24,3]},{"atoms":[17,7]},{"atoms":[16,10]},{"atoms":[15,11]},{"atoms":[24,19]}],"properties":{"numAtoms":25,"numBonds":29,"numRings":5,"smiles":"O=C1[C@@H]2C3CCC(O3)[C@@H]2C(=O)N1c1ccc([N+](=O)[O-])c2ccccc12"},"extensions":[{"name":"rdkitRepresentation","formatVersion":2,"toolkitVersion":"2022.09.1","aromaticAtoms":[3,4,5,6,19,20,21,22,23,24],"aromaticBonds":[3,4,5,18,19,20,21,22,23,24,28],"cipRanks":[23,19,24,13,7,6,12,18,16,21,8,14,0,1,15,20,9,17,22,10,4,2,3,5,11],"cipCodes":[[10,"S"],[16,"R"]],"atomRings":[[3,24,19,6,5,4],[8,7,17,16,10],[12,11,15,14,13],[15,14,16,10,11],[20,21,22,23,24,19]]}]}]}'
Generating CXSMILES:
with timer() as cm:
smis = [Chem.MolToCXSmiles(m) for m in mols]
That took 3.88 seconds
Generating the RDKit's binary format:
with timer() as cm:
pkls = [m.ToBinary() for m in mols]
That took 1.31 seconds
Generating the RDKit's variant of commonchem JSON:
with timer() as cm:
jsons = [Chem.MolToJSON(m) for m in mols]
That took 2.40 seconds
Generating JSON for all of the molecules at once:
with timer() as cm:
allJson = Chem.MolsToJSON(mols)
That took 2.83 seconds
Now look at reading the molecules.
Start from SMILES, but skip full sanitization since we know that the chemistry and aromaticity are correct:
with timer() as cm:
for smi in smis:
m = Chem.MolFromSmiles(smi,sanitize=False)
m.UpdatePropertyCache()
That took 1.56 seconds
From the binary format:
with timer() as cm:
for pkl in pkls:
m = Chem.Mol(pkl)
That took 0.92 seconds
And the two variants for parsing JSON:
with timer() as cm:
for js in jsons:
m = Chem.JSONToMols(js)[0]
That took 1.66 seconds
with timer() as cm:
_ = Chem.JSONToMols(allJson)
That took 1.49 seconds
Add a conformer to each molecule (which will be written to the output) and re-do the timing runs:
from rdkit.Chem import rdDepictor
for mol in mols:
rdDepictor.Compute2DCoords(mol)
with timer() as cm:
smis = [Chem.MolToCXSmiles(m) for m in mols]
That took 6.43 seconds
smis[0]
'O=C1[C@@H]2C3CCC(O3)[C@@H]2C(=O)N1c1ccc([N+](=O)[O-])c2ccccc12 |(-1.03854,-2.25072,;-1.65866,-0.884909,;-3.12813,-0.583792,;-4.49394,-1.20391,;-5.96341,-0.902797,;-6.13112,0.587798,;-4.7653,1.20792,;-3.75347,0.100581,;-3.29584,0.906803,;-1.93002,1.52692,;-1.62891,2.99639,;-0.918193,0.419586,;0.572402,0.587296,;1.17246,1.96204,;2.66305,2.12976,;3.55359,0.922718,;5.04419,1.09043,;5.93473,-0.116609,;5.64424,2.46518,;2.95354,-0.452031,;3.84407,-1.65907,;3.24402,-3.03382,;1.75342,-3.20153,;0.862885,-1.99449,;1.46294,-0.619741,)|'
with timer() as cm:
pkls = [m.ToBinary() for m in mols]
That took 1.48 seconds
with timer() as cm:
jsons = [Chem.MolToJSON(m) for m in mols]
That took 3.07 seconds
with timer() as cm:
for smi in smis:
m = Chem.MolFromSmiles(smi,sanitize=False)
m.UpdatePropertyCache()
That took 3.78 seconds
with timer() as cm:
for pkl in pkls:
m = Chem.Mol(pkl)
That took 1.14 seconds
with timer() as cm:
for js in jsons:
m = Chem.JSONToMols(js)[0]
That took 2.08 seconds
Since all three formats can also save properties, see how those impact writing and parsing times:
for mol in mols:
mol.RemoveAllConformers()
mol.SetIntProp('numAtoms',mol.GetNumAtoms())
mol.SetIntProp('numBonds',mol.GetNumBonds())
mol.SetIntProp('numRings',mol.GetRingInfo().NumRings())
mol.SetProp('smiles',Chem.MolToSmiles(mol))
with timer() as cm:
pkls = [m.ToBinary(Chem.PropertyPickleOptions.MolProps) for m in mols]
That took 1.44 seconds
with timer() as cm:
jsons = [Chem.MolToJSON(m) for m in mols]
That took 4.09 seconds
with timer() as cm:
for pkl in pkls:
m = Chem.Mol(pkl)
That took 1.16 seconds
with timer() as cm:
for js in jsons:
m = Chem.JSONToMols(js)[0]
That took 2.14 seconds