After reading [http://www.cybertec.at/2016/02/postgresql-on-hardware-vs-postgresql-on-virtualbox/](this post) from Hans-Jürgen Schönig about performance differences between PostgreSQL running natively and PostgreSQL running in a VM, I got curious about the impact of virtualization on the RDKit. This is a brief exploration of that topic.
Some technical details about the experiments first:
$RDBASE/Regress/Scripts/new_timings.py
. This is a more time consuming version of the standard RDKit benchmarking tests.The test set is 50K molecules pulled from ZNP (a subset that no longer exists) a few years ago.
HasSubstructMatch()
for the 50K molecules and 100 of the SMARTS (reproducibly randomly selected)GetSubstructMatches()
for the 50K molecules and 100 of the SMARTS (reproducibly randomly selected)$RDBASE/Data/SmartsLib/RLewis_smarts.txt
HasSubstructMatch()
for the 50K molecules and the 428 SMARTSGetSubstructMatches()
for the 50K molecules and the 428 SMARTSChem.BRICS.BreakBRICSBonds()
on the 50K moleculesNote that none of these need to do much in the way of I/O.
Env | T1 | T2 | T3 | T4 | T5 | T6 | T7 | T8 | T9 | T10 | T11 | T12 | T13 | T14 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Physical | 12.6 | 6.1 | 5.0 | 0.0 | 56.3 | 60.7 | 0.0 | 163.6 | 168.7 | 18.5 | 44.6 | 15.8 | 64.8 | 5.0 |
Vagrant | 12.9 | 6.5 | 5.0 | 0.1 | 56.0 | 61.4 | 0.0 | 164.2 | 168.5 | 19.3 | 45.5 | 16.1 | 68.5 | 5.1 |
Docker | 12.6 | 6.2 | 4.9 | 0.0 | 54.5 | 59.8 | 0.0 | 161.5 | 162.6 | 18.4 | 43.8 | 15.4 | 67.9 | 5.0 |
Comfortingly, running the code in a virtual environment doesn't have much, if any, impact on performance for this CPU-intensive test.
Since ContinuumIO makes Docker images with miniconda preconfigured available, this turns out to be really simple. Here's the Dockerfie I used:
FROM continuumio/miniconda3
MAINTAINER Greg Landrum <greg.landrum@gmail.com>
ENV PATH /opt/conda/bin:$PATH
ENV LANG C
# install the RDKit:
RUN conda config --add channels https://conda.anaconda.org/rdkit
RUN conda install -y rdkit
You can put that in an empty directory and then build a local image with the RDKit installed by running:
docker build -t basic_conda .
I wanted to mirror my local RDKit checkout into the image when I ran it so that I had access to the Regress
directory. This is easy to do:
docker run -i -t -v /scratch/RDKit_git:/opt/RDKit_git basic_conda /bin/bash
And then I ran the benchmark with:
cd /opt/RDKit_git/Regress/Scripts
python new_timing.py