I got a copy of Thing Explainer a while ago and I think it's pure genius (not that surprising from Randall Munroe). With my thinking infected by reading the book, the other day I wondered what the ten hundred most common "chemical words" are.
After a bit of extremely deep thinking I decided that I would be willing to accept that a "chemical word" is the kind of fragment that comes out of something like BRICS (or RECAP) decomposition of a molecule.
This post takes that simple idea and applies it to a big database of molecules: the almost 110 million molecules that are in the excellent Zinc15.
from rdkit import Chem
from rdkit.Chem import BRICS
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
%pylab inline
print(rdBase.rdkitVersion)
import os,time,glob
print( time.asctime())
Populating the interactive namespace from numpy and matplotlib 2016.09.1dev1 Fri Apr 29 10:04:28 2016
WARNING: pylab import has clobbered these variables: ['rc'] `%matplotlib` prevents importing * from pylab and numpy
Before diving into the analysis, a bit of explanation about BRICS.
Let's start with the reference: Degen, J., Wegscheid-Gerlach, C., Zaliani, A. & Rarey, M. "On the Art of Compiling and Using ‘Drug-Like’ Chemical Fragment Spaces". ChemMedChem 3:1503–07 (2008).
The idea is to fragment molecules based on synthetically accessible bonds. Here's an example starting from a random ZINC molecule, ZINC00002139:
m = Chem.MolFromSmiles('CC[NH+](CC)CCOc1ccc(cc1)Cc2ccccc2')
m
And here are the fragments from doing a BRICS analysis:
pieces = [Chem.MolFromSmiles(x) for x in BRICS.BRICSDecompose(m)]
Draw.MolsToGridImage(pieces,molsPerRow=4)
The isotope labels on the molecules provide information about the type of bond that was broken and can be used later (using BRICS.BRICSBuild()
) to build new molecules. For example, the rules allow a bond to be formed between an atom connected to a [4*]
and an atom connected to a [5*]
.
One important characteristic (I hesitate to use "feature" in this context) of the current RDKit BRICS implementation that impacts this analysis is that only unique fragments are returned. So we only get one [4*]CC
fragment for that molecule above, even though it appears in the molecule twice.
So the analysis below is of the most common chemical words, but we only count frequency based on the number of molecules they appear in, not how often they appear in those molecules.
We're going to use some parallel processing, so bring that stuff in:
from ipyparallel import Client
rc = Client()
dview = rc[:]
with dview.sync_imports():
from rdkit import Chem
from rdkit.Chem import BRICS
import dbm,struct,os
importing Chem from rdkit on engine(s) importing BRICS from rdkit.Chem on engine(s) importing dbm on engine(s) importing struct on engine(s) importing os on engine(s)
The input is the SMILES files for the individual ZINC tranches:
infs = glob.glob('/scratch/RDKit_git/Data/Zinc15/split_files/*.smi')
print(len(infs))
infs[:5]
120
['/scratch/RDKit_git/Data/Zinc15/split_files/zinc15_HG.smi', '/scratch/RDKit_git/Data/Zinc15/split_files/zinc15_FG.smi', '/scratch/RDKit_git/Data/Zinc15/split_files/zinc15_JD.smi', '/scratch/RDKit_git/Data/Zinc15/split_files/zinc15_DD.smi', '/scratch/RDKit_git/Data/Zinc15/split_files/zinc15_AI.smi']
def process_file(fname):
"""
This runs through the molecules in a SMILES file and keeps track of the BRICS
fragments that they produce and the counts of those fragments.
The results are stored in a dbm database.
"""
outfname = os.path.splitext(fname)[0] + '.dbm'
res = 0
with open(fname,'r') as inf:
inf.readline() # header
nDone=0
with dbm.open(outfname,flag='n') as db:
for inl in inf:
m = Chem.MolFromSmiles(inl.split()[0])
nDone += 1
if m is None or m.GetNumHeavyAtoms()>60: continue
s = BRICS.BRICSDecompose(m)
for entry in s:
cnt = struct.unpack('I',db.get(entry,b'\0\0\0\0'))[0]+1
db[entry] = struct.pack('I',cnt)
res = len(db.keys())
return res
Let's get a sence of how long this might take by running one of the really small files:
import time
t1 = time.time();process_file('/scratch/RDKit_git/Data/Zinc15/split_files/zinc15_AB.smi');print("%.2f"%(time.time()-t1))
329.65
Tranche AB only has 177404 molecules (and they are small molecules), but that already takes >5 minutes. Doing the full dataset is going to take a long time....
We run the whole thing like this:
counts = dview.map_sync(process_file,infs)
However, this is a big dataset (about 110 million molecules) and that takes a couple of days to run on my machine, (using up to 6 threads). So this isn't really interactive. I started it and then came back in a couple of days to work up the data.
from collections import defaultdict
all_counts = defaultdict(int)
for i,fname in enumerate(infs):
print(i+1,fname,len(all_counts.keys()))
dbname = os.path.splitext(fname)[0] + '.dbm'
with dbm.open(dbname,flag='r') as db:
for key,val in db.items():
all_counts[key]+=struct.unpack('I',val)[0]
1 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_HG.smi 0 2 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_FG.smi 27741 3 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_JD.smi 63895 4 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_DD.smi 68815 5 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_AI.smi 105976 6 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_BE.smi 106992 7 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_KI.smi 200809 8 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_GE.smi 211119 9 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_IB.smi 220289 10 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_EA.smi 220930 11 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_IJ.smi 222217 12 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_AE.smi 231661 13 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_AJ.smi 280129 14 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_KF.smi 280251 15 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_CC.smi 283191 16 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_II.smi 283665 17 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_AA.smi 290352 18 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_CJ.smi 302215 19 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_BA.smi 322925 20 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_GJ.smi 327841 21 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_CK.smi 339096 22 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_JH.smi 347612 23 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_FJ.smi 354429 24 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_CG.smi 364413 25 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_IG.smi 442441 26 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_JE.smi 446460 27 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_DF.smi 448929 28 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_GG.smi 471638 29 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_BK.smi 481306 30 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_AH.smi 482642 31 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_FH.smi 488559 32 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_KG.smi 501133 33 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_HA.smi 504284 34 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_CA.smi 504832 35 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_DA.smi 507655 36 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_BG.smi 508613 37 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_GB.smi 553936 38 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_HH.smi 554581 39 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_CH.smi 559540 40 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_DH.smi 605358 41 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_EG.smi 620375 42 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_HD.smi 634350 43 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_JJ.smi 637471 44 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_CF.smi 643712 45 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_IE.smi 707183 46 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_EF.smi 708884 47 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_KJ.smi 721578 48 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_KB.smi 726491 49 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_HI.smi 727240 50 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_EE.smi 732017 51 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_AC.smi 742005 52 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_BD.smi 818337 53 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_FF.smi 911199 54 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_DK.smi 920448 55 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_FC.smi 926110 56 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_ED.smi 928727 57 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_JG.smi 940659 58 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_BC.smi 944324 59 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_CI.smi 984116 60 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_IA.smi 1008049 61 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_CE.smi 1008439 62 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_BJ.smi 1057189 63 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_EB.smi 1060351 64 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_IH.smi 1061424 65 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_BI.smi 1064595 66 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_JF.smi 1073556 67 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_HC.smi 1076161 68 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_FI.smi 1077058 69 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_GK.smi 1085427 70 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_IF.smi 1099430 71 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_DE.smi 1101393 72 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_KD.smi 1113785 73 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_GA.smi 1115400 74 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_AB.smi 1115797 75 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_KE.smi 1145207 76 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_EC.smi 1146402 77 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_JI.smi 1149984 78 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_EI.smi 1154170 79 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_DB.smi 1163206 80 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_AG.smi 1165219 81 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_CD.smi 1176927 82 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_AD.smi 1235508 83 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_EH.smi 1327952 84 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_GI.smi 1338431 85 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_KK.smi 1344377 86 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_KH.smi 1378532 87 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_KC.smi 1381559 88 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_AK.smi 1382380 89 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_FA.smi 1382461 90 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_HK.smi 1382863 91 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_GC.smi 1394691 92 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_DI.smi 1396047 93 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_BF.smi 1405038 94 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_FD.smi 1449423 95 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_FE.smi 1456437 96 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_HF.smi 1462561 97 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_GD.smi 1465505 98 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_CB.smi 1469685 99 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_JA.smi 1476908 100 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_DG.smi 1477427 101 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_JB.smi 1491823 102 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_JK.smi 1492147 103 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_EK.smi 1509296 104 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_BB.smi 1516098 105 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_HJ.smi 1527221 106 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_DJ.smi 1530839 107 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_KA.smi 1535850 108 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_HB.smi 1536674 109 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_ID.smi 1537094 110 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_AF.smi 1538339 111 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_JC.smi 1562125 112 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_EJ.smi 1562876 113 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_IC.smi 1568424 114 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_FB.smi 1568942 115 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_GF.smi 1569531 116 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_BH.smi 1574923 117 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_IK.smi 1594461 118 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_HE.smi 1603131 119 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_DC.smi 1605123 120 /scratch/RDKit_git/Data/Zinc15/split_files/zinc15_FK.smi 1611387
import pickle
pickle.dump(all_counts,open('../data/chemical_words_all_counts.pkl','wb+'))
itms = sorted(((y,x) for x,y in all_counts.items()),reverse=True)
len(itms)
1619397
So we got 1.6 million unique fragments from the dataset.
Let's look at the most and least frequent fragments:
itms[:10]
[(64028217, b'[5*]N[5*]'), (29499142, b'[1*]C([6*])=O'), (27159161, b'[4*]C[8*]'), (21965118, b'[3*]O[3*]'), (18976763, b'[3*]OC'), (13840866, b'[1*]C([1*])=O'), (12942186, b'[16*]c1ccccc1'), (10405361, b'[16*]c1ccc([16*])cc1'), (10357103, b'[4*]CC'), (7947669, b'[5*]N([5*])C')]
itms[-10:]
[(1, b'B1BBB[C@@H]2BB[C@@H](BBB1)B2'), (1, b'B1BBBB[C@H]2B[C@H]2BBBB1'), (1, b'B1BBBB[C@H]2B[C@@H]2BBBB1'), (1, b'B1BBBB[C@@H]2B[C@H]2BBBB1'), (1, b'B1BBBBC2BC(BBB1)B2'), (1, b'B1BBBBB/C=C\\BBBB1'), (1, b'B1BBBBB/C=C/BBBB1'), (1, b'B1=BB=C=B1'), (1, b'B12B3B4B1C234'), (1, b'B(C1CCCCC1)C1CCCCC1')]
It's nicer to actually look at the molecules:
tc,ts = zip(*itms[:10])
Draw.MolsToGridImage([Chem.MolFromSmiles(x.decode('UTF-8')) for x in ts],molsPerRow=4,legends=[str(x) for x in tc])
tc,ts = zip(*itms[-10:])
Draw.MolsToGridImage([Chem.MolFromSmiles(x.decode('UTF-8')) for x in ts],molsPerRow=4,legends=[str(x) for x in tc])
Ew... those aren't particularly nice. Good thing we won't be doing much with them.
Look at the number of times each frequency appears:
import math
def summarize(pts,maxCount=1e7):
bins=[0]*int(ceil(math.log10(maxCount)))
for cnt,smi in itms:
if cnt>maxCount:
bins[-1]+=1
else:
bins[int(floor(math.log10(cnt)))]+=1
return bins
bins=summarize(itms)
scatter(list(range(len(bins))),bins,lw=0)
_=yscale('log')
_=xlabel('log10(freq)')
_=ylabel('count')
The wikipedia says I'm not supposed to call that a power-law distribution, so I'll go with "power-law-like". Is that a thing?
Let's clean up the data a little bit: remove molecules without attachment points and then get rid of the isomeric markers that describe what type of bond was broken:
import re
expr = re.compile(r'[0-9]+\*')
clean_counts = defaultdict(int)
nRejected=0
for k,v in all_counts.items():
k = k.decode('UTF-8')
if k.find('*')<0:
nRejected +=1
continue
k = expr.sub('*',k)
clean_counts[k]+=v
clean_itms = sorted([(v,k) for k,v in clean_counts.items()],reverse=True)
print(len(clean_itms))
clean_itms[:10]
869655
[(64028217, '[*]N[*]'), (43725178, '[*]C([*])=O'), (33363742, '[*]C[*]'), (21965118, '[*]O[*]'), (18976763, '[*]OC'), (16425583, '[*]CC'), (14531526, '[*]C(=O)C[*]'), (12942186, '[*]c1ccccc1'), (10405361, '[*]c1ccc([*])cc1'), (9220870, '[*]CC[*]')]
Most frequent fragments:
tc,ts = zip(*clean_itms[:10])
Draw.MolsToGridImage([Chem.MolFromSmiles(x) for x in ts],molsPerRow=4,legends=[str(x) for x in tc])
Least frequent:
clean_itms[-10:]
[(1, '[*]C(/C=C1\\Oc2ccc([*])cc2N1C)CC'), (1, '[*]C(/C=C1/Sc2ccccc2N1C)CC'), (1, '[*]C(/C=C1/Sc2ccc(Cl)cc2N1C)CC'), (1, '[*]C(/C=C/N(C)C)C(C)(C)C'), (1, '[*]C(/C=C/N(C)C)/C=C/N(C)C'), (1, '[*]C(/C(O)=N/S(=O)(=O)c1ccc(N)cc1)/C(O)=N/S(=O)(=O)c1ccc(N)cc1'), (1, '[*]C(/C(=N/[C@H]1Cc2ccc1cc2C)c1cccc(S(=O)(=O)O)c1)/C(=N/[C@H]1Cc2ccc1cc2C)c1cccc(S(=O)(=O)O)c1'), (1, '[*]C(/C(=N/[C@H]1Cc2ccc1cc2C)c1cccc(S(=O)(=O)O)c1)/C(=N/[C@@H]1Cc2ccc1cc2C)c1cccc(S(=O)(=O)O)c1'), (1, '[*]C(/C(=N/[C@@H]1Cc2ccc1cc2C)c1cccc(S(=O)(=O)O)c1)/C(=N/[C@@H]1Cc2ccc1cc2C)c1cccc(S(=O)(=O)O)c1'), (1, '[*]C(/C(=N/O)c1ccccc1)/C(=N\\O)c1ccccc1')]
tc,ts = zip(*clean_itms[-10:])
Draw.MolsToGridImage([Chem.MolFromSmiles(x) for x in ts],molsPerRow=4,legends=[str(x) for x in tc])
Ok, so what about the 1000 most common BRICS fragments?
thousand = clean_itms[:1000]
scatter(range(1000),[x[0] for x in thousand],lw=0)
_=yscale('log')
_=ylabel('freq')
That's a steep drop off, but even the least common "word" appears >31000 times in our dataset:
thousand[-1]
(31447, '[*]n1ccc2cc([*])ccc21')
How many attachment points do those fragments have?
_=hist([y.count('*') for x,y in thousand])
Who's got 4?
tmp = [x for x in thousand if x[1].count('*')==4]
tc,ts = zip(*tmp)
Draw.MolsToGridImage([Chem.MolFromSmiles(x) for x in ts],molsPerRow=4,legends=[str(x) for x in tc])
What about 3?
tmp = [x for x in thousand if x[1].count('*')==3]
tc,ts = zip(*tmp)
Draw.MolsToGridImage([Chem.MolFromSmiles(x) for x in ts],molsPerRow=4,legends=[str(x) for x in tc])
There are some examples in there that look like duplicates. That's just happening because the removal of the isotopic labels has made some stereoceneters look like they are no longer stereocenters when the SMILES is processed:
[x for x in thousand if x[1].count('*')==3][10:12]
[(217835, '[*][C@H]([*])C[*]'), (214000, '[*][C@@H]([*])C[*]')]
We can do at least a little bit about that by sanitizing the molecules differently:
tc,ts = zip(*tmp)
tms = [Chem.MolFromSmiles(x,sanitize=False) for x in ts]
[Chem.SanitizeMol(x) for x in tms]
Draw.MolsToGridImage(tms,molsPerRow=4,legends=[str(x) for x in tc])
Still, this does highlight a minor problem with the simple string-manipulation approach to removing the isotopes from the fragments. Let's fix that problem by actually re-building the molecules. This time we'll remove stereochemistry that is no longer present when the isotope labels are removed from the dummies.
We've got a bunch of these, so this takes a while:
clean_counts2 = defaultdict(int)
nRejected=0
for k,v in all_counts.items():
k = k.decode('UTF-8')
if k.find('*')<0:
nRejected +=1
continue
k = Chem.MolToSmiles(Chem.MolFromSmiles(expr.sub('*',k)),True)
clean_counts2[k]+=v
import pickle
pickle.dump(clean_counts2,open('../data/chemical_words_clean_counts2.pkl','wb+'))
clean_itms2 = sorted([(v,k) for k,v in clean_counts2.items()],reverse=True)
print(len(clean_itms2),len(clean_itms))
thousand = clean_itms2[:1000]
858146 869655
So we did remove a few.
Let's look at what's left:
tmp = [x for x in thousand if x[1].count('*')==3]
tc,ts = zip(*tmp)
Draw.MolsToGridImage([Chem.MolFromSmiles(x) for x in ts],molsPerRow=4,legends=[str(x) for x in tc])
tc,ts = zip(*tmp)
tms = [Chem.MolFromSmiles(x) for x in ts]
Draw.MolsToGridImage(tms,molsPerRow=4,legends=[str(x) for x in tc])
Write out our thousand common chemical words, this file will be in the github repo:
outf=open('../data/thousand_words.no_iso.smi','w+')
for tc,ts in thousand:
print(ts,tc,file=outf)
I'm going to want to use the ones that are still isotopically labeled later, so write those out too. This file will also be in the github repo:
t_itms = [(v,k.decode('UTF-8')) for k,v in all_counts.items()]
iso_itms = sorted([(v,k) for v,k in t_itms if k.count('*')>0],reverse=True)
print(len(iso_itms))
thousand_iso = iso_itms[:1000]
outf=open('../data/thousand_words.iso.smi','w+')
for tc,ts in thousand_iso:
print(ts,tc,file=outf)
907375
As a teaser, let's make some new molecules:
import random
thousand_iso_frags = [Chem.MolFromSmiles(y) for x,y in thousand_iso]
new_mols = []
# the building algorithm tends to explore around a given point,
# so let's pick a number of starting points and enumerate a few molecules
# around each of them:
for i in range(5):
random.seed(0x1234+i)
brics_gen = BRICS.BRICSBuild(thousand_iso_frags)
for tm in [next(brics_gen) for x in range(4)]:
try:
Chem.SanitizeMol(tm)
except:
continue
new_mols.append(tm)
Draw.MolsToGridImage(new_mols,molsPerRow=4)
That's enough for now...