I recently came across this article by Peter Norvig. It is a very interesting enquiry into the world of spelling correction. In the spirit of learning by doing, I'm going to try to deconstruct what he does and then reuse it and experiment with it inside this notebook. First of all, let's paste the complete code by Peter Norvig in the next cell.
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
--------------------------------------------------------------------------- IOError Traceback (most recent call last) <ipython-input-1-664748b5f142> in <module>() 9 return model 10 ---> 11 NWORDS = train(words(file('big.txt').read())) 12 13 alphabet = 'abcdefghijklmnopqrstuvwxyz' IOError: [Errno 2] No such file or directory: 'big.txt'
Apparently, it fails, which is no surprise given that I don't have the big.txt data file he uses to construct his dictionary of possible words (train his model, as he says). Let's do this step by step. The file we're talking about is online, so I'm just gonna use a local version of it.
%time NWORDS = train(words(file(r'files/big.txt').read()))
CPU times: user 0.47 s, sys: 0.00 s, total: 0.47 s Wall time: 0.47 s
Surprising how fast this was, given that this file is 6 MB and full of words. Let's check how many words actually end up in the dictionary.
len(NWORDS.keys())
29157
How about sampling some of these words?
NWORDS.keys()[:20]
['nunnery', 'presnya', 'woods', 'clotted', 'spiders', 'hanging', 'disobeying', 'scold', 'originality', 'grenadiers', 'pigment', 'appropriation', 'strictest', 'bringing', 'revelers', 'wooded', 'wooden', 'wednesday', 'shows', 'immunities']
I'm going to try to understand the theory Norvig is explaining in this section. We will say that we are trying to find the correction $c$, out of all possible corrections, that maximizes the probability of $c$ given the original input word $w$:
$$c_0 = \argmax_c P(c | w)$$This expression can be expanded using Bayes' theorem, whose demonstration is pasted below as a reminder.
When we expand the orignal expression, we obtain:
$$c_0 = \argmax_c \frac{P(w | c) P(c)}{P(w)}$$Assuming that all typos are equally probable for a given correct word, we can set $P(w)$ to 1 and thus ignore it.
$$c_0 = \argmax_c P(w | c) P(c)$$This leads us to an interesting expression, made up of three terms:
Below is the adapted code, using the file saved to my disk.
import re
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file(r'files/big.txt').read()))
Let's explore some data that can be found in the dictionary.
for word in ['darling', 'eternal', 'flame', 'better', 'host', 'the']:
print word, NWORDS[word]
darling 39 eternal 27 flame 16 better 267 host 24 the 80031
I'm asking myself: what are the most frequent words in this dictionary? Let's use Pandas to find out.
import pandas as pd
words_df = pd.DataFrame(NWORDS.items())
words_df.head(6)
0 | 1 | |
---|---|---|
0 | nunnery | 4 |
1 | presnya | 2 |
2 | woods | 23 |
3 | clotted | 2 |
4 | spiders | 2 |
5 | hanging | 43 |
words_df.sort(columns=1).tail(5)
0 | 1 | |
---|---|---|
11184 | in | 22051 |
15454 | to | 28767 |
22090 | and | 38314 |
13587 | of | 40026 |
9392 | the | 80031 |
We now know that the 5 most frequent words in the corpus are "the", "of", "and", "to" and "in".
As Peter Norvig explains, the measure to say if a word is far from another word is to use the edit distance between them. Here, we can start considering all words that are within 1 edit of a given word for taking into account corrections. As he says:
An edit can be a deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter).
Peter's function is below.
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
Let's see what the output of this function produces.
print edits1("bloating")
set(['bloatinsg', 'bfloating', 'blozating', 'blouating', 'eloating', 'bloatzing', 'sbloating', 'cloating', 'bloatuing', 'bloaoting', 'bloatikg', 'wloating', 'bloaiting', 'bloacing', 'brloating', 'bloaeing', 'ibloating', 'bloazting', 'bloyating', 'blzoating', 'bmloating', 'bloatinlg', 'bcloating', 'blocting', 'mbloating', 'bloatineg', 'blolating', 'bloatirg', 'bjoating', 'blkoating', 'blonting', 'blooating', 'bloyting', 'byoating', 'blloating', 'bloatinzg', 'blobting', 'ybloating', 'bloatving', 'bloajting', 'bloatinwg', 'broating', 'bliating', 'bzoating', 'blofting', 'bloatqng', 'bloatigg', 'bloatyng', 'pbloating', 'bloatign', 'bloatqing', 'bgoating', 'bloatxing', 'bloatying', 'bloaiing', 'blofating', 'bloatiug', 'bolating', 'bvloating', 'blowating', 'bloiating', 'lbloating', 'bloqating', 'bloatiqng', 'blogating', 'bloatcng', 'bloatbing', 'bboating', 'blyoating', 'bloateng', 'blqating', 'zbloating', 'bloatung', 'blgating', 'bloatisng', 'bloatging', 'cbloating', 'bloatinyg', 'bloatisg', 'bloatmng', 'bloatding', 'bloatifng', 'bloatting', 'bloatiwng', 'bloatfing', 'blotating', 'bloatsng', 'bloatina', 'bloatinc', 'bloatinb', 'bloatine', 'bloatind', 'bloating', 'bloatinf', 'bloatini', 'bloatinh', 'bloatink', 'bloatinj', 'bloatinm', 'bloatinl', 'bloatino', 'bloatinn', 'bloatinq', 'bloatinp', 'bloatins', 'bloatinr', 'bloatinu', 'bloatint', 'bloatinw', 'bloatinv', 'bloatiny', 'bloatinx', 'bloatinz', 'abloating', 'bloaticg', 'blcoating', 'blovating', 'bloaring', 'jloating', 'bloaping', 'bxloating', 'lboating', 'xbloating', 'bloatiing', 'bloatjing', 'bgloating', 'bloagting', 'nbloating', 'bloatinkg', 'bloatwing', 'bllating', 'bloatixg', 'bloatwng', 'buoating', 'bloazing', 'blosting', 'bloatang', 'bloatfng', 'bloatgng', 'fbloating', 'bloatong', 'blfoating', 'bloatiqg', 'boloating', 'blating', 'bloawting', 'bloayting', 'blotaing', 'blroating', 'bloafing', 'blmating', 'bloatimg', 'rbloating', 'bloqting', 'bzloating', 'bleoating', 'blcating', 'blooting', 'bloaming', 'bloatinrg', 'bwoating', 'uloating', 'bloatiog', 'bloabting', 'bloatijg', 'bloatling', 'bloading', 'bloatning', 'bloaqting', 'blomting', 'bloatitg', 'blfating', 'bloatsing', 'boating', 'bloaking', 'blnating', 'loating', 'btloating', 'bloathing', 'bloatinog', 'buloating', 'bloatindg', 'bpoating', 'bloaxing', 'bhoating', 'bloatinqg', 'blvoating', 'ploating', 'baloating', 'bloatdng', 'bloatinug', 'bloaving', 'nloating', 'bloatiig', 'bloattng', 'bnoating', 'blodating', 'bloaxting', 'blsating', 'bloatvng', 'bloatkng', 'bloatifg', 'bloatinvg', 'bloatrng', 'blomating', 'aloating', 'bloeting', 'bloathng', 'bnloating', 'bmoating', 'bleating', 'blnoating', 'blqoating', 'bloaeting', 'dbloating', 'bloatinjg', 'btoating', 'bvoating', 'bloatiyng', 'bloatintg', 'bloateing', 'bloatring', 'bloaling', 'blotting', 'bloeating', 'blojating', 'blosating', 'bloatipg', 'bloabing', 'bloatirng', 'byloating', 'bloatnng', 'bloatincg', 'blorting', 'blhoating', 'obloating', 'oloating', 'bljoating', 'blobating', 'bloatnig', 'qloating', 'hbloating', 'bloatinxg', 'bloatlng', 'bloatipng', 'bloativng', 'blopating', 'bxoating', 'vloating', 'bloatizng', 'blouting', 'bloatieg', 'blogting', 'jbloating', 'biloating', 'bloawing', 'bloatitng', 'bloatibg', 'blowting', 'blpating', 'bwloating', 'xloating', 'bloaating', 'blwoating', 'bluating', 'bjloating', 'tbloating', 'blojting', 'bloamting', 'bloakting', 'mloating', 'blboating', 'bloatinng', 'bioating', 'bloatingc', 'bloatingb', 'bloatinga', 'bloatingg', 'bloauing', 'bloatinge', 'bloatingd', 'bloatingk', 'bloatingj', 'bloatingi', 'bloatingh', 'bloatingo', 'bloatingn', 'bloatingm', 'bloatingl', 'kbloating', 'bloatingr', 'bloatingq', 'bloatxng', 'bloatingw', 'bloatingv', 'bloatingu', 'bloatbng', 'bloatingz', 'bloatingy', 'bloatingx', 'blonating', 'blaoting', 'blorating', 'bloatixng', 'zloating', 'beoating', 'bloatibng', 'bloahing', 'bloatilg', 'tloating', 'bloiting', 'bbloating', 'bloatiag', 'bloting', 'wbloating', 'bloasing', 'blwating', 'booating', 'blxating', 'blokating', 'ebloating', 'bloalting', 'blocating', 'bloaqing', 'bdloating', 'bloatinig', 'blbating', 'bloauting', 'ubloating', 'vbloating', 'bloatijng', 'bloatiwg', 'bltating', 'bloatilng', 'bloaning', 'bloapting', 'bkoating', 'bloatzng', 'hloating', 'blolting', 'bloarting', 'bloxting', 'bfoating', 'blokting', 'iloating', 'bloatpng', 'blaating', 'blmoating', 'blhating', 'gloating', 'bloatng', 'bloatin', 'bloatinpg', 'blioating', 'bloatig', 'bloatihng', 'bloatihg', 'bqoating', 'bloacting', 'blohating', 'rloating', 'bldoating', 'beloating', 'blopting', 'bloataing', 'blxoating', 'bloatizg', 'blaoating', 'bloatinmg', 'bluoating', 'blpoating', 'bloatming', 'bloafting', 'blsoating', 'blkating', 'bloatingf', 'bsoating', 'bloatking', 'gbloating', 'bkloating', 'bloatinbg', 'bcoating', 'blzating', 'bldating', 'bloasting', 'blrating', 'sloating', 'bljating', 'lloating', 'blozting', 'bloationg', 'bloaing', 'kloating', 'bloahting', 'bloatiung', 'bloatings', 'bloadting', 'bloaying', 'bloatingp', 'yloating', 'bloanting', 'bloajing', 'bloatingt', 'bsloating', 'blyating', 'blvating', 'bloxating', 'bloaticng', 'bloatping', 'bloaging', 'bloatiang', 'bloaaing', 'bloatidng', 'bloatidg', 'bloatcing', 'bhloating', 'bploating', 'bloatinhg', 'blodting', 'bloatiyg', 'bloaoing', 'floating', 'bdoating', 'dloating', 'bqloating', 'bloatjng', 'bloavting', 'baoating', 'blohting', 'blovting', 'bloativg', 'bloatoing', 'bltoating', 'bloatieng', 'blgoating', 'bloatinag', 'bloatikng', 'bloatinfg', 'bloatimng', 'qbloating', 'bloaitng', 'bloatigng'])
That's a number of modifications of the original word! Let's see how this was done. First, let's look at the splitting done in the code. The purpose of this is to easily calculate the alterations considered as modifications of a word.
word = 'bloating'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
print splits
[('', 'bloating'), ('b', 'loating'), ('bl', 'oating'), ('blo', 'ating'), ('bloa', 'ting'), ('bloat', 'ing'), ('bloati', 'ng'), ('bloatin', 'g'), ('bloating', '')]
From there it's easy to code deletions of one letter:
deletes = [(a + b[1:]) for (a, b) in splits if b != '']
print deletes
['loating', 'boating', 'blating', 'bloting', 'bloaing', 'bloatng', 'bloatig', 'bloatin']
Then transpositions.
transposes = [(a + b[1] + b[0] + b[2:]) for (a, b) in splits if len(b) > 1]
print transposes
['lboating', 'bolating', 'blaoting', 'blotaing', 'bloaitng', 'bloatnig', 'bloatign']
Than alterations (replace a letter by another).
alphabet = 'abcdefghijklmnopqrstuvwxyz'
replaces = [(a + c + b[1:]) for (a, b) in splits for c in alphabet if b != '']
print replaces
['aloating', 'bloating', 'cloating', 'dloating', 'eloating', 'floating', 'gloating', 'hloating', 'iloating', 'jloating', 'kloating', 'lloating', 'mloating', 'nloating', 'oloating', 'ploating', 'qloating', 'rloating', 'sloating', 'tloating', 'uloating', 'vloating', 'wloating', 'xloating', 'yloating', 'zloating', 'baoating', 'bboating', 'bcoating', 'bdoating', 'beoating', 'bfoating', 'bgoating', 'bhoating', 'bioating', 'bjoating', 'bkoating', 'bloating', 'bmoating', 'bnoating', 'booating', 'bpoating', 'bqoating', 'broating', 'bsoating', 'btoating', 'buoating', 'bvoating', 'bwoating', 'bxoating', 'byoating', 'bzoating', 'blaating', 'blbating', 'blcating', 'bldating', 'bleating', 'blfating', 'blgating', 'blhating', 'bliating', 'bljating', 'blkating', 'bllating', 'blmating', 'blnating', 'bloating', 'blpating', 'blqating', 'blrating', 'blsating', 'bltating', 'bluating', 'blvating', 'blwating', 'blxating', 'blyating', 'blzating', 'bloating', 'blobting', 'blocting', 'blodting', 'bloeting', 'blofting', 'blogting', 'blohting', 'bloiting', 'blojting', 'blokting', 'blolting', 'blomting', 'blonting', 'blooting', 'blopting', 'bloqting', 'blorting', 'blosting', 'blotting', 'blouting', 'blovting', 'blowting', 'bloxting', 'bloyting', 'blozting', 'bloaaing', 'bloabing', 'bloacing', 'bloading', 'bloaeing', 'bloafing', 'bloaging', 'bloahing', 'bloaiing', 'bloajing', 'bloaking', 'bloaling', 'bloaming', 'bloaning', 'bloaoing', 'bloaping', 'bloaqing', 'bloaring', 'bloasing', 'bloating', 'bloauing', 'bloaving', 'bloawing', 'bloaxing', 'bloaying', 'bloazing', 'bloatang', 'bloatbng', 'bloatcng', 'bloatdng', 'bloateng', 'bloatfng', 'bloatgng', 'bloathng', 'bloating', 'bloatjng', 'bloatkng', 'bloatlng', 'bloatmng', 'bloatnng', 'bloatong', 'bloatpng', 'bloatqng', 'bloatrng', 'bloatsng', 'bloattng', 'bloatung', 'bloatvng', 'bloatwng', 'bloatxng', 'bloatyng', 'bloatzng', 'bloatiag', 'bloatibg', 'bloaticg', 'bloatidg', 'bloatieg', 'bloatifg', 'bloatigg', 'bloatihg', 'bloatiig', 'bloatijg', 'bloatikg', 'bloatilg', 'bloatimg', 'bloating', 'bloatiog', 'bloatipg', 'bloatiqg', 'bloatirg', 'bloatisg', 'bloatitg', 'bloatiug', 'bloativg', 'bloatiwg', 'bloatixg', 'bloatiyg', 'bloatizg', 'bloatina', 'bloatinb', 'bloatinc', 'bloatind', 'bloatine', 'bloatinf', 'bloating', 'bloatinh', 'bloatini', 'bloatinj', 'bloatink', 'bloatinl', 'bloatinm', 'bloatinn', 'bloatino', 'bloatinp', 'bloatinq', 'bloatinr', 'bloatins', 'bloatint', 'bloatinu', 'bloatinv', 'bloatinw', 'bloatinx', 'bloatiny', 'bloatinz']
Finally, inserts.
alphabet = 'abcdefghijklmnopqrstuvwxyz'
inserts = [(a + c + b) for (a, b) in splits for c in alphabet]
print inserts
['abloating', 'bbloating', 'cbloating', 'dbloating', 'ebloating', 'fbloating', 'gbloating', 'hbloating', 'ibloating', 'jbloating', 'kbloating', 'lbloating', 'mbloating', 'nbloating', 'obloating', 'pbloating', 'qbloating', 'rbloating', 'sbloating', 'tbloating', 'ubloating', 'vbloating', 'wbloating', 'xbloating', 'ybloating', 'zbloating', 'baloating', 'bbloating', 'bcloating', 'bdloating', 'beloating', 'bfloating', 'bgloating', 'bhloating', 'biloating', 'bjloating', 'bkloating', 'blloating', 'bmloating', 'bnloating', 'boloating', 'bploating', 'bqloating', 'brloating', 'bsloating', 'btloating', 'buloating', 'bvloating', 'bwloating', 'bxloating', 'byloating', 'bzloating', 'blaoating', 'blboating', 'blcoating', 'bldoating', 'bleoating', 'blfoating', 'blgoating', 'blhoating', 'blioating', 'bljoating', 'blkoating', 'blloating', 'blmoating', 'blnoating', 'blooating', 'blpoating', 'blqoating', 'blroating', 'blsoating', 'bltoating', 'bluoating', 'blvoating', 'blwoating', 'blxoating', 'blyoating', 'blzoating', 'bloaating', 'blobating', 'blocating', 'blodating', 'bloeating', 'blofating', 'blogating', 'blohating', 'bloiating', 'blojating', 'blokating', 'blolating', 'blomating', 'blonating', 'blooating', 'blopating', 'bloqating', 'blorating', 'blosating', 'blotating', 'blouating', 'blovating', 'blowating', 'bloxating', 'bloyating', 'blozating', 'bloaating', 'bloabting', 'bloacting', 'bloadting', 'bloaeting', 'bloafting', 'bloagting', 'bloahting', 'bloaiting', 'bloajting', 'bloakting', 'bloalting', 'bloamting', 'bloanting', 'bloaoting', 'bloapting', 'bloaqting', 'bloarting', 'bloasting', 'bloatting', 'bloauting', 'bloavting', 'bloawting', 'bloaxting', 'bloayting', 'bloazting', 'bloataing', 'bloatbing', 'bloatcing', 'bloatding', 'bloateing', 'bloatfing', 'bloatging', 'bloathing', 'bloatiing', 'bloatjing', 'bloatking', 'bloatling', 'bloatming', 'bloatning', 'bloatoing', 'bloatping', 'bloatqing', 'bloatring', 'bloatsing', 'bloatting', 'bloatuing', 'bloatving', 'bloatwing', 'bloatxing', 'bloatying', 'bloatzing', 'bloatiang', 'bloatibng', 'bloaticng', 'bloatidng', 'bloatieng', 'bloatifng', 'bloatigng', 'bloatihng', 'bloatiing', 'bloatijng', 'bloatikng', 'bloatilng', 'bloatimng', 'bloatinng', 'bloationg', 'bloatipng', 'bloatiqng', 'bloatirng', 'bloatisng', 'bloatitng', 'bloatiung', 'bloativng', 'bloatiwng', 'bloatixng', 'bloatiyng', 'bloatizng', 'bloatinag', 'bloatinbg', 'bloatincg', 'bloatindg', 'bloatineg', 'bloatinfg', 'bloatingg', 'bloatinhg', 'bloatinig', 'bloatinjg', 'bloatinkg', 'bloatinlg', 'bloatinmg', 'bloatinng', 'bloatinog', 'bloatinpg', 'bloatinqg', 'bloatinrg', 'bloatinsg', 'bloatintg', 'bloatinug', 'bloatinvg', 'bloatinwg', 'bloatinxg', 'bloatinyg', 'bloatinzg', 'bloatinga', 'bloatingb', 'bloatingc', 'bloatingd', 'bloatinge', 'bloatingf', 'bloatingg', 'bloatingh', 'bloatingi', 'bloatingj', 'bloatingk', 'bloatingl', 'bloatingm', 'bloatingn', 'bloatingo', 'bloatingp', 'bloatingq', 'bloatingr', 'bloatings', 'bloatingt', 'bloatingu', 'bloatingv', 'bloatingw', 'bloatingx', 'bloatingy', 'bloatingz']
possibilities = set(deletes + transposes + replaces + inserts)
print possibilities
set(['bloatinsg', 'bfloating', 'blozating', 'blouating', 'eloating', 'bloatzing', 'sbloating', 'cloating', 'bloatuing', 'bloaoting', 'bloatikg', 'wloating', 'bloaiting', 'bloacing', 'brloating', 'bloaeing', 'ibloating', 'bloazting', 'bloyating', 'blzoating', 'bmloating', 'bloatinlg', 'bcloating', 'blocting', 'mbloating', 'bloatineg', 'blolating', 'bloatirg', 'bjoating', 'blkoating', 'blonting', 'blooating', 'bloyting', 'byoating', 'blloating', 'bloatinzg', 'blobting', 'ybloating', 'bloatving', 'bloajting', 'bloatinwg', 'broating', 'bliating', 'bzoating', 'blofting', 'bloatqng', 'bloatigg', 'bloatyng', 'pbloating', 'bloatign', 'bloatqing', 'bgoating', 'bloatxing', 'bloatying', 'bloaiing', 'blofating', 'bloatiug', 'bolating', 'bvloating', 'blowating', 'bloiating', 'lbloating', 'bloqating', 'bloatiqng', 'blogating', 'bloatcng', 'bloatbing', 'bboating', 'blyoating', 'bloateng', 'blqating', 'zbloating', 'bloatung', 'blgating', 'bloatisng', 'bloatging', 'cbloating', 'bloatinyg', 'bloatisg', 'bloatmng', 'bloatding', 'bloatifng', 'bloatting', 'bloatiwng', 'bloatfing', 'blotating', 'bloatsng', 'bloatina', 'bloatinc', 'bloatinb', 'bloatine', 'bloatind', 'bloating', 'bloatinf', 'bloatini', 'bloatinh', 'bloatink', 'bloatinj', 'bloatinm', 'bloatinl', 'bloatino', 'bloatinn', 'bloatinq', 'bloatinp', 'bloatins', 'bloatinr', 'bloatinu', 'bloatint', 'bloatinw', 'bloatinv', 'bloatiny', 'bloatinx', 'bloatinz', 'abloating', 'bloaticg', 'blcoating', 'blovating', 'bloaring', 'jloating', 'bloaping', 'bxloating', 'lboating', 'xbloating', 'bloatiing', 'bloatjing', 'bgloating', 'bloagting', 'nbloating', 'bloatinkg', 'bloatwing', 'bllating', 'bloatixg', 'bloatwng', 'buoating', 'bloazing', 'blosting', 'bloatang', 'bloatfng', 'bloatgng', 'fbloating', 'bloatong', 'blfoating', 'bloatiqg', 'boloating', 'blating', 'bloawting', 'bloayting', 'blotaing', 'blroating', 'bloafing', 'blmating', 'bloatimg', 'rbloating', 'bloqting', 'bzloating', 'bleoating', 'blcating', 'blooting', 'bloaming', 'bloatinrg', 'bwoating', 'uloating', 'bloatiog', 'bloabting', 'bloatijg', 'bloatling', 'bloading', 'bloatning', 'bloaqting', 'blomting', 'bloatitg', 'blfating', 'bloatsing', 'boating', 'bloaking', 'blnating', 'loating', 'btloating', 'bloathing', 'bloatinog', 'buloating', 'bloatindg', 'bpoating', 'bloaxing', 'bhoating', 'bloatinqg', 'blvoating', 'ploating', 'baloating', 'bloatdng', 'bloatinug', 'bloaving', 'nloating', 'bloatiig', 'bloattng', 'bnoating', 'blodating', 'bloaxting', 'blsating', 'bloatvng', 'bloatkng', 'bloatifg', 'bloatinvg', 'bloatrng', 'blomating', 'aloating', 'bloeting', 'bloathng', 'bnloating', 'bmoating', 'bleating', 'blnoating', 'blqoating', 'bloaeting', 'dbloating', 'bloatinjg', 'btoating', 'bvoating', 'bloatiyng', 'bloatintg', 'bloateing', 'bloatring', 'bloaling', 'blotting', 'bloeating', 'blojating', 'blosating', 'bloatipg', 'bloabing', 'bloatirng', 'byloating', 'bloatnng', 'bloatincg', 'blorting', 'blhoating', 'obloating', 'oloating', 'bljoating', 'blobating', 'bloatnig', 'qloating', 'hbloating', 'bloatinxg', 'bloatlng', 'bloatipng', 'bloativng', 'blopating', 'bxoating', 'vloating', 'bloatizng', 'blouting', 'bloatieg', 'blogting', 'jbloating', 'biloating', 'bloawing', 'bloatitng', 'bloatibg', 'blowting', 'blpating', 'bwloating', 'xloating', 'bloaating', 'blwoating', 'bluating', 'bjloating', 'tbloating', 'blojting', 'bloamting', 'bloakting', 'mloating', 'blboating', 'bloatinng', 'bioating', 'bloatingc', 'bloatingb', 'bloatinga', 'bloatingg', 'bloauing', 'bloatinge', 'bloatingd', 'bloatingk', 'bloatingj', 'bloatingi', 'bloatingh', 'bloatingo', 'bloatingn', 'bloatingm', 'bloatingl', 'kbloating', 'bloatingr', 'bloatingq', 'bloatxng', 'bloatingw', 'bloatingv', 'bloatingu', 'bloatbng', 'bloatingz', 'bloatingy', 'bloatingx', 'blonating', 'blaoting', 'blorating', 'bloatixng', 'zloating', 'beoating', 'bloatibng', 'bloahing', 'bloatilg', 'tloating', 'bloiting', 'bbloating', 'bloatiag', 'bloting', 'wbloating', 'bloasing', 'blwating', 'booating', 'blxating', 'blokating', 'ebloating', 'bloalting', 'blocating', 'bloaqing', 'bdloating', 'bloatinig', 'blbating', 'bloauting', 'ubloating', 'vbloating', 'bloatijng', 'bloatiwg', 'bltating', 'bloatilng', 'bloaning', 'bloapting', 'bkoating', 'bloatzng', 'hloating', 'blolting', 'bloarting', 'bloxting', 'bfoating', 'blokting', 'iloating', 'bloatpng', 'blaating', 'blmoating', 'blhating', 'gloating', 'bloatng', 'bloatin', 'bloatinpg', 'blioating', 'bloatig', 'bloatihng', 'bloatihg', 'bqoating', 'bloacting', 'blohating', 'rloating', 'bldoating', 'beloating', 'blopting', 'bloataing', 'blxoating', 'bloatizg', 'blaoating', 'bloatinmg', 'bluoating', 'blpoating', 'bloatming', 'bloafting', 'blsoating', 'blkating', 'bloatingf', 'bsoating', 'bloatking', 'gbloating', 'bkloating', 'bloatinbg', 'bcoating', 'blzating', 'bldating', 'bloasting', 'blrating', 'sloating', 'bljating', 'lloating', 'blozting', 'bloationg', 'bloaing', 'kloating', 'bloahting', 'bloatiung', 'bloatings', 'bloadting', 'bloaying', 'bloatingp', 'yloating', 'bloanting', 'bloajing', 'bloatingt', 'bsloating', 'blyating', 'blvating', 'bloxating', 'bloaticng', 'bloatping', 'bloaging', 'bloatiang', 'bloaaing', 'bloatidng', 'bloatidg', 'bloatcing', 'bhloating', 'bploating', 'bloatinhg', 'blodting', 'bloatiyg', 'bloaoing', 'floating', 'bdoating', 'dloating', 'bqloating', 'bloatjng', 'bloavting', 'baoating', 'blohting', 'blovting', 'bloativg', 'bloatoing', 'bltoating', 'bloatieng', 'blgoating', 'bloatinag', 'bloatikng', 'bloatinfg', 'bloatimng', 'qbloating', 'bloaitng', 'bloatigng'])
len(possibilities)
442
%time edits1('bloating');
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s Wall time: 0.00 s
Apparently, this function is very very fast. If we go one step further, and apply this function recursively to get the set of 2 edit distance words, we obtain the following.
def edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1))
However this syntax is a little tough for me. I'd rather write it like this.
one_distance_words = edits1('bloating')
two_distance_words = set([])
for word in one_distance_words:
two_distance_words = two_distance_words.union(edits1(word))
How many words are in the two distance word set from bloating?
len(two_distance_words)
90902
However, as Peter Norvig does it, we only keep the known words in the resulting set of possible words until edit distance of 2.
two_distance_words = [word for word in two_distance_words if word in NWORDS]
len(two_distance_words)
19
print(two_distance_words)
['loafing', 'clotting', 'blowing', 'beating', 'locating', 'coating', 'bleating', 'floating', 'blocking', 'blaming', 'boasting', 'loading', 'looting', 'loathing', 'blazing', 'blotting', 'blooming', 'plating', 'blasting']
test_set = set()
test_set
set([])
test_set2 = set([1, 2, 2, 3])
test_set2
set([1, 2, 3])
test_set = test_set.union(test_set2)
test_set
set([1, 2, 3])
def edits2(word):
one_distance_words = edits1(word)
two_distance_words = set([])
for word in one_distance_words:
two_distance_words = two_distance_words.union(edits1(word))
return two_distance_words
ed2 = edits2('bloating')
len(ed2)
90902
%time edits2('bloating');
CPU times: user 3.64 s, sys: 0.00 s, total: 3.64 s Wall time: 3.64 s
def known_edits2(word):
one_distance_words = edits1(word)
two_distance_words = set([])
for word in one_distance_words:
two_distance_words = two_distance_words.union(edits1(word))
return [word for word in two_distance_words if word in NWORDS]
%time known_edits2('bloating');
CPU times: user 3.72 s, sys: 0.00 s, total: 3.72 s Wall time: 3.72 s
What if we compare with Norvig's function?
def edits2_norvig(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1))
%time edits2_norvig('bloating');
CPU times: user 0.26 s, sys: 0.00 s, total: 0.26 s Wall time: 0.26 s
Damn! Why? I have a clue: maybe it's all the unions that get done during the loop. Why not generate a huge list and produce a set at the end? In fact, that's what Peter does (and he knows how to use double loop list comprehensions).
def edits2(word):
one_distance_words = edits1(word)
two_distance_words = []
for word in one_distance_words:
two_distance_words += list(edits1(word))
return set(two_distance_words)
%time edits2('bloating');
CPU times: user 0.23 s, sys: 0.00 s, total: 0.23 s Wall time: 0.23 s
len(edits2('bloating')), len(edits2_norvig('bloating'))
(90902, 90902)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word, verbose=False):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
if verbose:
print len(candidates)
if len(candidates) < 20:
print candidates
return max(candidates, key=NWORDS.get)
correct('treheb', verbose=True)
2 ['trees', 'tree']
'trees'
correct('narh')
'are'
What about this 'or' syntax?
print known(['brurr'])
known(['brlur']) or ['true']
set([])
['true']
bool(set([]))
False
From the above line: using this or syntax makes the preceding expression a boolean expression. Which results in an incremental sort of loop. Clever coding, but not easy to figure out for me.
While discussing this with a colleague, we figured out that the code was actually straightforward to understand. Basically, we understood the following "create a list of candidates with the keyboard input measure and use the one that has the highest frequency in the language". However, in the conceptual framework, one of the important things that wasn't really elaborated upon is the fact that the edit measure has an interesting property: it works both way. If you take the word bloat, a distance 2 word would be broad. However, if you use the word broad as input, you can also find bloat in the distance 2 words list.
print known_edits2('bloat')
['boost', 'plot', 'blest', 'lost', 'broad', 'boa', 'boast', 'bout', 'blurt', 'gloat', 'plat', 'blown', 'blows', 'boats', 'flat', 'blast', 'bolt', 'lout', 'aloft', 'blood', 'coat', 'bloom', 'clot', 'loath', 'afloat', 'brat', 'float', 'lot', 'groat', 'loot', 'sloat', 'bloated', 'blunt', 'goat', 'slot', 'cloak', 'boat', 'load', 'loaf', 'loam', 'loan', 'bloch', 'block', 'bat', 'bloke', 'bleak', 'beat', 'blow', 'blot', 'bloc', 'flout', 'boot']
print known_edits2('broad')
['crowd', 'bred', 'bryan', 'broad', 'boa', 'rod', 'browed', 'proud', 'trod', 'brow', 'broaden', 'broader', 'breed', 'read', 'broke', 'abroad', 'blood', 'bored', 'roan', 'broadly', 'brat', 'brag', 'board', 'groan', 'groat', 'dread', 'bond', 'briar', 'brian', 'boat', 'load', 'roads', 'break', 'bread', 'brown', 'brows', 'brood', 'broom', 'brook', 'tread', 'bad', 'bold', 'roar', 'roam', 'road', 'bead', 'brand']
This is something I believe to be very central and beautiful, but is something he doesn't spend time describing.
Things I learnt:
Things to think about: