A recent post on Patrick Vennebush blog Math Jokes for 4 Mathy Folks asserted that the rule of thumb "I before E except after C" was "total bullshit." This got me thinking: the "I before E except after C" rule was almost certainly developed without any research at all, just based on the subjective experience of educators. It's not a horrible rule, but certainly we can more intelligently construct better rules of this kind (for lack of a better term, I'll be refering to these as "trigram rules"). We have the technology, and I have nothing better to do with my night off :)
To start with, we're going to need a corpus of words (i.e. a vocabulary) to analyze. I'll be using the wordlist in the NLTK package.
from nltk.corpus import words, brown
import string
from collections import Counter, defaultdict
from itertools import combinations
from __future__ import division
corpus = set(w.lower() for w in words.words())
print len(corpus)
233621
This analysis will proceed by identifying all of the trigrams in my corpus, and then decomposing each trigram into its starting letter and the trailing bigram. For each bigram in the corpus, we want to count the occurence of the bigram, and the frequency with which that bigram is preceded by each respective "starting letter" in all occurences of trigrams. My favorite datatype for this sort of operation is a defaultdict
initialized by Counter
s. What this means is that if I key into my dict for an key that isn't already there, the dict will initialize that key pointing to an empty Counter
as the value. This way, I can key into my object for each occurence of a bigram and update the counter corresponding to the associated trigram's starting letter.
You'll notice that I refer to this data structure as a "trie" although I'm not positive this is technically accurate. At the very least, it's similar to a trie.
trigram_trie = defaultdict(Counter)
Next, I'll define a simple function to tokenize an input word into ngrams. I could specify a function to just trigram tokenize, but it's easy enough to generalize this to any ngram. I haven't gotten into it yet, but maybe at a later date I'll want to repeat this experiment with 4-grams or something instead of just focusing on trigrams. In any event, having a simple ngram tokenizing function handy certainly can't hurt.
def ngram_tokenize(word, n=3):
m = len(word)
grams = []
for i in range(m-n+1):
j = i + n
grams.append(word[i:j])
return grams
Now that we've got all of the necessary machinery in place, let's process our corpus and tally our trigram rules (unigram->bigram relationshps)
# Process trigrams as unigram->bigram
for word in corpus:
for trigram in ngram_tokenize(word, n=3):
start, bigram = trigram[0], trigram[1:]
trigram_trie[bigram].update([start])
# Count occurence of bigrams
stats = Counter()
for k,v in trigram_trie.iteritems():
stats[k] = sum(v.values())
Just to make my life a little easier, I define a function that prints some simple information for a given rule.
# Investigate what makes something a rule
def print_rule_stats(rule, trigram_trie=trigram_trie, stats=stats):
"""
Print summary statistics for a trigram rule
"""
ab,c = rule
a,b = ab
ba = b+a
print ab, stats[ab], '\t', c+ab, trigram_trie[ab][c]
print ba, stats[ba], '\t', c+ba, trigram_trie[ba][c]
#print c+ab, trigram_trie[ab][c]
#print c+ba, trigram_trie[ba][c]
Our analysis corroborates the research that the "'i' before 'e' except after 'c'" rule is not effective from the naive perspective of frequency in the vocabulary (it might be valid if we consider the occurence of the trigram in written language and not just the vocabulary -- since the usage of different words is absolutely not uniform -- but we haven't investigated this). We can contrast this with a better trigram rule to get a sense of what we're looking for.
print "Weak rule"
print_rule_stats(['ie','c'])
print "\nStrong rule"
print_rule_stats(['ie','a'])
Weak rule ie 3950 cie 256 ei 2607 cei 156 Strong rule ie 3950 aie 16 ei 2607 aei 44
Given a trigram rule of the form "'a' then 'b' except after 'c'," the strength of the rule is relative to the extent with which it exhibits the following properties:
To rank rules, we'll need some mechanism of scoring them. A simple method I've found reasonably effective for the purposes of this experiment is to filter out rules where the ratio of count(ab)/count(ba) is below some threshhold, and then use the inverse ratio of trigram occurences as the score: count(cba)/count(cab). In our best rules, the trigram cab
may not occur in our vocabulary at all, so to mitigate division-by-zero issues, we can Laplace smooth the trigram occurence ratio by adding 1 to both the numerator and the denominator. Our scoring function then becomes:
def score_rule(rule, k=1/2):
ab,c = rule
a,b = ab
ba=b+a
if stats[ab]/(stats[ab] + stats[ba]) > k:
return (trigram_trie[ba][c]+1)/(trigram_trie[ab][c]+1)
else:
return 0
With the notion of a trigram rule defined and a scoring function in place, we can pipeline the rule identification, scoring and ranking process into a function to allow us to easily identify and rank all trigram rules in a given vocabulary/corpus.
def extract_trigram_rules(corpus, scoring_function=score_rule):
"""
Extracts "I before E except after C" style 'trigram rules' from an input corpus,
ranked by the specified scoring function.
"""
# Process trigrams as unigram->bigram
trigram_trie = defaultdict(Counter)
for word in corpus:
for trigram in ngram_tokenize(word, n=3):
start, bigram = trigram[0], trigram[1:]
trigram_trie[bigram].update([start])
# Tally bigrams
stats = Counter()
for k,v in trigram_trie.iteritems():
stats[k] = sum(v.values())
special_cases = []
for a, b in combinations(string.lowercase,2):
#print a, b
comb1 = a+b
comb2 = b+a
if stats[comb2] > stats[comb1]:
comb1, comb2 = comb2, comb1
for c in trigram_trie[comb2].keys():
if trigram_trie[comb2][c] > trigram_trie[comb1][c]:
rule = [comb1, str(c)]
special_cases.append([scoring_function(rule), rule])
special_cases.sort(reverse=True)
return special_cases, trigram_trie, stats
Here are the top 20 trigram rules. The notation ['ab','c']
should be interpreted as "a
before b
except after c
."
Although the highest ranked rule is "P before E, except after C," my favorite is the second one: "E before U, except after Q."
vocab_rules, trigram_trie, stats = extract_trigram_rules(corpus, scoring_function=lambda x:score_rule(x, k=1/2))
vocab_rules[0]
for score, rule in vocab_rules[:5]:
print rule
print_rule_stats(rule, trigram_trie, stats)
print
['pe', 'c'] pe 8052 cpe 0 ep 5053 cep 955 ['eu', 'q'] eu 2620 qeu 0 ue 1981 que 949 ['ic', 'i'] ic 26140 iic 1 ci 6561 ici 1830 ['te', 'm'] te 27265 mte 2 et 11743 met 2684 ['rd', 'n'] rd 3641 nrd 0 dr 2738 ndr 808
If we modify the scoring heuristic to filter on a more aggressive cutoff threshold for the bigram ratio, different rules get highlighted.
vocab_rules, trigram_trie, stats = extract_trigram_rules(corpus, scoring_function=lambda x:score_rule(x, k=3/4))
vocab_rules[0]
for score, rule in vocab_rules[:5]:
print rule
print_rule_stats(rule, trigram_trie, stats)
print
['ic', 'i'] ic 26140 iic 1 ci 6561 ici 1830 ['ns', 's'] ns 6442 sns 0 sn 1121 ssn 298 ['og', 'a'] og 7578 aog 2 go 2459 ago 619 ['ex', 'e'] ex 1688 eex 0 xe 398 exe 201 ['ng', 'n'] ng 13024 nng 1 gn 1767 ngn 373
The above analysis determined trigram rules that are useful across the english vocabulary. But really, the aim of these rules is to help us with our day-to-day spelling, so it might be more useful to conduct the analysis scoring trigrams based not on their appearance in unique words, but in their actual occurence in written English. For this purpose, a useful corpus of reasonably size is the Brown corpus.
brown_corpus = [w.lower() for w in brown.words()]
print "Total words:", len(brown_corpus) # plus a few grammatical tokens
print "Unique words:", len(set(brown_corpus))
print "\nMost common tokens:"
Counter(brown_corpus).most_common(10)
Total words: 1161192 Unique words: 49815 Most common tokens:
[(u'the', 69971), (u',', 58334), (u'.', 49346), (u'of', 36412), (u'and', 28853), (u'to', 26158), (u'a', 23195), (u'in', 21337), (u'that', 10594), (u'is', 10109)]
brown_rules, brown_trigrams, brown_stats = extract_trigram_rules(brown_corpus)
for score, rule in brown_rules[:5]:
print rule
print_rule_stats(rule, brown_trigrams, brown_stats)
print
['pe', 'c'] pe 12235 cpe 0 ep 6038 cep 891 ['ic', 'i'] ic 24054 iic 0 ci 7575 ici 1592 ['te', 'm'] te 39262 mte 2 et 15913 met 1785 ['rd', 'n'] rd 7135 nrd 0 dr 1305 ndr 377 ['iv', 'i'] iv 9448 iiv 0 vi 6956 ivi 1874
We've established that the "I before E" rule doesn't hold up when we just look at the vocabulary, but in the context of the distribution of words in written language is the rule salvageable?
print_rule_stats(['ie','c'], brown_trigrams, brown_stats)
ie 13275 ei 5677 cie 1310 cei 485
We've been lied to for so long... I don't know what's real anymore.