I love xkcd! It reliably provides high-quality insights, humor, or both. It was a thrill for me when I got to introduce Randall Monroe when he gave a talk at Google. But in xkcd #1313,
I found that the hover text, "/bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls/ matches the last names of elected US presidents but not their opponents.", contains a confusing contradiction. There are several last names (like "Nixon") that denote both elected presidents and opponents. So no regular expression could both match and not match "Nixon". I could only assume that Randall meant for these names to be winners and not losers (and in fact he later confirmed that was the correct interpretation).
So that got me thinking: can I come up with an algorithm to find a short regex that covers the winners and not the losers?
I started by finding a page that lists winners and losers of US presidential elections through 2000. Adding the 2004-2012 results I get:
winners = set('''washington adams jefferson jefferson madison madison monroe
monroe adams jackson jackson vanburen harrison polk taylor pierce buchanan
lincoln lincoln grant grant hayes garfield cleveland harrison cleveland mckinley
mckinley roosevelt taft wilson wilson harding coolidge hoover roosevelt
roosevelt roosevelt roosevelt truman eisenhower eisenhower kennedy johnson nixon
nixon carter reagan reagan bush clinton clinton bush bush obama obama'''.split())
losers = set('''clinton jefferson adams pinckney pinckney clinton king adams
jackson adams clay vanburen vanburen clay cass scott fremont breckinridge
mcclellan seymour greeley tilden hancock blaine cleveland harrison bryan bryan
parker bryan roosevelt hughes cox davis smith hoover landon wilkie dewey dewey
stevenson stevenson nixon goldwater humphrey mcgovern ford carter mondale
dukakis bush dole gore kerry mccain romney'''.split())
We can see that there are names that are both winners and losers:
winners & losers
set(['hoover', 'jackson', 'carter', 'roosevelt', 'vanburen', 'bush', 'jefferson', 'harrison', 'cleveland', 'clinton', 'nixon', 'adams'])
Note that we are working with names, not people: Bill Clinton was a winner and George Clinton was a loser (in 1792), but they both count as "clinton"
.
To avoid a contradiction and achieve Randall's intent, eliminate all winners from the set of losers:
losers = losers - winners
Next we need to be clear exactly what we're trying to achieve. We're looking for a Python regular expression which, when used with the re.search
function, will match all of the winners but none of the losers. We can define the function verify
to verify a proposed solution (and print helpful debugging information for an incorrect solution):
import re
def verify(regex, winners, losers):
"Return true iff the regex matches all winners but no losers."
missed_winners = {w for w in winners if not re.search(regex, w)}
matched_losers = {L for L in losers if re.search(regex, L)}
if missed_winners:
print "Error: should match but did not:", ', '.join(missed_winners)
if matched_losers:
print "Error: should not match but did:", ', '.join(matched_losers)
return not (missed_winners or matched_losers)
We can use this to test xkcd's proposal:
xkcd = "bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls"
verify(xkcd, winners, losers)
Error: should not match but did: fremont
False
That's interesting—we see that the xkcd regular expression incorrectly matches "fremont"
, representing John C. Fremont, opponent of James Buchanan in 1856. Could Randall have made an error? Say it ain't so! Randall must have had Millard Fillmore listed as the opponent in the three-party election of 1856. Fillmore is of course more famous, having served as the 13th president (although he never won an election; he became president when Taylor died in office). But Fillmore only got 8 electoral votes and Fremont got 114 in 1856, so I will stick with Fremont in my list of losers.
However, we can verify that Randall got it right under the interpretation that Fillmore, not Fremont, is the opponent (note that in the algebra of sets, (A & B)
means intersection, (A | B)
means union, and (A - B)
means set difference, or the elements of A
that are not in B
):
verify(xkcd, winners, losers - {'fremont'} | {'fillmore'})
True
We need a strategy to find a regex that matches all the winners but none of the losers. I came up with this approach:
"bu"
or "n.e"
or "r.e$"
)."bu|n.e|j"
).for sure that there is at least one solution: "or" together one component for each winner.
Note that there are four ways in which this strategy can fail to find the shortest possible regex:
"a|b|c|..."
. If there is a shorter regex that is not a disjunction, we can't find it.If the results end up looking bad, we can always go back and address one or more of these four limitations.
The algorithm is below. We create an initial pool with regex_components(winners, losers)
, and will accumulate components into cover
, which starts empty. On each iteration choose the best component according to a scoring metric that allocates 3 points for each winner matched, and penalizes 1 point for each character in the regex component. (I decided on "3" because that seems to be the average length of a conjunction.) We then add the best component to cover
, and remove from winners all the strings that are matched by best
. Finally, we update the pool, keeping only those components that still match one or more of the remaining winners.
def findregex(winners, losers):
"Find a regex that matches all winners but no losers (sets of strings)."
# Make a pool of regex components, then pick from them to cover winners.
# On each iteration, add the 'best' component to 'cover',
# remove winners covered by best, and keep in 'pool' only components
# that still match some winner.
pool = regex_components(winners, losers)
cover = []
while winners:
best = max(pool, key=lambda c: 3 * len(matches(c, winners)) - len(c))
cover.append(best)
winners = winners - matches(best, winners)
pool = {c for c in pool if matches(c, winners)}
return '|'.join(cover)
What component can we pick that is guaranteed to match a winner and not match any loser? We can pick the entire winner itself, which is denoted by the regex .
Now we need to define what the regex components are. Here's what I came up with:
"^winner$"
. That way we know that each winner has at least one component that matches it.I call this regex a whole.
Example: subparts('^it$')
== {'^', 'i', 't', '$', '^i', 'it', 't$', '^it', 'it$', '^it$'}
Example: dotify('it')
== {'it', 'i.', '.t', '..'}
Note that I only used a few of the regular expression mechanisms: '.'
, '^'
, and '$'
. I didn't try to use character classes, [a-z]
, or any of the repetition operators, or other advanced features. Why? I wanted to keep it simple, plus I thought that the advanced features take too many characters, and would rarely be useful for the kinds of lists we are dealing with. One could always add more complicated features later.
def regex_components(winners, losers):
"Return components that match at least one winner, but no loser."
wholes = {'^'+winner+'$' for winner in winners}
parts = {d for w in wholes for p in subparts(w) for d in dotify(p)
if not matches(d, losers)}
return wholes | parts
def subparts(word):
"Return a set of subparts of word, consecutive characters up to length 4."
return set(word[i:i+n] for i in range(len(word)) for n in (1, 2, 3, 4))
def dotify(part):
"Return all ways to replace a subset of chars in part with '.'."
if part == '':
return {''}
else:
return {c+rest for rest in dotify(part[1:]) for c in replacements(part[0])}
def replacements(char):
"Return possible replacement characters for char (char + '.' unless char is special)."
return char if (char in '^$') else char + '.'
def matches(regex, strings):
"Return a set of all the strings that are matched by regex."
return {s for s in strings if re.search(regex, s)}
Let's see some examples of how these functions work, putting the examples into a test suite:
def test():
assert regex_components({'win'}, {'losers', 'bin', 'won'}) == {
'^win$', '^win', '^wi.', 'wi.', 'wi', '^wi', 'win$', 'win', 'wi.$'}
assert subparts('^it$') == {'^', 'i', 't', '$', '^i', 'it', 't$', '^it', 'it$', '^it$'}
assert subparts('this') == {'t', 'h', 'i', 's', 'th', 'hi', 'is', 'thi', 'his', 'this'}
assert dotify('it') == {'it', 'i.', '.t', '..'}
assert dotify('^it$') == {'^it$', '^i.$', '^.t$', '^..$'}
assert dotify('this') == {'this', 'thi.', 'th.s', 'th..', 't.is', 't.i.', 't..s', 't...',
'.his', '.hi.', '.h.s', '.h..', '..is', '..i.', '...s', '....'}
assert replacements('x') == 'x.'
assert replacements('^') == '^'
assert replacements('$') == '$'
assert matches('a|b|c', {'a', 'b', 'c', 'd', 'e'}) == {'a', 'b', 'c'}
assert matches('a|b|c', {'any', 'bee', 'succeed', 'dee', 'eee!'}) == {'any', 'bee', 'succeed'}
assert verify('a+b+', {'ab', 'aaabb'}, {'a', 'bee', 'a b'})
assert findregex({"ahahah", "ciao"}, {"ahaha", "bye"}) == 'a.$'
return 'test passes'
test()
'test passes'
Now we can finally run findregex
, and verify that it works:
findregex(winners, losers)
'a.a|a..i|j|li|a.t|i..n|bu|oo|n.e|ay.|r.e$|tr|ls|po|v.l'
solution = findregex(winners, losers)
verify(solution, winners, losers)
True
We can define a function to do this finding and verifying, and to do it in both directions (e.g. separating winners from losers and losers from winners). We will also report the number of characters in the solution and the competitive ratio of the solution: the ratio between the length of a trivial solution and the solution found (a trivial solution is disjoining the full text of all the winners).
def findboth(list1, list2):
"Find, in both directions, a regex to separate one list from the other. Print summary."
def find1(list1, list2, legend):
solution = findregex(list1, list2)
assert verify(solution, list1, list2)
trivial = '|'.join(list1)
print '%s: %2d chars, %4.1f ratio: %r' % (
legend, len(solution), len(trivial) / float(len(solution)), solution)
find1(list1, list2, '1 vs. 2')
find1(list2, list1, '2 vs. 1')
findboth(winners, losers)
1 vs. 2: 54 chars, 4.9 ratio: 'a.a|a..i|j|li|a.t|i..n|bu|oo|n.e|ay.|r.e$|tr|ls|po|v.l' 2 vs. 1: 60 chars, 4.0 ratio: '^s|^d|r.$|^.re|cc|hu|o.d|n.y|l.i|d.n$|ya|rk|oc|ss|x$|cla|^ki'
And if we look back up at panel two of xkcd 113, we find that indeed "I wrote a program that plays regex golf with arbitrary lists". Let's give it a try, first separating two lists, the top 10 boys and girls names for 2012:
boys = set('jacob mason ethan noah william liam jayden michael alexander aiden'.split())
girls = set('sophia emma isabella olivia ava emily abigail mia madison elizabeth'.split())
findboth(boys, girls)
1 vs. 2: 11 chars, 6.0 ratio: 'e.$|a.$|a.o' 2 vs. 1: 10 chars, 6.7 ratio: 'a$|^..i|ad'
Now let's try another example, separating the 2013 NFL playoff teams from the non-playoff teams:
nfl_in = set('''colts saints chargers 49ers seahawks patriots panthers broncos chiefs eagles bengals packers'''.split())
nfl_out = set('''jets dolphins bills steelers ravens browns titans jaguars texans raiders cowboys giants
redskins bears lions vikings falcons buccaneers cardinals rams'''.split())
findboth(nfl_in, nfl_out)
1 vs. 2: 24 chars, 3.8 ratio: '^p|g..s|4|nc|lt|se|ef|sa' 2 vs. 1: 21 chars, 7.1 ratio: 'ns|^.i|d|j|ee|y|m|ars'
And separating the top ten best-selling drugs from the top 10 cities to visit:
drugs = set('lipitor nexium plavix advair ablify seroquel singulair crestor actos epogen'.split())
cities = set('paris trinidad capetown riga zurich shanghai vancouver chicago adelaide auckland'.split())
findboth(drugs, cities)
1 vs. 2: 15 chars, 5.0 ratio: 'o.$|x|ir|b|q|en' 2 vs. 1: 11 chars, 7.3 ratio: 'ri|an|ca|id'
We can answer the challenge from panel one of the strip:
starwars = set('''The Phantom Menace
Attack of the Clones
Revenge of the Sith
A New Hope
The Empire Strikes Back
Return of the Jedi'''.upper().splitlines())
startrek = set('''The Wrath of Khan
The Search for Spock
The Voyage Home
The Final Frontier
The Undiscovered Country
Generations
First Contact
Insurrection
Nemesis'''.upper().splitlines())
findboth(starwars, startrek)
1 vs. 2: 9 chars, 13.0 ratio: ' T|E.P| N' 2 vs. 1: 14 chars, 10.7 ratio: 'ER|CT|H .|Y|S$'
Neat—the solution ' T|E.P| N'
is one character shorter than Randall's. This solution is equivalent to 'E.P| [TN]'
, so it shares the ' [TN]'
component. You can see why I didn't bother to put character classes (like [TN]
) in my pool of regex components: ' T| N'
is the same nunber of characters as ' [TN]'
.
We can verify that Randall's solution is correct:
verify('M | [TN]|B', starwars, startrek)
True
There are lots of examples to play with over at regex.alf.nu, like this one:
foo = set('''afoot catfoot dogfoot fanfoot foody foolery foolish fooster footage foothot footle footpad footway
hotfoot jawfoot mafoo nonfood padfoot prefool sfoot unfool'''.split())
bar = set('''Atlas Aymoro Iberic Mahran Ormazd Silipan altared chandoo crenel crooked fardo folksy forest
hebamic idgah manlike marly palazzi sixfold tarrock unfold'''.split())
findboth(foo, bar)
1 vs. 2: 3 chars, 52.3 ratio: 'f.o' 2 vs. 1: 28 chars, 5.4 ratio: 'k|r..$|.m|^..l|ld|...n|ga|es'
I assume that one is intended to find 'foo'
, not 'f.o'
. My solution might be the same number of characters, but it is smaller in the amount of ink/pixels it uses.
I hope you found this interesting, and perhaps you can find ways to improve my algorithm, or more interesting lists to apply it to. I found it was fun to play with, and I hope this page gives you an idea of how to address problems like this.
Thanks to Randall for starting this all off, and to Davide Canton and Stefan Pochmann for providing suggestions to improve my code.