xkcd 1313: Regex Golf

Peter Norvig
January 2014

I love ♡ xkcd! It reliably provides top-quality insights, humor, or both. It was a thrill for me when I got to introduce Randall Monroe for his talk at Google. But in xkcd #1313,

I found that the hover text, "/bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls/ matches the last names of elected US presidents but not their opponents", contains a confusing contradiction. There are several last names (like "Nixon") that denote both elected presidents and opponents. So no regular expression could both match and not match "Nixon". I could only assume that Randall meant for these names to be winners and not losers (and in fact he later confirmed that was the correct interpretation).

So that got me thinking: can I come up with an algorithm to find a short regex that matches the winners and not the losers?

I started by finding a page that lists presidential elections through 2000. Adding the 2004-2012 results I get these lists of winners and losers:

In [29]:
def words(text): return set(text.lower().split())

winners = words('''washington adams jefferson jefferson madison madison monroe 
    monroe adams jackson jackson van-buren harrison polk taylor pierce buchanan 
    lincoln lincoln grant grant hayes garfield cleveland harrison cleveland mckinley
    mckinley roosevelt taft wilson wilson harding coolidge hoover roosevelt 
    roosevelt roosevelt roosevelt truman eisenhower eisenhower kennedy johnson nixon 
    nixon carter reagan reagan bush clinton clinton bush bush obama obama''')

losers = words('''clinton jefferson adams pinckney pinckney clinton king adams 
    jackson adams clay van-buren van-buren clay cass scott fremont breckinridge 
    mcclellan seymour greeley tilden hancock blaine cleveland harrison bryan bryan 
    parker bryan roosevelt hughes cox davis smith hoover landon wilkie dewey dewey 
    stevenson stevenson nixon goldwater humphrey mcgovern ford carter mondale 
    dukakis bush dole gore kerry mccain romney''')

We can see that there are names that are both winners and losers:

In [2]:
print(winners & losers)
set(['hoover', 'jackson', 'carter', 'roosevelt', 'bush', 'jefferson', 'harrison', 'cleveland', 'clinton', 'nixon', 'van-buren', 'adams'])

Note that we are working with names, not people: Bill Clinton was a winner and George Clinton (the Revolutionary War leader, not the Funkadelic leader) was a loser, but they both count as 'clinton'.

To avoid a contradiction and achieve Randall's intent, eliminate all winners from the set of losers:

In [30]:
losers = losers - winners

Defining the Problem

Next we need to be clear exactly what we're trying to achieve. We're looking for a Python regular expression which, when used with the re.search function, will match all of the winners but none of the losers. We can define the function verify to return True if a regex does that; if not it returns False and prints all the winners/losers it gets wrong:

In [31]:
import re

def verify(regex, winners, losers):
    "Return true iff the regex matches all winners but no losers."
    missed_winners = {W for W in winners if not re.search(regex, W)}
    matched_losers = {L for L in losers if re.search(regex, L)}
    if missed_winners:
        print "Error: should match but did not:", ', '.join(missed_winners)
    if matched_losers:
        print "Error: should not match but did:", ', '.join(matched_losers)
    return not (missed_winners or matched_losers)

We can use this to test xkcd's regex:

In [5]:
xkcd = "bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls"
verify(xkcd, winners, losers)
Error: should not match but did: fremont

Out[5]:
False

That's interesting—we see that the xkcd regular expression incorrectly matches "fremont", representing John C. Fremont, opponent of James Buchanan in 1856. Could Randall have made an error? Is someone wrong on the Internet? Investigating the 1856 election, I see that Randall must have had Millard Fillmore, the third-party candidate, listed as the opponent. Fillmore is of course more famous, having served as the 13th president (although he never won an election; he became president when Taylor died in office). But Fillmore only got 8 electoral votes in 1856 and Fremont got 114, so I will stick with Fremont in my list of losers.

However, we can verify that Randall got it right under the interpretation that Fillmore, not Fremont, is the opponent. (Note: in the algebra of sets, (A & B) means intersection, (A | B) means union, and (A - B) means set difference, or the elements of A that are not in B.)

In [6]:
verify(xkcd, winners, losers - {'fremont'} | {'fillmore'})
Out[6]:
True

Strategy for Finding a Regex

We need a strategy to find a regex that matches all the winners but none of the losers. I came up with this approach:

  • Generate a pool of component regexs (small regexs of a few characters, such as "bu" or "n.e" or "r.e$").
  • Each component regex must match at least one winner, but no losers.
  • Our solution will be formed by "or"-ing together some of these components (e.g. "bu|n.e|j").
  • This is a set cover problem: pick some of the components so that when we "or" them together they cover all the winners.
  • Set cover is an NP-hard problem, so I feel justified in using an approximation approach that finds a small but not necessarily smallest solution.
  • A good approximation can be had with a greedy algorithm: Pick the "best" component first (the one that covers the most winners with the fewest characters), and repeat, choosing the "best" each time until there are no more winners to cover.
  • To guarantee that we will find a solution, make sure that each winner has at least one component that matches it, but matches no losers.

Note that there are four ways in which this strategy can fail to find the shortest possible regex:

  • We are limiting the solution to be a disjunction: a regex of the form "a|b|c|...". If there is a shorter regex that is not a disjunction, we can't find it.
  • We start with a restricted pool of components regexes. If a best regex component is not in the pool, we can't find it.
  • The greedy algorithm is inherently sub-optimal on some inputs.
  • The tradeoff I make between the number of winners covered by a component (more is good) and the number of characters in the component (more is bad) may be the wrong tradeoff.

If the solutions end up looking bad, we can always go back and address one or more of these four issues.

The algorithm is below. We create an initial pool with regex_components(winners, losers), and will accumulate components into the list solution, which starts empty. On each iteration choose the best component: the one with a maximum score. (I decided to score 4 points for each winner matched minus one point for each character in the component. The function matches(regex, strings) returns a set of the strings that match.) We then add the best component to solution, and remove from winners all the strings that are matched by best. Finally, we update the pool, keeping only those components that still match one or more of the remaining winners. When there are no more winners left, OR together all the solution components to give the final regular expression string.

In [32]:
def findregex(winners, losers):
    "Find a regex that matches all winners but no losers (sets of strings)."
    # Make a pool of regex components, then pick from them to cover winners.
    # On each iteration, add the 'best' component to 'solution',
    # remove winners covered by best, and keep in 'pool' only components
    # that still match some winner.
    pool = regex_components(winners, losers)
    solution = []
    def score(r): return 4 * len(matches(r, winners)) - len(r)
    while winners:
        best = max(pool, key=score)
        solution.append(best)
        winners = winners - matches(best, winners)
        pool = {r for r in pool if matches(r, winners)}
    return OR(solution)
In [33]:
def matches(regex, strings):
    "Return a set of all the strings that are matched by regex."
    return {s for s in strings if re.search(regex, s)}

OR = '|'.join # Join a sequence of strings with '|' between them

Glossary

In this notebook the following terminology is used:

  • component: A small regular expression, a string, such as 'bu' or 'a.a'.
  • losers: A set of strings; our solution is not allowed to match any of them.
  • part: A component that matches part of at least one winner.
  • pool: A set of components from which we will pick a subset to form the solution.
  • regex: A regular expression; a pattern used to match against a string. Both components and solutions are regular expressions.
  • solution: A regular expression formed by OR-ing components together: 'bu|a.a'. Must match winners but not losers.
  • whole: A component that matches a whole word (and nothing else): '^word$'
  • winners: A set of string; our solution is required to match each of them.

Regex Components

Now we need to define what the regex_components are. Here's what I came up with:

  • For each winner, include a regex that matches the entire string exactly. I call this regex a whole.
    Example: for "winner", include "^winner$"
  • For each whole, generate subparts consisting of 1 to 4 consecutive characters.
    Example: subparts('^it$') == {'^', 'i', 't', '$', '^i', 'it', 't$', '^it', 'it$', '^it$'}
  • For each subpart, generate all ways to replace any of the letters with a dot (the "match any" character).
    Example: dotify('it') == {'it', 'i.', '.t', '..'}
  • Keep only the subparts that do not match any of the losers.

Note that I only used a few of the regular expression mechanisms: '.', '^', and '$'. I didn't try to use character classes, (e.g., [a-z]), nor any of the repetition operators, or other advanced mechanisms. Why? I wanted to keep it simple, and I thought that the advanced features take too many characters. I can always add more complicated mechanisms later. Here is the code:

In [34]:
def regex_components(winners, losers):
    "Return components that match at least one winner, but no loser."
    wholes = {'^'+winner+'$' for winner in winners}
    parts = {d for w in wholes for s in subparts(w) for d in dotify(s)}
    return wholes | {p for p in parts if not matches(p, losers)}

def subparts(word):
    "Return a set of subparts of word, consecutive characters up to length 4."
    return set(word[i:i+n] for i in range(len(word)) for n in (1, 2, 3, 4)) 

def dotify(part):
    "Return all ways to replace a subset of chars in part with '.'."
    if part == '':
        return {''}  
    else:
        return {c+rest for rest in dotify(part[1:]) 
                for c in replacements(part[0])}
       
def replacements(char):
    "Return replacement characters for char (char + '.' unless char is special)."
    if (char == '^' or char == '$'):
        return char
    else:
        return char + '.'

Our program is complete!

But to make sure we understand what eachof these subfunctions does, and to help us debug any problems, here's a test suite:

In [37]:
def test1():
    assert subparts('^it$') == {'^', 'i', 't', '$', '^i', 'it', 't$', '^it', 'it$', '^it$'}
    assert subparts('this') == {'t', 'h', 'i', 's', 'th', 'hi', 'is', 'thi', 'his', 'this'}
    assert dotify('it') == {'it', 'i.', '.t', '..'}
    assert dotify('^it$') == {'^it$', '^i.$', '^.t$', '^..$'}
    assert dotify('this') == {'this', 'thi.', 'th.s', 'th..', 't.is', 't.i.', 't..s', 't...',
                              '.his', '.hi.', '.h.s', '.h..', '..is', '..i.', '...s', '....'}
    assert replacements('x') == 'x.'
    assert replacements('^') == '^'
    assert replacements('$') == '$'
    assert regex_components({'win'}, {'losers', 'bin', 'won'}) == {
        '^win$', '^win', '^wi.', 'wi.',  'wi', '^wi', 'win$', 'win', 'wi.$'}
    assert regex_components({'win'}, {'losers', 'bin', 'won', 'wine'}) == {
        '^win$', 'win$', 'wi.$'}
    assert matches('a|b|c', {'a', 'b', 'c', 'd', 'e'}) == {'a', 'b', 'c'}
    assert matches('a|b|c', {'any', 'bee', 'succeed', 'dee', 'eee!'}) == {
        'any', 'bee', 'succeed'}
    assert verify('a+b+', {'ab', 'aaabb'}, {'a', 'bee', 'a b'})
    assert findregex({"ahahah", "ciao"},  {"ahaha", "bye"}) == 'a.$' 
    assert OR(['a', 'b', 'c']) == 'a|b|c'
    assert OR(['a']) == 'a'
    assert words('This is a TEST this is') == {'this', 'is', 'a', 'test'}
    assert lines('Testing / 1 2 3 / Testing over') == {'TESTING', '1 2 3', 'TESTING OVER'}
    return 'test1 passes'

test1()
Out[37]:
'test1 passes'

Now we can finally run findregex, verify that it works, and compare the length of our solution to Randall's:

In [38]:
findregex(winners, losers)
Out[38]:
'a.a|a..i|j|oo|a.t|i..o|i..n|bu|n.e|ay.|r.e$|tr|po|v.l'
In [39]:
solution = findregex(winners, losers)
verify(solution, winners, losers)
Out[39]:
True
In [41]:
print len(solution), len(xkcd)
53 63

Ours is 10 characters shorter. So we've answered the challenge in the hover text; what else can we do?

Regex Golf with Arbitrary Lists

We can define a convenience function to do this finding and verifying, and we might as well do it in both directions (e.g. separating winners from losers and losers from winners). We will also report the number of characters in the solution and the competitive ratio of the solution: the ratio between the length of a trivial solution and the solution found (a trivial solution for the set of winners {'one', 'two', 'three'} is the disjunction '^(one|two|three)$').

In [42]:
def findboth(A, B):
    "Find a regex to match A but not B, and vice-versa.  Print summary."
    for (W, L, legend) in [(A, B, 'A-B'), (B, A, 'B-A')]:
        solution = findregex(W, L)
        assert verify(solution, W, L)
        ratio = len('^(' + OR(W) + ')$') / float(len(solution))
        print '%3d chars, %4.1f ratio, %2d winners %s: %r' % (
            len(solution), ratio , len(W), legend, solution)
In [43]:
findboth(winners, losers)
 53 chars,  5.0 ratio, 34 winners A-B: 'a.a|a..i|j|oo|a.t|i..o|i..n|bu|n.e|ay.|r.e$|tr|po|v.l'
 60 chars,  4.1 ratio, 34 winners B-A: '^s|^d|r.$|^.re|cc|hu|o.d|n.y|l.i|d.n$|ya|rk|oc|ss|x$|cla|^ki'

Let's try the top 10 boys and girls names for 2012:

In [16]:
boys  = words('jacob mason ethan noah william liam jayden michael alexander aiden')
girls = words('sophia emma isabella olivia ava emily abigail mia madison elizabeth')

findboth(boys, girls)
 11 chars,  6.4 ratio, 10 winners A-B: 'e.$|a.$|a.o'
 10 chars,  7.1 ratio, 10 winners B-A: 'a$|^..i|ad'

We have now fulfilled panel two of the strip. Let's try another example, separating the 2013 NFL playoff teams from the non-playoff teams:

In [17]:
nfl_in = words('colts saints chargers 49ers seahawks patriots panthers broncos chiefs eagles bengals packers')
nfl_out = words('''jets dolphins bills steelers ravens browns titans jaguars texans raiders cowboys giants 
  redskins bears lions vikings falcons buccaneers cardinals rams''')

findboth(nfl_in, nfl_out)
 24 chars,  4.0 ratio, 12 winners A-B: '^p|g..s|4|nc|lt|fs|wk|sa'
 21 chars,  7.3 ratio, 20 winners B-A: 'ns|^.i|d|j|ee|y|m|ars'

And separating the top ten best-selling drugs from the top 10 cities to visit:

In [18]:
drugs = words('lipitor nexium plavix advair ablify seroquel singulair crestor actos epogen')
cities = words('paris trinidad capetown riga zurich shanghai vancouver chicago adelaide auckland')

findboth(drugs, cities)
 15 chars,  5.3 ratio, 10 winners A-B: 'o.$|x|ir|b|q|en'
 11 chars,  7.6 ratio, 10 winners B-A: 'ri|an|ca|id'

We can answer the challenge from panel one of the strip:

In [19]:
def lines(text): return {line.upper().strip() for line in text.split('/')}

starwars = lines('''The Phantom Menace / Attack of the Clones / Revenge of the Sith /
A New Hope / The Empire Strikes Back / Return of the Jedi''')

startrek = lines('''The Wrath of Khan / The Search for Spock / The Voyage Home /
The Final Frontier / The Undiscovered Country / Generations / First Contact /
Insurrection / Nemesis''')

findboth(starwars, startrek)
  9 chars, 13.0 ratio,  6 winners A-B: ' T|E.P| N'
 13 chars, 11.5 ratio,  9 winners B-A: 'ER|CT|H |Y|IS'

Neat—the solution ' T|E.P| N' is one character shorter than Randall's. This solution is equivalent to 'E.P| [TN]', so it shares the ' [TN]' component. You can see why I didn't bother to put character classes (like [TN]) in my pool of regex components: ' T| N' is the same nunber of characters as ' [TN]'.

We can verify that Randall's solution is correct:

In [20]:
verify('M | [TN]|B', starwars, startrek)
Out[20]:
True

There are lots of examples to play with over at regex.alf.nu, like this one:

In [21]:
foo = words('''afoot catfoot dogfoot fanfoot foody foolery foolish fooster 
    footage foothot footle footpad footway hotfoot jawfoot mafoo nonfood padfoot 
    prefool sfoot unfool''')

bar = words('''Atlas Aymoro Iberic Mahran Ormazd Silipan altared chandoo crenel 
    crooked fardo folksy forest hebamic idgah manlike marly palazzi sixfold 
    tarrock unfold''')

findboth(foo, bar)
  3 chars, 53.7 ratio, 21 winners A-B: 'f.o'
 28 chars,  5.5 ratio, 21 winners B-A: 'r..$|k|.m|^..l|ld|...n|ga|es'

I assume that one is intended to find 'foo', not 'f.o'. My solution might be the same number of characters, 3, but it is smaller in the amount of ink/pixels it uses.

What To Do Next?

I see three options:

  • Stop here and declare victory! Yay!
  • Try to make the program run faster (so that it can handle bigger sets of winners and losers).
  • Try to find shorter regexes.

My first inclination was "stop here", and that's what this notebook will do (a few paragraphs from now). But several correspondents offered very interesting suggestions, so I returned to the problem in a second notebook. You'll see some tricks for making code run faster, and some new algorithms to find better (shorter) regexes.

Summary

That was fun! I hope this page gives you an idea of how to think about problems like this. Let's review what we did:

  • Found an interesting problem (in a comic strip) and realized that it would not be hard to write a program to solve the problem.
  • Wrote the function verify to prove that we really understand exactly what the problem is.
  • Came up with an approach: create lots of component regexes, and "or" together a subset of them.
  • Realized that this is an instance of a known problem, set cover.
  • Since set cover is computationally expensive, decide to use a greedy algorithm, which will be efficient (although not optimal).
  • Decided what goes into the pool of component regexes (see the function component_regexes).
  • Implemented an algorithm to greedily pick components from the pool (the function findregex).
  • Tried the algorithm on some examples.
  • Declared victory!

Thanks

Thanks to Randall Monroe for inspiring me to do this, to regex.alf.nu for inspiring Randall to do his strip, and to Davide Canton, Thomas Breuel, and Stefan Pochmann for providing suggestions to improve my code.

Complete Program

Here is the complete program (without the data; you can apply it to any sets of strings you want):

In [26]:
import re

def verify(regex, winners, losers):
    "Return true iff the regex matches all winners but no losers."
    missed_winners = {W for W in winners if not re.search(regex, W)}
    matched_losers = {L for L in losers if re.search(regex, L)}
    if missed_winners:
        print "Error: should match but did not:", ', '.join(missed_winners)
    if matched_losers:
        print "Error: should not match but did:", ', '.join(matched_losers)
    return not (missed_winners or matched_losers)

def findregex(winners, losers):
    "Find a regex that matches all winners but no losers (sets of strings)."
    # Make a pool of regex components, then pick from them to cover winners.
    # On each iteration, add the 'best' component to 'solution',
    # remove winners covered by best, and keep in 'pool' only components
    # that still match some winner.
    pool = regex_components(winners, losers)
    solution = []
    def score(r): return 4 * len(matches(r, winners)) - len(r)
    while winners:
        best = max(pool, key=score)
        solution.append(best)
        winners = winners - matches(best, winners)
        pool = {r for r in pool if matches(r, winners)}
    return OR(solution)

def matches(regex, strings):
    "Return a set of all the strings that are matched by regex."
    return {s for s in strings if re.search(regex, s)}

OR = '|'.join # Join a sequence of strings with '|' between them

def regex_components(winners, losers):
    "Return components that match at least one winner, but no loser."
    wholes = {'^'+winner+'$' for winner in winners}
    parts = {d for w in wholes for s in subparts(w) for d in dotify(s)}
    return wholes | {p for p in parts if not matches(p, losers)}

def subparts(word):
    "Return a set of subparts of word, consecutive characters up to length 4."
    return set(word[i:i+n] for i in range(len(word)) for n in (1, 2, 3, 4)) 

def dotify(part):
    "Return all ways to replace a subset of chars in part with '.'."
    if part == '':
        return {''}  
    else:
        return {c+rest for rest in dotify(part[1:]) 
                for c in replacements(part[0])}
       
def replacements(char):
    "Return replacement characters for char (char + '.' unless char is special)."
    if (char == '^' or char == '$'):
        return char
    else:
        return char + '.'
    
def words(text): return set(text.lower().split())

def lines(text): return {line.upper().strip() for line in text.split('/')}

################ Data

winners = words('''washington adams jefferson jefferson madison madison monroe 
    monroe adams jackson jackson van-buren harrison polk taylor pierce buchanan 
    lincoln lincoln grant grant hayes garfield cleveland harrison cleveland mckinley
    mckinley roosevelt taft wilson wilson harding coolidge hoover roosevelt 
    roosevelt roosevelt roosevelt truman eisenhower eisenhower kennedy johnson nixon 
    nixon carter reagan reagan bush clinton clinton bush bush obama obama''')

losers = words('''clinton jefferson adams pinckney pinckney clinton king adams 
    jackson adams clay van-buren van-buren clay cass scott fremont breckinridge 
    mcclellan seymour greeley tilden hancock blaine cleveland harrison bryan bryan 
    parker bryan roosevelt hughes cox davis smith hoover landon wilkie dewey dewey 
    stevenson stevenson nixon goldwater humphrey mcgovern ford carter mondale 
    dukakis bush dole gore kerry mccain romney''')

starwars = lines('''The Phantom Menace / Attack of the Clones / Revenge of the Sith /
    A New Hope / The Empire Strikes Back / Return of the Jedi''')

startrek = lines('''The Wrath of Khan / The Search for Spock / The Voyage Home /
    The Final Frontier / The Undiscovered Country / Generations / First Contact /
    Insurrection / Nemesis''')

def findboth(A, B):
    "Find a regex to match A but not B, and vice-versa.  Print summary."
    for (W, L, legend) in [(A, B, 'A-B'), (B, A, 'B-A')]:
        solution = findregex(W, L)
        assert verify(solution, W, L)
        ratio = len('^(' + OR(W) + ')$') / float(len(solution))
        print '%3d chars, %4.1f ratio, %2d winners %s: %r' % (
            len(solution), ratio , len(W), legend, solution)
        
################ Tests
        
def test1():
    assert subparts('^it$') == {'^', 'i', 't', '$', '^i', 'it', 't$', '^it', 'it$', '^it$'}
    assert subparts('this') == {'t', 'h', 'i', 's', 'th', 'hi', 'is', 'thi', 'his', 'this'}
    assert dotify('it') == {'it', 'i.', '.t', '..'}
    assert dotify('^it$') == {'^it$', '^i.$', '^.t$', '^..$'}
    assert dotify('this') == {'this', 'thi.', 'th.s', 'th..', 't.is', 't.i.', 't..s', 't...',
                              '.his', '.hi.', '.h.s', '.h..', '..is', '..i.', '...s', '....'}
    assert replacements('x') == 'x.'
    assert replacements('^') == '^'
    assert replacements('$') == '$'
    assert regex_components({'win'}, {'losers', 'bin', 'won'}) == {
        '^win$', '^win', '^wi.', 'wi.',  'wi', '^wi', 'win$', 'win', 'wi.$'}
    assert regex_components({'win'}, {'losers', 'bin', 'won', 'wine'}) == {
        '^win$', 'win$', 'wi.$'}
    assert matches('a|b|c', {'a', 'b', 'c', 'd', 'e'}) == {'a', 'b', 'c'}
    assert matches('a|b|c', {'any', 'bee', 'succeed', 'dee', 'eee!'}) == {
        'any', 'bee', 'succeed'}
    assert verify('a+b+', {'ab', 'aaabb'}, {'a', 'bee', 'a b'})
    assert findregex({"ahahah", "ciao"},  {"ahaha", "bye"}) == 'a.$' 
    assert OR(['a', 'b', 'c']) == 'a|b|c'
    assert OR(['a']) == 'a'
    assert words('This is a TEST this is') == {'this', 'is', 'a', 'test'}
    assert lines('Testing / 1 2 3 / Testing over') == {'TESTING', '1 2 3', 'TESTING OVER'}
    return 'test1 passes'

if __name__ == '__main__':
    print test1()
test passes


Peter Norvig, Jan. 2014

Back to top