xkcd 1313: Regex Golf (Part 2: Infinite Problems)

Peter Norvig
with Stefan Pochmann
February 2014

This Again?

Last month I demonstrated a simple Regex Golf program. I thought that was the end of the topic, but Stefan Pochmann sent me a series of emails suggesting some impressive improvements. (Davide Canton and Thomas Breuel also had good suggestions.) So this post is an exploration of some of Stefan's new ideas, and some of my own. I'll try to show how to think about solving a problem, and how to use exploratory programming to test out ideas.

To review, here's the program from part 1 (with minor modifications):

In [1]:
import re
from __future__ import division
from __builtin__ import any, all, sum # (because ipython imports the numpy versions)

def verify(regex, winners, losers):
    "Return true iff the regex matches all winners but no losers."
    missed_winners = {W for W in winners if not re.search(regex, W)}
    matched_losers = {L for L in losers if re.search(regex, L)}
    if missed_winners:
        print "Error: should match but did not:", ', '.join(missed_winners)
    if matched_losers:
        print "Error: should not match but did:", ', '.join(matched_losers)
    return not (missed_winners or matched_losers)

def findregex(winners, losers):
    "Find a regex that matches all winners but no losers (sets of strings)."
    # Make a pool of regex components, then pick from them to cover winners.
    # On each iteration, add the 'best' component to 'solution',
    # remove winners covered by best, and keep in 'pool' only components
    # that still match some winner.
    pool = regex_components(winners, losers)
    solution = []
    def score(r): return 4 * len(matches(r, winners)) - len(r)
    while winners:
        best = max(pool, key=score)
        solution.append(best)
        winners = winners - matches(best, winners)
        pool = {r for r in pool if matches(r, winners)}
    return OR(solution)

def matches(regex, strings):
    "Return a set of all the strings that are matched by regex."
    return {s for s in strings if re.search(regex, s)}

OR  = '|'.join # Join a sequence of strings with '|' between them

def regex_components(winners, losers):
    "Return components that match at least one winner, but no loser."
    wholes = {'^'+winner+'$' for winner in winners}
    parts = {d for w in wholes for s in subparts(w) for d in dotify(s)}
    return wholes | {p for p in parts if not matches(p, losers)}

def subparts(word):
    "Return a set of subparts of word, consecutive characters up to length 4."
    return set(word[i:i+n] for i in range(len(word)) for n in (1, 2, 3, 4)) 

def dotify(part):
    "Return all ways to replace a subset of chars in part with '.'."
    if part == '':
        return {''}  
    else:
        return {c+rest for rest in dotify(part[1:]) 
                for c in replacements(part[0])}
       
def replacements(char):
    "Return replacement characters for char (char + '.' unless char is special)."
    if (char == '^' or char == '$'):
        return char
    else:
        return char + '.'
    
def words(text): 
    "Return a set of all words in the text, lowercased."
    return frozenset(text.lower().split())

def phrases(text, sep='/'): 
    "Return a set of all 'sep'-separated phrases in text, uppercased."
    return frozenset(line.strip() for line in text.upper().split(sep))

Here are some "arbitrary lists" (see panel two of the comic) which we will be using to test out the code. (Note that I made these frozen sets (immutable objects) this time, because they are constants that shouldn't be changed.)

In [2]:
winners = words('''washington adams jefferson jefferson madison madison monroe 
    monroe adams jackson jackson van-buren harrison polk taylor pierce buchanan 
    lincoln lincoln grant grant hayes garfield cleveland harrison cleveland mckinley
    mckinley roosevelt taft wilson wilson harding coolidge hoover roosevelt 
    roosevelt roosevelt roosevelt truman eisenhower eisenhower kennedy johnson nixon 
    nixon carter reagan reagan bush clinton clinton bush bush obama obama''')

losers = words('''clinton jefferson adams pinckney pinckney clinton king adams 
    jackson adams clay van-buren van-buren clay cass scott fremont breckinridge 
    mcclellan seymour greeley tilden hancock blaine cleveland harrison bryan bryan 
    parker bryan roosevelt hughes cox davis smith hoover landon wilkie dewey dewey 
    stevenson stevenson nixon goldwater humphrey mcgovern ford carter mondale 
    dukakis bush dole gore kerry mccain romney''') - winners

boys = words('jacob mason ethan noah william liam jayden michael alexander aiden')
girls = words('sophia emma isabella olivia ava emily abigail mia madison elizabeth')

nfl_in = words('colts saints chargers 49ers seahawks patriots panthers broncos chiefs eagles bengals packers')
nfl_out = words('''jets dolphins bills steelers ravens browns titans jaguars texans raiders cowboys giants 
    redskins bears lions vikings falcons buccaneers cardinals rams''')

pharma = words('lipitor nexium plavix advair ablify seroquel singulair crestor actos epogen')
cities = words('paris trinidad capetown riga zurich shanghai vancouver chicago adelaide auckland')

foo = words('''afoot catfoot dogfoot fanfoot foody foolery foolish fooster footage foothot footle footpad footway 
    hotfoot jawfoot mafoo nonfood padfoot prefool sfoot unfool''')
bar = words('''Atlas Aymoro Iberic Mahran Ormazd Silipan altared chandoo crenel crooked fardo folksy forest 
    hebamic idgah manlike marly palazzi sixfold tarrock unfold''')

nouns = words('''time year people way day man thing woman life child world school 
    state family student group country problem hand part place case week company 
    system program question work government number night point home water room 
    mother area money story fact month lot right study book eye job word business 
    issue side kind head house service friend father power hour game line end member 
    law car city community name president team minute idea kid body information 
    back parent face others level office door health person art war history party result 
    change morning reason research girl guy moment air teacher force education''')
adverbs = words('''all particularly just less indeed over soon course still yet before 
    certainly how actually better to finally pretty then around very early nearly now 
    always either where right often hard back home best out even away enough probably 
    ever recently never however here quite alone both about ok ahead of usually already 
    suddenly down simply long directly little fast there only least quickly much forward 
    today more on exactly else up sometimes eventually almost thus tonight as in close 
    clearly again no perhaps that when also instead really most why ago off 
    especially maybe later well together rather so far once''') - nouns  
verbs = words('''ask believe borrow break bring buy can be able cancel change clean
    comb complain cough count cut dance draw drink drive eat explain fall
    fill find finish fit fix fly forget give go have hear hurt know learn
    leave listen live look lose make do need open close shut organise pay
    play put rain read reply run say see sell send sign sing sit sleep
    smoke speak spell spend stand start begin study succeed swim take talk
    teach tell think translate travel try turn off turn on type understand
    use wait wake up want watch work worry write''') - nouns

randoms = frozenset(vars(random))
builtins = frozenset(vars(__builtin__)) - randoms

starwars = phrases('''The Phantom Menace / Attack of the Clones / Revenge of the Sith /
    A New Hope / The Empire Strikes Back / Return of the Jedi''')

startrek = phrases('''The Wrath of Khan / The Search for Spock / The Voyage Home /
    The Final Frontier / The Undiscovered Country / Generations / First Contact /
    Insurrection / Nemesis''')

dogs = phrases(''''Labrador Retrievers / German Shepherd Dogs / Golden Retrievers / Beagles / Bulldogs / 
    Yorkshire Terriers / Boxers / Poodles / Rottweilers / Dachshunds / Shih Tzu / Doberman Pinschers / 
    Miniature Schnauzers / French Bulldogs / German Shorthaired Pointers / Siberian Huskies / Great Danes / 
    Chihuahuas / Pomeranians / Cavalier King Charles Spaniels / Shetland Sheepdogs / Australian Shepherds / 
    Boston Terriers / Pembroke Welsh Corgis / Maltese / Mastiffs / Cocker Spaniels / Havanese / 
    English Springer Spaniels / Pugs / Brittanys / Weimaraners / Bernese Mountain Dogs / Vizslas / Collies / 
    West Highland White Terriers / Papillons / Bichons Frises / Bullmastiffs / Basset Hounds / 
    Rhodesian Ridgebacks / Newfoundlands / Russell Terriers / Border Collies / Akitas / 
    Chesapeake Bay Retrievers / Miniature Pinschers / Bloodhounds / St. Bernards / Shiba Inu / Bull Terriers / 
    Chinese Shar-Pei / Soft Coated Wheaten Terriers / Airedale Terriers / Portuguese Water Dogs / Whippets / 
    Alaskan Malamutes / Scottish Terriers / Australian Cattle Dogs / Cane Corso / Lhasa Apsos / 
    Chinese Crested / Cairn Terriers / English Cocker Spaniels / Dalmatians / Italian Greyhounds / 
    Dogues de Bordeaux / Samoyeds / Chow Chows / German Wirehaired Pointers / Belgian Malinois / 
    Great Pyrenees / Pekingese / Irish Setters / Cardigan Welsh Corgis / Staffordshire Bull Terriers / 
    Irish Wolfhounds / Old English Sheepdogs / American Staffordshire Terriers / Bouviers des Flandres / 
    Greater Swiss Mountain Dogs / Japanese Chin / Tibetan Terriers / Brussels Griffons / 
    Wirehaired Pointing Griffons / Border Terriers / English Setters / Basenjis / Standard Schnauzers / 
    Silky Terriers / Flat-Coated Retrievers / Norwich Terriers / Afghan Hounds / Giant Schnauzers / Borzois / 
    Wire Fox Terriers / Parson Russell Terriers / Schipperkes / Gordon Setters / Treeing Walker Coonhounds''')
cats = phrases('''Abyssinian / Aegean cat / Australian Mist / American Curl / American Bobtail / 
    American Polydactyl / American Shorthair / American Wirehair / Arabian Mau / Asian / Asian Semi-longhair / 
    Balinese / Bambino / Bengal / Birman / Bombay / Brazilian Shorthair / British Shorthair / British Longhair / 
    Burmese / Burmilla / California Spangled Cat / Chantilly/Tiffany / Chartreux / Chausie / Cheetoh / 
    Colorpoint Shorthair / Cornish Rex / Cymric / Cyprus cat / Devon Rex / Donskoy or Don Sphynx / Dragon Li / 
    Dwelf / Egyptian Mau / European Shorthair / Exotic Shorthair / German Rex / Havana Brown / Highlander / 
    Himalayan-Colorpoint Persian / Japanese Bobtail / Javanese / Khao Manee / Korat / Korn Ja / 
    Kurilian Bobtail / LaPerm / Maine Coon / Manx / Mekong bobtail / Minskin / Munchkin / Nebelung / Napoleon / 
    Norwegian Forest Cat / Ocicat / Ojos Azules / Oregon Rex / Oriental Bicolor / Oriental Shorthair / 
    Oriental Longhair / Persian / Peterbald / Pixie-bob / Ragamuffin / Ragdoll / Russian Blue / Russian Black / 
    Sam Sawet / Savannah / Scottish Fold / Selkirk Rex / Serengeti cat / Serrade petit / Siamese / Siberian / 
    Singapura / Snowshoe / Sokoke / Somali / Sphynx / Swedish forest cat / Thai / Tonkinese / Toyger / 
    Turkish Angora / Turkish Van / Ukrainian Levkoy / York Chocolate Cat''')

And here we show how it works:

In [3]:
findregex(starwars, startrek)
Out[3]:
' T|E.P| N'
In [4]:
verify(' T|E.P| N', starwars, startrek)
Out[4]:
True

Plan of Attack

To improve the program, I'll take the following steps:

  • Profiling: Figure out where the program spends its time.
  • Speedup: Make the program faster.
  • Benchmarking: Run the program over pairs of arbitrary lists to see how fast it is and how short the solutions are.
  • Studying: Learn something by looking at the benchmark results.
  • Searching: Introduce a better search algorithm.
  • Eliminating Components: Get rid of components that can't possibly be part of an optimal solution.
  • Adding Components: Add new types of components to allow new, shorter solutions.
  • Randomizing Search: Randomness allows us to explore different parts of the search space.
  • Speculating: Think about what we could do next.

Profiling

Let's see how long it takes to separate the top 100 adverbs from top 100 nouns. We can use iPython's magic incantation %time:

In [5]:
%time findregex(adverbs, nouns)
CPU times: user 7.22 s, sys: 290 ms, total: 7.51 s
Wall time: 7.29 s

Out[5]:
'a.l|^..$|st$|et|re$|tl|^al|en$|ver$|nl|^a.o|ui|ll|no|^to|l.s|ard|a.a|o.n$|i..e|f$|ls|yb|ns|rh|wh|hu|ah|nc|^v|imp|hat|muc|far|out|^ra|ow$|ur.|lon|bot|lat'

On my computer it was 7 seconds. I have some ideas for how to make this faster, but over the years I've learned an important lesson: don't waste effort trying to speed up parts of your program that don't take much of the total time. I'll use the cProfile module to see where the time goes:

In [6]:
import cProfile

cProfile.run('findregex(adverbs, nouns)', sort='cumulative')
         15837886 function calls (15832965 primitive calls) in 10.975 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   10.975   10.975 <string>:1(<module>)
        1    0.001    0.001   10.975   10.975 <ipython-input-1-4d0ed760084b>:15(findregex)
    53900    0.056    0.000   10.854    0.000 <ipython-input-1-4d0ed760084b>:31(matches)
    53900    1.362    0.000   10.799    0.000 <ipython-input-1-4d0ed760084b>:33(<setcomp>)
  2831603    2.095    0.000    9.437    0.000 re.py:139(search)
  2831603    2.383    0.000    6.275    0.000 re.py:226(_compile)
       41    0.016    0.000    5.031    0.123 {max}
    25183    0.039    0.000    5.015    0.000 <ipython-input-1-4d0ed760084b>:23(score)
       41    0.024    0.001    4.677    0.114 <ipython-input-1-4d0ed760084b>:28(<setcomp>)
    52999    0.222    0.000    3.223    0.000 sre_compile.py:495(compile)
    52999    0.210    0.000    1.605    0.000 sre_parse.py:663(parse)
    52999    0.109    0.000    1.306    0.000 sre_compile.py:480(_code)
        1    0.000    0.000    1.261    1.261 <ipython-input-1-4d0ed760084b>:37(regex_components)
        1    0.004    0.004    1.230    1.230 <ipython-input-1-4d0ed760084b>:41(<setcomp>)
    52999    0.107    0.000    1.163    0.000 sre_parse.py:301(_parse_sub)
  2831603    1.067    0.000    1.067    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
    52999    0.423    0.000    1.014    0.000 sre_parse.py:379(_parse)
    52999    0.494    0.000    0.958    0.000 sre_compile.py:361(_compile_info)
  2831603    0.577    0.000    0.577    0.000 {method 'get' of 'dict' objects}
   309666    0.165    0.000    0.474    0.000 sre_parse.py:201(get)
   362665    0.339    0.000    0.397    0.000 sre_parse.py:182(__next)
    52999    0.302    0.000    0.346    0.000 sre_parse.py:140(getwidth)
    52999    0.191    0.000    0.233    0.000 sre_compile.py:32(_compile)
  1080196    0.156    0.000    0.156    0.000 {method 'append' of 'list' objects}
    52999    0.050    0.000    0.138    0.000 sre_parse.py:178(__init__)
   203668    0.099    0.000    0.131    0.000 sre_parse.py:138(append)
  1074046    0.112    0.000    0.112    0.000 {len}
   105998    0.066    0.000    0.080    0.000 sre_compile.py:474(isstring)
    52999    0.048    0.000    0.048    0.000 {_sre.compile}
   105998    0.044    0.000    0.044    0.000 {min}
   158997    0.041    0.000    0.041    0.000 {isinstance}
    52999    0.033    0.000    0.033    0.000 sre_parse.py:67(__init__)
        1    0.003    0.003    0.031    0.031 <ipython-input-1-4d0ed760084b>:40(<setcomp>)
    52999    0.030    0.000    0.030    0.000 sre_parse.py:90(__init__)
7052/2131    0.008    0.000    0.026    0.000 <ipython-input-1-4d0ed760084b>:47(dotify)
    52999    0.023    0.000    0.023    0.000 sre_parse.py:195(match)
    68130    0.021    0.000    0.021    0.000 {method 'extend' of 'list' objects}
     4921    0.015    0.000    0.019    0.000 <ipython-input-1-4d0ed760084b>:52(<setcomp>)
   140650    0.017    0.000    0.017    0.000 {ord}
      529    0.011    0.000    0.011    0.000 {method 'clear' of 'dict' objects}
    52999    0.009    0.000    0.009    0.000 {method 'items' of 'dict' objects}
    10829    0.003    0.000    0.003    0.000 <ipython-input-1-4d0ed760084b>:55(replacements)
       98    0.001    0.000    0.002    0.000 <ipython-input-1-4d0ed760084b>:43(subparts)
     2874    0.001    0.000    0.001    0.000 <ipython-input-1-4d0ed760084b>:45(<genexpr>)
       98    0.000    0.000    0.000    0.000 {range}
        1    0.000    0.000    0.000    0.000 <ipython-input-1-4d0ed760084b>:39(<setcomp>)
        1    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}



We can see that about 99% of the cumulative time was spent in matches. And 85% of the time in matches goes to calling re.search. So my first thoughts are:

  • Can we make each call to matches run faster?
  • Can we make fewer calls to matches?
  • Can we use re.search more effectively?
  • Nothing else matters. 99% of the time is in matches; don't waste effort speeding up anything else.

Speedup: Matching Faster by Compiling Regexes

Here's one way to use re.search more effectively (and thus make matches faster): The re package uses strings to specify regular expressions, but internally, when we call re.search on a string, that string is compiled, with the function re.compile, into an object that has a search method, which is then called. If you need to search many times with the same regex, then it is faster to call re.compile (and then call the search method on the compiled object), rather than calling re.search.

One thing to be careful of: the re module manages a cache of recent compilations. When we do our timing measurements we will call re.purge() first to clear the cache, to make sure that we don't get misleading time measurements (because some of the regex-compilation work was already done by a previous match).

In [7]:
def matches(regex, strings):
    "Return a set of all the strings that are matched by regex."
    searcher = re.compile(regex).search
    return {s for s in strings if searcher(s)}
In [8]:
re.purge();
%time findregex(adverbs, nouns)
CPU times: user 3.39 s, sys: 80.3 ms, total: 3.47 s
Wall time: 3.41 s

Out[8]:
'a.l|^..$|st$|et|re$|tl|^al|en$|ver$|nl|^a.o|ui|ll|no|^to|l.s|ard|a.a|o.n$|i..e|f$|ls|yb|ns|rh|wh|hu|ah|nc|^v|imp|hat|muc|far|out|^ra|ow$|ur.|lon|bot|lat'

That was twice as fast! Not bad for adding one line.

Here's one thing about optimizing Python code that is not an issue with, say, C code: there is overhead in executing Python statements, because they go through the Python byte code interpreter. There is no such overhead for the built-in functions (which are written in C), so it often is faster to replace statement logic with built-in functions. We can compare {s for s in strings if searcher(s)} with a call to the built-in function filter. Here I will use %timeit% which runs the statement repeatedly in a loop, repeats the loop 3 times, and reports the resulting best time:

In [9]:
searcher = re.compile('^a.o').search
strings = adverbs

%timeit {s for s in strings if searcher(s)}
%timeit set(filter(searcher, strings))
10000 loops, best of 3: 39.9 µs per loop
10000 loops, best of 3: 32.5 µs per loop

So filter is about 18% faster. We'll keep it:

In [10]:
def matches(regex, strings):
    "Return a set of all the strings that are matched by regex."
    searcher = re.compile(regex).search
    return set(filter(searcher, strings))
In [11]:
re.purge();
%time findregex(adverbs, nouns)
CPU times: user 3.24 s, sys: 159 ms, total: 3.4 s
Wall time: 3.28 s

Out[11]:
'a.l|^..$|st$|et|re$|tl|^al|en$|ver$|nl|^a.o|ui|ll|no|^to|l.s|ard|a.a|o.n$|i..e|f$|ls|yb|ns|rh|wh|hu|ah|nc|^v|imp|hat|muc|far|out|^ra|ow$|ur.|lon|bot|lat'

I've run out of ways to make an individual call to matches faster, so let's try to make fewer matches.

Speedup: Minimizing matches Against Losers

I will start by minimizing the number of matches against losers. In regex_components, for each part, p, we test if not matches(p, losers). This applies searcher to every loser and builds up a set of all those that match. It would be faster if we could exit early: as soon as we find one loser that matches, we don't need to apply searcher to the remaining losers. We could do something like this:

def matches_any(regex, strings):
    searcher = re.compile(regex).search
    return any(searcher(s) for s in strrings)

But Stefan Pochmann came up with a better idea. If we concatenate the list of losers together into one big string, then we can call re.search only once (for each part p) on that big string. If a part matches the big string anywhere, the call will return immediately, and we will know the part is no good. We have effectively moved the work from Python statements into a builtin function, which should be faster (but we will measure to be sure).

There is one catch: because we allow '^' and '$' in our regex parts, we'll have to arrange that the start of each loser matches '^' and the end matches '$'. We can do that by concatenating together the losers with a newline between them, and by using the re.MULTILINE option to say that '^' and '$' match at newlines. (See the re module documentation if this is unfamiliar.)

Here it is, with before-and-after timing:

In [12]:
%timeit re.purge(); regex_components(winners, losers)  
10 loops, best of 3: 137 ms per loop

In [13]:
def regex_components(winners, losers):
    "Return components that match at least one winner, but no loser."
    losers_str = '\n'.join(losers)
    def no_losers(part): return not re.compile(part, re.MULTILINE).search(losers_str)
    wholes = {'^'+winner+'$' for winner in winners}
    parts = {d for w in wholes for p in subparts(w) for d in dotify(p)}
    return wholes | set(filter(no_losers, parts))
In [14]:
%timeit re.purge(); regex_components(winners, losers)
10 loops, best of 3: 116 ms per loop

Speedup: Minimizing matches Against Winners

Here's what findregex currently looks like:

In [15]:
def findregex(winners, losers):
    "Find a regex that matches all winners but no losers (sets of strings)."
    # Make a pool of regex components, then pick from them to cover winners.
    # On each iteration, add the 'best' component to 'solution',
    # remove winners covered by best, and keep in 'pool' only components
    # that still match some winner.
    pool = regex_components(winners, losers)
    solution = []
    def score(r): return 4 * len(matches(r, winners)) - len(r)
    while winners:
        best = max(pool, key=score)
        solution.append(best)
        winners = winners - matches(best, winners)
        pool = {r for r in pool if matches(r, winners)}
    return OR(solution)

Notice that we call matches twice for every component regex in the pool on every iteration of the main loop. If there are 50 iterations of the loop, and 1000 components, that's 100,000 calls to matches. Instead of repeating all these calls, I propose that, for each component regex in the pool, we limit ourselves to this:

  • One re.compile.
  • One re.search against each of the winners.
  • One re.search against the big string of losers.

How can we make sure we don't repeat searches? By doing the search and caching the result. During initialization, we create a dict, which we will call covers, such that covers[r] rerturns the set of all winners that match the regex r. From then on, we look in the dict rather than repeating a search. The new function regex_covers replaces the old function regex_components:

In [16]:
def regex_covers(winners, losers):
    """Generate regex components and return a dict of {regex: {winner...}}.
    Each regex matches at least one winner and no loser."""
    losers_str = '\n'.join(losers)
    wholes = {'^'+winner+'$' for winner in winners}
    parts = {d for w in wholes for p in subparts(w) for d in dotify(p)}
    pool = wholes | parts
    searchers = [re.compile(c, re.MULTILINE).search for c in pool]
    return {r: set(filter(searcher, winners)) 
            for (r, searcher) in zip(pool, searchers)
            if not searcher(losers_str)}

Now we rewrite findregex to call regex_covers, and replace the global matches function with a local function that looks into the covers cache. Notice that we can't just return the cached set of winners, covers[r], we have to go through it and find only those elements that are still in the current set of remaining winners.

In [17]:
def findregex(winners, losers):
    "Find a regex that matches all winners but no losers (sets of strings)."
    # Initialize covers = {regex: {winner,...}} for a large set of regex components.
    # On each iteration, add the 'best' component to 'solution',
    # remove winners covered by best, and keep in 'pool' only components
    # that still match some winner.
    covers = regex_covers(winners, losers)
    pool = set(covers)
    solution = []
        
    def matches(r, winners): return {w for w in covers[r] if w in winners}
    
    def score(r): return 4 * len(matches(r, winners)) - len(r)
    
    while winners:
        best = max(pool, key=score)
        solution.append(best)
        winners = winners - covers[best]
        pool = {r for r in pool if matches(r, winners)}
    return OR(solution)
In [18]:
re.purge()
%time findregex(adverbs, nouns)
CPU times: user 316 ms, sys: 48.9 ms, total: 365 ms
Wall time: 327 ms

Out[18]:
'a.l|^..$|st$|et|re$|tl|^al|en$|ver$|nl|^a.o|ui|ll|no|^to|l.s|ard|a.a|o.n$|i..e|f$|ls|yb|ns|rh|wh|hu|ah|nc|^v|imp|hat|muc|far|out|^ra|ow$|ur.|lon|bot|lat'

Wow! That's approximately 10 times faster than the previous version of findregex, and 20 times faster than the first version! But I don't want to draw too many conclusions from just this one example.

Benchmarking

We will benchmark the program on a suite of 20 different problems, and summarize the results. For each problem we'll print the number of characters in the solution, the number of winners covered, the time it took, and the solution. And we'll store solutions in the dict SOLUTION, because it might be useful to have them.

In [19]:
import time

ALL = [ # An example has two sets of strings, each with a one-character initial identifying it.
    (winners, 'W', 'L', losers),
    (boys, 'B', 'G', girls),
    (nfl_in, 'I', 'O', nfl_out),
    (pharma, 'P', 'C', cities),
    (foo, 'F', 'B', bar),
    (starwars, 'W', 'T', startrek),
    (nouns, 'N', 'A', adverbs),
    (nouns, 'N', 'V', verbs),
    (randoms, 'R', 'B', builtins),
    (dogs, 'D', 'C', cats)]

ALL = [d for datum in ALL for d in (datum, datum[::-1])] # Add in the reverse versions

SOLUTION = {} # Remember solutions; SOLUTION[W, L] will hold a regex
               
def benchmark(examples=ALL): 
    "Run examples; print summaries; return total of solution lengths."
    total = 0
    for (W, Wname, Lname, L) in examples:
        re.purge()
        t0 = time.time()
        solution = SOLUTION[W, L] = findregex(W, L)
        t1 = time.time()
        assert verify(solution, W, L)
        print '%3d ch %3d win %5.2f s %s-%s: %r' % (
            len(solution), len(W), t1-t0, Wname, Lname, solution)
        total += len(solution)
    return total
In [20]:
%time benchmark()
 53 ch  34 win  0.17 s W-L: 'a.a|a..i|j|oo|a.t|i..o|i..n|bu|n.e|ay.|r.e$|tr|po|v.l'
 60 ch  34 win  0.14 s L-W: '^s|^d|r.$|^.re|cc|hu|ney|o.d|l.i|d.n$|ya|rk|oc|ss|x$|cla|^ki'
 11 ch  10 win  0.04 s B-G: 'e.$|a.$|a.o'
 10 ch  10 win  0.04 s G-B: 'a$|^..i|ad'
 24 ch  12 win  0.06 s I-O: 'pa|g..s|4|nc|lt|fs|wk|sa'
 21 ch  20 win  0.08 s O-I: 'ns|^.i|d|j|ee|y|m|ars'
 15 ch  10 win  0.05 s P-C: 'o.$|x|ir|b|q|en'
 11 ch  10 win  0.05 s C-P: 'ri|an|ca|id'
  3 ch  21 win  0.06 s F-B: 'f.o'
 27 ch  21 win  0.09 s B-F: 'r..$|k|or|^..l|ld|...n|b|ga'
  9 ch   6 win  0.07 s W-T: ' T|E.P| N'
 14 ch   9 win  0.10 s T-W: 'ER|N$|F.R|Y|IS'
170 ch 100 win  0.32 s N-A: 'm.n|a.e$|m$|in.$|en.$|ch.|^w.r|io|^g|id|s.n|o.y|..k$|mb|lt|rc|an|me$|s.u|i.y|^wa|hou|f.c|rt$|^.ig|ot.e|tr|b$|eo|if|ca|rv|ey|po|a$|yea|v.l|^da|or$|rty|aw$|ot$|f.t|ir$|^hea'
152 ch  98 win  0.31 s A-N: 'a.l|^..$|st$|et|re$|tl|^al|en$|ver$|nl|^a.o|ui|ll|no|^to|l.s|ard|a.a|o.n$|i..e|f$|ls|yb|ns|rh|wh|hu|ah|nc|^v|imp|hat|muc|far|out|^ra|ow$|ur.|lon|bot|lat'
174 ch 100 win  0.34 s N-V: 'm.n|er$|am|es|^.ar|ho|.on|or.$|id|o.l|f.c|..h.|h..d|ye|dy|ir|^ar|t.m|m..n|b..k|...u|h.ng|j|gu|fr|fe|lt|rv|rc|oi|we|ot|ud|ci|ki|roo|.tr|day|or$|w.y|ase|^la|tat|ine|^en|lac|lev'
177 ch  94 win  0.25 s V-N: '^..l|ak|v.$|^..$|s..n|^fi|a.n|.pe|o.g|^tr|ut$|lo..|w..t|...w|ep|nk|nc|ed$|ur.|^s.n|tc|f$|eg|ru|sw|br|cl|^u|rr|fl|ke|sk|sa|ee$|it$|p.y|tar|l.y|buy|at$|o.b|can|re.d|a.h$|hear|unt$'
 95 ch  64 win  0.30 s R-B: 'a.._|..om|no|r.nd|be|f$|es$|t.st|ei|ia|o$|u..t|ch..|o..s|ga|ld|do|_f|ee|we|hu|Te|if|^la|np$|^sa'
140 ch 141 win  0.70 s B-R: 'E|g$|a.s|I|co|^f.|tr$|^o|^re|u.$|..p$|n.e$|rt|^a|^..x|hr|it|mo|li|ho|up|e.u|l.a|in$|^int|N|ty|cf|ha|ea|w$|ro|fe|bo|id|pr|ev|^le|ct$|ir$|set$'
 76 ch 100 win  0.64 s D-C: 'E.S$|O.S$|DS|.AS|.HI|RD|.D.E|P.M|IES|N.S$|GS|FR|LM|LT|EKI|.SO|AGL|^HAVANESE$'
113 ch  91 win  0.55 s C-D: 'AI.$|T$|AN$|EX|L$|A$|Y$|A..B|NX|M.I|R$|H$|O.O|BAL|RAG|MES|KI.$|^CHA|-B|G$|GY|M$|WN|O |OE|EO|NK|F$|J.V|O.A|H.F|^TH'
CPU times: user 4.36 s, sys: 91.5 ms, total: 4.45 s
Wall time: 4.39 s

Out[20]:
1355

In 4 seconds we solved twenty problems, getting a score of 1355 for the total number of characters in all the solutions.

(Note: if you run this, you might get a slightly different answer. Why? The algorithm has no random parts in it; shouldn't it return the same result every time? It will return the same results if you run it twice in the same Python. But when findregex evaluates max(pool, key=score) and two components are tied for the best score, Python chooses the one that comes first in the iteration of pool. And different versions of Python can iterate over pool in different orders.)

Studying

What can we do to understand the program better? Can we determine what kinds of regex components are being used? What kinds might we add that are missing? How might we make the program faster? Or allow it to find better solutions?

Let's start by understanding what the regex components look like. Glancing over the output of the benchmark, my impression is that most components are short, in the 2-4 character range. We can be more precise:

In [21]:
from collections import Counter, defaultdict

components = [r for solution in SOLUTION.values() for r in solution.split('|')]

Counter(map(len, components)).most_common()
Out[21]:
[(2, 167), (3, 129), (4, 63), (1, 16), (10, 1)]

This says that there are 167 components of length 2, etc. We can also view this data as a histogram:

In [22]:
hist(map(len, components));

There's also one component of length 10; that's such an outlier, I wonder if it is an error? Let's investigate.

In [23]:
max(components, key=len)
Out[23]:
'^HAVANESE$'

I happen to own a Havanese, but I don't think the program knew that. Rather, it looks like the issue is that in the set of cats we have 'HAVANA BROWN' and 'JAVANESE', which means that any 4-letter-or-less subsequence of 'HAVANESE' matches one of these two. So the only component that doesn't match one of these two cats is the whole, '^HAVANESE$'.

What about the individual characters that make up the components? Which ones are popular? We can see:

In [24]:
cat = ''.join

Counter(cat(components)).most_common(20)
Out[24]:
[('.', 149),
 ('$', 74),
 ('a', 63),
 ('e', 54),
 ('r', 53),
 ('o', 49),
 ('t', 42),
 ('^', 41),
 ('n', 40),
 ('i', 37),
 ('l', 33),
 ('s', 27),
 ('c', 24),
 ('u', 23),
 ('d', 21),
 ('h', 20),
 ('f', 19),
 ('y', 18),
 ('E', 13),
 ('b', 13)]

I wonder if there are components that appear in many solutions?

In [25]:
Counter(components).most_common(20)
Out[25]:
[('id', 4),
 ('lt', 3),
 ('nc', 3),
 ('hu', 3),
 ('j', 3),
 ('f$', 3),
 ('v.l', 2),
 ('ga', 2),
 ('ld', 2),
 ('tr', 2),
 ('rc', 2),
 ('b', 2),
 ('ee', 2),
 ('or$', 2),
 ('a.a', 2),
 ('rv', 2),
 ('we', 2),
 ('^la', 2),
 ('ca', 2),
 ('^..$', 2)]

I guess not.

Studying: Set Cover Theory

Now I want to think about the problem as a set cover problem. Wikipedia shows this diagram to explain set cover:

The idea is that each blue dot in set $u$ on the left represents a regex component, and each green dot in set $v$ represents a winner that is matched by the regex component. The problem is to pick some blue dots such that all the green dots are covered. In this example, we can see that picking the blue dots that are second and third from the top will do it.

But in general, what do our graphs look like? How many blue and green dots are there? How many links between them? Let's build a little tool to show show some information. We'll list the $N$ "most popular" blue and green dots—the ones with the most links to the other side. Then we'll show histograms of all the nodes and their numbers of links.

But first some explanation: I use the term multimap for a dictionary which is a mapping {key: collection}. So covers is a multimap of the form {regex: {winner,...}}. If we apply the function invert_multimap to covers, we get a dict of the form {winner: {regex, ...}}.

In [26]:
import matplotlib.pylab as pylab
pylab.rcParams['figure.figsize'] = (8.0, 4.0)

def show(W, L, N=10):
    covers = regex_covers(W, L)
    inv = invert_multimap(covers)
    for ((n1, w), (n2, r)) in zip(top(N, covers), top(N, inv)):
        print "%8s %2d | %3d %s" % (w, n1, n2, r)
    print "TOTAL %5d | %3d TOTAL" % (len(covers), len(W))
    subplot(1, 2, 1); histmm(covers, "regexes", "winners")
    subplot(1, 2, 2); histmm(inv, "winners", "regexes")
    
def top(N, multimap): 
    "The top N longest items in a dict of {key: {val,...})"
    return sorted([(len(vals), key) for (key, vals) in multimap.items()], reverse=True)[:N]

def histmm(multimap, key, val): 
    "Display a histogram of how many values each key has."
    hist(map(len, multimap.values()))
    xlabel('set size of {' + val + ',...}')
    ylabel('number of ' + key + ' mapping to that size')

def invert_multimap(multimap):
    "Covert {key: {val,...}} to {val: [key,...]}."
    result = defaultdict(set)
    for key in multimap:
        for val in multimap[key]:
            result[val].add(key)
    return result
In [27]:
show(winners, losers)
    i..n  4 | 110 van-buren
     a.a  4 |  98 washington
    a..i  4 |  93 jefferson
    r.i.  3 |  93 eisenhower
     r.i  3 |  81 garfield
    oo..  3 |  78 roosevelt
     oo.  3 |  72 buchanan
      oo  3 |  61 obama
      ma  3 |  61 kennedy
    li..  3 |  59 jackson
TOTAL  1615 |  34 TOTAL

What does this say? First the table says that there are only three regexes that have 4 winners; all the rest have 3 or fewer winners. So the links going left-to-right are rather sparse. There are more links going the other direction; 110 links go in to 'van-buren'. (Presumably in large part because the '-' character is unique, so any component containing a '-' will not match any loser.) The histograms give the whole picture mapping set sizes to the number of keys that have that set size. So, on the left we see that over 1400 regexes have a set size of one winner, and 100 or fewer have set sizes of 2 and 3, and we saw that only 3 regexes have a set size of 4 winners. On the right, the histogram shows a wide spread of winners that have set sizes between 20 and 110 regexes.

From looking at this I get two ideas:

  • I notice that both 'oo..' and 'oo.' happen to match 3 winners. But even if I didn't know how many they matched, I would know that the winners matched by 'oo..' must be a subset of the winners matched by 'oo.' (remember that a set is a subset of itself). Furthermore, 'oo.' is shorter. This suggests that we would never want to use 'oo..' in a solution, because we could always make the solution one character shorter (and cover at least all the same winners) by substituting 'oo.'. We say that A dominates B if A is shorter and matches a superset of winners. We might as well throw out all dominated regexes right at the start.

  • It might be a good idea to pay special attention to the characters, like '-', that appear in the winners but not the losers.

Let's look at another example:

In [28]:
show(dogs, cats)
    E.S$ 41 | 186 SOFT COATED WHEATEN TERRIERS
     RS$ 34 | 184 STAFFORDSHIRE BULL TERRIERS
    ERS$ 34 | 175 GREATER SWISS MOUNTAIN DOGS
    .RS$ 34 | 162 TREEING WALKER COONHOUNDS
    IE.S 22 | 157 CAVALIER KING CHARLES SPANIELS
    E.RI 21 | 149 AMERICAN STAFFORDSHIRE TERRIERS
    T..R 20 | 147 WEST HIGHLAND WHITE TERRIERS
    IER. 19 | 146 STANDARD SCHNAUZERS
     IER 19 | 146 FLAT-COATED RETRIEVERS
    I.R. 19 | 146 CHESAPEAKE BAY RETRIEVERS
TOTAL  4570 | 100 TOTAL

My immediate reaction to this is "there are a lot of retrievers and terriers." All ten of the regexes in the table recognize this fact, but almost all the regexes (about 4400 out of 4570) match fewer than 5 dog breeds. In contrast, the right-hand histogram shows that most of the dog breeds have 9 or more regexes that match them.

Let's look at one more example before drawing conclusions:

In [29]:
show(adverbs, nouns)
     a.l 12 |  94 especially
    a.ly 11 |  83 eventually
    a.l. 11 |  74 particularly
    .a.l 11 |  67 actually
      ll  9 |  65 quickly
    ^..$  9 |  65 exactly
     .ll  9 |  65 directly
    ..ll  8 |  63 certainly
     all  7 |  61 usually
     st$  6 |  57 recently
TOTAL  2049 |  98 TOTAL

The pattern looks similar here. Almost all the regexes match only one winner, but most of the winners are matched by many regexes. What can I conclude from this?

  • It would be good to think of new types of regexes that match more winners.
  • Most regexes are 2 or 3 characters long and most match just 1 or 2 winners. That suggests that often we will have a tie in the score function, and greedy search will arbitrarily pick the one that comes first. But I'm worried that another choice might be better. Maybe we should replace greedy search with something that considers more than one choice.

Given all we've seen, I will shift my attention to understanding the search algorithm we have, and thinking about how to improve it.

findregex is a greedy search algorithm. In general, greedy search is used when you are trying to find a solution that consists of components. On each iteration you choose a component that looks good, and you never reconsider that choice; you just keep on adding components until you get a complete solution.

The pseudo-code for a general greedy_search, described as a recursive function, is given below. It takes two arguments: a set of components, and a partial solution indicating components that have previously been chosen. (The invariant condition is that the partial solution plus a solution for the components will always equal a complete solution for the problem.)

# Pseudocode 
def greedy_search(components, partial_solution=None):
    if is_complete(partial_solution):
        return partial_solution
    else:
        best = select_best(components)
        return greedy_search(components - best, partial_solution + best)

An exhaustive search considers every possible choice of components, and selects the best solution. On each iteration exhaustive search picks a component (just like greedy search), but then it considers both using the component and not using the component. You can see that exhaustive search is almost identical to greedy search, except that it has two recursive calls (on lines 7 and 8) instead of one (on line 7). (If you are viewing this in a iPython notebook, not just a web page, you can toggle line numbers by pressing 'ctrl-M L' within a cell.) How do we choose between the results of the two calls? We need a cost function that we are trying to minimize. (For regex golf the cost of a solution is the length of the string.)

# Pseudocode
def exhaustive_search(components, partial_solution=None):
    if is_complete(partial_solution):
        return partial_solution
    else:
        best = select_best(components)
        return min(exhaustive_search(components - best, partial_solution + best),
                   exhaustive_search(components - best, partial_solution),
                   key=cost)

Here's an interesting piece of trivia: the first thing that this exhaustive search does is a greedy search! Think about it: on each recursive call, exhaustive_search selects the best component, and then calls itself recursively with best chosen as part of the partial solution. It keeps going until it reaches a complete solution. (That's exactly what greedy search does.) But then exhaustive search does more. It also considers all the cases where it didn't chose the best component.

Here's something else to ponder: why should we bother with selecting the "best" component? If we're going to explore all possibilities, wouldn't it be the same if we just selected any component, not the "best" one?

Greedy search only considers one path towards the goal of finding a solution. Exhaustive search considers two paths on every iteration. How much work is that all together? If there are R regex components, then there are $2^R$ subsets of components. Now, we don't have to consider every one of these subsets, because the search ends when we have a complete solution. If there are $W$ winners, than an upper bound on the number of combinations we have to consider is the number of ways we can pick $W$ components from a pool of $R$, which is math we write as $R \choose W$, or ($R$ choose $W$). For the presidential candidates, we knows that the number of winners, $W$, is 34, and $R$ is 1615.

So exhaustive search would have to consider roughly ${1615} \choose {34}$ = $10^{70}$ combinations, which is clearly too many. Don't even think about dogs versus cats, where $W$ = 100 and $R$ = 4570, so ${4570} \choose {100}$ = $10^{207}$. Fortunately, there is another way.

Branch and bound is a search strategy that computes the same result as exhaustive search, but runs faster because it avoids considering many combinations that could not possibly lead to an optimal solution. Here's how and why it works:

  • The branch part of the name means that, just like exhaustive search, it picks a best component, and considers two branches, one where best is part of the solution, and one where it is not.
  • The bound part of the name means that the algorithm continually keeps track of the lowest-cost solution it has found so far. This serves as a bound on what the optimal solution can be.
  • The trick is to notice that the cost of a partial solution always increases as we add more components to it. Therefore, if we ever get to a partial solution whose cost is more than the lowest-cost solution we have found so far, then we can immediately discard the partial solution, and never consider all the other combinations that would have come from continuing the partial solution. If there are $M$ components left to consider, that means we just eliminated $2^M$ potential combinations all at once. "Eliminating potential combinations" is also called "pruning the seach tree"—"pruning" means cutting off unwanted branches.

The pseudocode below uses a global variable called SOLUTION to hold the lowest-cost solution found so far. Note that the two branches (two recursive calls in lines 7 and 8) communicate through this global variable. If the first branch sets SOLUTION to an improved solution, then the second branch will see that improved value. That is why it is important to choose a good "best" component first—it means we are more likely to find a good solution early on, which makes it more likely that we can prune away branches.

# Pseudocode
def branch_and_bound_search(components, partial_solution=None):
    if is_complete(partial_solution):
        SOLUTION = min(SOLUTION, partial_solution, key=cost)
    elif cost(partial_solution) < cost(SOLUTION):
        best = select_best(components)
        branch_and_bound_search(components - best, partial_solution + best)
        branch_and_bound_search(components - best, partial_solution)
    else:
        pass 
    return SOLUTION

Notice that we don't bother doing the min calculation on the results of the two recursive calls. We know that every time we find a better solution, we will be setting the global variable SOLUTION, so we can just return SOLUTION at the bottom of the function, without worrying about which branch the solution came from.

Even with all this pruning, we still may be presented with problems that require trillions of recursive calls. Who's got that kind of time? Happily, we know that right at the start the algorithm did the equivalent of a greedy search and set a value for SOLUTION, and that from then on the search is continually updating SOLUTION whenever it finds a better combination. So if at any point we stop the search, we always have some solution. Algorithms that have the property that an answer is always available are called anytime algorithms. (Obviously, if you stop the algorithm before it completes you've lost the guarantee of finding the optimal solution.)

I can think of two easy ways to allow early stopping. We could use the threading.Timer class to set up a thread that will run the search for a given number of seconds, then stop. Or we could have a counter which gets decremented every time we make a call, and when it hits zero, we refuse to make any more recursive calls. I'll take this second approach.

Enough pseudocode—here is a real implementation. Since global variables are considered bad style, I decided to use a class, BranchBound, to hold the lowest-cost solution and the number of calls remaining. The class provides one method, search, which does the work.

In [30]:
def findregex(winners, losers, calls=100000):
    "Find the shortest disjunction of regex components that covers winners but not losers."
    solution = '^(' + OR(winners) + ')$'
    covers = regex_covers(winners, losers)
    return BranchBound(solution, calls).search(covers)

class BranchBound(object):
    "Hold state information for a branch and bound search."
    def __init__(self, solution, calls):
        self.solution, self.calls = solution, calls
    
    def search(self, covers, partial=None):
        """Recursively extend partial regex until it matches all winners in covers.
        Try all reasonable combinations until we run out of calls."""
        if self.calls <= 0: 
            return self.solution
        self.calls -= 1
        covers, partial = simplify_covers(covers, partial)
        if not covers: # Nothing left to cover; solution is complete
            self.solution = min(partial, self.solution, key=len)
        elif len(OR(partial, min(covers, key=len))) < len(self.solution):
            # Try with and without the greedy-best component
            def score(c): return 4 * len(covers[c]) - len(c)
            best = max(covers, key=score) # Best component
            covered = covers[best] # Set of winners covered by best
            covers.pop(best)
            self.search({c:covers[c]-covered for c in covers}, OR(partial, best))
            self.search(covers, partial)
        return self.solution

Simplifying Covers

There's a new function introduced above, simplify_covers. The idea is that you pass it a covers dict and optionally a partial regex, and it does a triage process to divide the components in covers into three categories:

  • Useless components: could never appear in an optimal solution.
  • Necessary components: must appear in any solution (because no other component covers some winner).
  • Remaining components: neither useless nor necessary.

The necessary components are joined to the partial regex that is passed in, to yield a new partial regex that the function returns. The useless components are eliminated, and the remaining ones are returned in a new covers dict.

simplify_covers works by first eliminating useless components, and then selecting necessary components. We keep repeating this process of eliminating and selecting until there are no more changes to covers:

In [31]:
def simplify_covers(covers, partial=None):
    "Eliminate dominated regexes, and select ones that uniquely cover a winner."
    previous = None
    while covers != previous:
        previous = covers
        covers = eliminate_dominated(covers)
        covers, necessary = select_necessary(covers)
        partial = OR(partial, necessary)
    return covers, partial

For convenience we modified the function OR slightly; we made it work just like the function max in that you can call it two ways: either with a single argument, which is a collection of components, or with 2 or more arguments, each a component. We also made it ignore components that are None, so that OR(None, 'a') equals 'a', but OR('a', 'b') equals 'a|b', as does OR(['a', 'b'])

In [32]:
def OR(*regexes):
    """OR together component regexes. Ignore 'None' components.
    Allows both OR(a, b, c) and OR([a, b, c]), similar to max."""
    if len(regexes) == 1: 
        regexes = regexes[0]
    return '|'.join(r for r in regexes if r)

Simplifying Covers: Eliminating Dominated Components

We defined the notion of a regex being dominated if there is some other regex that is shorter (or at least not longer) and covers a superset of winners. How can we efficiently compute which components are dominated? Here's one simple approach: go through each regex component, r, and every time we find one that is not dominated by any another regex component r2, add r (and its winners) to a new dict that we are building up. This approach works, but it is $O(n^2)$.

Here's an alternative that is faster: First, sort all the regexes so that the ones that match more winners come first (in case of tie, put the shorter ones first). (In other words, "best" first.) Initialize a new dict to empty, and go through the sorted regexes, r, one at a time. For each r, compare it only to the regexes, r2, that we have already added to the new dict; if none of them dominate r, then add r. This is still $O(n^2)$, but if, say, 99 out of 100 regexes are eliminated, we will only need $n^2/100$ comparisons.

The code for eliminate_dominated is a slight variant on the version by Stefan Pochmann (because his version turned out to be better than my first version). Note that the function signature represents a regex as a pair of numbers, the negative of the number of winners and the length in characters. The smaller (in lexicographical order) this pair is, the "better" the regex is. Why is it important to put the regexes in sorted order? So that we know that the only possible dominators are the ones that came before, not ones that come after. Note that, as we are going through the component regexes, r, in sorted order, if we ever hit an r that covers no winners, then we know that all the remianing r will also cover no winners (because we're going in best-first order), so we can break out of the loop immediately. And one more thing: in Python the expression (set2 >= set1) means "Is set2 a superset of set1?", so the expression covers[r2] >= covers[r] asks if r2 covers a superset of the winners that r covers.

In [33]:
def eliminate_dominated(covers):
    """Given a dict of {regex: {winner...}}, make a new dict with only the regexes
    that are not dominated by any others. A regex r is dominated by r2 if r2 covers 
    a superset of the matches covered by r, and r2 is shorter."""
    newcovers = {}
    def signature(r): return (-len(covers[r]), len(r))
    for r in sorted(covers, key=signature):
        if not covers[r]: break # All remaining r must not cover anything
        # r goes in newcache if it is not dominated by any other regex
        if not any(covers[r2] >= covers[r] and len(r2) <= len(r) 
                   for r2 in newcovers):
            newcovers[r] = covers[r]
    return newcovers

Simplifying Covers: Selecting Necessary Components

If some winner is covered by only one component, then we know that component must be part of the solution. The function select_necessary finds all such components, ORs them together, and returns that, along with a dict of the remaining covers, after removing the necessary components (and removing all the winners that are covered by the neccessary components).

In [34]:
def select_necessary(covers):
    """Select winners covered by only one component; remove from covers.
    Return a pair of (covers, necessary)."""
    counts = Counter(w for r in covers for w in covers[r])
    necessary = {r for r in covers if any(counts[w] == 1 for w in covers[r])}
    if necessary:
        covered = {w for r in necessary for w in covers[r]}
        covers = {r:covers[r]-covered for r in covers if r not in necessary}
        return covers, OR(necessary)
    else:
        return covers, None

Testing and Benchmarking: Branch and Bound

Let's make up a test suite (showing some examples of how these functions work). Then let's see if findregex works with the new branch and bound search.

In [35]:
def test_bb():
    assert OR(['a', 'b', 'c']) == OR('a', 'b', 'c') == 'a|b|c'
    assert OR(['a|b', 'c|d']) == OR('a|b', 'c|d') == 'a|b|c|d'
    assert OR(None, 'c') == 'c'
    covers1 = {'a': {'ann', 'abe'}, 'ab': {'abe'}}
    assert eliminate_dominated(covers1) == {'a': {'ann', 'abe'}}
    assert simplify_covers(covers1) == ({}, 'a')
    covers2 = {'a': {'abe', 'cab'}, 'b': {'abe', 'cab', 'bee'}, 
               'c': {'cab', 'cee'}, 'c.': {'cab', 'cee'}, 'abe': {'abe'},
               'ab': {'abe', 'cab'}, '.*b': {'abe', 'cab', 'bee'}, 
               'e': {'abe', 'bee', 'cee'}, 'f': {}, 'g': {}}
    assert eliminate_dominated(covers2) == simplify_covers(covers2)[0] == {
        'c': {'cab', 'cee'}, 'b': {'cab', 'abe', 'bee'}, 'e': {'cee', 'abe', 'bee'}}
    covers3 = {'1': {'w1'}, '.1': {'w1'}, '2': {'w2'}}
    assert eliminate_dominated(covers3) == {'1': {'w1'}, '2': {'w2'}}
    assert simplify_covers(covers3) == ({}, '1|2')
    assert select_necessary({'a': {'abe'}, 'c': {'cee'}}) == ({}, 'a|c')
    assert {0, 1, 2} >= {1, 2}
    assert {1, 2} >= {1, 2}
    assert not ({1, 2, 4} >= {1, 3})
    return 'test_bb passes'

test_bb()
Out[35]:
'test_bb passes'

Let's investigate how much the cover simplification process is helping:

In [36]:
covers = regex_covers(winners, losers)
len(covers)
Out[36]:
1615
In [37]:
covers2, partial = simplify_covers(covers)
len(covers2), partial
Out[37]:
(49, 'j|po')
In [38]:
1 - len(covers2)/len(covers)
Out[38]:
0.9696594427244583

We see that simplify_covers gives us a 97% reduction in the number of components!

Now let's do complete benchmarks for branch and bound. I'm going to alter benchmark to print the number of calls taken. The easiest way to do that is to have findregex return the BranchBound object it creates, rather than returning the solution. Since we're changing the signature of findregex, I'll give it a new name, bb_findregex (for "branch and bound findregex").

In [39]:
def benchmark(data=ALL, calls=10000): 
    "Run these data sets; print summaries; return total of solution lengths."
    total = 0
    for (W, Wname, Lname, L) in data:
        re.purge()
        t0 = time.time()
        bb = bb_findregex(W, L, calls)
        t1 = time.time()
        SOLUTION[W, L] = bb.solution
        assert verify(bb.solution, W, L)
        print '{:3d} ch {:3d} win {:6.2f} s {:7,d} calls {}-{}: {!r}'.format(
            len(bb.solution), len(W), t1-t0, (calls - bb.calls), Wname, Lname, bb.solution)
        total += len(bb.solution)
    return total

def bb_findregex(winners, losers, calls=10000):
    "Find the shortest disjunction of regex components that covers winners but not losers."
    solution = '^(' + OR(winners) + ')$'
    bb = BranchBound(solution, calls)
    bb.search(regex_covers(winners, losers))
    return bb
In [40]:
%time benchmark(calls=1000)
 52 ch  34 win   0.33 s   1,000 calls W-L: 'j|po|a.a|a..i|bu|ma|ix|ho|li|v.l|ls|a.t|ay.|n.e|r.e$'
 58 ch  34 win   0.25 s   1,000 calls L-W: 'ss|^ki|^s|x$|^d|go|^.re|ry|hu|cc|o.d|k..$|^.la|ney|oc|d.n$'
 11 ch  10 win   0.04 s       7 calls B-G: 'a.$|c|de|as'
 10 ch  10 win   0.04 s      13 calls G-B: 'a$|e.i|a.i'
 23 ch  12 win   0.05 s       7 calls I-O: 'lt|ch|os|4|pa|sa|e.g|ha'
 21 ch  20 win   0.08 s      11 calls O-I: 'j|e.r|ns|y|te|d|m|^.i'
 15 ch  10 win   0.05 s       3 calls P-C: 'ge|ir|o.$|x|f|q'
 11 ch  10 win   0.05 s      17 calls C-P: 'ri|an|h|a.e'
  3 ch  21 win   0.06 s       1 calls F-B: 'f.o'
 23 ch  21 win   0.14 s     251 calls B-F: 'k|ld|...n|^i|b|ar|or|la'
  9 ch   6 win   0.07 s      25 calls W-T: ' T|B|E..H'
 13 ch   9 win   0.09 s       9 calls T-W: 'Y|IS|RA| F|CT'
167 ch 100 win   0.45 s   1,000 calls N-A: 'rt$|^wa|jo|me$|m.n|ca|v.l|^h.a|fat|air|o.y|lot|^da|law|rty|^g|wo|s.u|i.y|mb|s.n|hou|id|doo|m$|ey|y.a|ch.|eo|in.$|if|tr|an|ac.$|io|end|rc|ice|are|po|ot.e|s..t|..k$|^.ig'
146 ch  98 win   0.41 s     769 calls A-N: 'st$|far|rse|^v|wh|a.l|ui|tha|b.t|u.h|f$|hu|re$|r.t|tl|ns|ard|l.s|lat|ll|s.m|lon|^..$|ver$|ow$|en$|ah|ut$|^to|nl|nc|a.s|a.a|l.e|y..$|i..e|o.n$|^a.o'
170 ch 100 win   0.43 s   1,000 calls N-V: '^ar|t.t|gu|h..d|ace|h.ng|ki|.ne|ir|am|er$|.as|^en|m.n|ot|po|law|ye|nt.|^.ar|w.y|es|lev|ho|ud|lt|ht|..or|o.l|j|ic|fr|if|rc|.d.$|t.m|^i|d.y|.on|ci|we|ro..|b..k|ct|m..n|or.$'
170 ch  94 win   0.37 s   1,000 calls V-N: 'p.y|tar|a.h$|f$|^fi|ru|at$|nk|rr|nc|r..d|^tr|tc|o.b|it$|wi|hear|lay|b.y|ke|.ed|^u|^s.n|can|sa|unt$|sk|fl|br|v.$|.pe|^..l|in$|^..$|ut$|o.g|s..n|lo..|lea|ur.|w..t|...w|^s.e'
 91 ch  64 win   0.42 s   1,000 calls R-B: 'rand|sam|ee|hu|es$|f$|lac|^np|^w|a.._|do|r..l|ia|be|we|c$|o.s|a$|_..i|t.st|Te|o$|mu|if|ch..'
134 ch 141 win   0.89 s   1,000 calls B-R: 'set$|w$|it|as|^int|^a|in$|^f.|ir$|id|fe|pr|hr|E|ev|bo|od|a.s|g$|ro|en$|rr|N|co|tr$|li|ct$|rt|u.$|..p$|t.p|up|o.d|e.u|^..x|n.e$|py|c..l'
 69 ch 100 win   0.96 s   1,000 calls D-C: '^HAVANESE$|E.S$|O.S$|N.S$|.AS|GS|.HI|RD|AGL|OD|G.S|NS$|.SO|S.F|IES|LT'
104 ch  91 win   0.59 s     547 calls C-D: 'AN$|^TH|JAV|Y$|NX|SN|BAL|H$|EX|R$|T$|MES|L$|DW|IX|A$|H F|LI$|M$|FI|N.B|M.I|WN|EO|MAU|KI.$|K..E|^C.A|O..A'
CPU times: user 5.78 s, sys: 91.1 ms, total: 5.87 s
Wall time: 5.8 s

Out[40]:
1300

This is encouraging! For 12 of the 20 cases we found the optimal solution (of course, optimal only with respect to the set of regex components we allow so far). For the other 8 cases we cut off search at 1,000 calls. We took 2 seconds more than greedy search, and reduced the total solution size by 4%.

In [41]:
%time benchmark(calls=10000)
 52 ch  34 win   0.44 s   1,581 calls W-L: 'j|po|a.a|a..i|bu|ma|ix|ho|li|v.l|ls|a.t|ay.|n.e|r.e$'
 57 ch  34 win   0.53 s   3,045 calls L-W: 'ss|^ki|^s|x$|i.$|go|^f|de|br|d.l|hu|k.r|e.l|ney|an.o|^.la'
 11 ch  10 win   0.04 s       7 calls B-G: 'a.$|c|de|as'
 10 ch  10 win   0.04 s      13 calls G-B: 'a$|e.i|a.i'
 23 ch  12 win   0.05 s       7 calls I-O: 'lt|ch|os|4|pa|sa|e.g|ha'
 21 ch  20 win   0.08 s      11 calls O-I: 'j|e.r|ns|y|te|d|m|^.i'
 15 ch  10 win   0.05 s       3 calls P-C: 'ge|ir|o.$|x|f|q'
 11 ch  10 win   0.05 s      17 calls C-P: 'ri|an|h|a.e'
  3 ch  21 win   0.06 s       1 calls F-B: 'f.o'
 23 ch  21 win   0.14 s     251 calls B-F: 'k|ld|...n|^i|b|ar|or|la'
  9 ch   6 win   0.07 s      25 calls W-T: ' T|B|E..H'
 13 ch   9 win   0.09 s       9 calls T-W: 'Y|IS|RA| F|CT'
167 ch 100 win   0.70 s   2,009 calls N-A: 'rt$|^wa|jo|me$|m.n|ca|v.l|^h.a|fat|air|o.y|lot|^da|law|rty|^g|wo|s.u|i.y|mb|s.n|hou|id|doo|m$|ey|y.a|ch.|eo|in.$|if|tr|an|ac.$|io|end|rc|ice|are|po|ot.e|s..t|..k$|^.ig'
146 ch  98 win   0.41 s     769 calls A-N: 'st$|far|rse|^v|wh|a.l|ui|tha|b.t|u.h|f$|hu|re$|r.t|tl|ns|ard|l.s|lat|ll|s.m|lon|^..$|ver$|ow$|en$|ah|ut$|^to|nl|nc|a.s|a.a|l.e|y..$|i..e|o.n$|^a.o'
169 ch 100 win   1.75 s  10,000 calls N-V: '^ar|t.t|gu|h..d|ace|h.ng|ki|.ne|ir|am|er$|.as|^en|m.n|ot|po|law|ye|nt.|^.ar|w.y|es|lev|ho|ud|lt|ht|do.|or.$|we|t.m|^i|d.y|o.l|j|ic|fr|rc|if|.on|ci|ro..|.d.$|b..k|ct|m..n'
170 ch  94 win   0.88 s   3,211 calls V-N: 'p.y|tar|a.h$|f$|^fi|ru|at$|nk|rr|nc|r..d|^tr|tc|o.b|it$|wi|hear|lay|b.y|ke|.ed|^u|^s.n|can|sa|unt$|sk|fl|br|v.$|.pe|^..l|in$|^..$|ut$|o.g|s..n|lo..|lea|ur.|w..t|...w|^s.e'
 90 ch  64 win   0.88 s   3,937 calls R-B: 'rand|sam|ee|hu|es$|f$|lac|^np|^w|a.._|do|r..l|ia|be|we|e.m|o$|a$|if|_..i|o..s|_s|q|hl|^.es'
133 ch 141 win   3.05 s  10,000 calls B-R: 'set$|w$|it|as|^int|^a|in$|^f.|ir$|id|fe|pr|hr|rr|ev|bo|od|a.s|W|ro|N|l.n|co|li|^o|tr$|rt|u.$|..p$|t.p|up|l.a|e.u|ec|^..x|n.e$|py|^..c'
 69 ch 100 win   5.68 s  10,000 calls D-C: '^HAVANESE$|E.S$|O.S$|N.S$|.AS|GS|.HI|RD|AGL|OD|G.S|NS$|.SO|S.F|IES|LT'
104 ch  91 win   0.59 s     547 calls C-D: 'AN$|^TH|JAV|Y$|NX|SN|BAL|H$|EX|R$|T$|MES|L$|DW|IX|A$|H F|LI$|M$|FI|N.B|M.I|WN|EO|MAU|KI.$|K..E|^C.A|O..A'
CPU times: user 15.6 s, sys: 158 ms, total: 15.7 s
Wall time: 15.6 s

Out[41]:
1296

Now there are only 3 cases left where we are cutting off search before finding the optimal solution.

In [42]:
%time benchmark(calls=100000)
 52 ch  34 win   0.44 s   1,581 calls W-L: 'j|po|a.a|a..i|bu|ma|ix|ho|li|v.l|ls|a.t|ay.|n.e|r.e$'
 57 ch  34 win   0.54 s   3,045 calls L-W: 'ss|^ki|^s|x$|i.$|go|^f|de|br|d.l|hu|k.r|e.l|ney|an.o|^.la'
 11 ch  10 win   0.04 s       7 calls B-G: 'a.$|c|de|as'
 10 ch  10 win   0.04 s      13 calls G-B: 'a$|e.i|a.i'
 23 ch  12 win   0.05 s       7 calls I-O: 'lt|ch|os|4|pa|sa|e.g|ha'
 21 ch  20 win   0.08 s      11 calls O-I: 'j|e.r|ns|y|te|d|m|^.i'
 15 ch  10 win   0.05 s       3 calls P-C: 'ge|ir|o.$|x|f|q'
 11 ch  10 win   0.05 s      17 calls C-P: 'ri|an|h|a.e'
  3 ch  21 win   0.06 s       1 calls F-B: 'f.o'
 23 ch  21 win   0.14 s     251 calls B-F: 'k|ld|...n|^i|b|ar|or|la'
  9 ch   6 win   0.07 s      25 calls W-T: ' T|B|E..H'
 13 ch   9 win   0.09 s       9 calls T-W: 'Y|IS|RA| F|CT'
167 ch 100 win   0.71 s   2,009 calls N-A: 'rt$|^wa|jo|me$|m.n|ca|v.l|^h.a|fat|air|o.y|lot|^da|law|rty|^g|wo|s.u|i.y|mb|s.n|hou|id|doo|m$|ey|y.a|ch.|eo|in.$|if|tr|an|ac.$|io|end|rc|ice|are|po|ot.e|s..t|..k$|^.ig'
146 ch  98 win   0.41 s     769 calls A-N: 'st$|far|rse|^v|wh|a.l|ui|tha|b.t|u.h|f$|hu|re$|r.t|tl|ns|ard|l.s|lat|ll|s.m|lon|^..$|ver$|ow$|en$|ah|ut$|^to|nl|nc|a.s|a.a|l.e|y..$|i..e|o.n$|^a.o'
167 ch 100 win  19.04 s 100,000 calls N-V: '^ar|t.t|gu|h..d|ace|h.ng|ki|.ne|ir|am|er$|.as|^en|m.n|ot|po|law|ye|nt.|^.ar|w.y|ho|lt|ht|t.m|.on|ci|or.$|we|do.|.d.$|o.l|j|ic|fr|rc|if|lev|^i|d.y|ent|ro..|b..k|ct|m..n'
170 ch  94 win   0.88 s   3,211 calls V-N: 'p.y|tar|a.h$|f$|^fi|ru|at$|nk|rr|nc|r..d|^tr|tc|o.b|it$|wi|hear|lay|b.y|ke|.ed|^u|^s.n|can|sa|unt$|sk|fl|br|v.$|.pe|^..l|in$|^..$|ut$|o.g|s..n|lo..|lea|ur.|w..t|...w|^s.e'
 90 ch  64 win   0.88 s   3,937 calls R-B: 'rand|sam|ee|hu|es$|f$|lac|^np|^w|a.._|do|r..l|ia|be|we|e.m|o$|a$|if|_..i|o..s|_s|q|hl|^.es'
133 ch 141 win   7.29 s  23,703 calls B-R: 'set$|w$|it|as|^int|^a|in$|^f.|ir$|id|fe|pr|hr|rr|ev|bo|od|a.s|W|ro|N|l.n|co|li|^o|tr$|rt|u.$|..p$|t.p|up|l.a|e.u|ec|^..x|n.e$|py|^..c'
 69 ch 100 win 111.08 s 100,000 calls D-C: '^HAVANESE$|E.S$|O.S$|N.S$|.AS|GS|.HI|RD|AGL|OD|G.S|NS$|.SO|S.F|IES|LT'
104 ch  91 win   0.60 s     547 calls C-D: 'AN$|^TH|JAV|Y$|NX|SN|BAL|H$|EX|R$|T$|MES|L$|DW|IX|A$|H F|LI$|M$|FI|N.B|M.I|WN|EO|MAU|KI.$|K..E|^C.A|O..A'
CPU times: user 2min 22s, sys: 583 ms, total: 2min 22s
Wall time: 2min 22s

Out[42]:
1294

Here's a summary of the total size in characters for all the solutions, and the time taken, for each algorithm:

AlgorithmSize (chars)Time (secs)
Greedy13554
BB(1,000)13006
BB(10,000)129616
BB(100,000)1294142

We've reached the optimal solution on all but 2 cases, N-V and D-C. It seems like the best path to improvement is not to allow more calls, but rather to sanction additional regex components, hopefully ones that cover more winners in fewer characters.

Additional Components: Repetition Characters

Following a suggestion by Stefan Pochmann, I will generate a set of regexes of the form 'A.*B' where 'A' and 'B' are any two characters that appear anywhere in any winner, the dot is always the second character, and the third character can be either '*', '+', or '?'. I call these pairs.

I will also insert a repetition character ('*', '+', or '?') after every character (except '^' and '$') in every part. For example, given the part 'a.c', insert either a '*' or a '+' or a '?' after each of the three characters, to get nine new components: {'a+.c', 'a*.c', 'a?.c', 'a.c+', 'a.*c', 'a.?c', 'a.+c', 'a.c*', 'a.c?'}. I call these new components reps.

Here is a new version of regex_covers that adds the new parts (in the sets pairs and reps in lines 8-10):

In [43]:
def regex_covers(winners, losers):
    """Generate regex components and return a dict of {regex: {winner...}}.
    Each regex matches at least one winner and no loser."""
    losers_str = '\n'.join(losers)
    wholes = {'^'+winner+'$' for winner in winners}
    parts = {d for w in wholes for p in subparts(w) for d in dotify(p)}
    chars = set(cat(winners))
    pairs = {A+'.'+q+B for A in chars for B in chars for q in rep_chars}
    reps = {r for p in parts for r in repetitions(p)}
    pool = wholes | parts | pairs | reps                         
    searchers = [re.compile(c, re.MULTILINE).search for c in pool]
    return {r: set(filter(searcher, winners)) 
            for (r, searcher) in zip(pool, searchers)
            if not searcher(losers_str)}

def repetitions(part):
    """Return a set of strings derived by inserting a single repetition character ('+' or '*' or '?') 
    after each non-special character.  Avoid redundant repetition of dots."""
    splits = [(part[:i], part[i:]) for i in range(1, len(part)+1)]
    return {A + q + B 
            for (A, B) in splits
            # Don't allow '^*' nor '$*' nor '..*' nor '.*.'
            if not (A[-1] in '^$')
            if not A.endswith('..')
            if not (A.endswith('.') and B.startswith('.'))
            for q in rep_chars}

rep_chars = ('*', '+', '?')
In [44]:
def test_rep():
    assert repetitions('a') == {'a+', 'a*', 'a?'}
    assert repetitions('ab') == {'a+b', 'a*b', 'a?b', 
                                 'ab+', 'ab*', 'ab?'}
    assert repetitions('a.c') == {'a+.c', 'a*.c', 'a?.c',
                                  'a.c+', 'a.*c', 'a.?c',
                                  'a.+c', 'a.c*', 'a.c?'}
    assert repetitions('^a..d$') == {'^a+..d$', '^a*..d$', '^a?..d$', 
                                     '^a..d+$', '^a..d*$', '^a..d?$'}
    assert (eliminate_dominated(regex_covers(
           {'one', 'on'}, {'won', 'wuan', 'juan'}))
           == {'e': {'one'}, '^o': {'on', 'one'}})
    return 'test_rep passes'

test_rep()
Out[44]:
'test_rep passes'

Let's take our new components out for a spin and see what they can do:

In [45]:
findregex(starwars, startrek)
Out[45]:
' T|P.*E'

That's pretty cool. Remember, Randall had a 10 character regex in the strip (see below); we previously got it down to 9, but we now have it down to 7 with ' T|P.*E', thanks to the 'P.*E' pair.

How much more work are we doing with these additional components? Before, regex_covers(winners, losers) gave 1615 regexes, which were reduced to 49 by simplify_covers. Now:

In [46]:
covers = regex_covers(winners, losers)
len(covers)
Out[46]:
13200
In [47]:
covers, partial = simplify_covers(covers)
len(covers)
Out[47]:
90

So the total number of regexes is increased almost 10-fold, but after simplification the number is only double what it was before. (To put it another way, simplify_covers eliminated 99.3% of the components.) That's not too bad; let's add more components.

Additional Components: Longer Parts

Previously we generated regex component parts of length up to 4 characters. Why not say "we're doing 5"? The implementation is easy:

In [48]:
subpart_size = 5

def subparts(word):
    "Return a set of subparts of word, consecutive characters up to length 5."
    return set(word[i:i+1+s] for i in range(len(word)) for s in range(subpart_size)) 

But how many more regex components does this give us?

In [49]:
covers = regex_covers(winners, losers)
len(covers)
Out[49]:
39537
In [50]:
covers, partial = simplify_covers(covers)
len(covers)
Out[50]:
112

In the end the longer parts add only 22 new components, and let's hope they are useful ones (because we had to sort through nearly 40,000 components and eliminate 99.7% of them). It is time to benchmark again:

In [51]:
%time benchmark(calls=1000)
 47 ch  34 win   5.66 s   1,000 calls W-L: 'i.*o|o.*o|a.a|po|a.t|j|bu|di|ay.|ma|n.e|ie.|vel'
 58 ch  34 win   4.98 s   1,000 calls L-W: 'r.*y|i.h?$|^s|go|d..e|^f|^.?la|k.*g|hu|n.+k|pa|ss|ox|cc|de'
 11 ch  10 win   1.20 s      25 calls B-G: 'a.$|c|de|as'
  9 ch  10 win   1.29 s      23 calls G-B: 'a$|^m*..i'
 22 ch  12 win   2.03 s     369 calls I-O: '4|pa|lt|ch|e.g|os|s.?a'
 19 ch  20 win   3.13 s     557 calls O-I: 'ns|^.i|e.+r|y|j|d|m'
 15 ch  10 win   1.68 s      83 calls P-C: 'o.$|x|ir|q|b|ep'
 11 ch  10 win   1.86 s      45 calls C-P: 'ri|a.+e|h|k'
  3 ch  21 win   2.34 s       1 calls F-B: 'foo'
 23 ch  21 win   3.24 s   1,000 calls B-F: '^...a|k|...n|ld|b|ar|or'
  7 ch   6 win   3.21 s      23 calls W-T: ' T|P.*E'
 12 ch   9 win   4.25 s     175 calls T-W: 'I.?S|TI|H |Y'
163 ch 100 win   9.79 s   1,000 calls N-A: 'l.v|^wa|m.+n|me?$|.i.*d|....o|a.*c.$|.o..e|wo|^g|ir$|s.u|ch*a|i.y|lo?t|in.$|an|aw*$|o.y|y.a|fe|tr|rt.?$|^..m?b|^..g?h|bu|ice|f.t|o.r$|..k$|p..e|^d*.y|..a.*e|^.?e.d'
136 ch  98 win   9.40 s   1,000 calls A-N: 'e.+y|^..l*$|l.*s|rs*e$|a.l|ei*t|st$|en$|far|ver$|^a.o|i..l|u.h|lat|yb|b.t|a.*a.|^to|^l?on|ll|ow$|ard|n.+d|o.n$|wh|^o..$|ui|hu|hap*.$|r.t'
167 ch 100 win  10.04 s   1,000 calls N-V: 'h.*g|mp*.n|p.*r|..h.|mb*e|ea*s|^k?i|.ne|or.?$|f.*c|fe|^.?ar|.as|way|da?y|h.*d|c.*t.|^en|ir|am|fr|ou.e*$|gu|o.m|ye|em|rl|po|.r.i|b..k|p.+e|lev|j|.t.+e|lt|id|w.*e.|^l..$'
159 ch  94 win   6.57 s   1,000 calls V-N: 'mb$|tar|rr|^..p*l|^s*..$|b.y|f$|as*k|read|ve$|s..n|^s.n|..i.$|nk|^fi|un?t$|^t.?r|.pe|o.g|lo..|ke|^.a.c|eed|^r?u|^.le|c.n|...w|wr|br|ea..$|pl?.y|fl|an?t$|^he*.r'
 88 ch  64 win  12.50 s   1,000 calls R-B: 'nd*o|a.._|do|g.+m|f.?$|np$|rand|s...$|est?$|_.*s|e_|ei|be|are|if|hu|u.a|po.e|hl|^w|am?pl'
106 ch 141 win  28.58 s   1,000 calls B-R: 'E|^...$|ge*$|o.*d|a.?s|^c?o|bo|^f.|a.t|po?r|u.?e|id|t.i*p|N|^i.*t|li|he|ev|r.*w|de|xt|b.*r|ic*t|hr|ec|^c?a'
 43 ch 100 win  34.11 s   1,000 calls D-C: 'E.+S$|I.*S$|H.+N.S|.HI|G.?S|.SO|W |TES|O.D.'
104 ch  91 win  24.49 s   1,000 calls C-D: 'J.V|RA?$|R.*A.$|LD*$|N.*X|M$|M...$|E.*O.$|T$|AN?$|S.+E$|A. *B|Y$|N.*KIN|R.G|DW|EU?X|O..A|B.L.N|T.+A.$|H$'
CPU times: user 2min 49s, sys: 2.86 s, total: 2min 52s
Wall time: 2min 50s

Out[51]:
1203

This is a big improvement! 1,000 calls with the additional components takes about the same amount of time as 100,000 calls with the smaller set of components, and reduces the total solution length from 1294 to 1203, a 7% reduction. And notice we no longer need '^HAVANESE$'; instead we have 'H.+N.S', which saves 4 characters, and, look how many breeds it matches:

In [52]:
print matches('H.+N.S', dogs)
set(['CHINESE CRESTED', 'BASSET HOUNDS', 'AFGHAN HOUNDS', 'DACHSHUNDS', 'TREEING WALKER COONHOUNDS', 'CHINESE SHAR-PEI', 'IRISH WOLFHOUNDS', 'BLOODHOUNDS', 'ITALIAN GREYHOUNDS', 'HAVANESE'])

Let's continue benchmarking:

In [53]:
%time benchmark(calls=10000)
 46 ch  34 win   7.27 s  10,000 calls W-L: 'i.*o|o.*o|bu|a.t|j|ma|po|ay.|n.e|ie.|ar*d|el?a'
 57 ch  34 win   6.04 s  10,000 calls L-W: 'r.*y|i.h?$|^s|go|do|de|^f|k.*g|hu|m.+a|^.la|n.+k|pa|ss|ox'
 11 ch  10 win   1.23 s      25 calls B-G: 'a.$|c|de|as'
  9 ch  10 win   1.30 s      23 calls G-B: 'a$|^m*..i'
 22 ch  12 win   1.97 s     369 calls I-O: '4|pa|lt|ch|e.g|os|s.?a'
 19 ch  20 win   3.16 s     557 calls O-I: 'ns|^.i|e.+r|y|j|d|m'
 15 ch  10 win   1.69 s      83 calls P-C: 'o.$|x|ir|q|b|ep'
 11 ch  10 win   1.87 s      45 calls C-P: 'ri|a.+e|h|k'
  3 ch  21 win   2.33 s       1 calls F-B: 'foo'
 23 ch  21 win   4.10 s   4,983 calls B-F: 'k|...n|ar|ld|^i|or|la|b'
  7 ch   6 win   3.21 s      23 calls W-T: ' T|P.*E'
 12 ch   9 win   4.21 s     175 calls T-W: 'I.?S|TI|H |Y'
162 ch 100 win  11.02 s  10,000 calls N-A: 'l.v|^wa|m.+n|me?$|.i.*d|....o|a.*c.$|.o..e|wo|^g|ir$|s.u|ch*a|i.y|lo?t|in.$|an|aw*$|o.y|y.a|fe|tr|rt.?$|b$|mb|^..g?h|bu|ice|f.t|o.r$|..k$|p..e|^d*.y|..a.*e|^.?e.d'
136 ch  98 win  10.68 s  10,000 calls A-N: 'e.+y|^..l*$|l.*s|rs*e$|a.l|ei*t|st$|en$|far|ver$|^a.o|i..l|u.h|lat|yb|b.t|^l?on|ll|^to|..ea*d|ard|a.a|ow$|o.n$|wh|^o..$|ui|hu|hap*.$|r.t'
164 ch 100 win  11.22 s  10,000 calls N-V: 'h.*g|mp*.n|p.*r|..h.|mb*e|ea*s|^k?i|.ne|or.?$|f.*c|fe|^.?ar|.as|way|da?y|h.*d|c.*t.|^en|ir|am|fr|ye|em|ho|rl|po|ro..|gu|.r.i|b..k|p.+e|lev|j|.t.+e|lt|id|w.*e.|^l..$'
158 ch  94 win   7.77 s  10,000 calls V-N: 'mb$|tar|rr|^..p*l|^s*..$|b.y|f$|as*k|read|ve$|s..n|^s.n|..i.$|nk|^fi|un?t$|^t.?r|.pe|o.g|lo..|ke|nc|eed|^r?u|^.le|c.n|...w|w..t|br|^l*ea|pl?.y|fl|t.+h$|^he*.r'
 88 ch  64 win  13.69 s  10,000 calls R-B: 'nd*o|a.._|do|g.+m|f.?$|np$|rand|s...$|es$|t.st|ei|be|are|if|hu|u.a|_.*l|po.e|hl|^w|am?pl'
106 ch 141 win  29.97 s  10,000 calls B-R: 'E|^...$|ge*$|o.*d|a.?s|^c?o|bo|^f.|a.t|po?r|u.?e|id|t.i*p|N|^i.*t|li|he|ev|r.*w|de|xt|b.*r|ic*t|hr|ec|^c?a'
 43 ch 100 win  36.06 s  10,000 calls D-C: 'E.+S$|I.*S$|H.+N.S|.HI|G.?S|.SO|W |TES|O.D.'
104 ch  91 win  25.44 s  10,000 calls C-D: 'J.V|RA?$|R.*A.$|LD*$|N.*X|M$|M...$|E.*O.$|T$|AN?$|S.+E$|Y$|N.*KIN|A.B|N.B|BAL|R.G|DW|EU?X|O..A|T.+A.$|H$'
CPU times: user 3min 3s, sys: 3.24 s, total: 3min 6s
Wall time: 3min 4s

Out[53]:
1196
In [54]:
%time benchmark(calls=100000)
 46 ch  34 win  29.70 s 100,000 calls W-L: 'i.*o|j|bu|oo|a.t|nr?.e|ma|po|ay.|ie.|ar*d|el?a'
 57 ch  34 win  19.98 s 100,000 calls L-W: 'r.*y|i.h?$|^s|go|do|de|^f|k.*g|hu|m.+a|^.la|n.+k|pa|ss|ox'
 11 ch  10 win   1.23 s      25 calls B-G: 'a.$|c|de|as'
  9 ch  10 win   1.30 s      23 calls G-B: 'a$|^m*..i'
 22 ch  12 win   1.97 s     369 calls I-O: '4|pa|lt|ch|e.g|os|s.?a'
 19 ch  20 win   3.15 s     557 calls O-I: 'ns|^.i|e.+r|y|j|d|m'
 15 ch  10 win   1.70 s      83 calls P-C: 'o.$|x|ir|q|b|ep'
 11 ch  10 win   1.88 s      45 calls C-P: 'ri|a.+e|h|k'
  3 ch  21 win   2.32 s       1 calls F-B: 'foo'
 23 ch  21 win   4.11 s   4,983 calls B-F: 'k|...n|ar|ld|^i|or|la|b'
  7 ch   6 win   3.22 s      23 calls W-T: ' T|P.*E'
 12 ch   9 win   4.22 s     175 calls T-W: 'I.?S|TI|H |Y'
161 ch 100 win  23.42 s 100,000 calls N-A: 'l.v|^wa|m.+n|me?$|.i.*d|....o|a.*c.$|.o..e|wo|^g|ir$|s.u|ch*a|i.y|lo?t|aw*$|o.y|y.a|rt.?$|b$|mb|ow*.r$|^..g?h|ice|f.t|ine|fe|nt$|tr|h.n.|..k$|^d*.y|..a.*e|^.?e.d'
135 ch  98 win  24.42 s 100,000 calls A-N: 'e.+y|^..l*$|l.*s|rs*e$|a.l|ei*t|st$|en$|far|ver$|^a.o|^l?on|..ea*d|ui|u.h|^to|b.t|ard|a.a|ow$|o.n$|wh|l.t.|yb|^o..$|s...l|hu|hap*.$|r.t'
163 ch 100 win  24.16 s 100,000 calls N-V: 'h.*g|mp*.n|p.*r|..h.|mb*e|ea*s|^k?i|.ne|or.?$|f.*c|fe|^.?ar|.as|way|da?y|h.*d|c.*t.|^en|a.+r|we|am|fr|rl|ye|em|ho|po|ro..|gu|.r.i|b..k|p.+e|lev|j|.t.+e|lt|id|^l..$'
158 ch  94 win  22.90 s 100,000 calls V-N: 'mb$|tar|rr|^..p*l|^s*..$|b.y|f$|as*k|read|ve$|s..n|^s.n|..i.$|nk|^fi|un?t$|^t.?r|.pe|o.g|lo..|ke|nc|eed|^r?u|^.le|c.n|...w|w..t|br|^l*ea|pl?.y|fl|t.+h$|^he*.r'
 88 ch  64 win  28.71 s 100,000 calls R-B: 'nd*o|a.._|do|g.+m|f.?$|np$|s...$|^.a.d|es$|t.st|ei|be|ia|are|we|if|hl|hu|mu?t|_.*l|am?pl'
105 ch 141 win  46.55 s 100,000 calls B-R: 'E|^...$|ge*$|o.*d|a.?s|^c?o|bo|^f.|a.t|po?r|u.?e|id|li|t.p|N|e.*y|fe|he|ic*t|hr|rn|pu|de|ev|ec|^c?a|si|xt'
 43 ch 100 win  66.20 s 100,000 calls D-C: 'E.+S$|I.*S$|H.+N.S|.HI|G.?S|.SO|W |TES|O.D.'
103 ch  91 win  37.40 s 100,000 calls C-D: 'J.V|RA?$|R.*A.$|LD*$|N.*X|M$|M...$|E.*O.$|T$|AN?$|O.*E$|AZ|Y$|US+I|A.B|R.G|DW|K.N$|EU?X|B.L.N|T.+A.$|H$'
CPU times: user 5min 47s, sys: 3.69 s, total: 5min 51s
Wall time: 5min 48s

Out[54]:
1191

Here is the updated summary:

parts@4parts@5, reps, pairs
AlgorithmSize (chars)Time (secs)Size (chars)Time (secs)
Greedy13554
BB(1,000)130061203170
BB(10,000)1296161196184
BB(100,000)12941421191348

We're making good progress; 1191 characters is a 12% improvement over the 1355 we started with. (On the other hand, it takes almost 100 times longer.)

What next? Add more components? Actually, now I'm more concerned about searching than about components. Let me explain. In the previous benchmark, with fewer components, 100,000 calls were enough to complete the search on all but two cases, so we knew we had the optimal answer given the components, and the only place to go was to add more components.

With subpart_size = 5, we have so many additional components that benchmark(calls=100000) only completes the search half the time. Suppose one of our first choices was wrong, and the search took a component that does not belong in an optimal solution. We might well spend all 100,000 calls on the half of the search tree that keeps the component, never exploring a part of the tree without it. We need to alter the search algorithm to get us out of that rut.

Searching: Random restart

One option is to run the search for a while, then stop it, and run a different search, and hope it explores a different part of the search space. This is called a random restart search. How do we get a different search when we restart? I can think of two easy things to vary:

  • The '4' in the score function. That is, vary the tradeoff between number of winners matched and number of characters in a regex.
  • The tie-breaker. In case of a tie on max(covers, key=score), Python always picks the first component. Let's make it choose a different one each time.

The first of these is easy; we can use the random.choice function to choose an integer, K, to serve as the tradeoff factor. The second is easy too. We could write an alternative to the max function, say max_random_tiebreaker. That would work, but an easier approach is to build the tiebreaker into the score function. In addition to awarding points for matching winners, and subtracting a point for each characters in the component, we will have score add in a random number. We'll call this the "fudge factor" and use the variable F to determine how big it is. If the fudge factor is a random number between 0 and 1, then all it does is serve as a tiebreaker. But if it is a random number that can be larger than 1, then sometimes it will cause max to pick a winner that does not have quite the best score. We're not sure how much randomness is best, so we'll let the values of K and F vary from one choice to the next.

In [55]:
import random

def bb_findregex(winners, losers, calls=10000, restarts=10):
    "Find the shortest disjunction of regex components that covers winners but not losers."
    solution = '^(' + OR(winners) + ')$'
    bb = BranchBoundRandom(solution, calls)
    covers = eliminate_dominated(regex_covers(winners, losers))
    for restart in range(restarts):
        bb.calls = calls
        bb.search(covers.copy())
        if bb.calls > 0: 
            return bb # If search was not cut off, then stop
    return bb

class BranchBoundRandom(BranchBound):
    def search(self, covers, partial=None):
        """Recursively extend partial regex until it matches all winners in covers.
        Try all reasonable combinations until we run out of calls."""
        if self.calls <= 0: 
            return partial, covers
        self.calls -= 1
        covers, partial = simplify_covers(covers, partial)
        if not covers: # Nothing left to cover; solution is complete
            self.solution = min(partial, self.solution, key=len)
        elif len(OR(partial, min(covers, key=len))) < len(self.solution):
            # Try with and without the greedy-best component
            K = random.choice((2, 3, 4, 4, 5, 6))
            F = random.choice((1., 1., 2.))
            def score(c): return K * len(covers[c]) - len(c) + random.uniform(0., F)
            best = max(covers, key=score) # Best component
            covered = covers[best] # Set of winners covered by r
            covers.pop(best)
            self.search({c:covers[c]-covered for c in covers}, OR(partial, best))
            self.search(covers, partial)
        return self.solution
In [56]:
%time benchmark(calls=100)
 49 ch  34 win   5.71 s     100 calls W-L: 'i.*o|o.*o|a.a|po|.f|rc|bu|n.e|j|ay.|ma|an.$|rt|di'
 53 ch  34 win   5.08 s     100 calls L-W: '^d|r.*y|^s|go|co*c|nd.|^f|.k.?e|^.la|k.*g|de|hu|ss|ox'
 11 ch  10 win   1.23 s      29 calls B-G: 'a.$|de|c|as'
  9 ch  10 win   1.31 s      21 calls G-B: 'a$|^m*..i'
 22 ch  12 win   2.05 s     100 calls I-O: '4|pa|lt|ch|e.g|s.?a|os'
 21 ch  20 win   3.15 s     100 calls O-I: 'ng?s|e.+r|j|m|d|^.i|y'
 15 ch  10 win   1.70 s      99 calls P-C: 'o.$|x|ir|q|b|ep'
 11 ch  10 win   1.87 s      55 calls C-P: 'ri|a.+e|h|k'
  3 ch  21 win   2.30 s       1 calls F-B: 'foo'
 24 ch  21 win   3.21 s     100 calls B-F: '..r..|k|.m|...n|ld|la|ah'
  7 ch   6 win   3.22 s      31 calls W-T: ' T|P.*E'
 12 ch   9 win   4.35 s     100 calls T-W: 'I.?S|TI|H |Y'
158 ch 100 win  10.92 s     100 calls N-A: 'l.v|^wa|m.+n|me?$|....o|.i.*d|a.e$|..k$|^g|ir$|.o..e|wo|s.u|h.n.|s.+r|lo?t|aw*$|o.y|y.a|^..g?h|mb|b$|po|tr|rt.?$|i.y|f.c|f.t|ine|fe|o.r$|c.+r$|en.$|^h.a|^d*.y'
128 ch  98 win  10.14 s     100 calls A-N: 'e.+y|^o*..$|l.*s|ll|rs*e$|ei*t|st$|en$|far|^a.o|a.*a.|^to|ver$|i..l|u.h|lat|ui|b.t|^l?on|ow$|ard|n.+d|o.n$|wh|ay?b|hu|r.t|ap|hat'
162 ch 100 win  11.55 s     100 calls N-V: 'h.*g|m.r*n|p.*r|..h.|mb*e|es|^k?i|.ne|or.?$|f.*c|fe|^.?ar|.as|way|da?y|ye|c.+y|h.*d|ho|ir|am|em|o.l|b..k|ace|ro..|gu|j|lo?t|l.w|fr|s.u*d|t.t|po|^e*.d|^.er*v|w.*e.'
160 ch  94 win   7.48 s     100 calls V-N: 'mb$|tar|rr|^..p*l|^s*..$|b.y|f$|ak|read|ve$|s..n|^s.n|..i.$|nk|^fi|^t.?r|ur?t$|h.ar|.pe|o.g|nc|eed|lo..|l.a|c.n|sl|s.*k|^u|un.?$|...w|pl?.y|fl|an?t$|t.+h$|wr|br'
 88 ch  64 win  12.72 s     100 calls R-B: 'nd*o|a.._|do|g.+m|f.?$|np$|rand|es$|t.st|s...$|be|ei|ia|are|if|hl|hu|mu|_.*l|^w|we|am?pl'
105 ch 141 win  30.79 s     100 calls B-R: 'E|^...$|ge*$|o.*d|a.?s|om?p|N|^f.|tt|u.?e|id|li|i.p|de|ic*t|e.*y|fe|er?c|hr|ev|bo|pr|t.p|he|^c?a|rn|n.x*t'
 43 ch 100 win  34.44 s     100 calls D-C: 'E.+S$|I.*S$|H.+N.S|.HI|G.?S|.SO|W |TES|O.D.'
 96 ch  91 win  25.72 s     100 calls C-D: 'J.V|RA?$|T$|LI?$|AN?$|N.*X|M$|DW|S.+E$|Y$|M.+N$|M..E*$|AI$|LD$|N. *B|BAL|H$|EU|M.I|K..E|AZ|EO|IX'
CPU times: user 2min 58s, sys: 2.61 s, total: 3min
Wall time: 2min 58s

Out[56]:
1177

Remember, when it says "100 calls" here, it really means 1,000 calls, because we are doing 10 random restarts. This looks very promising; even with only 100 calls per restart, we're doing better than 100,000 calls without restarts. (But the results will be slightly different every time you re-run this cell, because of the randomness.) Let's continue:

In [57]:
%time benchmark(calls=1000)
 47 ch  34 win   6.90 s   1,000 calls W-L: 'i.*o|o.*o|a.a|po|a.t|j|bu|rc|r.i|n.e|vel|ay.|ma'
 56 ch  34 win   6.12 s   1,000 calls L-W: 'r.*y|i.e?$|^s|go|d..e|^f|la.$|ox|ss|d.n$|n.+k|k.*g|pa|hu'
 11 ch  10 win   1.22 s      29 calls B-G: 'a.$|c|de|as'
  9 ch  10 win   1.29 s      23 calls G-B: 'a$|^m*..i'
 22 ch  12 win   1.98 s     371 calls I-O: '4|pa|lt|ch|e.g|os|s.?a'
 19 ch  20 win   3.15 s     581 calls O-I: 'ns|d|^.i|e.+r|y|j|m'
 15 ch  10 win   1.71 s      95 calls P-C: 'o.$|x|ir|q|b|ep'
 11 ch  10 win   1.89 s      45 calls C-P: 'ri|h|a.+e|k'
  3 ch  21 win   2.34 s       1 calls F-B: 'foo'
 23 ch  21 win   4.89 s   1,000 calls B-F: 'k|...n|ar|ld|^i|or|la|b'
  7 ch   6 win   3.23 s      23 calls W-T: ' T|P.*E'
 12 ch   9 win   4.22 s     163 calls T-W: 'I.?S|ER|H |Y'
158 ch 100 win  12.19 s   1,000 calls N-A: 'l.v|^wa|m.+n|me?$|.i.*d|....o|a.ea?$|..k$|^g|ir$|.o..e|wo|nt$|.e.+c|lo?t|h.n.|mb|b$|rt.?$|^..g?h|ow*.r$|s.u|ine|fe|o.y|tr|f.c|i.y|ca|f.t|aw*$|y.a|^d*.y|^.?e.d'
128 ch  98 win  11.57 s   1,000 calls A-N: 'e.+y|^o*..$|l.*s|ll|rs*e$|ei*t|st$|en$|far|ver$|^a.o|i..l|u.h|lat|ui|b.t|^l?on|^to|..ea*d|ard|ow$|a.a|ay?b|o.n$|wh|hu|rh?a..|hat'
161 ch 100 win  12.72 s   1,000 calls N-V: 'h.*g|m.r*n|p.*r|..h.|mb*e|^k?i|.ne|or.?$|e.*o|po|f.*c|fe|^.?ar|.as|way|.d.$|h.*d|rl|^en|lo?t|l.w|j|c.+y|ye|am|em|ho|gu|b..k|ace|ro..|a.+r|we|e.e.|fr|ic|.t.+e|d.y'
157 ch  94 win   8.95 s   1,000 calls V-N: 'mb$|tar|rr|^..p*l|^s*..$|b.y|f$|as*k|read|ve$|s..n|^s.n|ink*$|^fi|un?t$|^t.?r|.pe|o.g|w.?i|lo..|ke|nc|eed|l.a|c.n|sl|^r?u|br|...w|an?t$|^he*.r|pl?.y|fl|t.+h$'
 88 ch  64 win  14.03 s   1,000 calls R-B: 'nd*o|a.._|do|g.+m|rand|f.?$|np$|est?$|s...$|_s|be|^w|ia|ch..|if|hu|ei|_.*l|^p.r|we|am?pl'
104 ch 141 win  32.03 s   1,000 calls B-R: 'E|^...$|ge*$|o.*d|a.?s|^c?o|bo|^f.|a.t|e.*y|u.?e|id|fe|^i.*t|li|N|t.p|ev|pr|he|i.p|de|xt|^a|ic*t|hr|c..l'
 43 ch 100 win  35.49 s   1,000 calls D-C: 'E.+S$|I.*S$|H.+N.S|.HI|G.?S|.SO|W |TES|O.D.'
 96 ch  91 win  27.10 s   1,000 calls C-D: 'J.V|RA?$|T$|LD*$|AN?$|N.*X|M$|S.+E$|Y$|M.+N$|M..E*$|H$|N. *B|BAL|M.I|LI$|AI$|DW|NK|EU?X|AZ|EO|IX'
CPU times: user 3min 12s, sys: 2.9 s, total: 3min 14s
Wall time: 3min 13s

Out[57]:
1170
In [58]:
%time benchmark(calls=10000)
 46 ch  34 win  23.52 s  10,000 calls W-L: 'i.*o|o.*o|bu|a.t|j|ma|po|ie.|ay.|ar*d|n.e|el?a'
 52 ch  34 win  19.92 s  10,000 calls L-W: 'r.*y|^d|^s|go|^f|m.+a|....k|^.?la|ki?..$|de|hu|ss|ox'
 11 ch  10 win   1.22 s      25 calls B-G: 'a.$|c|de|as'
  9 ch  10 win   1.32 s      19 calls G-B: 'a$|^m*..i'
 22 ch  12 win   1.97 s     355 calls I-O: '4|pa|lt|ch|s.?a|e.g|os'
 19 ch  20 win   3.11 s     621 calls O-I: 'ns|^.i|e.+r|y|j|d|m'
 15 ch  10 win   1.69 s      97 calls P-C: 'o.$|x|ir|q|b|ep'
 11 ch  10 win   1.87 s      43 calls C-P: 'ri|a.+e|h|k'
  3 ch  21 win   2.30 s       1 calls F-B: 'foo'
 23 ch  21 win   4.18 s   5,223 calls B-F: 'k|...n|or|ar|ld|la|^i|b'
  7 ch   6 win   3.26 s      23 calls W-T: ' T|P.*E'
 12 ch   9 win   4.22 s     167 calls T-W: 'I.?S| F|RA|Y'
159 ch 100 win  25.66 s  10,000 calls N-A: 'l.v|^wa|m.+n|me?$|.i.*d|....o|wo|a.*c.$|.o..e|^g|ir$|nt$|h.n.|lo?t|aw*$|o.y|y.a|s.u|tr|rt.?$|^..g?h|b$|mb|ca|i.y|ow*.r$|ice|f.t|ine|fe|..k$|^d*.y|..a.*e|^.?e.d'
128 ch  98 win  27.69 s  10,000 calls A-N: 'e.+y|^o*..$|l.*s|ll|rs*e$|ei*t|st$|en$|far|ver$|i..l|u.h|lat|ui|ow*$|a.*a.|^to|ay?b|b.t|lon|ard|o.nd*$|wh|^on|n.+d|hu|hap*.$|r.t'
161 ch 100 win  26.15 s  10,000 calls N-V: 'h.*g|m.r*n|p.*r|..h.|mb*e|t.?o|.ne|^k?i|f.*c|fe|^.?ar|.as|way|da?y|h.*d|^en|lo?t|l.w|j|c.+y|ye|or.?$|rl|ho|am|em|p.+e|po|ro..|gu|b..k|id|a.+r|we|fr|e.e.|ic|.t.+e'
158 ch  94 win  23.69 s  10,000 calls V-N: 'mb$|tar|rr|^..p*l|^s*..$|b.y|f$|as*k|read|ve$|s..n|^s.n|ink*$|^fi|un?t$|^t.?r|.pe|o.g|w..t|sw|lo..|ke|nc|eed|sl|l.a|c.n|^ea|^r?u|br|...w|pl?.y|fl|t.+h$|^he*.r'
 86 ch  64 win  28.64 s  10,000 calls R-B: 'nd*o|a.._|do|rand|f$|np$|g.+m|est?$|s...$|_s|u.?a|ei|be|o$|if|hu|_.*l|po.e|hl|^w|am?pl'
104 ch 141 win  47.87 s  10,000 calls B-R: 'E|^...$|ge*$|o.*d|a.?s|^c?o|bo|^f.|a.t|e.*y|u.?e|id|fe|^i.*t|li|N|t.p|pr|he|ev|i.p|de|xt|^c?a|ic*t|hr|ec'
 43 ch 100 win  56.61 s  10,000 calls D-C: 'E.+S$|I.*S$|H.+N.S|.HI|G.?S|.SO|W |TES|O.D.'
 96 ch  91 win  40.46 s  10,000 calls C-D: 'J.V|RA?$|T$|LD*$|AN?$|N.*X|M$|Y$|S.+E$|M.+N$|M..E*$|N. *B|BAL|H$|LI$|AI$|DW|M.I|IX|EU|K..E|EO|AZ'
CPU times: user 5min 44s, sys: 3.71 s, total: 5min 47s
Wall time: 5min 45s

Out[58]:
1165

Just One More Thing: Negative Lookahead

In the foo versus bar example, we have a big asymmetry in the solution lengths: three letter for 'foo' compared to 23 characters in the other direction. But if you are a real regular expression superhero, then you might know about negative lookahead assertions and you could solve bar versus foo like this:

In [59]:
solution = '^(?!.*foo)'
verify(solution, bar, foo)
Out[59]:
True

Whoa—what just happened there? The form '^(?!.*foo) means "starting from the beginning of the line, look ahead for '.*foo' (that is, any number of characters followed by 'foo'). If it is there, then succeed, else fail." That's just what we need to match all the non-foo strings in the bar collection. Let's apply this idea to the complete benchmark and see how many characters we save. We'll use the SOLUTION dict that was compiled by the previous call to benchmark:

In [60]:
def negative_lookahead_solution(W, L): 
    "Find a solution for W but not L by negative lookahead on SOLUTION[L, W]."
    solution = '^(?!.*(' + SOLUTION[L, W] + '))'
    assert verify(solution, W, L)
    return solution

def better_solution(W, L):
    "Return the better of the 'regular' or negative lookahead solutions."
    return min(SOLUTION[W, L], negative_lookahead_solution(W, L),
               key=len)
    
%time sum(len(better_solution(W, L)) for (W, _, _, L) in ALL)
CPU times: user 19 ms, sys: 5.38 ms, total: 24.4 ms
Wall time: 20.2 ms

Out[60]:
1079

Very nice! We've improved from 1163 to 1079 characters, and it took almost no time at all. Here are all the results, in table and plot form:

parts@4parts@5, reps, pairs
AlgorithmSize (chars)Time (secs)Size (chars)Time (secs)
Greedy13554
BB(1,000)130061203170
BB(10,000)1296161196184
BB(100,000)12941421191348
BBR(100 × 10)1177178
BBR(1,000 × 10)1170193
BBR(10,000 × 10)1165345
BBR(10,000 × 10) + Neg Look1079345
In [63]:
m = 60.
plot([4/m], [1355], 'gD-', label='Greedy')
plot([170/m, 184/m, 348/m], [1203, 1196, 1191], 'bo-', label='BB')
plot([178/m, 193/m, 345/m], [1177, 1170, 1165], 'rs-', label='BBR')
plot([345/m], [1079], 'k*-', markersize=12, label='BBR + Neg Look')
xlabel('Time (minutes)'); ylabel('Solution size (chars)')
legend(loc='upper right');

(This plot gives the total solution length as a function of the amount of computation time.)

We've made great strides, improving the total solution length by 20% over the initial version with greedy search. (It is up to you whether that 20% improvement is worth the increase in time from 4 seconds to 6 minutes.)

Speculating: Other Ideas

I'm going to stop here, but I'll briefly mention some other ideas that I didn't get to investigate. Perhaps you can play with some of these, or with other ideas.

  • Completely different approach: Thomas Breuel suggested doing minimization of weighted finite state transducers.
  • Character classes: We never considered character classes, such as '[abc]'. It is not obvious they would help, but it is possible. Consider the fragment 'ld|la', which shows up in one of the solutions. We could replace that with a single component, 'l[da]'. But we haven't gained anything; both are 5 characters long. If we had had 'l.d|l.c|l.b|l.a' then the replacement 'l.[a-d]' would save 8 characters, but I don't see many opportunities like that. It might also be possible to use negative character classes, like l[^y] to explicitly avoid something in the losers set.

  • Prefilter parts: Currently, regex_covers generates a huge set of components, then eliminates the ones that match losers (or don't match winners), and then eliminates the ones that are dominated. We could make regex_covers faster by not generating components that can't possibly contribute. Suppose we're considering a subpart, say 'ab'. If it matches no losers, then there is no reason to extend it to a longer subpart, such as 'abc', because we know that 'abc' is dominated by 'ab'. By filtering out dominated parts before we have to check them, I estimate we should be able to cut the time for regex_covers in half, maybe better.

  • Better component choice: We pick the "best" component based on how many winners it covers and how many characters it takes. Maybe we could have a better measure; perhaps taking into account whether it covers winners that are "easy" to cover by other regexes, or "hard" to cover.
  • Tuning: We have the two parameters K and F, which are randomly chosen from a fairly arbitrary distribution. We could measure which values of the parameters perform better and pick those values more often.
  • Post-editing: We concentrated on the task of generating a solution from scratch; another task is to improve an existing solution. For example, take the set of components in a solution, and see if one component could be dropped—it is possible that all the winners covered by that component might accidentally have been covered by other components, making it unnecessary. Or take all pairs of components, and see if they could be replaced by a shorter set of components.
  • Better bounds: Branch and Bound search works best if the bounds are tight. If we can recognize early that the path we are on cannot possibly be as good as the best solution found so far then we can cut off search early and save time. But the bound estimate we are using now is a poor one. We estimate the cost of a solution as the cost of the partial solution plus the cost of the shortest component remaining. We use this estimate because all we know for sure is that we need at least one more component. Here's one way to get a better bound by recognizing that in many cases we will need more than one more component. We'll define the following quantities:

    • P = the length of the partial solution, plus the "|", if needed. So if the partial solution is None, then P = 0, else P is the length plus 1.
    • S = the length of the shortest regex component in covers.
    • W = the number of winners still in covers.
    • C = the largest number of winners covered by any regex in covers.

      The current estimate is P + S. We can see that a better estimate is P + S × ceil(W / C).

I hope you had fun seeing how I (and Stefan) attacked this problem; now go out and solve some problems of your own.

If you want to download the Python in one file, here it is: xkcd1313.py


Peter Norvig, Feb. 2014