Notebook

Memoization with Fibonacci¶

We discuss what vs how programming and resolve performance flaws with memoization.

We use the classic Fibonacci sequence as an initial example

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ...

We finish with an example using memoization to minimize repeated web requests

Fib¶

The Fibonacci sequence is often defined recursively as follows

          /            0              if i is 0
fib(i) =  |            1              if i is 1
          \ fib(i - 1) + fib(i - 2)   otherwise

Because Python supports recursion we can translate this mathematical definition cleanly into code

In [1]:

def fib_what(i):
    if i == 0:
        return 0
    elif i == 1:
        return 1
    else:
        return fib_what(i -1) + fib_what(i - 2)

map(fib_what, range(11))

Out[1]:

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

A common algorithm for how to compute the $i^{th}$ Fibonacci number efficiently is to keep a running pair of numbers, summing them together to obtain the next

In [2]:

def fib_how(i):
    a, b = 0, 1
    for _ in range(i):
        a, b = b, a + b
    return a

map(fib_how, range(11))

Out[2]:

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

These two implementations differ in two ways.

Style: What v. How:

The first function is similar to a mathematical definition, defining what is to be computed without clearly specifying about how. The implementation of this algorithm depends on how the Python runtime handles recursive calls.
The second function provides a clear recipe for how to compute the answer; it is not immediately clear from the function definition what is being computed however.

What solutions tend to be more conceptually clear than how solutions.

Performance

Unfortunately the what solution is painfully slow. This is common with what code.

In [3]:

%timeit fib_what(35)

1 loops, best of 3: 4.1 s per loop

In [4]:

%timeit fib_how(35)

100000 loops, best of 3: 2.14 µs per loop

Note the units. The what solution is two million times slower on this input. It only gets worse.

The what solution requires exponentially more time than the how solution. This is because it repeatedly calls fib on the same values over and over again.

For example:

fib(5) calls fib(4) and fib(3)
fib(4) calls fib(3) and fib(2)
fib(3) calls fib(2) and fib(1)
The second fib(3) also calls fib(2) and fib(1)
All three fib(2)s call fib(1) and fib(0)
We now have five redundant calls to fib(1).

For low inputs like 5 this is merely comical. For large values it rapidly becomes tragic.

Caching¶

The what solution can be saved by caching intermediate values

In [5]:

cache = dict()

def fib(i):
    if i in cache:                # Have we seen this input before?
        return cache[i]           # if so then return the cached result
    
    if i == 0:
        return 0
    if i == 1:
        return 1
    else:
        result = fib(i -1) + fib(i - 2)
        cache[i] = result         # Remember that fib(i) == result
        return result

%timeit cache.clear(); fib(35)

100000 loops, best of 3: 16.8 µs per loop

By caching results to expensive functions we decrease computation time but increase memory usage. This can often be a helpful technique.

Memoization¶

Historically functional languages favor what over how implementations.

Consider our friend, map. The map function defines what should be done while hiding exactly how it is done. We are free to implement map in a variety of interesting ways. Functional abstractions take computational control away from programmers and deliver it to language and infrastructure designers.

Because functional languages favor what solutions they often run into issues like the repeated computation we saw in the Fibonacci example. Rather than explicitly cache every function we make we turn caching into a higher order function.

This function is called memoize.

In [6]:

from toolz import memoize

def fib(i):
    if i == 0:
        return 0
    elif i == 1:
        return 1
    else:
        return fib(i -1) + fib(i - 2)

cache = dict()
fib = memoize(fib, cache)

%timeit cache.clear(); fib(35)

10000 loops, best of 3: 33.4 µs per loop

In the above example we explicitly create a cache so that we can clear it before each timing. Usually we just call memoize with a single argument like so

fib = memoize(fib)

and the cache dictionary is generated automatically.

Alternatively we may prefer to use the Python decorator syntax

@memoize
def fib(i):
    if i == 0:
        return 0
    if i == 1:
        return 1
    else:
        return fib(i -1) + fib(i - 2)

Both of these are equivalent.

Memoization provides a purely how optimization so that we can continue to shamelessly write mathematically clear what code.

Avoid Explicit State with Memoization¶

The Fibonacci example is a canonical use case of memoization. Sadly we are rarely called upon to compute Fibonacci numbers in real life.

Here we play with a slightly more realistic example. We want to sort a set of webpages by the number of distinct words in their body.

In [51]:

# We'll use the Wikipedia articles for the fifty United States

with open('data/states.txt') as f:
    urls = ['http://en.wikipedia.org/wiki/' + state for state in f.read().strip().split()]

Exercise¶

Our task is to sort these URLs by the number of distinct words present in the text of the website. We'll do this by using the sorted function's keyword cmp and urlopen from urllib.

We'll go through an implementation that explicitly caches results for performance. You'll simplify this implementation by adding memoize appropriately.

In [48]:

# Naive Solution

# We can get the raw HTML from a URL with `urlopen`
# Depending on your internet connection this might take a little while
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]  # Trick Wikipedia into thinking that we're not a robot
urlopen = opener.open


def count_distinct_words(url):
    """ Count distinct words of a URL
    
    >>> count_distinct_words('http://www.example.com')
    86
    """
    text = urlopen(url).read()
    return len(set(text.split()))

            
def cmp_url(a, b):
    """ A comparator function for URLs
    
    >>> cmp_url('http://en.wikipedia.org/', 'http://www.example.org')
    1
    """
    return count_distinct_words(a) - count_distinct_words(b)


# sorted(urls, cmp=cmp_url)  # We could do this but it would be slow

In [47]:

# Efficient Solution

# To increase performance and reduce the extent to which we hammer the Wikipedia servers 
# we'll precompute all of the counts ahead of time in one pass 
# and store them in a global dictionary

word_counts = dict()
for url in urls:
    word_counts[url] = count_distinct_words(url)


def cmp_url_cached(a, b):
    """ A Cached version of ``cmp_url`` """
    return cmp(word_counts[a], word_counts[b])
          

# sorted(urls cmp=cmp_url_cached)

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-47-1b7e62009eaa> in <module>()
      5 word_counts = dict()
      6 for url in urls:
----> 7     word_counts[url] = count_distinct_words(url)
      8 
      9 

<ipython-input-46-375f431f0072> in count_distinct_words(url)
     13     86
     14     """
---> 15     text = urlopen(url).read()
     16     return len(set(text.split()))
     17 

/home/mrocklin/Software/anaconda/lib/python2.7/socket.pyc in read(self, size)
    349             while True:
    350                 try:
--> 351                     data = self._sock.recv(rbufsize)
    352                 except error, e:
    353                     if e.args[0] == EINTR:

/home/mrocklin/Software/anaconda/lib/python2.7/httplib.pyc in read(self, amt)
    565         # connection, and the user is reading more bytes than will be provided
    566         # (for example, reading in 1k chunks)
--> 567         s = self.fp.read(amt)
    568         if not s:
    569             # Ideally, we would raise IncompleteRead if the content-length

/home/mrocklin/Software/anaconda/lib/python2.7/socket.pyc in read(self, size)
    378                 # fragmentation issues on many platforms.
    379                 try:
--> 380                     data = self._sock.recv(left)
    381                 except error, e:
    382                     if e.args[0] == EINTR:

KeyboardInterrupt:

Task¶

Your task is simple. Use memoize on our naive solution to achieve the same performance of the efficient solution.

In [ ]:

End¶

Establishing state is often required for performance. Memoization captures a common tactic for higher performance code, particularly when repeated expensive operations must be avoided. The memoize higher order function wraps this functionality into a simple function call or decorator.

In [ ]: