We discuss what vs how programming and resolve performance flaws with memoization.
We use the classic Fibonacci sequence as an initial example
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ...
We finish with an example using memoization to minimize repeated web requests
The Fibonacci sequence is often defined recursively as follows
/ 0 if i is 0
fib(i) = | 1 if i is 1
\ fib(i - 1) + fib(i - 2) otherwise
Because Python supports recursion we can translate this mathematical definition cleanly into code
def fib_what(i):
if i == 0:
return 0
elif i == 1:
return 1
else:
return fib_what(i -1) + fib_what(i - 2)
map(fib_what, range(11))
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55]
A common algorithm for how to compute the $i^{th}$ Fibonacci number efficiently is to keep a running pair of numbers, summing them together to obtain the next
def fib_how(i):
a, b = 0, 1
for _ in range(i):
a, b = b, a + b
return a
map(fib_how, range(11))
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55]
These two implementations differ in two ways.
Style: What v. How:
The first function is similar to a mathematical definition, defining what is to be computed without clearly specifying about how. The implementation of this algorithm depends on how the Python runtime handles recursive calls.
The second function provides a clear recipe for how to compute the answer; it is not immediately clear from the function definition what is being computed however.
What solutions tend to be more conceptually clear than how solutions.
Performance
Unfortunately the what solution is painfully slow. This is common with what code.
%timeit fib_what(35)
1 loops, best of 3: 4.1 s per loop
%timeit fib_how(35)
100000 loops, best of 3: 2.14 µs per loop
Note the units. The what solution is two million times slower on this input. It only gets worse.
The what solution requires exponentially more time than the how solution. This is because it repeatedly calls fib
on the same values over and over again.
For example:
fib(5)
calls fib(4)
and fib(3)
fib(4)
calls fib(3)
and fib(2)
fib(3)
calls fib(2)
and fib(1)
fib(3)
also calls fib(2)
and fib(1)
fib(2)
s call fib(1)
and fib(0)
fib(1)
.For low inputs like 5
this is merely comical. For large values it rapidly becomes tragic.
The what solution can be saved by caching intermediate values
cache = dict()
def fib(i):
if i in cache: # Have we seen this input before?
return cache[i] # if so then return the cached result
if i == 0:
return 0
if i == 1:
return 1
else:
result = fib(i -1) + fib(i - 2)
cache[i] = result # Remember that fib(i) == result
return result
%timeit cache.clear(); fib(35)
100000 loops, best of 3: 16.8 µs per loop
By caching results to expensive functions we decrease computation time but increase memory usage. This can often be a helpful technique.
Historically functional languages favor what over how implementations.
Consider our friend, map
. The map
function defines what should be done while hiding exactly how it is done. We are free to implement map
in a variety of interesting ways. Functional abstractions take computational control away from programmers and deliver it to language and infrastructure designers.
Because functional languages favor what solutions they often run into issues like the repeated computation we saw in the Fibonacci example. Rather than explicitly cache every function we make we turn caching into a higher order function.
This function is called memoize
.
from toolz import memoize
def fib(i):
if i == 0:
return 0
elif i == 1:
return 1
else:
return fib(i -1) + fib(i - 2)
cache = dict()
fib = memoize(fib, cache)
%timeit cache.clear(); fib(35)
10000 loops, best of 3: 33.4 µs per loop
In the above example we explicitly create a cache so that we can clear it before each timing. Usually we just call memoize with a single argument like so
fib = memoize(fib)
and the cache dictionary is generated automatically.
Alternatively we may prefer to use the Python decorator syntax
@memoize
def fib(i):
if i == 0:
return 0
if i == 1:
return 1
else:
return fib(i -1) + fib(i - 2)
Both of these are equivalent.
Memoization provides a purely how optimization so that we can continue to shamelessly write mathematically clear what code.
The Fibonacci example is a canonical use case of memoization. Sadly we are rarely called upon to compute Fibonacci numbers in real life.
Here we play with a slightly more realistic example. We want to sort a set of webpages by the number of distinct words in their body.
# We'll use the Wikipedia articles for the fifty United States
with open('data/states.txt') as f:
urls = ['http://en.wikipedia.org/wiki/' + state for state in f.read().strip().split()]
Our task is to sort these URLs by the number of distinct words present in the text of the website. We'll do this by using the sorted
function's keyword cmp
and urlopen
from urllib
.
We'll go through an implementation that explicitly caches results for performance. You'll simplify this implementation by adding memoize
appropriately.
# Naive Solution
# We can get the raw HTML from a URL with `urlopen`
# Depending on your internet connection this might take a little while
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')] # Trick Wikipedia into thinking that we're not a robot
urlopen = opener.open
def count_distinct_words(url):
""" Count distinct words of a URL
>>> count_distinct_words('http://www.example.com')
86
"""
text = urlopen(url).read()
return len(set(text.split()))
def cmp_url(a, b):
""" A comparator function for URLs
>>> cmp_url('http://en.wikipedia.org/', 'http://www.example.org')
1
"""
return count_distinct_words(a) - count_distinct_words(b)
# sorted(urls, cmp=cmp_url) # We could do this but it would be slow
# Efficient Solution
# To increase performance and reduce the extent to which we hammer the Wikipedia servers
# we'll precompute all of the counts ahead of time in one pass
# and store them in a global dictionary
word_counts = dict()
for url in urls:
word_counts[url] = count_distinct_words(url)
def cmp_url_cached(a, b):
""" A Cached version of ``cmp_url`` """
return cmp(word_counts[a], word_counts[b])
# sorted(urls cmp=cmp_url_cached)
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-47-1b7e62009eaa> in <module>() 5 word_counts = dict() 6 for url in urls: ----> 7 word_counts[url] = count_distinct_words(url) 8 9 <ipython-input-46-375f431f0072> in count_distinct_words(url) 13 86 14 """ ---> 15 text = urlopen(url).read() 16 return len(set(text.split())) 17 /home/mrocklin/Software/anaconda/lib/python2.7/socket.pyc in read(self, size) 349 while True: 350 try: --> 351 data = self._sock.recv(rbufsize) 352 except error, e: 353 if e.args[0] == EINTR: /home/mrocklin/Software/anaconda/lib/python2.7/httplib.pyc in read(self, amt) 565 # connection, and the user is reading more bytes than will be provided 566 # (for example, reading in 1k chunks) --> 567 s = self.fp.read(amt) 568 if not s: 569 # Ideally, we would raise IncompleteRead if the content-length /home/mrocklin/Software/anaconda/lib/python2.7/socket.pyc in read(self, size) 378 # fragmentation issues on many platforms. 379 try: --> 380 data = self._sock.recv(left) 381 except error, e: 382 if e.args[0] == EINTR: KeyboardInterrupt:
Your task is simple. Use memoize on our naive solution to achieve the same performance of the efficient solution.
Establishing state is often required for performance. Memoization captures a common tactic for higher performance code, particularly when repeated expensive operations must be avoided. The memoize
higher order function wraps this functionality into a simple function call or decorator.