See http://ivory.idyll.org/blog/2013-pycon-awesome-big-data-algorithms-talk.html

IPython Blocks¶

Below, I will use IPython Blocks (https://github.com/jiffyclub/ipythonblocks) to demonstrate Bloom filters.

IPython Blocks is a nifty little visualization tool built by Matt Davis @jiffyclub for use in teaching Python and programming basics. Here's a quick little demo --

In [1]:

from ipythonblocks import BlockGrid

In [2]:

grid = BlockGrid(10,10, fill=(0,0,128))

x = grid.shape[0]
y = grid.shape[1]

for block in grid:
    r = block.row * 255 / float(x)
    g = block.col * 255 / float(y)
    block.red = r
    block.green = g
    
grid

Out[2]:

Bloom filters¶

I'll implement Bloom filters using multiple individual hash tables. (The canonical way to implement them is to use multiple hash functions with one big hash table, but I find that a bit harder to understand.)

First, let's build a simple hash table object that doesn't track collisions. Note, to get 'num' from a string you'd just use a hash function.

In [3]:

import math

class Hash(object):
    def __init__(self, size):
        self.size = size
        self.bits = [0]*size
        
    def add(self, num):
        num = num % self.size
        self.bits[num] = 1
        
    def get(self, num):
        num = num % self.size
        return self.bits[num]

Next, build a full Bloom filter that creates multiple hash tables; here, 'add' inserts the element into each hash table, and 'get' checks to see if the element is in all the hash tables.

I've also included three utility methods: fp(), empirical_fp(), and show(). The first calculates the predicted false positive rate based on hash table occupancy; the second calculates the actual false positive rate for a range of numbers; and the third shows the hash table occupancy using IPython Blocks.

In [4]:

class BloomFilter(object):
    def __init__(self, *sizes):
        self.hashes = [ Hash(size) for size in sizes ]
        
    def add(self, num):
        for h in self.hashes:
            h.add(num)
            
    def get(self, num):
        for h in self.hashes:
            if not h.get(num):
                return 0
        return 1
    
    def fp(self):
        total = 0.
        for h in self.hashes:
            occupancy = sum(h.bits)
            f = occupancy / float(h.size)
            total += math.log(f, 2)
            
        return 2**total
    
    def empirical_fp(self, actual, max):
        found_true = 0
        found_false = 0
        for i in range(max):
            if self.get(i):
                if i in actual:
                    found_true += 1
                else:
                    found_false += 1
                    
        return found_false / float(max)
            
    
    def show(self):
        rows = len(self.hashes)
        cols = max([ h.size for h in self.hashes ])
        grid = BlockGrid(cols, rows, fill=(0,0,0))
        for i, h in enumerate(self.hashes):
            for pos in range(h.size, cols):
                grid[i, pos] = (255, 255, 255)
            for j, bit in enumerate(h.bits):
                if bit:
                    grid[i, j] = (255, 0, 0)
        return grid.show()

Let's create a Bloom filter with three hash tables, size 5, 7, and 11, and then show the occupied cells in the three hash tables after adding '253' and '8132' (no special significance to the numbers).

In [5]:

x = BloomFilter(5, 7, 11)
x.show()
x.add(253)
x.show()
print x.get(253)

###

x.add(8132)
x.show()
print x.get(8132)

Next, let's check out what happens when you start filling up the hash tables with lots and lots of entries.

In [6]:

import random, time

x = BloomFilter(5, 7, 11)
actual = set()
for _ in range(10):
    num = random.randint(0, 255)
    actual.add(num)
    x.add(num)
    x.show()
    theory, empirical = x.fp(), x.empirical_fp(actual, 255)
    print 'inserting', num
    print 'predicted FP:', theory, 'diff from actual:', abs(theory-empirical)
    time.sleep(1)

inserting 2
predicted FP: 0.0025974025974 diff from actual: 0.0025974025974

inserting 49
predicted FP: 0.0207792207792 diff from actual: 0.00509294626942

inserting 141
predicted FP: 0.0701298701299 diff from actual: 0.0034632034632

inserting 151
predicted FP: 0.124675324675 diff from actual: 0.0187929717341

inserting 26
predicted FP: 0.194805194805 diff from actual: 0.0183346065699

inserting 54
predicted FP: 0.233766233766 diff from actual: 0.0220015278839

inserting 0
predicted FP: 0.363636363636 diff from actual: 0.0185383244207

inserting 96
predicted FP: 0.363636363636 diff from actual: 0.0224598930481

inserting 248
predicted FP: 0.623376623377 diff from actual: 0.0390628978864

inserting 54
predicted FP: 0.623376623377 diff from actual: 0.0390628978864

One-sided error¶

Bloom filters have what's called one-sided error: they may report elements that were never inserted as being members of the set -- false positives -- but they will never miss reporting elements that WERE inserted as being members -- false negatives. This is a straightforward consequence of the "no collision tracking" aspect of the hash tables: collisions lead to false reporting, but you never miss something you've already inserted.

If you know the hash table sizes, it's easy to predict the collisions. One simple way is to test something that's 5711 times added to something you've already inserted, i.e. to force a collision based on the modulus of the hash table sizes.

In [7]:

# here, we add '3' and test for '3 + 5*7*11', or 388.
x = BloomFilter(5, 7, 11)
x.add(3)
x.show()

print x.get(3), x.get(3 + (5*7*11))
print "oh noes! 388 is falsely said to be present in the data set!"

1 1
oh noes! 388 is falsely said to be present in the data set!

In [7]: