See http://ivory.idyll.org/blog/2013-pycon-awesome-big-data-algorithms-talk.html

## IPython Blocks

Below, I will use IPython Blocks (https://github.com/jiffyclub/ipythonblocks) to demonstrate Bloom filters.

IPython Blocks is a nifty little visualization tool built by Matt Davis @jiffyclub for use in teaching Python and programming basics. Here's a quick little demo --

In [1]:
from ipythonblocks import BlockGrid

In [2]:
grid = BlockGrid(10,10, fill=(0,0,128))

x = grid.shape[0]
y = grid.shape[1]

for block in grid:
r = block.row * 255 / float(x)
g = block.col * 255 / float(y)
block.red = r
block.green = g

grid

Out[2]:

## Bloom filters

I'll implement Bloom filters using multiple individual hash tables. (The canonical way to implement them is to use multiple hash functions with one big hash table, but I find that a bit harder to understand.)

First, let's build a simple hash table object that doesn't track collisions. Note, to get 'num' from a string you'd just use a hash function.

In [3]:
import math

class Hash(object):
def __init__(self, size):
self.size = size
self.bits = [0]*size

num = num % self.size
self.bits[num] = 1

def get(self, num):
num = num % self.size
return self.bits[num]


Next, build a full Bloom filter that creates multiple hash tables; here, 'add' inserts the element into each hash table, and 'get' checks to see if the element is in all the hash tables.

I've also included three utility methods: fp(), empirical_fp(), and show(). The first calculates the predicted false positive rate based on hash table occupancy; the second calculates the actual false positive rate for a range of numbers; and the third shows the hash table occupancy using IPython Blocks.

In [4]:
class BloomFilter(object):
def __init__(self, *sizes):
self.hashes = [ Hash(size) for size in sizes ]

for h in self.hashes:

def get(self, num):
for h in self.hashes:
if not h.get(num):
return 0
return 1

def fp(self):
total = 0.
for h in self.hashes:
occupancy = sum(h.bits)
f = occupancy / float(h.size)
total += math.log(f, 2)

return 2**total

def empirical_fp(self, actual, max):
found_true = 0
found_false = 0
for i in range(max):
if self.get(i):
if i in actual:
found_true += 1
else:
found_false += 1

return found_false / float(max)

def show(self):
rows = len(self.hashes)
cols = max([ h.size for h in self.hashes ])
grid = BlockGrid(cols, rows, fill=(0,0,0))
for i, h in enumerate(self.hashes):
for pos in range(h.size, cols):
grid[i, pos] = (255, 255, 255)
for j, bit in enumerate(h.bits):
if bit:
grid[i, j] = (255, 0, 0)
return grid.show()


Let's create a Bloom filter with three hash tables, size 5, 7, and 11, and then show the occupied cells in the three hash tables after adding '253' and '8132' (no special significance to the numbers).

In [5]:
x = BloomFilter(5, 7, 11)
x.show()
x.show()
print x.get(253)

###

x.show()
print x.get(8132)

1


1



Next, let's check out what happens when you start filling up the hash tables with lots and lots of entries.

In [6]:
import random, time

x = BloomFilter(5, 7, 11)
actual = set()
for _ in range(10):
num = random.randint(0, 255)
x.show()
theory, empirical = x.fp(), x.empirical_fp(actual, 255)
print 'inserting', num
print 'predicted FP:', theory, 'diff from actual:', abs(theory-empirical)
time.sleep(1)

inserting 2
predicted FP: 0.0025974025974 diff from actual: 0.0025974025974


inserting 49
predicted FP: 0.0207792207792 diff from actual: 0.00509294626942


inserting 141
predicted FP: 0.0701298701299 diff from actual: 0.0034632034632


inserting 151
predicted FP: 0.124675324675 diff from actual: 0.0187929717341


inserting 26
predicted FP: 0.194805194805 diff from actual: 0.0183346065699


inserting 54
predicted FP: 0.233766233766 diff from actual: 0.0220015278839


inserting 0
predicted FP: 0.363636363636 diff from actual: 0.0185383244207


inserting 96
predicted FP: 0.363636363636 diff from actual: 0.0224598930481


inserting 248
predicted FP: 0.623376623377 diff from actual: 0.0390628978864


inserting 54
predicted FP: 0.623376623377 diff from actual: 0.0390628978864



## One-sided error

Bloom filters have what's called one-sided error: they may report elements that were never inserted as being members of the set -- false positives -- but they will never miss reporting elements that WERE inserted as being members -- false negatives. This is a straightforward consequence of the "no collision tracking" aspect of the hash tables: collisions lead to false reporting, but you never miss something you've already inserted.

If you know the hash table sizes, it's easy to predict the collisions. One simple way is to test something that's 5711 times added to something you've already inserted, i.e. to force a collision based on the modulus of the hash table sizes.

In [7]:
# here, we add '3' and test for '3 + 5*7*11', or 388.
x = BloomFilter(5, 7, 11)

1 1