Below, I will use IPython Blocks (https://github.com/jiffyclub/ipythonblocks) to demonstrate Bloom filters.
IPython Blocks is a nifty little visualization tool built by Matt Davis @jiffyclub for use in teaching Python and programming basics. Here's a quick little demo --
from ipythonblocks import BlockGrid
grid = BlockGrid(10,10, fill=(0,0,128)) x = grid.shape y = grid.shape for block in grid: r = block.row * 255 / float(x) g = block.col * 255 / float(y) block.red = r block.green = g grid
I'll implement Bloom filters using multiple individual hash tables. (The canonical way to implement them is to use multiple hash functions with one big hash table, but I find that a bit harder to understand.)
First, let's build a simple hash table object that doesn't track collisions. Note, to get 'num' from a string you'd just use a hash function.
import math class Hash(object): def __init__(self, size): self.size = size self.bits = *size def add(self, num): num = num % self.size self.bits[num] = 1 def get(self, num): num = num % self.size return self.bits[num]
Next, build a full Bloom filter that creates multiple hash tables; here, 'add' inserts the element into each hash table, and 'get' checks to see if the element is in all the hash tables.
I've also included three utility methods: fp(), empirical_fp(), and show(). The first calculates the predicted false positive rate based on hash table occupancy; the second calculates the actual false positive rate for a range of numbers; and the third shows the hash table occupancy using IPython Blocks.
class BloomFilter(object): def __init__(self, *sizes): self.hashes = [ Hash(size) for size in sizes ] def add(self, num): for h in self.hashes: h.add(num) def get(self, num): for h in self.hashes: if not h.get(num): return 0 return 1 def fp(self): total = 0. for h in self.hashes: occupancy = sum(h.bits) f = occupancy / float(h.size) total += math.log(f, 2) return 2**total def empirical_fp(self, actual, max): found_true = 0 found_false = 0 for i in range(max): if self.get(i): if i in actual: found_true += 1 else: found_false += 1 return found_false / float(max) def show(self): rows = len(self.hashes) cols = max([ h.size for h in self.hashes ]) grid = BlockGrid(cols, rows, fill=(0,0,0)) for i, h in enumerate(self.hashes): for pos in range(h.size, cols): grid[i, pos] = (255, 255, 255) for j, bit in enumerate(h.bits): if bit: grid[i, j] = (255, 0, 0) return grid.show()
Let's create a Bloom filter with three hash tables, size 5, 7, and 11, and then show the occupied cells in the three hash tables after adding '253' and '8132' (no special significance to the numbers).
x = BloomFilter(5, 7, 11) x.show() x.add(253) x.show() print x.get(253) ### x.add(8132) x.show() print x.get(8132)
Next, let's check out what happens when you start filling up the hash tables with lots and lots of entries.
import random, time x = BloomFilter(5, 7, 11) actual = set() for _ in range(10): num = random.randint(0, 255) actual.add(num) x.add(num) x.show() theory, empirical = x.fp(), x.empirical_fp(actual, 255) print 'inserting', num print 'predicted FP:', theory, 'diff from actual:', abs(theory-empirical) time.sleep(1)
inserting 2 predicted FP: 0.0025974025974 diff from actual: 0.0025974025974
inserting 49 predicted FP: 0.0207792207792 diff from actual: 0.00509294626942
inserting 141 predicted FP: 0.0701298701299 diff from actual: 0.0034632034632
inserting 151 predicted FP: 0.124675324675 diff from actual: 0.0187929717341
inserting 26 predicted FP: 0.194805194805 diff from actual: 0.0183346065699
inserting 54 predicted FP: 0.233766233766 diff from actual: 0.0220015278839
inserting 0 predicted FP: 0.363636363636 diff from actual: 0.0185383244207
inserting 96 predicted FP: 0.363636363636 diff from actual: 0.0224598930481
inserting 248 predicted FP: 0.623376623377 diff from actual: 0.0390628978864
inserting 54 predicted FP: 0.623376623377 diff from actual: 0.0390628978864
Bloom filters have what's called one-sided error: they may report elements that were never inserted as being members of the set -- false positives -- but they will never miss reporting elements that WERE inserted as being members -- false negatives. This is a straightforward consequence of the "no collision tracking" aspect of the hash tables: collisions lead to false reporting, but you never miss something you've already inserted.
If you know the hash table sizes, it's easy to predict the collisions. One simple way is to test something that's 5711 times added to something you've already inserted, i.e. to force a collision based on the modulus of the hash table sizes.
# here, we add '3' and test for '3 + 5*7*11', or 388. x = BloomFilter(5, 7, 11) x.add(3) x.show() print x.get(3), x.get(3 + (5*7*11)) print "oh noes! 388 is falsely said to be present in the data set!"
1 1 oh noes! 388 is falsely said to be present in the data set!