We have a text of length n and a pattern of length m and we want to find all occurences of the pattern in the text.
Naively we would compare all n-m substrings in the text to the pattern, but this will have a complexity of $O((n-m)m)=O(nm)$.
We want to improve this using a hashtable (as was done for finding repeating substrings in recitation 8. Then we can calculate the hash of the pattern and of all substrings in the text and compare the hashes. Calculating each hash, though, takes $O(m) \;$ and therefore we still have a complexity of $O(mn)$.
The trick of the Rabin-Karp algorithm is using a rolling hash:
if we already calculated the rolling hash of the substring of length m that started at position i, calculating the rolling hash of the substring of length m that started at position i+1 will take $O(1)$.
The complexity now will be much better - we need to calculate the hash on the pattern and on the substring that starts at position 0, each with complexity $O(m)$. In addition, we need to calculate the hash on the rest of the n-m substrings, each taking $O(1)$ because of the rolling hash. Altogether this is $O(n+m) \;$, much smaller than $O(nm)$.
def arithmetize(text, basis=2**16, r=2**32-3):
""" convert substring to number using basis powers
employs Horner method modulo r """
partial_sum = 0
for ch in text:
partial_sum = (partial_sum * basis + ord(ch)) % r
return partial_sum
ord("b"), ord("e"), ord("n")
(98, 101, 110)
arithmetize("b"), arithmetize("be"), arithmetize("ben")
(98, 6422629, 6619540)
def arithmetize_text_naive(text, m, basis=2**16, r=2**32-3):
""" computes arithmization of all m long substrings
of text, using basis powers """
return [arithmetize(text[i:i + m], basis, r) for i in range(len(text) - m + 1)]
def arithmetize_text(text, m, basis=2**16, r=2**32-3):
""" efficiently computes arithmetization of all m long
substrings of text, using basis powers """
b_power = basis ** (m - 1)
lst = [None]*(len(text) - m + 1)
# lst[0] equals first word arithmetization
lst[0] = arithmetize(text[0:m], basis, r)
for i in range(1, len(lst)):
rolling_hash = (lst[i - 1] - ord(text[i - 1])* b_power) * basis + ord(text[i + m - 1])
rolling_hash %= r
lst[i] = rolling_hash # append new_number to existing lst
return lst
song = '''אני כל כך עצוב לי ושמש על העיר
ודיזנגוף נראה לי כמו רכבת לילה לקהיר
בין כל הצלילים מחפש סימן
יושב בצד הדרך
יושב ליד הזמן
בחדר הסגור בדידות מקיר לקיר
ואם אצא החוצה המצב רק יחמיר
אז אני שומע צליל מאוד מוכר
מגיע מלמטה רחוק מהכיכר'''
print(arithmetize_text(song, 1) == arithmetize_text_naive(song, 1))
True
print(arithmetize_text(song, 2)[:10])
[97519072, 98567641, 98107424, 2098651, 98239964, 98304032, 2098651, 98239962, 98172960, 2098658]
print(arithmetize_text(song, 2, 2**16 - 1)[:10])
[97517584, 98566137, 98105927, 2098619, 98238465, 98302532, 2098619, 98238463, 98171462, 2098626]
%timeit -n 5 arithmetize_text_naive(song, 5)
%timeit -n 5 arithmetize_text(song, 5)
5 loops, best of 3: 1.19 ms per loop 5 loops, best of 3: 417 us per loop
import urllib
with urllib.request.urlopen("http://www.gutenberg.org/cache/epub/2701/pg2701.txt") as f:
book = f.read().decode('utf8')
print(book[:book.index('\n\r\n\r')])
The Project Gutenberg EBook of Moby Dick; or The Whale, by Herman Melville This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org
len(book)
1257258
%timeit -n 3 arithmetize_text_naive(book[:len(book)//10], 3)
%timeit -n 3 arithmetize_text_naive(book, 3)
3 loops, best of 3: 310 ms per loop 3 loops, best of 3: 3.14 s per loop
%timeit -n 3 arithmetize_text(book[:len(book)//10], 3)
%timeit -n 3 arithmetize_text(book, 3)
3 loops, best of 3: 222 ms per loop 3 loops, best of 3: 2.23 s per loop
%timeit -n 3 arithmetize_text_naive(book[:len(book)//10], 3)
%timeit -n 3 arithmetize_text_naive(book[:len(book)//10], 5)
%timeit -n 3 arithmetize_text_naive(book[:len(book)//10], 10)
3 loops, best of 3: 310 ms per loop 3 loops, best of 3: 480 ms per loop 3 loops, best of 3: 903 ms per loop
%timeit -n 3 arithmetize_text(book[:len(book)//10], 3)
%timeit -n 3 arithmetize_text(book[:len(book)//10], 5)
%timeit -n 3 arithmetize_text(book[:len(book)//10], 10)
3 loops, best of 3: 220 ms per loop 3 loops, best of 3: 229 ms per loop 3 loops, best of 3: 246 ms per loop
def find_matches(pattern, text, basis=2**16, r=2**32-3):
""" find all occurances of pattern in text
using efficient arithmetization of text """
assert len(pattern) <= len(text)
pattern_hash = arithmetize(pattern, basis, r)
text_hashes = arithmetize_text(text, len(pattern), basis, r)
return [i for i,hs in enumerate(text_hashes) if hs == pattern_hash]
matches = find_matches("moby", book.lower())
print(matches[0], len(matches), book[matches[0]:matches[0] + 4])
matches = find_matches("aye, aye, sir", book.lower())
print(matches[0], len(matches), book[matches[0]:matches[0] + 13])
32 124 Moby 383330 5 Aye, aye, sir
Now we must remember that a substring can have the same hash as the pattern even if it is not equal to it.
Here is a safe version in which we check that the substring does indeed equals the pattern:
def find_matches_safe(pattern, text, basis=2**16, r=2**32-3):
""" find all occurances of pattern in text
using efficient arithmetization of text """
assert len(pattern) <= len(text)
pattern_hash = arithmetize(pattern, basis, r)
text_hashes = arithmetize_text(text, len(pattern), basis, r)
matches = [i for i,hs in enumerate(text_hashes) if hs == pattern_hash and text[i:i+len(pattern)] == pattern]
return matches
def foo(v,t):
print(t)
return v
print(foo(True,"foo") and foo(True,"bar"))
print(foo(False,"foo") and foo(True,"bar"))
foo bar True foo False
The worst case is when everything matches:
text = "a" * 10**5
for pattern in ["a"*10**2, "a"*10**3, "a"*10**4]:
for f in [find_matches, find_matches_safe]:
%timeit -n 1 f(pattern, text)
1 loops, best of 3: 479 ms per loop 1 loops, best of 3: 567 ms per loop 1 loops, best of 3: 3.09 s per loop 1 loops, best of 3: 3.33 s per loop 1 loops, best of 3: 25.9 s per loop 1 loops, best of 3: 28.9 s per loop
But for normal texts there is hardly any difference:
for pattern in ["moby", "aye, aye, sir", "moby dick was his name"]:
for f in [find_matches, find_matches_safe]:
%timeit -n 1 f(pattern, book.lower())
1 loops, best of 3: 2.7 s per loop 1 loops, best of 3: 2.69 s per loop 1 loops, best of 3: 2.9 s per loop 1 loops, best of 3: 2.88 s per loop 1 loops, best of 3: 3.08 s per loop 1 loops, best of 3: 3.1 s per loop
r
¶find_matches("da","abracadabra", basis=2**16, r=2**16)
[2, 4, 6, 9]
find_matches_safe("da","abracadabra", basis=2**16, r=2**16)
[6]
Because $b=r$, $mod r$ just takes the rightmost character. When searching for "da" we actually seach for "?a".
r
better not be a power of the base used for arithmetization of the string.
We prefer that r
is not a power of small primes.
Large primes are a good choice.
We will try a different hash function in which the hash is the sum of the characters of the string.
def arithmetize_sum(text, r=2**32-3):
partial_sum = 0
for ch in text:
partial_sum = (partial_sum + ord(ch)) % r
return partial_sum
def arithmetize_text_sum(text, m, r=2**32-3):
lst = []
lst.append(arithmetize_sum(text[:m], r))
for i in range(1, len(text) - m + 1):
rolling_hash = (lst[i-1] - ord(text[i-1]) + ord(text[i + m - 1]))
rolling_hash %= r
lst.append(rolling_hash)
return lst
%timeit -n 3 arithmetize_text(song, 3)
%timeit -n 3 arithmetize_text_sum(song, 3)
%timeit -n 3 arithmetize_text(song, 30)
%timeit -n 3 arithmetize_text_sum(song, 30)
3 loops, best of 3: 411 us per loop 3 loops, best of 3: 309 us per loop 3 loops, best of 3: 540 us per loop 3 loops, best of 3: 290 us per loop
%timeit -n 3 arithmetize_text(book, 3)
%timeit -n 3 arithmetize_text_sum(book, 3)
%timeit -n 3 arithmetize_text(book, 10)
%timeit -n 3 arithmetize_text_sum(book, 10)
3 loops, best of 3: 2.36 s per loop 3 loops, best of 3: 1.76 s per loop 3 loops, best of 3: 2.66 s per loop 3 loops, best of 3: 1.73 s per loop
This rolling hash function is efficient - seems to be at least as good as the previous one if not even faster, probably due to less arithmetics being involved.
arithmetize_sum("I am Lord Voldemort".lower())
1828
arithmetize_sum("Tom Marvolo Riddle ".lower())
1828
The problem is that anagrams collide - permutations of the same string get the same hash value.
This notebook is part of the Extended introduction to computer science course at Tel-Aviv University.
The notebook was written using Python 3.2 and IPython 0.13.1.
The code is available at https://raw.github.com/yoavram/CS1001.py/master/recitation10.ipynb.
The notebook can be viewed online at http://nbviewer.ipython.org/urls/raw.github.com/yoavram/CS1001.py/master/recitation10.ipynb.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.