The LZ77 compression, published by Lempel and Ziv in 1977, is widely used today is formats such as GIF, PNG, DEFLATE and COMPRESS.
The basic idea is that repetitions are common in meaningful information and that space can be saved by replacing occurrences of information that was already seen by a reference to the first occurrence.
The references are coded as (offset, length)
pairs (tuples).
For example, the text 'abracadabra' can be compressed as 'abracad(7,4)'. The 7
refers to the substring 7 position behind, and the 4
defines the repetition length.
Counter-intuitively, the repetition length can be larger than the offset, creating an illusion of a loop or self-reference:
However, as the algorithm is implemented using a state machine (DFA) there is no real problem here.
We begin by writing the repetition finding funciton.
The function recieves a text and the current position in the text, and looks at the text to find a previous occurence of the substring that starts at the current position.
The maximum window length and the maximum repetition length, given with default values, are used to fine tune the memory-running time tradeoff of the algorithm.
def find_max_repetition(text, index, window_length=2**12-1, max_rep_length=2**5-1):
""" finds a maximum repetition in the text.
Returns offset and length of the longest repetition.
Returned offset is smallest one (closest to index) among all max matches"""
assert isinstance(text, str)
repitition_length = 0 # length of max repitition
repitition_offset = 0 # distance backwards to begining of max repitition
for offset in range(1, min(index + 1, window_length)):
current_length = 0
while current_length < min(max_rep_length, len(text) - index) and text[index - offset + current_length] == text[index + current_length]:
current_length += + 1
if repitition_length < current_length:
repitition_length = current_length
repitition_offset = offset
return repitition_offset, repitition_length
def test_find_max_repetition(text, index):
offset, length = find_max_repetition(text, index)
print(text[index:index + length])
print(text[index-offset:index - offset + length])
return text[index:index + length] == text[index - offset:index - offset + length]
find_max_repetition("abracadabra", 7)
(7, 4)
find_max_repetition("xyzxyzwxyzw", 3)
(3, 3)
find_max_repetition("xyzxyzwxyzw", 7)
(4, 4)
find_max_repetition('aaaaa', 1)
(1, 4)
The next step is to write the compression algorithm. The flow of the compression is described by this graph:
We will start with the compression function. It will recieve a text and will iterate over the characters of the text, looking for a repetition at each character.
def lz77_compress(text, window_length=2**12-1, max_rep_length=2**5-1):
"""LZ77 compression of ASCII text.
Produces a list comprising of either ascii character
or by a pair [m,k] where m is an offset and
k is a match (both are non negative integers)"""
result = []
index = 0
while index < len(text):
if ord(text[index]) >= 128:
# non-ascii
index += 1
continue
offset, length = find_max_repetition(text, index, window_length, max_rep_length)
if length < 3:
# no repetition or a single character repetition
result.append(text[index])
index += 1
else:
result.append((offset, length)) # two or more chars in repetition
index += length
return result # produces a list composed of chars and pairs
print(lz77_compress("You know what I hate, you know what I hate, you know what I hate? Repetitions!"))
['Y', 'o', 'u', ' ', 'k', 'n', 'o', 'w', ' ', 'w', 'h', 'a', 't', ' ', 'I', ' ', (6, 3), 'e', ',', ' ', 'y', (22, 31), (22, 10), '?', ' ', 'R', 'e', 'p', 'e', 't', 'i', 't', 'i', 'o', 'n', 's', '!']
welcome = '''Welcome my son, welcome to the machine.
Where have you been? It's alright we know where you've been.
You've been in the pipeline, filling in time,
Provided with toys and Scouting for Boys.
You bought a guitar to punish your ma,
And you didn't like school, and you know you're nobody's fool,
So welcome to the machine.
Welcome my son, welcome to the machine.
What did you dream? It's alright we told you what to dream.
You dreamed of a big star, he played a mean guitar,
He always ate in the Steak Bar. He loved to drive in his Jaguar.
So welcome to the machine.'''
welcome_intermediate = lz77_compress(welcome)
print(welcome_intermediate)
['W', 'e', 'l', 'c', 'o', 'm', 'e', ' ', 'm', 'y', ' ', 's', 'o', 'n', ',', ' ', 'w', (16, 7), 't', 'o', ' ', 't', 'h', (23, 3), 'a', 'c', 'h', 'i', 'n', 'e', '.', '\n', 'W', 'h', 'e', 'r', 'e', ' ', 'h', 'a', 'v', 'e', ' ', 'y', 'o', 'u', ' ', 'b', 'e', 'e', 'n', '?', ' ', 'I', 't', "'", 's', ' ', 'a', 'l', 'r', 'i', 'g', 'h', 't', (58, 3), ' ', 'k', 'n', 'o', 'w', ' ', 'w', (42, 5), (37, 3), "'", (44, 3), (40, 4), '.', '\n', 'Y', (13, 10), ' ', 'i', 'n', (89, 5), 'p', 'i', 'p', 'e', 'l', (90, 3), ',', ' ', 'f', 'i', 'l', (9, 3), 'g', (25, 5), 'i', 'm', (17, 3), '\n', 'P', 'r', 'o', 'v', 'i', 'd', 'e', 'd', ' ', 'w', 'i', 't', 'h', (138, 3), 'y', (101, 3), 'n', 'd', ' ', 'S', 'c', 'o', 'u', 't', (42, 4), 'f', 'o', 'r', ' ', 'B', (22, 3), (89, 5), ' ', 'b', 'o', 'u', (127, 4), 'a', ' ', 'g', 'u', 'i', 't', 'a', 'r', (186, 4), 'p', 'u', 'n', 'i', 's', 'h', (132, 4), 'r', (194, 3), ',', '\n', 'A', (62, 3), (182, 4), 'd', 'i', 'd', 'n', "'", 't', ' ', 'l', 'i', 'k', 'e', ' ', 's', 'c', 'h', 'o', 'o', 'l', ',', (90, 5), (28, 4), (188, 5), (182, 4), (189, 3), 'n', 'o', 'b', 'o', 'd', 'y', (220, 3), 'f', (35, 4), '\n', 'S', 'o', (279, 26), (319, 31), (319, 10), 'a', 't', (127, 4), (135, 6), 'r', 'e', 'a', 'm', (318, 18), 't', 'o', 'l', (32, 6), 'w', (45, 4), (66, 3), (40, 5), (229, 6), (11, 5), (274, 3), 'o', 'f', (233, 3), 'b', 'i', 'g', ' ', 's', (235, 3), ',', ' ', (329, 4), 'l', 'a', 'y', (25, 3), 'a', ' ', 'm', 'e', 'a', 'n', (260, 7), ',', '\n', 'H', 'e', (90, 3), 'w', 'a', (314, 4), 't', 'e', (372, 8), 'S', 't', 'e', 'a', 'k', ' ', 'B', 'a', 'r', '.', ' ', (32, 3), 'l', 'o', 'v', (56, 3), (103, 5), 'i', (413, 3), (36, 3), 'h', 'i', 's', ' ', 'J', 'a', 'g', 'u', (33, 3), (244, 27)]
Next we need to code this list of characters and tuples to a bit string to allow for saving it to a file or transmiting it on the web.
First we need an ASCII coding function:
def char2bits(ch):
''' convert a character to int and then format it to binary
using exactly 7 bits '''
return '{:07b}'.format(ord(ch))
print(char2bits('1'))
print(char2bits('z'))
0110001 1111010
Note the difference between this and:
print(bin(ord('1'))[2:])
print(char2bits('1'))
110001 0110001
Next we write the function to convert the intermediate format to the binary format.
Each word is either a literal (a character) of an offset-length pair (a tuple).
char2bits
above).0
) or an offset-length pair (1
).import math
def intermediate2bits(compressed, window_length=2**12-1, max_rep_length=2**5-1):
""" converts intermediate format compressed list
to a string of bits"""
offset_width = math.ceil(math.log(window_length, 2))
length_width = math.ceil(math.log(max_rep_length,2))
result = []
for item in compressed:
if isinstance(item, str):
result.append("0")
result.append('{:07b}'.format(ord(item)))
elif isinstance(item, (tuple,list)):
result.append("1")
offset,length = item
result.append('{num:0{width}b}'.format
(num=offset, width=offset_width))
result.append('{num:0{width}b}'.
format(num=length, width=length_width))
return "".join(result)
welcome_bits = intermediate2bits(welcome_intermediate)
print(welcome_bits)
010101110110010101101100011000110110111101101101011001010010000001101101011110010010000001110011011011110110111000101100001000000111011110000000100000011101110100011011110010000001110100011010001000000010111000110110000101100011011010000110100101101110011001010010111000001010010101110110100001100101011100100110010100100000011010000110000101110110011001010010000001111001011011110111010100100000011000100110010101100101011011100011111100100000010010010111010000100111011100110010000001100001011011000111001001101001011001110110100001110100100000011101000011001000000110101101101110011011110111011100100000011101111000000101010001011000000100101000110010011110000001011000001110000001010000010000101110000010100101100110000000011010101000100000011010010110111010000010110010010101110000011010010111000001100101011011001000001011010000110010110000100000011001100110100101101100100000000100100011011001111000000011001001010110100101101101100000001000100011000010100101000001110010011011110111011001101001011001000110010101100100001000000111011101101001011101000110100010000100010100001101111001100000110010100011011011100110010000100000010100110110001101101111011101010111010010000001010100010001100110011011110111001000100000010000101000000010110000111000001011001001010010000001100010011011110111010110000011111110010001100001001000000110011101110101011010010111010001100001011100101000010111010001000111000001110101011011100110100101110011011010001000010000100001000111001010000110000100001100101100000010100100000110000001111100001110000101101100010001100100011010010110010001101110001001110111010000100000011011000110100101101011011001010010000001110011011000110110100001101111011011110110110000101100100000101101000101100000001110000100100001011110000101100001011011000100100001011110100011011011100110111101100010011011110110010001111001100001101110000011011001101000000100011001000000101001010011011011111000100010111110101000100111111111111000100111111010100110000101110100100000111111100100100001000011100110011100100110010101100001011011011000100111110100100111010001101111011011001000000100000001100111011110000001011010010010000010000100001110000001010000010110000111001010011010000000010110010110001000100100001101101111011001101000011101001000110110001001101001011001110010000001110011100001110101100011001011000010000010001010010010010001101100011000010111100110000000110010001101100001001000000110110101100101011000010110111010001000001000011100101100000010100100100001100101100000101101000011011101110110000110001001110100010001110100011001011000101110100010000101001101110100011001010110000101101011001000000100001001100001011100100010111000100000100000010000000011011011000110111101110110100000011100000011100000110011100101011010011000110011101000111000000100100000110110100001101001011100110010000001001010011000010110011101110101100000010000100011100001111010011011
We can now write this bitstring to binary file:
def bitstring_to_file(bitstring, filename):
bytestring = ( bitstring[i:i + 8] for i in range(0, len(bitstring), 8) )
intstring = (int(x, 2) for x in bytestring)
chrstring = (chr(x) for x in intstring)
output = ''.join(chrstring)
with open(filename,'wb') as f:
f.write(bytes(output, 'utf-8'))
bitstring_to_file(welcome_bits, 'welcome.lz')
!cat welcome.lz
Welcome my son, wֲ€ֲ�ֳ�ֳˆ .66ֲ†ֲ–ֳ¦Rֳ ֲ¥vֲ†W&Rֲ†fRֲ–ֳ·R&VVֳ£ֳ²ֲ—Bw2ֳ‡&ֲ–vֲ‡H ֲ�ֲֲ¹ֲ½ֳ�ֲ�ֳ�Eֲ�(ֳ‰ֳ X8ֲ¸)fֲ× inֲ‚ֳ‰\\[ ֲ´2ֳ‚fֲ–ֳˆֲ�ֲ�%imֲ€ֲˆֳ‚ֲ” ֲ›ֳ�ֲ�YY ֳ�]!7ֲ˜2ֲ�ֲ¹ֲ�ֲ�Mֲ�ֲ½ֳ•ֳ’Dfor Bֲ€ֲ°ֳ ֲ²R&ֳ·X?ֲ‘ֲ„ֲ�ֲ�ֳ•ֲ¥ֳ‘ֲ…ֳ�Dpunishֲ„! ֲ¡ֲ„2ֳ€ֲ₪ֳ„didn't like school,ֲ‚ֳ‘`8H^ֳ„ֲ…ֳ¨ֳ›ֲ›ֳ˜ֲ›ֳ™ aֲ¸6hֲ�)Mֲ¾"ֳ÷ֲ‰ֳ¿ֳ¢~ֲ¦H?ֲ’ֳ¦reamֲ‰ֳ´ֲ�ֳ› @gxֲ’Cֲ�Aaֳ�hֲ–"Cofֲ‡Hֳ˜ֲ�Yֳˆ ֳ¡ֳ–2ֳ‚ֲ₪ֲ‘ֲ±ֲ…ֳ¦#a meanֲˆ!ֳ‹ֲ’`ֲ´7vֲ�ֳ‘ֲ–.ֲˆSteak Bar. ֲ� ֳ›ֳ�ֲ p83ֲ•ֲ¦3ֲ£ֲ� ֳ�\ֳˆֲ˜Yֳ�`B8z
Next we want to do the reverse - decompress the bitstring to the text:
def bits2intermediate(bitstring, window_length=2**12-1, max_rep_length=2**5-1):
""" converts a compressed string of bits
to intermediate compressed format """
offset_width = math.ceil(math.log(window_length, 2))
length_width = math.ceil(math.log(max_rep_length, 2))
result=[]
i = 0
while i < len(bitstring):
if bitstring[i] == "0": # single ascii char
i += 1
ch = chr(int(bitstring[i:i + 7], 2))
result.append(ch)
i += 7
elif bitstring[i] == "1": # repeat of length >= 2
i += 1
offset = int(bitstring[i:i + offset_width], 2)
i += offset_width
length = int(bitstring[i:i + length_width], 2)
i += length_width
result.append((offset,length))
return result
bits2intermediate(welcome_bits)
['W', 'e', 'l', 'c', 'o', 'm', 'e', ' ', 'm', 'y', ' ', 's', 'o', 'n', ',', ' ', 'w', (16, 7), 't', 'o', ' ', 't', 'h', (23, 3), 'a', 'c', 'h', 'i', 'n', 'e', '.', '\n', 'W', 'h', 'e', 'r', 'e', ' ', 'h', 'a', 'v', 'e', ' ', 'y', 'o', 'u', ' ', 'b', 'e', 'e', 'n', '?', ' ', 'I', 't', "'", 's', ' ', 'a', 'l', 'r', 'i', 'g', 'h', 't', (58, 3), ' ', 'k', 'n', 'o', 'w', ' ', 'w', (42, 5), (37, 3), "'", (44, 3), (40, 4), '.', '\n', 'Y', (13, 10), ' ', 'i', 'n', (89, 5), 'p', 'i', 'p', 'e', 'l', (90, 3), ',', ' ', 'f', 'i', 'l', (9, 3), 'g', (25, 5), 'i', 'm', (17, 3), '\n', 'P', 'r', 'o', 'v', 'i', 'd', 'e', 'd', ' ', 'w', 'i', 't', 'h', (138, 3), 'y', (101, 3), 'n', 'd', ' ', 'S', 'c', 'o', 'u', 't', (42, 4), 'f', 'o', 'r', ' ', 'B', (22, 3), (89, 5), ' ', 'b', 'o', 'u', (127, 4), 'a', ' ', 'g', 'u', 'i', 't', 'a', 'r', (186, 4), 'p', 'u', 'n', 'i', 's', 'h', (132, 4), 'r', (194, 3), ',', '\n', 'A', (62, 3), (182, 4), 'd', 'i', 'd', 'n', "'", 't', ' ', 'l', 'i', 'k', 'e', ' ', 's', 'c', 'h', 'o', 'o', 'l', ',', (90, 5), (28, 4), (188, 5), (182, 4), (189, 3), 'n', 'o', 'b', 'o', 'd', 'y', (220, 3), 'f', (35, 4), '\n', 'S', 'o', (279, 26), (319, 31), (319, 10), 'a', 't', (127, 4), (135, 6), 'r', 'e', 'a', 'm', (318, 18), 't', 'o', 'l', (32, 6), 'w', (45, 4), (66, 3), (40, 5), (229, 6), (11, 5), (274, 3), 'o', 'f', (233, 3), 'b', 'i', 'g', ' ', 's', (235, 3), ',', ' ', (329, 4), 'l', 'a', 'y', (25, 3), 'a', ' ', 'm', 'e', 'a', 'n', (260, 7), ',', '\n', 'H', 'e', (90, 3), 'w', 'a', (314, 4), 't', 'e', (372, 8), 'S', 't', 'e', 'a', 'k', ' ', 'B', 'a', 'r', '.', ' ', (32, 3), 'l', 'o', 'v', (56, 3), (103, 5), 'i', (413, 3), (36, 3), 'h', 'i', 's', ' ', 'J', 'a', 'g', 'u', (33, 3), (244, 27)]
def lz77_decompress(compressed, window_length=2**12-1, max_rep_length=2**5-1):
"""LZ77 decompression from intermediate format to ASCII text"""
result = ''
i = 0
while i < len(compressed):
if isinstance(compressed[i], str): # char, as opposed to a pair
result += compressed[i]
i += 1
else:
offset,rep_length = compressed[i]
i += 1
for j in range(rep_length):
result += result[-offset] # fixed offset"to the left" as result itself grows
return result
print(lz77_decompress(welcome_intermediate))
Welcome my son, welcome to the machine. Where have you been? It's alright we know where you've been. You've been in the pipeline, filling in time, Provided with toys and Scouting for Boys. You bought a guitar to punish your ma, And you didn't like school, and you know you're nobody's fool, So welcome to the machine. Welcome my son, welcome to the machine. What did you dream? It's alright we told you what to dream. You dreamed of a big star, he played a mean guitar, He always ate in the Steak Bar. He loved to drive in his Jaguar. So welcome to the machine.
Full compression in one command and a function to calculate the compression ratio:
compress = lambda text: intermediate2bits(lz77_compress(text))
lz_ratio = lambda text: len(compress(text)) / (len(text)*7)
And to check the compression ratio:
lz_ratio(welcome)
0.733604473817997
from urllib.request import urlopen
with urlopen("http://www.gutenberg.org/cache/epub/28233/pg28233.txt") as f:
newton = f.read().decode('utf-8')
print(newton[:newton.index('\n\r')])
newton = ''.join(ch for ch in newton if ord(ch) < 128)
The Project Gutenberg EBook of Philosophiae Naturalis Principia Mathematica, by Isaac Newton
len(newton), lz_ratio(newton[:len(newton)//100])
(835497, 0.6752282909812237)
%timeit compress(welcome)
%timeit -n 1 compress(newton[:len(newton)//1000])
%timeit -n 1 compress(newton[:len(newton)//100])
10 loops, best of 3: 92.6 ms per loop 1 loops, best of 3: 282 ms per loop 1 loops, best of 3: 12.3 s per loop
Some more interesting examples:
lz77_compress("a"*40)
['a', (1, 31), (1, 8)]
So you need 8 bits for the initial a
, then 1+12+5=18 bits for the first offset-length pair of length 32 and another 18 bits for the second offset-length pair for a total of 44 bits:
len(compress("a"*40))
44
This can be reduced by increasing the max repetition length:
lz77_compress("a"*40, max_rep_length=2**6-1)
['a', (1, 39)]
For 100 repetitions of a
:
lz77_compress("a"*100)
['a', (1, 31), (1, 31), (1, 31), (1, 6)]
len(compress("a"*100))
80
So in general if we have k
repetitions of a
, then we need $(k-1)//31 + 1$ offset-length pairs plus the initial a
, for a total of $8+18((k-1)//31+1)$:
k = 1000
print(len(compress('a'*k)))
print(18*((k-1)//31+1)+8)
602 602
Try at home - compare LZ and Huffman compression ratios on you favorite text.
k_select
¶In homework 4, you were asked to implement recursively a function k_select, which finds a rank-k element in an orderable set. To reinforce your understanding of recursion, let's solve it together on board. We'll also analyze the number of comparisons which k_select performs in the best case and the worst case.
def k_select(lst, k):
n = len(lst)
assert 0 <= k < n
pivot = lst[0]
small = [x for x in lst if x < pivot]
large = [x for x in lst if x > pivot]
if k < len(small):
return k_select(small, k)
elif k >= n - len(large):
return k_select(large, k - (n - len(large)))
else:
return pivot
After one partition the pivot (left most element) is the k-th element and you are done - $O(n)$.
כל פעם מקטינים את הגודל ב- 1 (זה קורה כאשר הרשימה ממוינת / ממוינת הפוך / או באופן כללי כאשר האיבר השמאלי בכל פעם הוא המינ/מקס. בנוסף, אנחנו מגיעים לאיבר שמחפשים רק בסוף. סה"כ O(n^2). At each call you reduce the list length by one. In addition, you reach the number you are looking for only at the end - $O(n^2)$.
In homework 5, question 2 reality show contestents were given a list of missions and a time to do the missions.
The missions list is a list of tuples of missions points and time to complete the mission:
missions = [ (8, 7) , (4, 6) , (9, 8) , (13, 10) ]
You had to write a function that outputs the maximum number for a given missions list and a time interval.
The recursion formula is, for a given list of points $p$ and list ot times $t$ of length $n$ and for a time interval $T$: $$ A(n,0) = 0 \\\\ A(0,T) = 0 \\\\ A(n,T) = max(A(n-1,T-t_n) + p_n \; , \; A(n-1,T)) \; if \; t_n \le T \\\\ A(n,T) = A(n-1,T) \; if \; t_n > T $$
The recursive step is made by either doing the n-th mission or not doing it.
The code for this question is:
def choose_mem(missions, time, mem, i):
key = (time, i)
if key not in mem:
if time == 0 or i < 0:
mem[key] = 0
else:
p, t = missions[i]
if t > time:
mem[key] = choose_mem(missions, time, mem, i - 1)
else:
mem[key] = max(choose_mem(missions, time, mem, i - 1), p + choose_mem(missions, time - t, mem, i - 1))
return mem[key]
def choose(missions, time):
return choose_mem(missions, time, {}, len(missions) - 1)
choose([ (7, 8) , (6, 4) , (8, 9) , (10, 13) ], 15)
14
This notebook is part of the Extended introduction to computer science course at Tel-Aviv University.
The notebook was written using Python 3.2 and IPython 0.13.1.
The code is available at https://raw.github.com/yoavram/CS1001.py/master/recitation12.ipynb.
The notebook can be viewed online at http://nbviewer.ipython.org/urls/raw.github.com/yoavram/CS1001.py/master/recitation12.ipynb.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.