# Entropic Coding and Compression¶


This numerical tour studies source coding using entropic coders (Huffman and arithmetic).

In [1]:
using PyPlot
using NtToolBox
WARNING: Method definition ndgrid(AbstractArray{T<:Any, 1}) in module NtToolBox at /Users/quentin/.julia/v0.5/NtToolBox/src/ndgrid.jl:3 overwritten at /Users/quentin/.julia/v0.5/NtToolBox/src/ndgrid.jl:3.
WARNING: Method definition ndgrid(AbstractArray{#T<:Any, 1}, AbstractArray{#T<:Any, 1}) in module NtToolBox at /Users/quentin/.julia/v0.5/NtToolBox/src/ndgrid.jl:6 overwritten at /Users/quentin/.julia/v0.5/NtToolBox/src/ndgrid.jl:6.
WARNING: Method definition ndgrid_fill(Any, Any, Any, Any) in module NtToolBox at /Users/quentin/.julia/v0.5/NtToolBox/src/ndgrid.jl:13 overwritten at /Users/quentin/.julia/v0.5/NtToolBox/src/ndgrid.jl:13.
WARNING: Method definition ndgrid(AbstractArray{#T<:Any, 1}...) in module NtToolBox at /Users/quentin/.julia/v0.5/NtToolBox/src/ndgrid.jl:19 overwritten at /Users/quentin/.julia/v0.5/NtToolBox/src/ndgrid.jl:19.
WARNING: Method definition meshgrid(AbstractArray{T<:Any, 1}) in module NtToolBox at /Users/quentin/.julia/v0.5/NtToolBox/src/ndgrid.jl:33 overwritten at /Users/quentin/.julia/v0.5/NtToolBox/src/ndgrid.jl:33.
WARNING: Method definition meshgrid(AbstractArray{#T<:Any, 1}, AbstractArray{#T<:Any, 1}) in module NtToolBox at /Users/quentin/.julia/v0.5/NtToolBox/src/ndgrid.jl:36 overwritten at /Users/quentin/.julia/v0.5/NtToolBox/src/ndgrid.jl:36.
WARNING: Method definition meshgrid(AbstractArray{#T<:Any, 1}, AbstractArray{#T<:Any, 1}, AbstractArray{#T<:Any, 1}) in module NtToolBox at /Users/quentin/.julia/v0.5/NtToolBox/src/ndgrid.jl:44 overwritten at /Users/quentin/.julia/v0.5/NtToolBox/src/ndgrid.jl:44.

## Source Coding and Entropy¶

Entropic coding converts a vector $x$ of integers into a binary stream $y$. Entropic coding exploits the redundancies in the statistical distribution of the entries of $x$ to reduce as much as possible the size of $y$. The lower bound for the number of bits $p$ of $y$ is the Shannon bound :

$$p=-\sum_ih(i)\log_2(h(i))$$

where $h(i)$ is the probability of apparition of symbol $i$ in $x$.

Fist we generate a simple binary signal $x$ so that $0$ has a probability $p$ to appear in $x$.

Probability of 0.

In [2]:
p = 0.1;

Size.

In [3]:
n = 512;

Signal, should be with token 1,2.

In [4]:
x = (rand(n) .> p) + 1;

One can check the probabilities by computing the empirical histogram.

In [5]:
h = [sum(x .== 1); sum(x .== 2)]
h = h/sum(h)

print("Empirical p = $(h[1])") Empirical p = 0.099609375 We can compute the entropy of the distribution represented as a vector$h$of proability that should sum to 1. We take a max to avoid problems with null probabilties. In [6]: e = - sum(h.*log2([max(e,1e-20) for e in h])) print("Entropy =$e")
Entropy = 0.4677561172295487

## Huffman Coding¶

A Hufman code $C$ associates to each symbol $i$ in $\{1,...,m\}$ a binary code $C_i$ whose length is as close as possible to the optimal bound $-\log_2\left(h(i)\right)$, where $h(i)$ is the probability of apparition of the symbol $i$.

We select a set of proabilities.

In [7]:
h = [.1, .15, .4, .15, .2];

The tree $T$ contains the codes and is generated by an iterative algorithm. The initial "tree" is a collection of empty trees, pointing to the symbols numbers.

In [8]:
m = length(h)
T=Array{Any,1}(zeros(m)); # create an empty tree

We build iteratively the Huffman tree by grouping together the two erees that have the smallest probabilities. The merged tree has a probability which is the sum of the two selected probabilities.

Initial probability.

In [9]:
#we use the symbols i = 0,1,2,3,4 (as strings) with the associated probabilities h(i)

for i in 1:m
T[i] = (h[i],string(i))
end

Iterative merging of the leading probabilities.

In [10]:
while length(T) > 1

sort!(T) #sort according to the first values of the tuples (the probabilities)
t = tuple(T[1:2]...)
q = T[1][1] + T[2][1]
T = [T[3:end]; [(q,t)]]
end

We trim the computed tree by removing the probabilities.

In [11]:
function trim(T)
T0 = T[2]
if typeof(T0) == String
return T0
else
return (trim(T0[1]),trim(T0[2]))
end
end
T = trim(T[1]);

We display the computed tree.

In [12]:
plot_hufftree(T);

Once the tree $T$ is computed, one can compute the code $C_{i}$ associated to each symbol $i$. This requires to perform a deep first search in the tree and stop at each node.

In [13]:
codes = Dict()

function huffman_gencode(T,codes,c)
if typeof(T) == String #test if T is a leaf
codes[T] = c
else
huffman_gencode(T[1],codes, string(c, "0"))
huffman_gencode(T[2],codes, string(c, "1"))
end
end

huffman_gencode(T,codes,"") ;

Display the code.

In [14]:
for e in keys(codes)
println(string("Code of token ", e, ": ", codes[e]))
end
Code of token 4: 110
Code of token 1: 100
Code of token 5: 111
Code of token 2: 101
Code of token 3: 0

We draw a vector $x$ according to the distribution $h$.

Size of the signal.

In [15]:
n = 1024;

Randomization.

In [16]:
function rand_discr(p; m = 1)
"""
rand_discr - discrete random generator

y = rand_discr(p, n);

y is a random vector of length n drawn from
a variable X such that
p(i) = Prob( X=i )

"""

# makes sure it sums to 1
p = p/sum(p)

n = length(p)
coin = rand(m)
cumprob = [0.]
append!(cumprob, cumsum(p))
sample = zeros(m)

for j in 1:n
ind = find((coin .> cumprob[j]) & (coin .<= cumprob[j+1]))
sample[ind] = j
end
return sample
end

x = rand_discr(h, m=n);

Exercise 1

Implement the coding of the vector $x$ to obtain a binary vector $y$, which corresponds to replacing each sybmol $x(i)$ by the code $C_{x(i)}$.

In [17]:
include("NtSolutions/coding_2_entropic/exo1.jl")
In [18]:

Compare the length of the code with the entropy bound.

In [19]:
e = - sum(h.*log2([max(e,1e-20) for e in h]))
println("Entropy bound = $(n*e)") println("Huffman code =$(length(y))")
Entropy bound = 2197.9538889431196
Huffman code  = 2254

Decoding is more complicated, since it requires to iteratively parse the tree $T$.

Initial empty decoded stream.

In [20]:
x1 = [];

Perform decoding.

In [21]:
T0 = T
for e in y
if e == '0'
T0 = T0[1]
else
T0 = T0[2]
end
if typeof(T0) == String
append!(x1,T0)
T0 = T
end
end

We test if the decoding is correct.

In [22]:
err = norm(x-map(x->parse(Int, x), x1))
print("Error (should be zero) : $err") Error (should be zero) : 0.0 ## Huffman Block Coding¶ A Huffman coder is inefficient because it can distribute only an integer number of bit per symbol. In particular, distribution where one of the symbol has a large probability are not well coded using a Huffman code. This can be aleviated by replacing the set of$m$symbols by$m^q$symbols obtained by packing the symbols by blocks of$q$(here we use$m=2$for a binary alphabet). This breaks symbols with large probability into many symbols with smaller proablity, thus approaching the Shannon entropy bound. Generate a binary vector with a high probability of having 1, so that the Huffman code is not very efficient (far from Shanon bound). Proability of having 0. In [23]: t = .12; Probability distriution. In [24]: h = [t, 1-t]; Generate signal. In [25]: n = 4096*2 x = (rand(n) .> t) + 1; For block of length$q=3$, create a new vector by coding each block with an integer in$\{1,...,m^q=2^3\}$. The new length of the vector is$n_1/q$where$n_1=\lceil n/q\rceil q$. Block size. In [26]: q = 3; Maximum token value. In [27]: m = 2; New size. In [28]: n1 = Int(ceil(n/q)*q); New vector. In [29]: x1 = zeros(n1) x1[1:length(x)] = x x1[length(x)+1:end] = 1 x1 = x1 - 1 x2 = [] mult = [m^j for j in 0:q-1] for i in 1:q:n1-1 append!(x2,sum(x1[i:i+q-1].*mult)+1) end In [30]: H = h for i in 1:q-1 Hold = H H = [] for j in 1:length(h) append!(H,[e*h[j] for e in Hold]) end end A simpler way to compute this block-histogram is to use the Kronecker product. In [31]: H = h for i in 1:q-1 H = kron(H, h) end Exercise 2 For various values of block size$k\$, Perform the Huffman coding and compute the length of the code. Compare with the entropy lower bound.

In [32]:
include("NtSolutions/coding_2_entropic/exo2.jl");
Entropy bound = 0.5293608652873644
---
Huffman(block size = 1) = 1.0
Huffman(block size = 2) = 0.67041015625
Huffman(block size = 3) = 0.5736083984375
Huffman(block size = 4) = 0.5513916015625
Huffman(block size = 5) = 0.5406494140625
Huffman(block size = 6) = 0.537353515625
Huffman(block size = 7) = 0.5438232421875
Huffman(block size = 8) = 0.5411376953125
Huffman(block size = 9) = 0.541748046875
Huffman(block size = 10) = 0.5364990234375
In [33]: