Dictionary Learning for Denoising¶


This tour follows the Dictionary Learning numerical tour. It uses a learned dictionary to perform image denoising.

In [2]:
addpath('toolbox_signal')


Learning the Dictionary¶

We aim at applying the dictionary learning method to denoising. We thus consider a noisy image obtained by adding Gaussian noise to a clean, unknown image $f_0 \in \RR^N$ where $N=n_0 \times n_0$.

Noise level $\sigma$.

In [3]:
sigma = .06;


Size of the image.

In [4]:
n0 = 256;


In [5]:
name = 'barb';


Display $f_0$.

In [6]:
clf;
imageplot(f0);


Noisy observations $f=f_0+w$ where $w$ is a realization of $\Nn(0,\si^2\text{Id}_N)$.

In [7]:
f = f0 + sigma*randn(n0);


Display $f$.

In [8]:
clf;
imageplot(f);


Perform the numerical tour on Dictionary Learning to obtain a dictionary $D \in \RR^{n \times p}$

In [9]:
if not(exist('D'))
sparsity_4_dictionary_learning;
end

[Warning: Name is nonexistent or not a directory: ../toolbox_signal]
[> In path at 110
In perform_toolbox_installation>@(p)path(p,path) at 9
In perform_toolbox_installation at 12
In sparsity_4_dictionary_learning at 6
In pymat_eval at 38
In matlabserver at 27]
[Warning: Name is nonexistent or not a directory: ../toolbox_general]
[> In path at 110
In perform_toolbox_installation>@(p)path(p,path) at 9
In perform_toolbox_installation at 12
In sparsity_4_dictionary_learning at 6
In pymat_eval at 38
In matlabserver at 27]


Width $w$ of the patches.

In [10]:
w = sqrt(size(D,1));


Number $p$ of atoms.

In [11]:
p = size(D,2);


Dimension $n= w \times w$ of the data to be sparse coded.

In [12]:
n = w*w;


Denoising by Sparse Coding¶

Here we apply sparse coding in the learned dictionary to the problem of image denoising, as detailed in:

M. Elad and M. Aharon, <http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4011956 Image Denoising Via Sparse and Redundant representations over Learned Dictionaries>, IEEE Trans. on Image Processing, Vol. 15, no. 12, pp. 3736-3745, December 2006.

The method first extract lots of patches from $f$, then perform sparse coding of each patch in $D$, and then average the overlapping patch to obtained the denoised image $f_1$.

We extract a large number $m$ of patches $Y = (y_j)_{j=1}^m \in \RR^{n \times m}$ from the noisy image $f$.

To extract the patche, we use $$y_j(t) = f( a_j + t )$$ where $a_j$ is the location of patch indexed by $j$.

In practice, splitting $j=(j_1,j_2)$ as a 2-D index, we use $a_j = (j_1 q,j_2 q)$ where $q>0$ is an overlap factor ( so that setting $q=w$ implies no overlap). Overlap is important to obtain good denoising performance (reduced blocking artifact).

In [13]:
q = 2;


Define regularly space positions for the extraction of patches.

In [14]:
[y,x] = meshgrid(1:q:n0-w/2, 1:q:n0-w/2);
m = size(x(:),1);
Xp = repmat(dX,[1 1 m]) + repmat( reshape(x(:),[1 1 m]), [w w 1]);
Yp = repmat(dY,[1 1 m]) + repmat( reshape(y(:),[1 1 m]), [w w 1]);


Ensure boundary conditions (reflexion).

In [15]:
Xp(Xp>n0) = 2*n0-Xp(Xp>n0);
Yp(Yp>n0) = 2*n0-Yp(Yp>n0);


Extract the $m$ patches $Y$.

In [16]:
Y = f(Xp+(Yp-1)*n0);
Y = reshape(Y, [n, m]);


Save the mean $\theta_j$ of each patch appart, and remove it.

In [17]:
theta = mean(Y);
Y = Y - repmat( theta, [n 1] );


Denoising of the patches is obtained by performing a sparse coding of each patch $y_j$ in $D$ $$\umin{\norm{D x_j - y_j} \leq \epsilon} \norm{x_j}_1.$$

The value of $\epsilon>0$ is set proportionaly to the noise level $\sqrt{n}\sigma$ that contaminates each patch. We denote $\rho \approx 1$ the proportionality factor, that is a parameter of the method.

In [18]:
rho = .95;
epsilon = rho*sqrt(n)*sigma;


The sparse coding problem can written equivalently in a way that is easier to deal using proximal splitting schemes. We introduce an auxiliary variable $u=D x \in \RR^n$ as follow $$\umin{ z=(x,u) \in \RR^p \times \RR^n } F(z) + G(z)$$ where for $z=(x,u)$, one defines $$F(x,u) = \norm{x}_1 + \iota_{B_\epsilon(y)}(u) \qwhereq B_\epsilon(y) = \enscond{u}{ \norm{u-y} \leq \epsilon }$$ and $$G(x,u) = \iota_{\Cc}(x,u) \qwhereq \Cc = \enscond{(x,u)}{ u=D x }.$$

Here we included the constraints using characteristic functions $$\iota_{A}(z) = \choice{ 0 \qifq z \in A, \\ +\infty \quad \text{otherwise}. }$$

Douglas-Rachford Algorithm¶

To minimize the sparse coding problem, we make use of a proximal splitting scheme to minimize an energy of the form $F(z)+G(z)$. These schemes are adapted to solve structured non-smooth optimization problem.

They basically replace the traditional gradient-descent step (that is not available because neither $F$ nor $G$ are smooth functionals) by proximal mappings, defined as $$\text{Prox}_{\gamma F}(z) = \uargmin{y} \frac{1}{2}\norm{z-y}^2 + \ga F(y)$$ (the same definition applies also for $G$).

The Douglas-Rachford (DR) algorithm is an iterative scheme to minimize functionals of the form $$\umin{z} F(z) + G(z)$$ where $F$ and $G$ are convex functions for which one is able to comptue the proximal mappings $\text{Prox}_{\gamma F}$ and $\text{Prox}_{\gamma G}$.

The important point is that $F$ and $G$ do not need to be smooth. One onely needs then to be "proximable".

A DR iteration reads $$\tilde z_{k+1} = \pa{1-\frac{\mu}{2}} \tilde z_k + \frac{\mu}{2} \text{rPox}_{\gamma G}( \text{rProx}_{\gamma F}(\tilde z_k) ) \qandq z_{k+1} = \text{Prox}_{\gamma F}(\tilde z_{k+1},)$$

We have use the following shortcuts: $$\text{rProx}_{\gamma F}(z) = 2\text{Prox}_{\gamma F}(z)-z$$

It is of course possible to inter-change the roles of $F$ and $G$, which defines another set of iterations.

One can show that for any value of $\gamma>0$, any $0 < \mu < 2$, and any $\tilde z_0$, $z_k \rightarrow z^\star$ which is a minimizer of $F+G$.

Please note that it is actually $z_k$ that converges, and not $\tilde z_k$.

Proximal Splitting Methods in Signal Processing, Patrick L. Combettes and Jean-Christophe Pesquet, in: Fixed-Point Algorithms for Inverse Problems in Science and Engineering, New York: Springer-Verlag, 2010.

Douglas-Rachford for Sparse Coding¶

In the special case of the constrained sparse coding problem, the proximal mapping of $G$ is the orthogonal projection on the convex set $\Cc$: $$(\tilde x, \tilde u) = \text{Prox}_{\ga G}(x,u) = \text{Proj}_\Cc(x,u).$$

It can be computed by solving a linear system of equations since $$\tilde u = D \tilde x \qwhereq \tilde x = (\text{Id} + D^* D)^{-1}(f + D^* u).$$

Define Proj$_\Cc$, by pre-compuring the inverse of $\text{Id} + D^* D$.

In [19]:
U = (eye(p) + D'*D)^(-1);
Replicate = @(z)deal(z, D*z);
ProjC = @(x,u)Replicate( U*( x + D'*u ) );


One has $\text{Prox}_{\ga G} = \text{Proj}_{\Cc}$, whatever the value of $\ga$.

In [20]:
ProxG = @(f,u,gamma)ProjC(f,u);


Function $F(x,u)$ is actully a separable sum of a function that only depends on $x$ and a function that depends only on $u$: $$F(x,u) = \iota_{B_\epsilon(y)}(u) + \norm{x}_1.$$ The proximal operator of $F$ reads $$\text{Prox}_{\ga F}(x,u) = ( \text{Proj}_{B_\epsilon(y)}(u), \text{Prox}_{\ga \norm{\cdot}_1 }(x) ).$$

In order to speed up the implementation, the DR algorithm will be performed in parallel on all the $x_j$. We thus define $y=Y$ to be the set of all the patches.

In [21]:
y = Y(:,1:m);


Define the projector $$\text{Proj}_{B_\epsilon(y)}(u) = y + (u-y) \max\pa{1 , \frac{\epsilon}{\norm{u-y}} }$$

In [22]:
amplitude = @(a)repmat( sqrt(sum(a.^2,1)), [n 1] );
ProjB = @(u)y + (u-y) .* min(1, epsilon./amplitude(u-y) );


The proximal operator of the $\ell^1$ norm $\norm{\cdot}_1$ is a soft thresholding: $$\text{Prox}_{\ga \norm{\cdot}_1}(x)_i = \max\pa{ 0, \frac{\ga}{\abs{x_i}} } x_i.$$

In [23]:
ProxL1 = @(x,gamma)max(0,1-gamma./max(1e-9, abs(x))) .* x;


Define the proximal operator of $F$.

In [24]:
ProxF = @(x,u,gamma)deal( ProxL1(x,gamma), ProjB(u) );


Set the value of $\mu$ and $\gamma$. You might consider using your own value to speed up the convergence.

In [25]:
mu = 1;
gamma = 1;


Number of iterations.

In [26]:
niter = 800;


Exercise 1

Implement the DR iterative algorithm on |niter| iterations. Keep track of the evolution of the minimized energy $\norm{x}_1$ during the iterations.

In [27]:
exo1()

In [28]:
%% Insert your code here.


Patch Averaging¶

Once each $x_j$ is computed, one obtains the denoised patch $\tilde y_j = D x_j$. These denoised patches are then aggregated together to obtained the denoised image: $$f_1(t) = \frac{1}{W_t} \sum_j \tilde y_j(t-a_j)$$ where $W_t$ is the number of patches that overlap at a given pixel location $t$ (note that in this formula, we assumed $\tilde y_j(t-a_j)=0$ when $t-a_j$ falls outside the patch limits).

Approximated patches $Y_1 = (\tilde y_j)_j = D X$.

In [29]:
Y1 = reshape(D*x, [w w m]);


Insert back the mean.

In [30]:
Y1 = Y1 - repmat( mean(mean(Y1)), [w w] );
Y1 = Y1 + reshape(repmat( theta, [n 1] ), [w w m]);


To obtain the denoising, we average the value of the approximated patches $Y_1$ that overlap.

In [31]:
W = zeros(n0,n0);
f1 = zeros(n0,n0);
for i=1:m
x = Xp(:,:,i); y = Yp(:,:,i);
f1(x+(y-1)*n0) = f1(x+(y-1)*n0) + Y1(:,:,i);
W(x+(y-1)*n0) = W(x+(y-1)*n0) + 1;
end
f1 = f1 ./ W;


Display the result.

In [32]:
clf;
imageplot(clamp(f1), ['Denoised, SNR=' num2str(snr(f0,f1),4) 'dB']);


Exercise 2

Compare the obtained result with translation invariant wavelet hard thresholding.

In [33]:
exo2()

In [34]:
%% Insert your code here.


Exercise 3

Study the influence of the parameter $\rho$ on the quality of the denoising. Study the influence of the number $p$ of atoms.

In [35]:
exo3()

In [36]:
%% Insert your code here.