# Dictionary Learning¶


Instead of using a fixed data representation such as wavelets or Fourier, one can learn the representation (the dictionary) to optimize the sparsity of the representation for a large class of exemplar.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

from nt_toolbox.signal import load_image, imageplot, plot_dictionary
from nt_toolbox.general import crop


## Dictionary Learning as a Non-convex Optimization Problem¶

Given a set $Y = (y_j)_{j=1}^m \in \RR^{n \times m}$ of $m$ signals $y_j \in \RR^m$, dictionary learning aims at finding the best dictionary $D=(d_i)_{i=1}^p$ of $p$ atoms $d_i \in \RR^n$ to sparse code all the data.

In this numerical tour, we consider an application to image denoising, so that each $y_j \in \RR^n$ is a patch of size $n=w \times w$ extracted from the noisy image.

The idea of learning dictionaries to sparse code image patch was first proposed in:

Olshausen BA, and Field DJ., <http://www.nature.com/nature/journal/v381/n6583/abs/381607a0.html Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images.> Nature, 381: 607-609, 1996.

The sparse coding of a single data $y=y_j$ for some $j=1,\ldots,m$ is obtained by minimizing a $\ell^0$ constrained optimization $$\umin{ \norm{x}_0 \leq k } \frac{1}{2}\norm{y-Dx}^2 .$$ where the $\ell^0$ pseudo-norm of $x \in \RR^p$ is $$\norm{x}_0 = \abs{\enscond{i}{x(i) \neq 0}}.$$

The parameter $k>0$ controls the amount of sparsity.

Dictionary learning performs an optimization both on the dictionary $D$ and the set of coefficients $X = (x_j)_{j=1}^m \in \RR^{p \times m}$ where, for $j=1,\ldots,m$, $x_j$ is the set of coefficients of the data $y_j$. This joint optimization reads $$\umin{ D \in \Dd, X \in \Xx_k } E(X,D) = \frac{1}{2}\norm{Y-DX}^2 = \frac{1}{2} \sum_{j=1}^m \norm{y_j - D x_j}^2.$$

The constraint set on $D$ reads $$\Dd = \enscond{D \in \RR^{n \times p} }{ \forall i=1,\ldots,p, \quad \norm{D_{\cdot,i}} \leq 1 },$$ (the columns of the dictionary are unit normalized). The sparsity constraint set on $X$ reads $$\Xx_k = \enscond{X \in \RR^{p \times m}}{ \forall j, \: \norm{X_{\cdot,j}}_0 \leq k }.$$

We propose to use a block-coordinate descent method to minimize $E$: $$X^{(\ell+1)} \in \uargmin{X \in \Xx_k} E(X,D^{(\ell)}),$$ $$D^{(\ell+1)} \in \uargmin{D \in \Dd} E(X^{(\ell+1)},D).$$

One can show the convergence of this minimization scheme, see for instance

P. Tseng, <http://www.math.washington.edu/~tseng/papers/archive/bcr_jota.pdf Convergence of Block Coordinate Descent Method for Nondifferentiable Minimization>, J. Optim. Theory Appl., 109, 2001, 475-494.

We now define the parameter of the problem.

Width $w$ of the patches.

In [2]:
w = 10


Dimension $n= w \times w$ of the data to be sparse coded.

In [3]:
n = w*w


Number of atoms $p$ in the dictionary.

In [4]:
p = 2*n


Number $m$ of patches used for the training.

In [5]:
m = 20*p


Target sparsity $k$.

In [6]:
k = 4


## Patch Extraction¶

Since the learning is computationnaly intensive, one can only apply it to small patches extracted from an image.

In [7]:
f = crop(load_image("nt_toolbox/data/barb.bmp"), 256)
n0 = f.shape[0]


Display the input image.

In [8]:
plt.figure(figsize = (5,5))
imageplot(f)


Random patch location.

In [9]:
q = 3*m

# Random sampling of coordinates of the top left corner or the patches
x = (np.random.random((1, 1, q))*(n0-w)).astype(int)
y = (np.random.random((1, 1, q))*(n0-w)).astype(int)


Extract lots of patches $y_j \in \RR^n$, and store them in a matrix $Y=(y_j)_{j=1}^m$.

In [10]:
[dY, dX] = np.meshgrid(range(w), range(w))
dX = np.tile(dX, (q, 1, 1)).transpose((1, 2, 0))
dY = np.tile(dY, (q, 1, 1)).transpose((1, 2, 0))
Xp = np.tile(x, (w, w, 1)) + dX
Yp = dY + np.tile(y, (w, w, 1))

# Extract patches
Y = f.flatten()[Yp+Xp*n0]
Y = Y.reshape((w*w, q))


We remove the mean, since we are going to learn a dictionary of zero-mean and unit norm atom.

In [11]:
Y -= Y.mean(axis=0)


Only keep those with largest energy.

In [12]:
energies = np.sum(Y**2, axis=0)
indexes = np.argsort(energies)[::-1]
Y = Y[:, indexes[:m]];


We consider a dictionary $D \in \RR^{n \times p}$ of $p \geq n$ atoms in $\RR^n$. The initial dictionary $D$ is computed by a random selection of patches, and we normalize them to be unit-norm.

In [13]:
def projC(Y):
''' Scale the patches to unit norm '''
norm = np.tile(np.linalg.norm(Y, axis=0), (Y.shape[0], 1))
Y = np.divide(Y, norm)
return  Y

# Pick p indexes at random to pick patches from Y
sel = np.random.permutation(range(m))[:p]
D = projC(Y[:, sel])
D0 = D.copy()


Display the initial dictionary.

In [14]:
plt.figure(figsize=(10,6))
plot_dictionary(D)


## Update of the Coefficients $X$¶

The optimization on the coefficients $X$ requires, for each $y_j = Y_{\cdot,j}$ to compute $x_j = X_{\cdot,j}$ that solves $$\umin{ \norm{x_j}_0 \leq k } \frac{1}{2} \norm{y-D x_j}^2.$$

This is a non-smooth and non-convex minimization, that can be shown to be NP-hard. A heuristic to solve this method is to compute a stationary point of the energy using the Foward-Backward iterative scheme (projected gradient descent): $$x_j \leftarrow \text{Proj}_{\Xx_k}\pa{ x_j - \tau D^* ( D x_j - y ) } \qwhereq \tau < \frac{2}{\norm{D D^*}}.$$

Denoting $\abs{\bar x(1)} \leq \ldots \leq \abs{\bar x(n)}$ the ordered magnitudes of a vector $x \in \RR^n$, the orthogonal projector on $\Xx_k$ reads $z = \text{Proj}_{\Xx_k}(x)$ with $$\forall i=1,\ldots,n, \quad z(i) = \choice{ x(i) \qifq \abs{x(i)} \geq \abs{\bar x(k)}, \\ z(i) = 0 \quad \text{otherwise}. }$$

In [15]:
def projX(X, k):
''' Sparsity projection, keeps the k largest coefficients '''
X = X * (abs(X) >= np.sort(abs(X), axis=0)[-k, :])
return X


Exercise 1

Perform the iterative hard thresholding, and display the decay of the energy $J(x_j) = \norm{y_j-D x_j}^2$ for several $j$. Remark: note that the iteration can be performed in parallel on all $x_j$.

In [16]:
run -i nt_solutions/sparsity_4_dictionary_learning/exo1

In [17]:
## Insert your code here.


## Update the Dictionary $D$¶

Once the sparse coefficients $X$ have been computed, one can udpate the dictionary. This is achieve by performing the minimization $$\umin{D \in \Dd} \frac{1}{2}\norm{Y-D X}^2.$$

One can perform this minimization with a projected gradient descent $$D \leftarrow \text{Proj}_{\Cc}\pa{ D - \tau (DX - Y)X^* }$$ where $\tau < 2/\norm{XX^*}.$

Note that the orthogonal projector $\text{Proj}_{\Cc}$ is implemented in the function |ProjC| already defined.

Exercise 2

Perform this gradient descent, and monitor the decay of the energy.

In [18]:
run -i nt_solutions/sparsity_4_dictionary_learning/exo2

In [19]:
## Insert your code here.


Exercise 3

Perform the dictionary learning by iterating between sparse coding and dictionary update.

In [20]:
run -i nt_solutions/sparsity_4_dictionary_learning/exo3

In [21]:
## Insert your code here.


Display the dictionary.

In [22]:
plt.figure(figsize=(10,6))
plot_dictionary(D)