Newton Method¶

Important: Please read the installation page for details about how to install the toolboxes. $\newcommand{\dotp}[2]{\langle #1, #2 \rangle}$ $\newcommand{\enscond}[2]{\lbrace #1, #2 \rbrace}$ $\newcommand{\pd}[2]{ \frac{ \partial #1}{\partial #2} }$ $\newcommand{\umin}[1]{\underset{#1}{\min}\;}$ $\newcommand{\umax}[1]{\underset{#1}{\max}\;}$ $\newcommand{\umin}[1]{\underset{#1}{\min}\;}$ $\newcommand{\uargmin}[1]{\underset{#1}{argmin}\;}$ $\newcommand{\norm}[1]{\|#1\|}$ $\newcommand{\abs}[1]{\left|#1\right|}$ $\newcommand{\choice}[1]{ \left\{ \begin{array}{l} #1 \end{array} \right. }$ $\newcommand{\pa}[1]{\left(#1\right)}$ $\newcommand{\diag}[1]{{diag}\left( #1 \right)}$ $\newcommand{\qandq}{\quad\text{and}\quad}$ $\newcommand{\qwhereq}{\quad\text{where}\quad}$ $\newcommand{\qifq}{ \quad \text{if} \quad }$ $\newcommand{\qarrq}{ \quad \Longrightarrow \quad }$ $\newcommand{\ZZ}{\mathbb{Z}}$ $\newcommand{\CC}{\mathbb{C}}$ $\newcommand{\RR}{\mathbb{R}}$ $\newcommand{\EE}{\mathbb{E}}$ $\newcommand{\Zz}{\mathcal{Z}}$ $\newcommand{\Ww}{\mathcal{W}}$ $\newcommand{\Vv}{\mathcal{V}}$ $\newcommand{\Nn}{\mathcal{N}}$ $\newcommand{\NN}{\mathcal{N}}$ $\newcommand{\Hh}{\mathcal{H}}$ $\newcommand{\Bb}{\mathcal{B}}$ $\newcommand{\Ee}{\mathcal{E}}$ $\newcommand{\Cc}{\mathcal{C}}$ $\newcommand{\Gg}{\mathcal{G}}$ $\newcommand{\Ss}{\mathcal{S}}$ $\newcommand{\Pp}{\mathcal{P}}$ $\newcommand{\Ff}{\mathcal{F}}$ $\newcommand{\Xx}{\mathcal{X}}$ $\newcommand{\Mm}{\mathcal{M}}$ $\newcommand{\Ii}{\mathcal{I}}$ $\newcommand{\Dd}{\mathcal{D}}$ $\newcommand{\Ll}{\mathcal{L}}$ $\newcommand{\Tt}{\mathcal{T}}$ $\newcommand{\si}{\sigma}$ $\newcommand{\al}{\alpha}$ $\newcommand{\la}{\lambda}$ $\newcommand{\ga}{\gamma}$ $\newcommand{\Ga}{\Gamma}$ $\newcommand{\La}{\Lambda}$ $\newcommand{\si}{\sigma}$ $\newcommand{\Si}{\Sigma}$ $\newcommand{\be}{\beta}$ $\newcommand{\de}{\delta}$ $\newcommand{\De}{\Delta}$ $\newcommand{\phi}{\varphi}$ $\newcommand{\th}{\theta}$ $\newcommand{\om}{\omega}$ $\newcommand{\Om}{\Omega}$

This tour explores the use of the Newton method for the unconstrained optimization of a smooth function

In [2]:

addpath('toolbox_signal')
addpath('toolbox_general')
addpath('solutions/optim_2_newton')

Newton Method for Unconstrained Problems¶

We test here Newton method for the minimization of a 2-D function.

We define a highly anisotropic function, the Rosenbrock function

$$ g(x) = (1-x_1)^2 + 100 (x_2-x_1^2)^2 $$

In [3]:

f = @(x1,x2)(1-x1).^2 + 100*(x2-x1.^2).^2;

The minimum of the function is reached at $x^\star=(1,1)$ and $f(x^\star)=0$.

Evaluate the function on a regular grid.

In [4]:

x1 = linspace(-2,2,150);
x2 = linspace(-.5,3,150);
[X2,X1] = meshgrid(x2,x1);
F = f(X1,X2);

3-D display.

In [5]:

clf; 
surf(x2,x1, F, perform_hist_eq(F, 'linear') ); 
% shading interp;
% camlight;
% axis tight;
% colormap jet;

2-D display (histogram equalization helps to better visualize the iso-contours).

In [6]:

clf;
imageplot( perform_hist_eq(F, 'linear') );
colormap jet(256);

Gradient descent methods, that only use first order (gradient) information about $f$ are not able to efficiently minimize this function because of its high anisotropy.

Define the gradient of $f$ $$ \nabla g(x) = \pa{ \pd{g(x)}{x_1}, \pd{g(x)}{x_2} } = \pa{ 2 (x_1-1) + 400 x_1 (x_1^2-x_2), 200 (x_2-x_1^2) } \in \RR^2. $$

In [7]:

gradf = @(x1,x2)[2*(x1-1) + 400*x1.*(x1.^2-x2); 200*(x2-x1.^2)];
Gradf = @(x)gradf(x(1),x(2));

Compute its Hessian $$ Hf(x) = \begin{pmatrix}$ \frac{\partial^2 g(x)}{\partial x_1^2} & \frac{\partial^2 g(x)}{\partial x_1 \partial x_2} \\ \frac{\partial^2 g(x)}{\partial x_1 \partial x_2} & \frac{\partial^2 g(x)}{\partial x_2^2}$ \end{pmatrix}$ = \begin{pmatrix}$ 2 + 400 (x_1^2-x_2) + 800 x_1^2 & -400 x_1 \\ -400 x_1 & 200 \end{pmatrix} \in \RR^{2 \times 2}$ $$

In [8]:

hessf = @(x1,x2)[2 + 400*(x1.^2-x2) + 800*x1.^2, -400*x1; ...
                -400*x1,  200];
Hessf = @(x)hessf(x(1),x(2));

The Newton descent method starting from some $x^{(0)} \in \RR^2$, $$ x^{(\ell+1)} = x^{(\ell)} - Hf( x^{(\ell)} )^{-1} \nabla f(x^{(\ell)}). $$

Exercise 1

Implement the Newton algorithm. Display the evolution of $f(x^{(\ell)})$ and $\norm{x^{(\ell)}-x^{(+\infty)}}$ during the iterations. isplay

In [9]:

exo1()

In [10]:

%% Insert your code here.

Exercise 2

Display the evolution of $x^{(\ell)}$, from several starting points.

In [11]:

exo2()

In [12]:

%% Insert your code here.

Gradient and Divergence of Images¶

Local differential operators like gradient, divergence and laplacian are the building blocks for variational image processing.

Load an image $g \in \RR^N$ of $N=n \times n$ pixels.

In [13]:

n = 256;
g = rescale( load_image('lena',n) );

Display it.

In [14]:

clf;
imageplot(g);

For continuous functions, the gradient reads $$ \nabla g(x) = \pa{ \pd{g(x)}{x_1}, \pd{g(x)}{x_2} } \in \RR^2. $$ (note that here, the variable $x$ denotes the 2-D spacial position).

We discretize this differential operator using first order finite differences. $$ (\nabla g)_i = ( g_{i_1,i_2}-g_{i_1-1,i_2}, g_{i_1,i_2}-g_{i_1,i_2-1} ) \in \RR^2. $$ Note that for simplity we use periodic boundary conditions.

Compute its gradient, using finite differences.

In [15]:

s = [n 1:n-1];
grad = @(f)cat(3, f-f(s,:), f-f(:,s));

One thus has $ \nabla : \RR^N \mapsto \RR^{N \times 2}. $

In [16]:

v = grad(g);

One can display each of its components.

In [17]:

clf;
imageplot(v(:,:,1), 'd/dx', 1,2,1);
imageplot(v(:,:,2), 'd/dy', 1,2,2);

One can also display it using a color image.

In [18]:

clf; 
imageplot(v);

One can display its magnitude $\norm{\nabla g(x)}$, which is large near edges.

In [19]:

clf; 
imageplot( sqrt( sum3(v.^2,3) ) );

The divergence operator maps vector field to images. For continuous vector fields $v(x) \in \RR^2$, it is defined as $$ \text{div}(v)(x) = \pd{v_1(x)}{x_1} + \pd{v_2(x)}{x_2} \in \RR. $$ (note that here, the variable $x$ denotes the 2-D spacial position). It is minus the adjoint of the gadient, i.e. $\text{div} = - \nabla^*$.

It is discretized, for $v=(v^1,v^2)$ as $$ \text{div}(v)_i = v^1_{i_1+1,i_2} - v^1_{i_1,i_2} + v^2_{i_1,i_2+1} - v^2_{i_1,i_2} . $$

In [20]:

t = [2:n 1];
div = @(v)v(t,:,1)-v(:,:,1) + v(:,t,2)-v(:,:,2);

The Laplacian operatore is defined as $\Delta=\text{div} \circ \nabla = -\nabla^* \circ \nabla$. It is thus a negative symmetric operator.

In [21]:

delta = @(f)div(grad(f));

Display $\Delta f_0$.

In [22]:

clf; 
imageplot(delta(g));

Check that the relation $ \norm{\nabla f} = - \dotp{\Delta f}{f}. $

In [23]:

dotp = @(a,b)sum(a(:).*b(:));
fprintf('Should be 0: %.3i\n', dotp(grad(g), grad(g)) + dotp(delta(g),g) );

Should be 0: 000

Newton Method in Image Processing¶

We consider now the problem of denoising an image $y \in \RR^N$ where $N = n \times n$ is the number of pixels ($n$ being the number of rows/columns in the image).

Add noise to the clean image, to simulate a noisy image $y$.

In [24]:

sigma = .1;
y = g + randn(n)*sigma;

Display the noisy image $y$.

In [25]:

clf;
imageplot(clamp(y));

Denoising is obtained by minimizing the following functional $$ \umin{x \in \RR^d} g(x) = \frac{1}{2} \norm{ y-x}^2 + \la J(x) $$ where $J(x)$ is a smoothed total variation of the image. $$ J(x) = \sum_i \norm{ (G x)_i }_{\epsilon} $$ where $ (Gx)_i \in \RR^2 $ is an approximation of the gradient of $x$ at pixel $i$ and for $u \in \RR^2$, we use the following smoothing of the $L^2$ norm in $\RR^2$ $$ \norm{u}_\epsilon = \sqrt{ \epsilon^2 + \norm{u}^2 }, $$ for a small value of $\epsilon>0$.

Value for $\lambda$.

In [26]:

lambda = .3/5;

Value for $\epsilon$.

In [27]:

epsilon = 1e-3;

TV norm.

In [28]:

Amplitude = @(u)sqrt(epsilon^2 + sum(u.^2,3));
J = @(x)sum(sum(Amplitude(grad(x))));

Function to minimize.

In [29]:

f = @(x)1/2*norm(x-y,'fro')^2 + lambda*J(x);

The gradient of the functional read $$ \nabla g(x) = x-y + \lambda \nabla J(x) $$ where the gradient of the smoothed TV norm is $$ \nabla J(x)_i = G^*( u ) \qwhereq u_i = \frac{ (G x)_i }{\norm{ (G x)_i }_\epsilon} $$ where $G^*$ is the adjoint operator of $G$ which corresponds to minus a discretized divergence.

Define the gradient of TV norm. Note that |div| implement $-G^*$.

In [30]:

Normalize = @(u)u./repmat(Amplitude(u), [1 1 2]);
GradJ = @(x)-div( Normalize(grad(x)) );

Gradient of the functional.

In [31]:

Gradf = @(x)x-y+lambda*GradJ(x);

The gradient of the functional can be written as $$ \nabla g(x) = x - y + G^* \circ \delta_{A(x)} \circ G (x) $$ where $A(x) \in \RR^N$ is the norm of the gradient $$ A(x)_i = \norm{(G u)_i} $$ and where, for $x \in \RR^N$ and $ a \in \RR^{N} $, $$ \de_{a}(u)_i = \frac{u_i}{ \sqrt{\epsilon^2+a_i^2} }. $$

In [32]:

A = @(x)sqrt(sum(grad(x).^2, 3));
delta = @(a,u)u./sqrt( epsilon^2 + repmat(a.^2,[1 1 2]) );

This shows that the Hessian of $f$ is $$ Hf(x) = \text{Id} + G^* \circ \delta_{d(x)} \circ G + G^* \circ \Delta_{x} (G(x)) $$ where $\Delta_{x}(u) $ is the derivative of the mapping $ x \mapsto \delta_{d(x)}(u) $.

Note that since $f$ is strictly convex, $Hf(x)$ is a positive definite matrix.

We use a quasi-Newton scheme $$ x^{(\ell+1)} = x^{(\ell)} - H_\ell^{-1} \nabla f(x^{(\ell)}), $$ where $ H_\ell $ is an approximation of the Hessian $H f(x^{(\ell)})$.

We define this approximation as $$ H_\ell = \tilde H f(x^{(\ell)}) \qwhereq \tilde Hf(x) = \text{Id} + G^* \circ \delta_{x} \circ G $$

$H_\ell$ is a symmetric positive definite matrix, that is bounded by bellow by the identity, which ensures the convergence of the method.

Implement $\tilde H f(x)$. Note that is parameterized by $a=A(x)$.

In [33]:

H = @(z,a)z - lambda*div( delta(a, grad(z)) );

Implement $\tilde H f(x)^{-1}$ using a conjugate gradient. Note that it takes as argument a previous solution |u_prev|.

In [34]:

flat = @(x)x(:); flatI = @(x)reshape(x,[n,n]);
tol = eps; maxit = 40;
Hinv = @(u,a,u_prev)cgs(@(z)flat(H(flatI(z),a)), flat(u),tol,maxit,[],[],flat(u_prev));

Initialize the algorithm $x^{(0)} = y$.

In [35]:

x = y;

Compute the vector $d = H_\ell^{-1} \nabla f(x^{(\ell)})$.

Important: to get read of the warning message, you should use the code as this:

In [36]:

d = zeros(n); % replace thie line by the previous iterate d value
[d,~] = Hinv(Gradf(x), A(x), d);
d = flatI(d);

Compute $x^{(\ell+1)} = x^{(\ell)} - d$.

In [37]:

x = x - d;

Exercise 3

Implement the Newton descent algorithm. d = Hinv(Gradf(x), A(x), d);

In [38]:

exo3()

In [39]:

%% Insert your code here.

Display the result.

In [40]:

clf;
imageplot(clamp(x));

Exercise 4

Compare the Newton descent with the gradient descent with a fixed step size, in term of decay of the energy.

In [41]:

exo4()

In [42]:

%% Insert your code here.

Exercise 5

The direct comparison between gradient method and Newton is not fair in term of iteration count. Indeed, and iteration of Newton requires several steps of conjugate gradient, which takes some time. Try to set-up a fair comparison benchmark that takes into account the runing time of the methods. Pay a particular attention to the number of steps (or the tolerance criterion) that parameterize |cgs|.

In [43]:

exo5()

In [44]:

%% Insert your code here.