Description: Introduction to automatic differentiation and the DualNumbers package. Prepared for MIT course 15.S60 "Software Tools for Operations Research", January 2015.

Author: Miles Lubin

# Computing derivatives for nonlinear optimization: Forward mode automatic differentiation¶

Consider a general constrained nonlinear optimization problem: \begin{align} \min \quad&f(x)\\ \text{s.t.} \quad& g(x) = 0, \\ & h(x) \leq 0. \end{align} where $f : \mathbb{R}^n \to \mathbb{R}, g : \mathbb{R}^n \to \mathbb{R}^r$, and $h: \mathbb{R}^n \to \mathbb{R}^s$.

When $f$ and $h$ are convex and $g$ is affine, we can hope for a globally optimal solution, otherwise typically we can only ask for a locally optimal solution.

What approaches can we use to solve this?

This is not meant to be an exhaustive list, see http://plato.asu.edu/sub/nlores.html#general and http://www.neos-guide.org/content/nonlinear-programming for more details.

## How are derivatives computed?¶

• Hand-written by applying chain rule
• Finite difference approximation $\frac{\partial f}{\partial x_i} = \lim_{h\to 0} \frac{f(x+h e_i)-f(x)}{h}$
• Automatic differentiation
• Idea: Automatically transform code to compute a function into code to compute its derivatives

## Dual Numbers¶

Consider numbers of the form $x + y\epsilon$ with $x,y \in \mathbb{R}$. We define $\epsilon^2 = 0$, so $$(x_1 + y_1\epsilon)(x_2+y_2\epsilon) = x_1x_2 + (x_1y_2 + x_2y_1)\epsilon.$$ These are called the dual numbers. Think of $\epsilon$ as an infinitesimal perturbation (you've probably seen hand-wavy algebra using $(dx)^2 = 0$ when computing integrals - this is the same idea).

If we are given an infinitely differentiable function in Taylor expanded form $$f(x) = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a)^k$$ it follows that $$f(x+y\epsilon) = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a+y\epsilon)^k = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a)^k + y\epsilon\sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!}\binom{k}{1} (x-a)^{k-1} = f(x) + yf'(x)\epsilon$$

Let's unpack what's going on here. We started with a function $f : \mathbb{R} \to \mathbb{R}$. Dual numbers are not real numbers, so it doesn't even make sense to ask for the value $f(x+y\epsilon)$ given $x+y\epsilon \in \mathbb{D}$ (the set of dual numbers). But anyway we plugged the dual number into the Taylor expansion, and by using the algebra rule $\epsilon^2 = 0$ we found that $f(x+y\epsilon)$ must be equal to $f(x) + yf'(x)\epsilon$ if we use the Taylor expansion as the definition of $f : \mathbb{D} \to \mathbb{R}$.

Alternatively, for any once differentiable function $f : \mathbb{R} \to \mathbb{R}$, we can define its extension to the dual numbers as $$f(x+y\epsilon) = f(x) + yf'(x)\epsilon.$$ This is essentially equivalent to the previous definition.

Let's verify a very basic property, the chain rule, using this definition.

Suppose $h(x) = f(g(x))$. Then, $$h(x+y\epsilon) = f(g(x+y\epsilon)) = f(g(x) + yg'(x)\epsilon) = f(g(x)) + yg'(x)f'(g(x))\epsilon = h(x) + yh'(x)\epsilon.$$

Maybe that's not too surprising, but it's actually a quite useful observation.

### Implementation¶

Dual numbers are implemented in the DualNumbers package in Julia.

In [1]:
using DualNumbers


You construct $x + y\epsilon$ with Dual(x,y). The real and epsilon components are accessed as real(d) and epsilon(d):

In [2]:
d = Dual(2.0,1.0)

Out[2]:
2.0 + 1.0ɛ
In [3]:
typeof(d)

Out[3]:
DualNumbers.Dual{Float64}
In [4]:
realpart(d)

Out[4]:
2.0
In [5]:
epsilon(d)

Out[5]:
1.0

How is addition of dual numbers defined?

In [ ]:
@which d+Dual(3.0,4.0)


Clicking on the link, we'll see:

+(z::Dual, w::Dual) = dual(real(z)+real(w), epsilon(z)+epsilon(w))


Multiplication?

In [6]:
Dual(2.0,2.0)*Dual(3.0,4.0)

Out[6]:
6.0 + 14.0ɛ
In [ ]:
@which Dual(2.0,2.0)*Dual(3.0,4.0)


The code is:

*(z::Dual, w::Dual) = dual(real(z)*real(w), epsilon(z)*real(w)+real(z)*epsilon(w))


Basic univariate functions?

In [7]:
log(Dual(2.0,1.0))

Out[7]:
0.6931471805599453 + 0.5ɛ
In [8]:
1/2.0

Out[8]:
0.5

How is this implemented?

In [9]:
@code_lowered log(Dual(2.0,1.0))

Out[9]:
LambdaInfo template for log(z::DualNumbers.Dual) at /home/mlubin/.julia/v0.5/DualNumbers/src/dual.jl:273
:(begin
nothing
x = (DualNumbers.value)(z) # line 274:
xp = (DualNumbers.epsilon)(z) # line 275:
return (DualNumbers.Dual)((DualNumbers.log)(x),xp * (1 / x))
end)

Trig functions?

In [10]:
@code_lowered sin(Dual(2.0,1.0))

Out[10]:
LambdaInfo template for sin(z::DualNumbers.Dual) at /home/mlubin/.julia/v0.5/DualNumbers/src/dual.jl:273
:(begin
nothing
x = (DualNumbers.value)(z) # line 274:
xp = (DualNumbers.epsilon)(z) # line 275:
return (DualNumbers.Dual)((DualNumbers.sin)(x),xp * (DualNumbers.cos)(x))
end)

## Computing derivatives of functions¶

We can define a function in Julia as:

In [11]:
f(x) = x^2 - log(x)
# Or equivalently
function f(x)
return x^2 - log(x)
end

WARNING: Method definition f(Any) in module Main at In[11]:1 overwritten at In[11]:4.

Out[11]:
f (generic function with 1 method)
In [12]:
f(Dual(1/sqrt(2),1))

Out[12]:
0.8465735902799727 - 2.220446049250313e-16ɛ

[Exercise]: Differentiate it!

1. Evaluate $f$ at $1 + \epsilon$. What are $f(1)$ and $f'(1)$?
2. Evaluate $f$ at $\frac{1}{\sqrt{2}} + \epsilon$. What are $f(\frac{1}{\sqrt{2}})$ and $f'(\frac{1}{\sqrt{2}})$?
3. Define a new function fprime which returns the derivative of f by using DualNumbers.
4. Use the finite difference formula $$f'(x) \approx \frac{f(x+h)-f(x)}{h}$$ to evaluate $f'(\frac{1}{\sqrt{2}})$ approximately using a range of values of $h$. Visualize the approximation error using @manipulate, plots, or both!

### How general is it?¶

Recall Newton's iterative method for finding zeros: $$x \leftarrow x - \frac{f(x)}{f'(x)}$$ until $f(x) \approx 0$.

Let's use this method to compute $\sqrt{x}$ by solving $f(z) = 0$ where $f(z) = z^2-x$. So $f'(z) = 2z$, and we can implement the algorithm as:

In [13]:
function squareroot(x)
z = x # Initial starting point
while abs(z*z - x) > 1e-13
z = z - (z*z-x)/(2z)
end
return z
end

Out[13]:
squareroot (generic function with 1 method)
In [14]:
squareroot(100)

Out[14]:
10.0

Can we differentiate this code? Yes!

In [15]:
d = squareroot(Dual(100.0,1.0))

Out[15]:
10.0 + 0.049999999999999996ɛ
In [16]:
epsilon(d) # Computed derivative

Out[16]:
0.049999999999999996
In [17]:
1/(2*sqrt(100)) # The exact derivative

Out[17]:
0.05
In [18]:
abs(epsilon(d)-1/(2*sqrt(100)))

Out[18]:
6.938893903907228e-18

### Multivariate functions?¶

Dual numbers can be used to compute the gradient of a function $f: \mathbb{R}^n \to \mathbb{R}$. This requires $n$ evaluations of $f$ with dual number input, essentially computing the partial derivative in each of the $n$ dimensions. We won't get into the details, but this procedure is implemented in Optim with the autodiff=true keyword.

In [ ]:
using Optim
rosenbrock(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
optimize(rosenbrock, [0.0, 0.0], method = :l_bfgs, autodiff = true)


When $n$ is large, there's an alternative procedure called reverse-mode automatic differentiation which requires only $O(1)$ evaluations of $f$ to compute its gradient. This is the method used internally by JuMP (implemented in ReverseDiffSparse).

## Conclusions¶

• We can compute numerically exact derivatives of any differentiable function which is implemented by using a sequence of basic operations.
• In Julia it's very easy to use dual numbers for this!
• Reconsider when derivatives are "not available."

This was just an introduction to one technique from the area of automatic differentiation. For more references, see autodiff.org.