**Description**: Introduction to automatic differentiation and the `DualNumbers`

package. Prepared for MIT course 15.S60 "Software Tools for Operations Research", January 2015.

**Author**: Miles Lubin

**License**:

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Consider a general constrained nonlinear optimization problem: $$ \begin{align} \min \quad&f(x)\\ \text{s.t.} \quad& g(x) = 0, \\ & h(x) \leq 0. \end{align} $$ where $f : \mathbb{R}^n \to \mathbb{R}, g : \mathbb{R}^n \to \mathbb{R}^r$, and $h: \mathbb{R}^n \to \mathbb{R}^s$.

When $f$ and $h$ are convex and $g$ is affine, we can hope for a globally optimal solution, otherwise typically we can only ask for a locally optimal solution.

What approaches can we use to solve this?

- When $r=0$ and $s = 0$ (unconstrained), and $f$ differentiable, most classical approach is gradient descent, also fancier methods like Newton's method and quasi-newton methods like BFGS.
- When $f$ differentiable and $g$ and $h$ linear, gradient projection
- When $f$, $g$, and $h$ twice differentiable, interior-point methods, sequential quadratic programming
- When derivatives "not available", derivative-free optimization

This is not meant to be an exhaustive list, see http://plato.asu.edu/sub/nlores.html#general and http://www.neos-guide.org/content/nonlinear-programming for more details.

- Hand-written by applying chain rule
- Finite difference approximation $\frac{\partial f}{\partial x_i} = \lim_{h\to 0} \frac{f(x+h e_i)-f(x)}{h}$
**Automatic differentiation**- Idea: Automatically transform code to compute a function into code to compute its derivatives

Consider numbers of the form $x + y\epsilon$ with $x,y \in \mathbb{R}$. We *define* $\epsilon^2 = 0$, so
$$
(x_1 + y_1\epsilon)(x_2+y_2\epsilon) = x_1x_2 + (x_1y_2 + x_2y_1)\epsilon.
$$
These are called the *dual numbers*. Think of $\epsilon$ as an infinitesimal perturbation (you've probably seen hand-wavy algebra using $(dx)^2 = 0$ when computing integrals - this is the same idea).

If we are given an infinitely differentiable function in Taylor expanded form $$ f(x) = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a)^k $$ it follows that $$ f(x+y\epsilon) = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a+y\epsilon)^k = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a)^k + y\epsilon\sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!}\binom{k}{1} (x-a)^{k-1} = f(x) + yf'(x)\epsilon $$

Let's unpack what's going on here. We started with a function $f : \mathbb{R} \to \mathbb{R}$. Dual numbers are *not* real numbers, so it doesn't even make sense to ask for the value $f(x+y\epsilon)$ given $x+y\epsilon \in \mathbb{D}$ (the set of dual numbers). But anyway we plugged the dual number into the Taylor expansion, and by using the algebra rule $\epsilon^2 = 0$ we found that $f(x+y\epsilon)$ must be equal to $f(x) + yf'(x)\epsilon$ if we use the Taylor expansion as the definition of $f : \mathbb{D} \to \mathbb{R}$.

Alternatively, for any once differentiable function $f : \mathbb{R} \to \mathbb{R}$, we can *define* its extension to the dual numbers as
$$
f(x+y\epsilon) = f(x) + yf'(x)\epsilon.
$$
This is essentially equivalent to the previous definition.

Let's verify a very basic property, the chain rule, using this definition.

Suppose $h(x) = f(g(x))$. Then, $$ h(x+y\epsilon) = f(g(x+y\epsilon)) = f(g(x) + yg'(x)\epsilon) = f(g(x)) + yg'(x)f'(g(x))\epsilon = h(x) + yh'(x)\epsilon. $$

Maybe that's not too surprising, but it's actually a quite useful observation.

Dual numbers are implemented in the DualNumbers package in Julia.

In [1]:

```
using DualNumbers
```

You construct $x + y\epsilon$ with `Dual(x,y)`

. The real and epsilon components are accessed as `real(d)`

and `epsilon(d)`

:

In [2]:

```
d = Dual(2.0,1.0)
```

Out[2]:

In [3]:

```
typeof(d)
```

Out[3]:

In [4]:

```
realpart(d)
```

Out[4]:

In [5]:

```
epsilon(d)
```

Out[5]:

How is addition of dual numbers defined?

In [ ]:

```
@which d+Dual(3.0,4.0)
```

Clicking on the link, we'll see:

```
+(z::Dual, w::Dual) = dual(real(z)+real(w), epsilon(z)+epsilon(w))
```

Multiplication?

In [6]:

```
Dual(2.0,2.0)*Dual(3.0,4.0)
```

Out[6]:

In [ ]:

```
@which Dual(2.0,2.0)*Dual(3.0,4.0)
```

The code is:

```
*(z::Dual, w::Dual) = dual(real(z)*real(w), epsilon(z)*real(w)+real(z)*epsilon(w))
```

Basic univariate functions?

In [7]:

```
log(Dual(2.0,1.0))
```

Out[7]:

In [8]:

```
1/2.0
```

Out[8]:

How is this implemented?

In [9]:

```
@code_lowered log(Dual(2.0,1.0))
```

Out[9]:

Trig functions?

In [10]:

```
@code_lowered sin(Dual(2.0,1.0))
```

Out[10]:

We can define a function in Julia as:

In [11]:

```
f(x) = x^2 - log(x)
# Or equivalently
function f(x)
return x^2 - log(x)
end
```

Out[11]:

In [12]:

```
f(Dual(1/sqrt(2),1))
```

Out[12]:

[Exercise]: Differentiate it!

- Evaluate $f$ at $1 + \epsilon$. What are $f(1)$ and $f'(1)$?
- Evaluate $f$ at $\frac{1}{\sqrt{2}} + \epsilon$. What are $f(\frac{1}{\sqrt{2}})$ and $f'(\frac{1}{\sqrt{2}})$?
- Define a new function
`fprime`

which returns the derivative of`f`

by using`DualNumbers`

.- Use the finite difference formula $$ f'(x) \approx \frac{f(x+h)-f(x)}{h} $$ to evaluate $f'(\frac{1}{\sqrt{2}})$ approximately using a range of values of $h$. Visualize the approximation error using
`@manipulate`

, plots, or both!

Recall Newton's iterative method for finding zeros: $$ x \leftarrow x - \frac{f(x)}{f'(x)} $$ until $f(x) \approx 0$.

Let's use this method to compute $\sqrt{x}$ by solving $f(z) = 0$ where $f(z) = z^2-x$. So $f'(z) = 2z$, and we can implement the algorithm as:

In [13]:

```
function squareroot(x)
z = x # Initial starting point
while abs(z*z - x) > 1e-13
z = z - (z*z-x)/(2z)
end
return z
end
```

Out[13]:

In [14]:

```
squareroot(100)
```

Out[14]:

Can we differentiate this code? **Yes!**

In [15]:

```
d = squareroot(Dual(100.0,1.0))
```

Out[15]:

In [16]:

```
epsilon(d) # Computed derivative
```

Out[16]:

In [17]:

```
1/(2*sqrt(100)) # The exact derivative
```

Out[17]:

In [18]:

```
abs(epsilon(d)-1/(2*sqrt(100)))
```

Out[18]:

Dual numbers can be used to compute the gradient of a function $f: \mathbb{R}^n \to \mathbb{R}$. This requires $n$ evaluations of $f$ with dual number input, essentially computing the partial derivative in each of the $n$ dimensions. We won't get into the details, but this procedure is implemented in Optim with the `autodiff=true`

keyword.

In [ ]:

```
using Optim
rosenbrock(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
optimize(rosenbrock, [0.0, 0.0], method = :l_bfgs, autodiff = true)
```

When $n$ is large, there's an alternative procedure called reverse-mode automatic differentiation which requires only $O(1)$ evaluations of $f$ to compute its gradient. This is the method used internally by JuMP (implemented in ReverseDiffSparse).

- We can compute numerically exact derivatives of any differentiable function which is implemented by using a sequence of basic operations.
- In Julia it's very easy to use dual numbers for this!
- Reconsider when derivatives are "not available."

This was just an introduction to one technique from the area of automatic differentiation. For more references, see autodiff.org.