Description: Introduction to automatic differentiation and the DualNumbers package. Prepared for MIT course 15.S60 "Software Tools for Operations Research", January 2015.

Author: Miles Lubin

License: Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Computing derivatives for nonlinear optimization: Forward mode automatic differentiation

Consider a general constrained nonlinear optimization problem: $$ \begin{align} \min \quad&f(x)\\ \text{s.t.} \quad& g(x) = 0, \\ & h(x) \leq 0. \end{align} $$ where $f : \mathbb{R}^n \to \mathbb{R}, g : \mathbb{R}^n \to \mathbb{R}^r$, and $h: \mathbb{R}^n \to \mathbb{R}^s$.

When $f$ and $h$ are convex and $g$ is affine, we can hope for a globally optimal solution, otherwise typically we can only ask for a locally optimal solution.

What approaches can we use to solve this?

This is not meant to be an exhaustive list, see http://plato.asu.edu/sub/nlores.html#general and http://www.neos-guide.org/content/nonlinear-programming for more details.

How are derivatives computed?

  • Hand-written by applying chain rule
  • Finite difference approximation $\frac{\partial f}{\partial x_i} = \lim_{h\to 0} \frac{f(x+h e_i)-f(x)}{h}$
  • Automatic differentiation
    • Idea: Automatically transform code to compute a function into code to compute its derivatives

Dual Numbers

Consider numbers of the form $x + y\epsilon$ with $x,y \in \mathbb{R}$. We define $\epsilon^2 = 0$, so $$ (x_1 + y_1\epsilon)(x_2+y_2\epsilon) = x_1x_2 + (x_1y_2 + x_2y_1)\epsilon. $$ These are called the dual numbers. Think of $\epsilon$ as an infinitesimal perturbation (you've probably seen hand-wavy algebra using $(dx)^2 = 0$ when computing integrals - this is the same idea).

If we are given an infinitely differentiable function in Taylor expanded form $$ f(x) = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a)^k $$ it follows that $$ f(x+y\epsilon) = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a+y\epsilon)^k = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a)^k + y\epsilon\sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!}\binom{k}{1} (x-a)^{k-1} = f(x) + yf'(x)\epsilon $$

Let's unpack what's going on here. We started with a function $f : \mathbb{R} \to \mathbb{R}$. Dual numbers are not real numbers, so it doesn't even make sense to ask for the value $f(x+y\epsilon)$ given $x+y\epsilon \in \mathbb{D}$ (the set of dual numbers). But anyway we plugged the dual number into the Taylor expansion, and by using the algebra rule $\epsilon^2 = 0$ we found that $f(x+y\epsilon)$ must be equal to $f(x) + yf'(x)\epsilon$ if we use the Taylor expansion as the definition of $f : \mathbb{D} \to \mathbb{R}$.

Alternatively, for any once differentiable function $f : \mathbb{R} \to \mathbb{R}$, we can define its extension to the dual numbers as $$ f(x+y\epsilon) = f(x) + yf'(x)\epsilon. $$ This is essentially equivalent to the previous definition.

Let's verify a very basic property, the chain rule, using this definition.

Suppose $h(x) = f(g(x))$. Then, $$ h(x+y\epsilon) = f(g(x+y\epsilon)) = f(g(x) + yg'(x)\epsilon) = f(g(x)) + yg'(x)f'(g(x))\epsilon = h(x) + yh'(x)\epsilon. $$

Maybe that's not too surprising, but it's actually a quite useful observation.

Implementation

Dual numbers are implemented in the DualNumbers package in Julia.

In [1]:
using DualNumbers

You construct $x + y\epsilon$ with Dual(x,y). The real and epsilon components are accessed as real(d) and epsilon(d):

In [2]:
d = Dual(2.0,1.0)
Out[2]:
2.0 + 1.0ɛ
In [3]:
typeof(d)
Out[3]:
DualNumbers.Dual{Float64}
In [4]:
realpart(d)
Out[4]:
2.0
In [5]:
epsilon(d)
Out[5]:
1.0

How is addition of dual numbers defined?

In [ ]:
@which d+Dual(3.0,4.0)

Clicking on the link, we'll see:

+(z::Dual, w::Dual) = dual(real(z)+real(w), epsilon(z)+epsilon(w))

Multiplication?

In [6]:
Dual(2.0,2.0)*Dual(3.0,4.0)
Out[6]:
6.0 + 14.0ɛ
In [ ]:
@which Dual(2.0,2.0)*Dual(3.0,4.0)

The code is:

*(z::Dual, w::Dual) = dual(real(z)*real(w), epsilon(z)*real(w)+real(z)*epsilon(w))

Basic univariate functions?

In [7]:
log(Dual(2.0,1.0))
Out[7]:
0.6931471805599453 + 0.5ɛ
In [8]:
1/2.0
Out[8]:
0.5

How is this implemented?

In [9]:
@code_lowered log(Dual(2.0,1.0))
Out[9]:
LambdaInfo template for log(z::DualNumbers.Dual) at /home/mlubin/.julia/v0.5/DualNumbers/src/dual.jl:273
:(begin 
        nothing
        x = (DualNumbers.value)(z) # line 274:
        xp = (DualNumbers.epsilon)(z) # line 275:
        return (DualNumbers.Dual)((DualNumbers.log)(x),xp * (1 / x))
    end)

Trig functions?

In [10]:
@code_lowered sin(Dual(2.0,1.0))
Out[10]:
LambdaInfo template for sin(z::DualNumbers.Dual) at /home/mlubin/.julia/v0.5/DualNumbers/src/dual.jl:273
:(begin 
        nothing
        x = (DualNumbers.value)(z) # line 274:
        xp = (DualNumbers.epsilon)(z) # line 275:
        return (DualNumbers.Dual)((DualNumbers.sin)(x),xp * (DualNumbers.cos)(x))
    end)

Computing derivatives of functions

We can define a function in Julia as:

In [11]:
f(x) = x^2 - log(x)
# Or equivalently
function f(x)
    return x^2 - log(x)
end
WARNING: Method definition f(Any) in module Main at In[11]:1 overwritten at In[11]:4.
Out[11]:
f (generic function with 1 method)
In [12]:
f(Dual(1/sqrt(2),1))
Out[12]:
0.8465735902799727 - 2.220446049250313e-16ɛ

[Exercise]: Differentiate it!

  1. Evaluate $f$ at $1 + \epsilon$. What are $f(1)$ and $f'(1)$?
  2. Evaluate $f$ at $\frac{1}{\sqrt{2}} + \epsilon$. What are $f(\frac{1}{\sqrt{2}})$ and $f'(\frac{1}{\sqrt{2}})$?
  3. Define a new function fprime which returns the derivative of f by using DualNumbers.
  4. Use the finite difference formula $$ f'(x) \approx \frac{f(x+h)-f(x)}{h} $$ to evaluate $f'(\frac{1}{\sqrt{2}})$ approximately using a range of values of $h$. Visualize the approximation error using @manipulate, plots, or both!

How general is it?

Recall Newton's iterative method for finding zeros: $$ x \leftarrow x - \frac{f(x)}{f'(x)} $$ until $f(x) \approx 0$.

Let's use this method to compute $\sqrt{x}$ by solving $f(z) = 0$ where $f(z) = z^2-x$. So $f'(z) = 2z$, and we can implement the algorithm as:

In [13]:
function squareroot(x)
    z = x # Initial starting point
    while abs(z*z - x) > 1e-13
        z = z - (z*z-x)/(2z)
    end
    return z
end
Out[13]:
squareroot (generic function with 1 method)
In [14]:
squareroot(100)
Out[14]:
10.0

Can we differentiate this code? Yes!

In [15]:
d = squareroot(Dual(100.0,1.0))
Out[15]:
10.0 + 0.049999999999999996ɛ
In [16]:
epsilon(d) # Computed derivative
Out[16]:
0.049999999999999996
In [17]:
1/(2*sqrt(100)) # The exact derivative
Out[17]:
0.05
In [18]:
abs(epsilon(d)-1/(2*sqrt(100)))
Out[18]:
6.938893903907228e-18

Multivariate functions?

Dual numbers can be used to compute the gradient of a function $f: \mathbb{R}^n \to \mathbb{R}$. This requires $n$ evaluations of $f$ with dual number input, essentially computing the partial derivative in each of the $n$ dimensions. We won't get into the details, but this procedure is implemented in Optim with the autodiff=true keyword.

In [ ]:
using Optim
rosenbrock(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
optimize(rosenbrock, [0.0, 0.0], method = :l_bfgs, autodiff = true)

When $n$ is large, there's an alternative procedure called reverse-mode automatic differentiation which requires only $O(1)$ evaluations of $f$ to compute its gradient. This is the method used internally by JuMP (implemented in ReverseDiffSparse).

Conclusions

  • We can compute numerically exact derivatives of any differentiable function which is implemented by using a sequence of basic operations.
  • In Julia it's very easy to use dual numbers for this!
  • Reconsider when derivatives are "not available."

This was just an introduction to one technique from the area of automatic differentiation. For more references, see autodiff.org.