Version: Python 2.7
Most of the time when we are talking about errors in a code, we refer to bugs. Here we will take bugs to be systematic unintended errors often arising from our misapprehension of what we wrote or what we think someone else wrote. However, even in well-written code, errors can arise as a result of how numbers are stored in a computer and how algorithms are carried out—this is what we mean when we say numerical error.
Numerical error can be extraordinarily difficult to diagnose—particularly if the equations you are solving are ill-characterized (like nonlinear equations)—but there are a few known techniques. First, let's take a digression into how numbers are represented on the computer.
You are familiar with binary representation: how $1101_b = 1\times 2^3 + 1\times 2^2 + 0\times 2^1 + 1\times 2^0 = 13_d$. The question soon arises of how to represent fractional parts, which is again straightforward with the inclusion of a “binary point” "$_\times$" analogous to a decimal point "$.$": i.e., $1101_{\times} 1001b = 1\times 2^3 + 1\times 2^2 + 0\times 2^1 + 1\times 2^0 + 1\times 2^{-1} + 0\times 2^{-2} + 0\times 2^{-3} + 1\times 2^{-4} = 13.5625_d$.
This notation is functional but soon exhausts its utility when confronted with very large or very small numbers. Just as with decimal numbers, however, we can adopt a scientific notation which requires us to specify only a few parameters. For instance, instead of 45,230,000,000,000 we write $4.523 \times 10^{13}$—requiring us to specify two numbers (4.523 and 13) to gain an economy of expression (or, to be strict, three if we count positive and negative as a multiplier by one).
Floating-point mathematics allows the representation of numbers too large to fit into memory well as integers, as well as with fractional parts. Modern floating-point arithmetic typically allows the representation of numbers as 32-bit float
s or 64-bit double
s.
This task is achieved by splitting the bytes up into a sign bit, an binary exponent, and a binary mantissa or significand. (Consult the figure below.)
0
) or negative (1
).from IPython.display import Image
Image(filename='floating-point.png', width=800, embed=True)
Let's compare the result of $1.1-0.8$ and $0.3$. We expect these to be mathematically equal, but will the machine concur with us?
double
floating-point type).0.3
1.1-0.8
There's a difference, but why? Let's look at the binary floating-point representation of 1.1, 0.8, and 0.3 to see.
Decimal | Size | Sign Bit | Exponent | Mantissa |
---|---|---|---|---|
$1.1$ | single | 0 | 01111111 | 00011001100110011001101 |
$1 \times$ | $2^{(127-127)} \times$ | $1.10000002384185791015625$ | ||
double | 0 | 01111111111 | 0001100110011001100110011001100110011001100110011010 | |
$1 \times$ | $2^{(1023-1023)} \times$ | $1.1000000000000000888178419700125232338905$ | ||
$0.8$ | single | 0 | 01111110 | 10011001100110011001101 |
$1 \times$ | $2^{(126-127)} \times$ | $1.60000002384185791015625$ | ||
double | 0 | 01111111110 | 1001100110011001100110011001100110011001100110011010 | |
$1 \times$ | $2^{(1022-1023)} \times$ | $1.6000000000000000888178419700125232338905$ | ||
$1.1-0.8$ | single | 0 | 01111101 | 00110011001100110011010 |
$\,\, = 0.3$ | $1 \times$ | $2^{(125-127)} \times$ | $1.2000000476837158203125$ | |
double | 0 | 01111111110 | 0011001100110011001100110011001100110011001100110100 | |
$\,\, = 0.30000000000000004$ | $1 \times$ | $2^{(1021-1023)} \times$ | $1.2000000000000001776356839400250464677810$ | |
$0.3$ | single | 0 | 01111101 | 00110011001100110011010 |
$1 \times$ | $2^{(125-127)} \times$ | $1.2000000476837158203125$ | ||
double | 0 | 01111111101 | 0011001100110011001100110011001100110011001100110011 | |
$1 \times$ | $2^{(1021-1023)} \times$ | $1.1999999999999999555910790149937383830547$ |
Thus you should never compare floating-point values exactly—check to see that they are within a narrow range of each other (with a relative or absolute tolerance, as necessary).
There are clearly limits to how precisely any decimal number can be represented in binary floating-point format (just as the decimal representation often requires repeated fractions). Sometimes we have a number too large (overflow) or too small (underflow) to fit; sometimes the number just requires too many digits of accuracy to fit into a clean machine representation.
Machine Parameter | Single Precision | Double Precision | Quadruple Precision |
---|---|---|---|
Size | (4 bytes; 32 bits) | (8 bytes; 64 bits) | (16 bytes; 128 bits) |
Relative machine precision, $\epsilon$ | $2^{-24} \sim 5.96 \cdot 10^{-8}$ | $2^{-53} \sim 1.11 \cdot 10^{-16}$ | $2^{-113} \sim 9.63 \cdot 10^{-35}$ |
Underflow threshold | $2^{-126} \sim 1.18 \cdot 10^{-38}$ | $2^{-1022} \sim 2.23 \cdot 10^{-308}$ | $2^{-16382} \sim 3.36 \cdot 10^{-4392}$ |
Overflow threshold | $2^{128}(1-\epsilon) \sim 3.40 \cdot 10^{38}$ | $2^{1024}(1-\epsilon) \sim 1.79 \cdot 10^{308}$ | $2^{16384}(1-\epsilon) \sim 1.19 \cdot 10^{4392}$ |
Let's take a look at another example related to underflow and overflow. From the table above, we can see that the maximum representable number in 64-bit precision is about $1.79 \cdot 10^{308}$ (in Python, sys.float_info.max
). Let's double this value and see what happens—it correctly overflows to $+\infty$. But what about a smaller (but still nonnegligible) addition?
import sys
print sys.float_info
Some special types exist as well: $+\infty$ Inf
, $-\infty$ -Inf
, and NaN
(Not a Number). This occur as the result of procedures with undefined results (such as division by zero) and allow you to systematically handle errors and bugs as they arise.
For more details, consult Finley, 2000 (from whom the example figured was borrowed) and IEEE-754 Analysis, a calculator which shows you the pieces of the representation for float
, double
, and quad128
types.
You can consider your numerical system either as an approximate solution of an exact equation (the normal way of thinking) or the exact solution of an approximate equation. What I mean by the latter is this: the approximations introduced by your numerical method exactly solve a real equation which is similar to the equation you would like to solve. So one question is how we can identify errors by seeing what additional physics have been introduced into your problem by the solution and then looking for those physics.
In other words, given a finite-difference approximation of a partial differential equation, find another PDE which is better approximated by the finite-difference scheme. Then use the FD method to learn about the behavior of the new PDE—the modified equation—and thus the behavior of the error terms.
First-Order Example. Consider the hyperbolic PDE $u_t + a u_x = 0$ with initial condition $u(x, 0) = u_0 (x)$ and boundary conditions $u(L, t) = u_L (t)$; $u(R, t) = u_R (t)$. The Lax–Friedrichs forward in time, centered in space (FTCS) scheme with artificial viscosity can be written as
$$\frac{U_{j}^{n+1} - U_{j}^{n}}{\Delta t} + a \frac{U_{j+1}^{n} - U_{j-1}^{n}}{2 \Delta x} - \frac{U_{j+1}^{n} - 2 U_{j}^{n} + U_{j-1}^{n}}{2 \Delta t} = 0.$$A Taylor series expansion of the original equation yields $$u_{t} + a u_{x} - \frac{1}{2 \Delta t} u_{xx} \Delta x^2 + \frac{1}{2} u_{tt} \Delta t + \frac{1}{6} a u_{xxx} \Delta x^2 - \frac{1}{24 \Delta t} u_{xxxx} \Delta x^4 + \,\, ... $$ $$\,\,\,\,\,\,\, = \left(u_{t} + a u_{x} \right) + \frac{1}{2} \left({ u_tt \Delta t}- u_{xx}\frac{\Delta x^2}{\Delta t} \right) + \,\, ... $$ $$\,\,\,\,\,\,\, = \left(u_{t} + a u_{x} \right) + \frac{1}{2} \left({ a^2 \Delta t}- \frac{\Delta x^2}{\Delta t} \right) u_{xx} + \,\, ... $$
From this last result, we arrive at the modified equation $$u_{t} + a u_{x} = \frac{\Delta x^2}{2 \Delta t} \left(1 - \left(\frac{a \Delta t}{\Delta x}\right)^2 \right) u_{xx} $$ which is an advection–diffusion equation, having added a diffusion term and an antidiffusive correction (by the central differencing). Thus if we see unexpected diffusive behavior, we can suspect that this is due to the numerical method used rather than another source of error. Depending on the size of $a$, we could even find that a smaller time step size relative to $\Delta x$ increases the diffusive error.
Beyond general debugging and best practices in software development, there are a few techniques specific to scientific and numerical computing that you should bear in mind.
In the first place, it is good to use characteristic scales such that most of your expected values lie within the range $\left(10^3-10^{-3}\right)$. This allows you to easily identify underflow and overflow errors, as well as gross oscillations and instabilities.
Another way you can detect numerical error is to swap out the 64-bit double
s in your code for 32-bit float
s to see what difference, if any, it makes. (Note particularly that in our example the 32-bit float
was not susceptible to the error introduced by the vastly more precise double
.)
Maintaining precision is important in calculations with very large or small intermediate values. In these cases, it may be appropriate to utilize either arbitrary-precision arithmetic (slow) or Kahan enhanced summation. Kahan summation tracks the lost precision from a result separately and adds it back in at the end, allowing you to maintain higher precision than would otherwise be possible.
Neal Davis developed these materials for Computational Science and Engineering at the University of Illinois at Urbana–Champaign. This content is available under a Creative Commons Attribution 3.0 Unported License.