Notebook

Content under Creative Commons Attribution license CC-BY 4.0, code under MIT license (c)2014 L.A. Barba, C.D. Cooper, G.F. Forsyth.¶

Iterate This!¶

In Lesson 1 and Lesson 2 of this module we used the Jacobi method to iteratively solve for solutions to elliptic PDEs.

And it worked, so why are we still talking about it? Because it's slow. Very, very slow. It might not have seemed that way in the first two notebooks because our domains were quite tiny, but consider this: for a domain with $nx = ny = 128$, the Jacobi method will require more than 21000 iterations before the residual dips below $10^{-8}$.

Using one core of an Intel i7 1.9 GHz processor, that takes around 4.5 seconds. Now consider this: an incompressible Navier Stokes solver has to ensure that the pressure field is divergence-free at every timestep. One of the most common ways to ensure that the pressure field is divergence-free is to relax the pressure field using iterative methods!

In fact, the pressure condition is responsible for the majority of the computational expense of a Navier Stokes solver. And with the current set up, each timestep could require 4.5 seconds of CPU time just to satisfy the pressure constraints! We'll grow old and die before we ever get cavity flow working!

There has to be a better way. And, of course, there is. There are several!

We'll be using the same example problem that we covered in Lesson 1, with boundary conditions

\begin{equation} \begin{gathered} p=0 \text{ at } x=0\\ \frac{\partial p}{\partial x} = 0 \text{ at } x = L\\ p = 0 \text{ at }y = 0 \\ p = \sin \left( \frac{\frac{3}{2}\pi x}{L} \right) \text{ at } y = H \end{gathered} \end{equation}

In [ ]:

import numpy
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
%matplotlib inline

Instead of copying and pasting cells that we wrote in Lesson 1, we have again created a 'helper' file that you can import some useful functions from.

In [ ]:

from laplace_helper import p_analytical, plot2D, L2_rel_error

Now we have the p_analytical, plot2D, and L2_rel_error functions in our namespace. If you can't remember how they work, just use help() and take advantage of the docstrings.

Test Problem¶

We are going to use larger grid dimensions in this notebook to better illustrate the speed increases available with different iterative methods.

In [ ]:

nx = 128
ny = 128

L = 5.
H = 5.

x = numpy.linspace(0,L,nx)
y = numpy.linspace(0,H,ny)

dx = L/(nx-1)
dy = H/(ny-1)

p0 = numpy.zeros((ny, nx))

p0[-1,:] = numpy.sin(1.5*numpy.pi*x/x[-1])

First things first -- we said above that the Jacobi method takes more than 21000 iterations before it satisfies the target residual of $10^{-8}$, but let's verify that.

We can also time how long it takes to complete those iterations using the time module.

In [ ]:

import time

In [ ]:

def laplace2d(p, y, dx, dy, l2_target):
    '''Solves the diffusion equation with forward-time, centered scheme
    
    Parameters:
    ----------
    p: 2D array of float
        Initial potential distribution
    y: array of float
        Nodal coordinates in y
    dx: float
        Mesh size
    dy: float
        Mesh size
    l2_target: float
        Error target
        
    Returns:
    -------
    p: 2D array of float
        Potential distribution after relaxation
    '''
    
    l2norm = 1
    pn = numpy.empty_like(p)
    iterations = 0

    while l2norm > l2_target:
        pn = p.copy()
        p[1:-1,1:-1] = .25 * (pn[1:-1,2:] + pn[1:-1,:-2] +\
                              pn[2:,1:-1] + pn[:-2,1:-1])
        
        ##Neumann B.C. along x = L
        p[1:-1,-1] = .25 * (2*p[1:-1,-2] + p[2:,-1] + p[:-2, -1])
        
        l2norm = numpy.sqrt(numpy.sum((p - pn)**2)/numpy.sum(pn**2))
        iterations += 1
     
    return p, iterations

In [ ]:

t1 = time.time()
p = p0.copy()
eps = 1e-8
p, iterations = laplace2d(p, y, dx, dy, eps)
t2 = time.time()

print "Jacobi method took {:.4f} seconds over {} iterations at tolerance {}".\
        format(t2-t1, iterations, eps)

Would we lie to you? 21268 iterations before the relative L^2-norm dips below $10^-8$. That's a large number of iterations, but is it even accurate? Let's compare it to the analytical solution and check.

In [ ]:

pan = p_analytical(x,y)

In [ ]:

L2_rel_error(p,pan)

Ok, that's a pretty small error. Let's focus on speeding up the process.

Gauss-Seidel¶

You will recall from Lesson 1 that a single Jacobi iteration is written as:

\begin{equation} p^{n+1}_{i,j} = \frac{1}{4} \left(p^{n}_{i,j-1} + p^n_{i,j+1} + p^{n}_{i-1,j} + p^n_{i+1,j} \right) \end{equation}

The Gauss-Seidel method is a simple tweak to this idea -- use updated values as soon as they are available.

If you imagine that we progress through an array in the following order:

Then you can see that the values $p^{n+1}_{i-1,j}$ and $p^{n+1}_{i,j-1}$ can be used to calculate $p^{n+1}_{i,j}$, so the iteration formula will now read as:

\begin{equation} p^{n+1}_{i,j} = \frac{1}{4} \left(p^{n+1}_{i,j-1} + p^n_{i,j+1} + p^{n+1}_{i-1,j} + p^n_{i+1,j} \right) \end{equation}

But there's a problem. You can't use NumPy's array operations to evaluate that. Since Gauss-Seidel requires using values immediately after they're updated, we have to abandon our beloved array operations and return to nested for loops.

That's not ideal, but if it saves us a bunch of time, then we can manage.

In [ ]:

def laplace2d_gauss_seidel(p, y, dx, dy, nx, ny, eps):
    
    iterations = 0
    error = 2*eps
    
    while error > eps:
        pn = p.copy()
        error = 0.0
        for j in range(1,ny-1):
            for i in range(1,nx-1):
                p[j,i] = .25 * (p[j,i-1] + p[j,i+1] + p[j-1,i] + p[j+1,i])
                error += (p[j,i] - pn[j,i])**2
        
        #Neumann 2nd-order BC
        for j in range(1,ny-1):
            p[j,-1] = .25 * (2*p[j,-2] + p[j+1,-1] + p[j-1, -1])
            
        error = numpy.sqrt(error/numpy.sum(pn**2))
        iterations += 1        
        
    return p, iterations

And then we would run this via:

p, iterations = laplace2d_gauss_seidel(p, y, dx, dy, nx, ny, 1e-8)

But **don't do it**. We did it so that you don't have to.

The Gauss-Seidel method required several thousand fewer iterations than the traditional Jacobi method for this example, but it took more than 7 minutes to run.

And we were complaining about 4.9 seconds!?

What happened?¶

If you think back to the far off days when you first learned about array operations, you might recall that we discovered that NumPy array operations could drastically improve code performance compared with nested for loops. NumPy operations are largely written in compiled C code and they are much faster than vanilla Python. But the Jacobi method is old and while 4.9 seconds is much better than 7 minutes, it's still too slow. What can we do?

Blitz it¶

Enter weave.blitz

weave.blitz takes an existing NumPy expression, converts it into C++ code and uses the C++ blitz++ library to compile it. Then it repacks it into a Python function and it's ready for you to use.

But didn't we just explain that array operations don't work with Gauss-Seidel because we can't use immediately updated values? Yes we did and that is true.

Except, when weave.blitz converts the code it doesn't use any temporary arrays to store a previous solution. What does that mean for us? It means that it will perform the array operation but in the underlying C++ code, it will use the most recently updated values as it evaluates our array.

In other words, it will implement Gauss-Seidel for us!

Let's get to it. First, we need to import the weave library from SciPy.

In [ ]:

from scipy import weave

We only make a few changes to our original Jacobi method code and they will seem a little strange.

Make the array operation for calculating $p^{n+1}_{i,j}$ into a string. Then send that string to weave.blitz along with the keyword check_size=0.

The check_size keyword performs a few sanity checks if it's set equal to 1, but we can skip those in exchange for an extra speedup. If you have bugs in your code, you can set it equal to 1 to help track them down.

In [ ]:

def laplace2d_gauss_seidel_blitz(p, y, dx, dy, eps):
    
    iterations = 0
    error = 2*eps
    
    while error > eps:
        pn = p.copy()
        error = 0.0
        
        expr = "p[1:-1,1:-1] = \
                .25 * (p[1:-1,:-2] + p[1:-1,2:] + p[:-2,1:-1] + p[2:,1:-1])"
        
        weave.blitz(expr, check_size=0)
        
        #Neumann 2nd-order BC
        p[1:-1,-1] = .25 * (2*p[1:-1,-2] + p[2:,-1] + p[:-2, -1])
        
        error = numpy.sqrt(numpy.sum((p - pn)**2)/numpy.sum(pn**2))
        iterations += 1
        
    return p, iterations

Note: You might see some unusual output the first time you run weave.blitz in a new function. Instead of compiling the code every time you run a function, weave caches the compiled versions to improve performance.

In [ ]:

t1 = time.time()
p = p0.copy()
eps = 1e-8
p, iterations = laplace2d_gauss_seidel_blitz(p, y, dx, dy, eps)
t2 = time.time()

print "Blitzed GS method took {:.4f} seconds over {} iterations at tolerance {}".\
        format(t2-t1, iterations, eps)

Nice! We're almost under 3 seconds now with around 7000 fewer iterations! Not bad for 2 extra lines of code. Don't forget to make sure that the L2 error looks reasonable!

In [ ]:

L2_rel_error(p,pan)

This is definite progress. Think we can do any better?

Successive Over-Relaxation (SOR)¶

Successive over-relaxation is an extension of the Gauss-Seidel method. We take the existing Gauss-Seidel method and use a linear combination of the previous and the current solution to accelerate convergence.

\begin{equation} p^{n+1}_{i,j} = (1 - \omega)p^n_{i,j} + \frac{\omega}{4} \left(p^{n+1}_{i,j-1} + p^n_{i,j+1} + p^{n+1}_{i-1,j} + p^n_{i+1,j} \right) \end{equation}

SOR iterations are only stable for $0 < \omega < 2$.

If $\omega < 1$, that is technically an "under-relaxation" and it will be slower than Gauss-Seidel.

If $\omega > 1$, that's the over-relaxation and it should converge faster than Gauss-Seidel.

Also note that for $\omega = 1$, the equation collapses into the Gauss-Seidel method.

Everything we did above using weave.blitz still applies here -- the only things we need to do are implement the array operation for $p^{n+1}_{i,j}$ and add $\omega$ to the function inputs.

In [ ]:

def laplace2d_SOR_blitz(p, y, dx, dy, eps, omega):
    
    iterations = 0
    error = 2*eps
    
    while error > eps:
        pn = p.copy()
        error = 0.0
        
        expr = "p[1:-1,1:-1] = (1 - omega) * p[1:-1,1:-1] +\
        .25 * omega * (p[1:-1,2:]+p[1:-1,:-2]+p[2:,1:-1]+p[:-2,1:-1])"
        
        weave.blitz(expr, check_size=0)
        
        #Neumann 2nd-order BC
        p[1:-1,-1] = .25 * (2*p[1:-1,-2] + p[2:,-1] + p[:-2, -1])
        
        error = numpy.sqrt(numpy.sum((p - pn)**2)/numpy.sum(pn**2))
        
        iterations += 1
        
        
    return p, iterations

That wasn't too bad at all. Let's try this out first with $\omega = 1$ and make sure it matches the Gauss-Seidel results from above.

In [ ]:

p = p0.copy()
t1 = time.time()
eps = 1e-8
omega = 1
p, iterations = laplace2d_SOR_blitz(p, y, dx, dy, eps, omega)
t2 = time.time()

print "Blitzed SOR method took {:.4f} seconds and {} iterations\
 at tolerance {} with omega = {}".format(t2 - t1, iterations, eps, omega)

The execution time is going to vary depending on what else the computer is up to at the moment, but we have the exact same number of iterations as Gauss-Seidel. That's a good sign that things are working as expected.

Now let's try to over-relax the solution and see what happens. To start, let's try $\omega = 1.5$.

In [ ]:

p = p0.copy()
t1 = time.time()
eps = 1e-8
omega = 1.5
p, iterations = laplace2d_SOR_blitz(p, y, dx, dy, eps, omega)
t2 = time.time()

print "Blitzed SOR method took {:.4f} seconds and {} iterations\
 at tolerance {} with omega = {}".format(t2 - t1, iterations, eps, omega)

Wow! That really did the trick! Under 1.5 seconds! And we dropped from 15014 iterations down to 7406. Now we're really cooking!

Tuned SOR¶

We picked $\omega=1.5$ somewhat arbitrarily. Ideally, we would like to over-relax the solution as much as possible without introducing instability, as that will give us the fewest number of iterations.

For square domains, it turns out that the ideal factor $\omega$ can be computed as a function of the number of nodes in one direction, e.g. nx.

\begin{equation} \omega \approx \frac{2}{1+\frac{\pi}{nx}} \end{equation}

This is not some arbitrary formula, but its derivation lies outside the scope of this class. For now, let's try it out and see how it works.

In [ ]:

p = p0.copy()
t1 = time.time()
eps = 1e-8
omega = 2./(1 + numpy.pi/nx)
p, iterations = laplace2d_SOR_blitz(p, y, dx, dy, eps, omega)
t2 = time.time()

print "Blitzed SOR method took {:.4f} seconds and {} iterations\
 at tolerance {} with omega = {:.4f}".format(t2 - t1, iterations, eps, omega)

Wow! That's very fast. Also, $\omega$ is very close to the upper limit of 2. SOR tends to work fastest when $\omega$ approaches 2, but don't be tempted to push it. Set $\omega = 2$ and the walls will come crumbling down.

How does the L2 error look with only 1324 iterations?

In [ ]:

L2_rel_error(p,pan)

Looking very good, indeed.

We didn't explain it in any detail, but notice the very interesting implication of Equation $(5)$: the ideal relaxation factor is a function of the grid size.

Scheduled Relaxation Jacobi (SRJ)¶

The Scheduled Relaxation Jacobi method is the newest addition to the realm of iterative solver methods by a long shot. In fact, it was published just last year, in June of 2014 in the Journal of Computational Physics. There's also a preprint of the paper here that isn't behind a paywall.

SRJ works in much the same way as the SOR method, but where SOR used a single tuned value of $\omega$, SRJ uses a schedule of over- and under-relaxations.

Each SRJ scheme is characterized by the number of levels $P$ which are the number of separate over- or under-relaxation parameters used ($\omega_i$) with

\begin{align} \overrightarrow{\Omega} &= \{\omega_1, \omega_2, ..., \omega_p\}\\ \overrightarrow{Q} &= \{q_1, q_2, ..., q_p\}\\ \end{align}

where $q_i$ is the number of times the value $\omega_i$ is repeated in one SRJ cycle, which consists of M total iterations where $$M = \sum_{i=1}^p q_i$$

SRJ scheduling¶

In theory, the order sequencing of various $\omega$s doesn't matter, but in practice, roundoff error and arithmetic overflow cause issues. In order to prevent instabilities, the relaxation parameters within a given SRJ cycle are (more or less) evenly distributed.

The larger the number of relaxation factors, $P$, the better the performance of the SRJ method. Yang and Mittal provide precalculated relaxation factors up to $P=5$ which is what we will use.

For grid dimensions of 128 by 128, Yang and Mittal provide

\begin{align} \Omega &= \{4522.0, 580.86, 50.729, 4.5018, 0.67161\}\\ Q &= \{1, 3, 16, 73, 250 \} \end{align}

The relaxation factor 4522.0 will be applied once, 580.86 will be applied three times... you get the idea.

We used a Python adapted version of code provided by Yang and Mittal to generate a variety of relaxation schedules (you can check out the code here).

Let's begin by loading the schedule file.

In [ ]:

schedule = numpy.load('./data/relaxation_schedules.npz')

We can view the available schedules by asking for a list of files, e.g.

In [ ]:

schedule.files.sort()
print schedule.files

Note that each schedule listed is a NumPy array. The number following nx is the grid dimension and the number follow p is the quantity of relaxation factors.

The dimensions of the grid are $nx = ny = 128$ and we want to use $P = 5$.

In [ ]:

nx128p5 = schedule['nx128p5']

The schedule is 343 elements long, so we don't need to print the whole array out, but we can take a look at the first 20 elements or so to see how the various relaxation factors are distributed.

In [ ]:

print nx128p5[:21]

One last Neumann adventure¶

The eagle-eyed among you may have noticed that in the SOR code above, we used the 2nd-order Neumann boundary condition that we initially derived for the traditional Jacobi method.

That is, despite the fact the the SOR scheme is discretized according to

\begin{equation} p^{n+1}_{i,j} = (1 - \omega)p^n_{i,j} + \frac{\omega}{4} \left(p^{n+1}_{i,j-1} + p^n_{i,j+1} + p^{n+1}_{i-1,j} + p^n_{i+1,j} \right) \end{equation}

we used a Neumann condition based on the discretization

\begin{equation} p^{n+1}_{i,j} = \frac{1}{4} \left(p^{n}_{i,j-1} + p^n_{i,j+1} + p^{n}_{i-1,j} + p^n_{i+1,j} \right) \end{equation}

and with an L2 relative error of $7.78613\tt{E}$$^-5$, it doesn't seem to have caused any trouble.

However, SRJ uses very large relaxation factors. Remember, we said that $\omega = 2$ was unstable for SOR (and it is), but SRJ gets away with a relaxation factor of $\omega = 4522$ (albeit for only one iteration).

The result is that we do need a more tailor-made Neumann condition for SRJ if we want to hang on to 2nd-order accuracy (and we do!).

The SRJ discretization is given by

\begin{equation} p^{n+1}_{i,j} = (1 - \omega)p^n_{i,j} + \frac{\omega}{4} \left(p^{n}_{i,j-1} + p^n_{i,j+1} + p^{n}_{i-1,j} + p^n_{i+1,j} \right) \end{equation}

Using the same procedure as in Lesson 1 on Equation $(52)$, we get the SRJ-friendly 2nd-order Neumann boundary condition:

\begin{equation} p^{n+1}_{0,j} = (1 - \omega)p^n_{0,j} + \frac{\omega}{4} \left(2p^{n}_{1,j} + p^{n}_{0,j-1} + p^n_{0,j+1}\right) \end{equation}

It's probably a good idea for you to get out some pencil and paper and verify that.

SRJ with NumPy¶

Ok! We're finally here! The SRJ function should be similar to what we've already done, except for a few key differences:

There are no immediate updates, so we can use NumPy again!
We cycle through the SRJ schedule in order -- then, if the relative L2 target hasn't been met, we cycle through again.

In [ ]:

def laplace2d_SRJ_numpy(p, y, dx, dy, l2_target, w_schedule):
    '''Solves the diffusion equation with forward-time, centered scheme
    
    Parameters:
    ----------
    p: 2D array of float
        Initial potential distribution
    y: array of float
        Nodal coordinates in y
    dx: float
        Mesh size
    dy: float
        Mesh size
    l2_target: float
        Error target
    w_schedule: array of float
        Relaxation schedule
        
    Returns:
    -------
    p: 2D array of float
        Potential distribution after relaxation
    '''
    nx = len(y)
        
    l2norm = 1
    pn = numpy.empty_like(p)
    iteration_count = 0
    
    
    while l2norm > l2_target:
        for omega in w_schedule:
            pn = p.copy()
            p[1:-1,1:-1] = (1 - omega) * pn[1:-1,1:-1] + .25 * omega * (pn[1:-1,2:]+pn[1:-1,:-2]+pn[2:,1:-1]+pn[:-2,1:-1])
                       
            p[1:-1,-1] = (1 - omega) * pn[1:-1,-1] + .25 * omega *\
                        (2*pn[1:-1,-2]+pn[2:,-1]+pn[:-2,-1])
            
            l2norm = numpy.sqrt(numpy.sum((p - pn)**2)/numpy.sum(pn**2))
            
            if l2norm < l2_target:
                break
            iteration_count += 1
     
    return p, iteration_count

Ok, let's try it out!

In [ ]:

p = p0.copy()
t1 = time.time()

eps = 1e-8
p, iterations = laplace2d_SRJ_numpy(p, y, dx, dy, eps, nx128p5)

t2 = time.time()

print "SRJ method took {:.4f} seconds and {} iterations\
 at tolerance {}".format(t2 - t1, iterations, eps)

That's definitely faster than Jacobi and Gauss-Seidel, but it's slower than SOR. What gives?

It's true that SOR is faster than SRJ... on one processor. That last detail is what makes SRJ so compelling.

Remember, we had to do all sorts of code acrobatics with weave to get Gauss-Seidel and SOR to run quickly because they use immediately updated values, but SRJ doesn't. SRJ is just a hair slower than SOR and we implemented it in straight NumPy.

We haven't talked much about parallelization in this course, but it's at the heart of most computational science.

Without getting into a detailed explanation, suffice it to say that SOR is very difficult to parallelize while SRJ is relatively trivial. On one core, SOR is almost twice as fast as SRJ, but if we have four cores, then SRJ rules the roost.

If we have 8 cores, then SOR starts looking a little embarrassed.

And if we have a cluster with hundreds of cores? You get the idea.

References¶

Yang, Xiyang IA, and Rajat Mittal. "Acceleration of the Jacobi iterative method by factors exceeding 100 using scheduled relaxation." Journal of Computational Physics 274 (2014): 695-708.
Thomas, James William. Numerical partial differential equations: conservation laws and elliptic equations. Vol. 3. Berlin: Springer, 1999.
http://wiki.scipy.org/PerformancePython
http://www.physics.buffalo.edu/phy410-505/2011/topic3/app1/index.html

In [ ]:

from IPython.core.display import HTML
css_file = '../../styles/numericalmoocstyle.css'
HTML(open(css_file, "r").read())