For some search problems, the path to reach the goal is not
important. Discovering a goal state or a state with the best value of
some objective function is the important part. Such problems are
often called **optimization problems**.

What are some examples?

What is the best setting for a control on your stove to cook your soup as quickly as possible without burning it?

What is the best way to pack the words of a word cloud, or a wordle? See an answer here.

What is the best way to divide up the computations of a matrix calculation to be distributed across multiple cores?

**Local search** algorithms operate by maintaining a single state (or
set of pretty good states) and generating another state to try based
on this limited memory. Some set of actions are available to modify a
state to another "neighboring" state. It is a "neighbor" state in an
abstract graph of states connected by single actions.

Imagine all of the possible states for a problem laid out along a
line, along an x-axis. The value of the objective function can be
plotted as a height above the points on the x-axis. A local search
algorithm tries to find the state with the best height, where best
could be the maximal or minimal value depending on the search
problem. The best value over all states is the **global optimum** (a
minimum or a maximum depending on the problem).

A local search algorithm is **complete** if it always finds a goal
state. It is **optimal** if it always finds the optimum state, the
one with the global maximum or minimum.

This is our first example of a local search algorithm. Imagine you are climbing a mountain and you are in a very thick fog. You can only see a distance equal to one step length. To try to climb you take the step in the direction that is steepest to get to the highest point of all the locations you can currently see.

In other words, hill-climbing search simply evaluates the objective function for all states that are neighbors to the current state, and takes the neighbor state with the best objective function value as the new current state. If there are more than one next best states, one is picked randomly.

Hill-climbing search is sometimes called **greedy search**, because a
step is taken after only considering the immediate neighbors. No time
is spent considering possible future states.

Hill-climbing is easy to formulate and implement and often finds pretty good states quickly. But, it has the following problems:

- it gets stuck on local optima (hills for maximizing searches, valleys for minimizing searches,
- it may get stuck on a ridge, if no single action can advance the search along the ridge,
- it may get stuck wandering on a plateau for which all neighboring states have equal value.

Common variations include

- allow sideways moves (when on a plateau)
- stochastic hill-climbing: choose next state with probability related to increase in value of objective function
- first-choice hill-climbing: generate neighbors by random choice of available actions and keep first state that has better value,
- random-restart hill climbing: conduct multiple hill-climbing searches from multiple, randomly generated, initial states.

Only this last one, with random-restarts, is **complete**. In the limit, all states will be tried as starting states so the goal, or best state, will eventually be found.

Place eight queens on a chess board so that no queen is attacking another. Each queen must be in one of the 8 columns, so each queen can be placed in one of the 8 rows, for a total of $8^8 \approx 17$ million states. Actions are moving a single queen to a different row in the same column. The objective function is the number of pairs of queens attacking each other. This function is to be minimized, of course. See this animation and this nice explanation

Hill-climbing search as described only finds solutions 14% of the time, but solves those instances quickly, in an average of 4 steps. When it gets stuck this is discovered in an average of 3 steps.

Hill-climbing Variations | Percent Solved | Steps to Find Solution | Steps to Know Failure |
---|---|---|---|

basic | 14% | 4 | 3 |

with sideways moves | 94% | 21 | 64 |

with restarts | 100% | 22 | |

with sidways moves and restarts | 100% | 28 |

Hard ones for hill-climbing are ones with many local optima. NP-hard problems often have an exponential number of local optima, but states with pretty good value can often be found with a small number of restarts.

Hill-climbing searches will get stuck on local optima. Only by adding random restarts can you have a hill-climbing algorithm that is complete.

To get off of a local optimum, a search must be defined to allow steps that are "downhill" for maximizing searches, and "uphill" for minimizing searches, away from the optimum.

**Simulated annealing**
is an algorithm that does this probabilisitically. Assume we are
doing a maximizing search, meaning we want to find the state with the
maximum value. Let the value of
the current state be $v$. Imagine an action has been applied to that
state and the resulting state has a lower (worse) value $v'$. Simulated
annealing will accept this new state as the current state with
probability $e^{(v' - v)/T}$. $T$ is like a "temperature", the higher
the value the more likely we are to take a step to a state with a
worse value. In practice, $T$ starts at a high value and is slowly
decreased towards zero. If it is decreased "slowly enough", the
global optimum will be found with probabilty 1. In other words, this is a
**complete** algorithm.

In [2]:

```
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
def probOfAcceptance(dE, T):
r = np.exp(dE/T)
r[r>1] = 1.0
return r
dE = np.linspace(-5, 5, num=100)
plt.figure(figsize=(10,7))
plt.clf()
legendText = []
for T in [0.1, 1, 10, 100]:
plt.plot(dE, probOfAcceptance(dE,T), 'o-')
legendText.append('Temperature = {:1f}'.format(T))
plt.xlabel('New Value - Current Value')
plt.ylabel('Probability of Accepting New State')
plt.legend(legendText,loc='lower right');
```

The above searches keep just one state in memory. **Local beam
search** keeps the best $k$ states in memory. The successors of all
$k$ states are generated, and the best $k$ among them are kept.

A variant is **stochastic beam search** that selects the $k$ states to
keep as a probabilistic function of their values. This search tends
to maintain more diversity in the set of kept $k$ states on each
iteration.

**Genetic algorithms** are a lot like stochastic beam search. A set
of $k$ states are kept. The difference is in how successor states are
generated.

Genetic algorithms generate successor states by combining parts of
good states to make new ones (the **crossover** operator) and by
randomly modifying parts (the **mutation** operator). Then the values
of the new ones are used to select the best to keep for the next
iteration, or generation.

Many variations on representations of states as strings of symbols, operators, and on ways of selecting winners for the next generation.

A monotonic function of a single variable can be optimized using binary search.

Here is a useful application of binary search.

How can we compare the efficiency of different search algorithms, or of the A* algorithm using different heuristic functions? We can use the computation time and the maximum amount of memory used during the search.

It would also be nice to know how "focused" a search is. If we use a perfect heuristic function, we should get a search that won't explore any nodes that do not lie on an optimal search path.

A measure related to this is the "effective branching factor". The branching factor of a tree is the number of children at each node. If this is not the same for all nodes, then we can find an average, or "effective" branching factor.

If we do a breadth-first (or depth-first) search on a tree like this

we will have explored 1 + 3 + 6 = 10 nodes. What is the effective branching factor for this tree? Or, stated mathematically, what is the value of $b$ in

$$ 1 + b + b^2 = 10$$

Now we can solve this for $b$. What should the value be, roughly? Solve it exactly. What do you get?

We can solve this one exactly, but what if the search is 10 levels deep? We will have to do a search (yay!) to find the value of $b$. How? Could pick a whole bunch of values randomly.....

How about a binary search? To start a binary search we need to pick a low and high value of $b$ that we know bracket the true value. For a search that explored $n$ nodes to a maximum depth of $d$, what would good low and high values be? Now do a binary search between these two values and for each new guess at $b$, calculate

$$ 1 + b + b^2 + \cdots + b^d$$and compare the result with the actual number of nodes, $n$. Use the comparison to continue the binary search. Continue until the range of possible values of $b$ are within a desired precision, such as $0.01$. A faster way to calculate the above quantity is

$$ \frac{1-b^{d+1}}{1-b}$$

How would you derive this?

To estimate the effective branching factor, you could define a
function `def ebf(numberNodes, depth, precision=0.01)`

. Then, to use it

```
In [3]: ebf(100, 12, 0.01)
Out[3]: 1.3034343719482422
In [4]: ebf(0, 0)
Out[4]: 0
In [5]: ebf(1, 1,)
Out[5]: 0.0078125
```

This is not in our text. See the Scholarpedia entry for Nelder-Mead. This maintains a set of $k$ good states. To generate one next state to evaluate, the worst of the $k$ states is reflected through the centroid, or mean, of the $k$ states. If the new state is better than the worst of the $k$ states, it replaces the worst state.

Also see the Python documentation for fmin function in scipy.

Stands for Covariance Matrix Adaptation Evolution Strategy. A Gaussian distribution is fit to the set of good states and a new state is generated by sampling probabilistically from the Gaussian distribution. The Gaussian distribution is updated when a new good state is found. Recent results show show that certain quasi-random methods are computationally more efficient.

When the derivative of the value function with respect to each component of the state can be calculated, then many derivative-based searches are available. If you know the slope at the current state, you know which direction to go. If you also know the second derivative, you can make some intelligent guesses at how far to go.

If the state has multiple components, almost always the case, the derivatives are gradients, and the second derivatives are Hessian matrices.

Let's find the minimum of the function

$$f(x) = 2 x^4 + 3 x^3 + 3$$Its first derivative is

$$\frac{df(x)}{dx} = 8 x^3 + 9 x^2$$and its second derivative is

$$\frac{d^2f(x)}{dx^2} = 24 x^2 + 18 x$$In [31]:

```
import time
import IPython.display as ipd # for display and clear_output
import random
def f(x):
return 2 * x**4 + 3 * x**3 + 3
def df(x):
return 8 * x**3 + 9 * x**2
def ddf(x):
return 24 * x**2 + 18*x
def taylorf(x,dx):
return f(x) + df(x) * dx + 0.5 * ddf(x) * dx**2
xs = np.linspace(-2,1,num=100)
dxs = np.linspace(-0.5,0.5,num=100)
fig = plt.figure(figsize=(18, 8))
for rep in range(5):
x = random.uniform(-2, 1) # first guess at minimum
for step in range(10):
time.sleep(1) # sleep 2 seconds
plt.clf()
plt.plot(xs, f(xs))
plt.grid('on')
plt.plot(x+dxs, taylorf(x, dxs),'g-',linewidth=5, alpha=0.4)
plt.plot(x, f(x), 'ro')
y0, y1 = plt.ylim()
plt.plot([x, x], [y0, y1], 'r--')
if step == 0:
plt.text(x, 10, 'New first x', color='r', fontsize=40)
plt.text(x + 0.05, (y0 + y1) * 0.5, str(x), color='r', fontsize=20)
x = x - df(x) / float(ddf(x))
plt.plot(x, f(x), 'go')
plt.legend(('$f(x)$', '$\hat{f}(x)$'))
ipd.clear_output(wait=True)
ipd.display(fig)
ipd.clear_output(wait=True)
```

4.1 What kind of search do we get from

- Local beam search with $k=1$?
- Local beam search with one initial state and no limit on the number of states retained?
- Simulated annealing with $T=0$ at all times?
- Simulated annealing with $T=\infty$?
- Genetic algorithm with population size $k=1$?
- Gradient-ascent in a discrete state and action space?
- Nelder-Mead with $k=2$ in a two-dimensional state space?

For many real-world problems, an agent cannot predict the outcome of applying an action in the current state. It must take the action and observe the new state.

An agent can end up in a dead-end from which it cannot recover. We
will ignore this, and assume that a search problem is **safely
explorable**---some goal state is reachable from every reachable
state.

An on-line search agent cannot explore the successors to an arbitrary node, but only the successors of the current node. This is the nature of depth-first so the authors adapt off-line dept-first search to an on-line search algorithm. Results of each action are stored in a map that associates each state-action pair with the resulting state. It assumes it is possible to undo, or reverse, each action in order to backtrack.

An on-line version of A* is learning real-time A* , or LRTA*
Instead of maintaining memory of states as in the above depth-first
algorithm, LRTA* only maintains a memory of the current best estimate
of cost to reach a goal for each node. It updates this estimate after
each action is tried. It assumes it can identify all successors for a
state and compare their remaining cost estimates. An interesting
aspect is that all untried actions are assumed to result in a state
whose heuristic function value is correct, a feature called **optimism
under uncertainty**.

Learning better estimates of remaining cost based on single steps is
at the heart of **reinforcement learning**, covered in Chapter 21.