For this assignment, you will use reinforcement learning to solve the Towers of Hanoi puzzle with three pegs and five disks.

To accomplish this, you must modify the code discussed in lecture for learning to play Tic-Tac-Toe. Modify the code so that it learns to solve the five-disk, three-peg Towers of Hanoi Puzzle. In some ways, this will be simpler than the Tic-Tac-Toe code.

Steps required to do this include the following:

- Represent the state and move, and use it as a tuple as a key to the Q dictionary.
- Make sure only valid moves are tried from each state.
- Assign reinforcement of $1$ to each move, even for the move that results in the goal state.

Make a plot of the number of steps required to reach the goal for each trial. Each trial starts from the same initial state. Decay epsilon as in the Tic-Tac-Toe code.

First, how should we represent the state of this puzzle? We need to keep track of which disks are on which pegs. Name the disks 1, 2, 3, 4, and 5, with 1 being the smallest disk and 5 being the largest. The set of disks on a peg can be represented as a list of integers. Then the state can be a list of three lists.

For example, the starting state with all disks being on the left peg would be `[[1, 2, 3, 4, 5], [], []]`

. After moving disk 1 to peg 2, we have `[[2, 3, 4, 5], [1], []]`

.

To represent that move we just made, we can use a list of two peg numbers, like `[1, 2]`

, representing a move of the top disk on peg 1 to peg 2.

Now on to some functions. Define at least the following functions. Examples showing required output appear below.

`print_state(state)`

: prints the state in the form shown below`get_valid_moves(state)`

: returns list of moves that are valid from`state`

`make_move(state, move)`

: returns new (copy of) state after move has been applied.`train_Q(n_repetitions, learning_rate, epsilon_decay_factor, get_valid_moves, make_move)`

: train the Q function for number of repetitions, decaying epsilon at start of each repetition. Returns Q and list or array of number of steps to reach goal for each repetition.`test_Q(Q, max_steps, get_valid_moves, make_move)`

: without updating Q, use Q to find greedy action each step until goal is found. Return path of states.

A function that you might choose to implement is

`state_move_tuple(state, move)`

: returns tuple of state and move.

This is useful for converting state and move to a key to be used for the Q dictionary.

Show the code and results for testing each function. Then experiment with various values of `n_repetitions`

, `learning_rate`

, and `epsilon_decay_factor`

to find values that work reasonably well, meaning that eventually the minimum solution path of 31 steps is found consistently.

Make a plot of the number of steps in the solution path versus number of repetitions. The plot should clearly show the number of steps in the solution path eventually reaching the minimum of 31 steps, though the decrease will not be monotonic. Also plot a horizontal, dashed line at a height (value on y axis) of 31 to show the optimal path length.

Use at least a total of 15 sentences to describe the following results:

- Add markdown cells in which you describe the Q learning algorithm and your implementation of Q learning as applied to the Towers of Hanoi problem.
- Add code cells to examine several Q values from the start state with different moves and discuss if the Q values make sense.
- Also add code cells to examine several Q values from one or two states that are two steps away from the goal and discuss if these Q values make sense.

In [13]:

```
state = [[1, 2, 3, 4, 5], [], []]
print_state(state)
```

In [14]:

```
move =[1, 2] # Move top (smallest) disk from first peg to second peg
state_move_tuple(state, move)
```

Out[14]:

In [15]:

```
new_state = make_move(state, move)
new_state
```

Out[15]:

In [16]:

```
get_valid_moves(new_state)
```

Out[16]:

In [17]:

```
print_state(new_state)
```

In [18]:

```
Q, steps_to_goal = train_Q(200, 0.5, 0.7, get_valid_moves, make_move)
```

In [19]:

```
steps_to_goal
```

Out[19]:

In [20]:

```
path = test_Q(Q, 100, get_valid_moves, make_move)
```

In [21]:

```
path
```

Out[21]:

In [22]:

```
for s in path:
print_state(s)
print()
```

Download and extract `A4grader.py`

from A4grader.tar.

In [3]:

```
%run -i A4grader.py
```

Modify your code to solve the Towers of Hanoi puzzle with four pegs and five disks. Name your functions

```
- print_state_4pegs
- get_valid_moves_4pegs
- make_move_4pegs
```

Find values for number of repetitions, learning rate, and epsilon decay factor for which train_Q learns a Q function that test_Q can use to find the shortest solution path. Include the output from the successful calls to train_Q and test_Q.