$\newcommand{\trace}[1]{\operatorname{tr}\left\{#1\right\}}$ $\newcommand{\Norm}[1]{\lVert#1\rVert}$ $\newcommand{\RR}{\mathbb{R}}$ $\newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $\newcommand{\DD}{\mathscr{D}}$ $\newcommand{\grad}[1]{\operatorname{grad}#1}$ $\DeclareMathOperator*{\argmin}{arg\,min}$

Setting up the environment

In [ ]:

```
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
```

We will use an old dataset on the price of housing in Boston (see description). The aim is to predict the median value of the owner occupied homes from various other factors. This is the same data as was used in Tutorial 2. However, this time we will explore data normalisation, and hence use the raw data instead. Please download this from mldata.org.

As in Tutorial 2, use `pandas`

to read the data. Remove the 'CHAS' feature from the dataset.

In [ ]:

```
names_full = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat', 'medv']
data_full = np.loadtxt('regression-datasets-housing.csv', delimiter=',')
ichas = names_full.index('chas')
data = np.delete(data_full, ichas, axis=1)
names = names_full
del names[ichas]
data.shape
```

Implement a function that will normalise each feature such that the mean value of the feature is zero and the variance is one. Apply this function to each feature in the housing dataset.

In [ ]:

```
# Solution goes here
```

To simplify equations, we introduce an extra input so that the biases can be absorbed into the weights.

In [ ]:

```
num_ex = n_data.shape[0]
n_data_ones = np.hstack((n_data, np.ones(num_ex).reshape(-1,1)))
names_ones = names + ['one']
```

Compare the normalised data `n_data`

to the data from Tutorial 2 by plotting and/or comparing histograms. Discuss the potential effect of the normalisation on the regression task.

In [ ]:

```
# Solution goes here
```

Note that we are considering a regression problem. That is we want to predict the median value of homes (a real number) from the other features. We use the squared error to measure performance. $$ E = \frac{1}{2} \sum_k (y_k - t_k)^2 $$

Write down the objective function of a neural network with one hidden layer. Use the identity activation function for the hidden units. Write down the equation for 5 hidden units.

How many input units should there be? What should be the activation function of the output units? Explain why these choices are reasonable.

Compute the gradient $\frac{\partial E}{\partial w^{(2)}}$

One strategy to check that your code is correct in neural networks (and in general any gradient code) is to numerically check that your expression is correct. From the lecture we see that: $$ \frac{\partial E}{\partial w^{(2)}} \simeq \frac{E(w^{(2)} + \epsilon) - E(w^{(2)} - \epsilon)}{2\epsilon}. $$ For more information see the following wiki.

Implement two functions, one that computes the analytic gradient and the second that computes the numerical gradient.

In [ ]:

```
# Solution goes here
```

Using the Boston housing data above, confirm that the two functions return almost the same values of the gradient for various values of $w$.

In [ ]:

```
# Solution goes here
```

Derive and implement the gradients for the hidden layer, hence giving you the full two layer neural network. Use this with the experimental set up in Tutorial 2 to analyse the Boston housing data. Recall that since we are using linear activation functions, this is equivalent to using a linear model. Compare and contrast the results of the neural network with regularised linear regression.