# Neural Networks¶

###### COMP4670/8600 - Introduction to Statistical Machine Learning - Tutorial 5¶


Setting up the environment

In [ ]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline


## The data set¶

We will use an old dataset on the price of housing in Boston (see description). The aim is to predict the median value of the owner occupied homes from various other factors. This is the same data as was used in Tutorial 2. However, this time we will explore data normalisation, and hence use the raw data instead. Please download this from mldata.org.

As in Tutorial 2, use pandas to read the data. Remove the 'CHAS' feature from the dataset.

In [ ]:
names_full =  ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat', 'medv']
ichas = names_full.index('chas')
data = np.delete(data_full, ichas, axis=1)
names = names_full
del names[ichas]
data.shape


Implement a function that will normalise each feature such that the mean value of the feature is zero and the variance is one. Apply this function to each feature in the housing dataset.

In [ ]:
# Solution goes here


To simplify equations, we introduce an extra input so that the biases can be absorbed into the weights.

In [ ]:
num_ex = n_data.shape[0]
n_data_ones = np.hstack((n_data, np.ones(num_ex).reshape(-1,1)))
names_ones = names + ['one']


## Comparing two normalisations¶

Compare the normalised data n_data to the data from Tutorial 2 by plotting and/or comparing histograms. Discuss the potential effect of the normalisation on the regression task.

In [ ]:
# Solution goes here


## Error Backpropagation¶

Note that we are considering a regression problem. That is we want to predict the median value of homes (a real number) from the other features. We use the squared error to measure performance. $$E = \frac{1}{2} \sum_k (y_k - t_k)^2$$

### Objective function¶

Write down the objective function of a neural network with one hidden layer. Use the identity activation function for the hidden units. Write down the equation for 5 hidden units.

How many input units should there be? What should be the activation function of the output units? Explain why these choices are reasonable.

### Solution description¶

Compute the gradient $\frac{\partial E}{\partial w^{(2)}}$

## Checking correctness¶

One strategy to check that your code is correct in neural networks (and in general any gradient code) is to numerically check that your expression is correct. From the lecture we see that: $$\frac{\partial E}{\partial w^{(2)}} \simeq \frac{E(w^{(2)} + \epsilon) - E(w^{(2)} - \epsilon)}{2\epsilon}.$$ For more information see the following wiki.

Implement two functions, one that computes the analytic gradient and the second that computes the numerical gradient.

In [ ]:
# Solution goes here


Using the Boston housing data above, confirm that the two functions return almost the same values of the gradient for various values of $w$.

In [ ]:
# Solution goes here


## (optional) Gradients for hidden layer¶

Derive and implement the gradients for the hidden layer, hence giving you the full two layer neural network. Use this with the experimental set up in Tutorial 2 to analyse the Boston housing data. Recall that since we are using linear activation functions, this is equivalent to using a linear model. Compare and contrast the results of the neural network with regularised linear regression.