Notebook

In [1]:

import matplotlib
import matplotlib.pyplot as plt
matplotlib.pyplot.gray()

%matplotlib inline

# put nvcc on the path
import os
os.environ['PATH'] += ':/usr/local/cuda/bin'

import numpy as np
import theano
import theano.tensor as T
from theano import shared, function
theano.config.floatX = 'float32'
rng = np.random.RandomState(42)


import sys
import time

Using gpu device 0: GeForce GT 650M (CNMeM is disabled)

ComputeFest 2016 - Deep Learning Workshop Part 2 - Convolutional Networks¶

Today we will focus on deep convolutional neural networks. We'll use the handwritten digit recognition task, again. We'll build up a network with a pair of convolutional layers with two fully connected layers above that.

The first part of this tutorial is similar (though not identical, we skip normalization) to yesterday's notebook on fully-connected networks.

We will make heavy use of the resources in the Theano deep learning tutorial. If you cloned the Workshop git repo, you can just run

git submodule update

Now we have to add this directory to the PythonPath. Depending on the location of your git repository you might have to change this path.

In [2]:

sys.path.insert(1,'DeepLearningTutorials/code')

The Task: Handwritten Digit Recognition¶

Classifying handwritten digits is like the 'hello world' for deep learning methods. It is how everyone normally starts to get used to the models and how to best train them. Interstingly though, it is not just for beginners. There are still papers published nowadays that use MNIST to evaluate new methods. There is some debate on whether this is still useful or not, but we won't care about this discussion for now.

MNIST consists of 70 000 small image patches, each showing a handwritten digit character in white on a black background. There are 10 different classes (the digits from 0-9). Let's load the data and have a look.

In [3]:

from logistic_sgd import load_data
dataset='mnist.pkl.gz'

## If not already cached this function actually downloads the data
datasets = load_data(dataset)

train_set_x, train_set_y = datasets[0]
valid_set_x, valid_set_y = datasets[1]
test_set_x, test_set_y = datasets[2]

... loading data

Theano Shared Variables¶

In [4]:

print(train_set_x)

<CudaNdarrayType(float32, matrix)>

Now that looks a bit odd, why can't we see the values of our data matrix? The load_data function provided by the deep learning tutorial downloads the data and loads it into a shared variable. Shared variables are an important and genious concept in Theano. What it means is that the data can be shared between the CPU and the GPU without you having to write a single line of code! You basically do not have to care at all where your data lives, unless you are running out of memory, which is another story we can talk about later. For now let us only worry about how we can access the actual values of the data matrix:

In [5]:

print(train_set_x.get_value())

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

In [6]:

print("Size of the training data matrix: ", train_set_x.get_value().shape)

Size of the training data matrix:  (50000, 784)

The data is stored with one training sample per row. The original image patches are $28 \times 28$ pixels, hence we have $28 \cdot 28 = 784$ feature columns in our data matrix. We can reshape each row of our data matrix back to the original $28 \times 28$ image patch to visualize what the data looks like.

In [7]:

from utils import tile_raster_images

samples = tile_raster_images(train_set_x.get_value(), img_shape=(28,28), tile_shape=(3,10), tile_spacing=(0, 0),
                       scale_rows_to_unit_interval=True,
                       output_pixel_vals=True)

plt.imshow(samples)
plt.show()

So each image patch contains a hand written digit in white on a black background. The strokes vary in thickness and there is quite some variability in the hand writing.

The labels of the different image patches are stored in our variable train_set_y. When we print this variable it looks different than for our training data, and there is no get_value() function that we can call to access the labels.

In [8]:

print(train_set_y)

Elemwise{Cast{int32}}.0

The reason is that the load_data function casted our labels to integers. Why would it do that? The GPU can only store float variables, so all shared variables are stored as float. Our labels are integers though and later on we want to use them as indices, which requires them to be integers. As a workaround load_data first stores the labels as floats and then casts them to integer. As you can see from the print statement, Theano does not execute the casting right away. Instead it keeps a record of all the computations that are to be performed on the shared variable in a computational graph. This is another genious concept of Theano, because when these graphs get more complicated, it optimizes the whole graph for computational efficiency and numerical stability. If you want to know more about the computational graph structures and their optimization in Theano you can look here.

To access the labels we need to tell Theano to exectue all computations from the corresponding graph. We do this by calling eval():

In [9]:

print(train_set_y.eval()[:10])

[5 0 4 1 9 2 1 3 1 4]

Convolutional Network¶

We'll create a convolutional network to classify digits. We use code from the Deep Network Tutorials to make this a little easier, particularly the Convolutional Layer class LeNetConvPoolLayer, defined in convolutional_mlp.py.

We'll build up a network using the following layers:

input (1 channel)
10 1x9x9 filters — (first dimension is 1 because the input image is grayscale)
10 10x5x5 filters — (first dimenions is 10 because the previous layer had 10 filters)
a fully connected sigmoidal layer with 20 outputs
a fully connected Logistic Regression layer with 10 outputs

We'll train with batches, with a batch size of 500.

Setup for the Convolutional Network¶

In [10]:

# Setup 1: parameters of the network and training
from convolutional_mlp import LeNetConvPoolLayer
from mlp import HiddenLayer
from logistic_sgd import LogisticRegression

# network parameters
num_kernels = [10, 10]
kernel_sizes = [(9, 9), (5, 5)]
sigmoidal_output_size = 20

# training parameters
learning_rate = 0.1
batch_size = 50

For efficiency, we break the data into batches. Our network will take batch_size images at a time as input.

In [11]:

# Setup 2: compute batch sizes for train/test/validation

# borrow=True gets us the value of the variable without making a copy.
n_train_batches = train_set_x.get_value(borrow=True).shape[0]
n_test_batches = test_set_x.get_value(borrow=True).shape[0]
n_valid_batches = valid_set_x.get_value(borrow=True).shape[0]
n_train_batches //= batch_size
n_test_batches //= batch_size
n_valid_batches //= batch_size

In [12]:

# Setup 3.

# Declare inputs to network - x and y are placeholders
# that will be used in the training/testing/validation functions below.
x = T.matrix('x')  # input image data
y = T.ivector('y') # input label data

Layer 0 - First convolutional Layer¶

The first layer takes (batch_size, 1, 28, 28) as input, convolves it with 10 different 9x9 filters, and then downsamples (via maxpooling) in a 2x2 region. Each filter/maxpool combination produces an output of size (28-9+1)/2 = 10 on a side.

The size of the first layer's output is therefore (batch_size, 10, 10, 10).

In [13]:

layer0_input_size = (batch_size, 1, 28, 28)  # fixed size from the data
edge0 = (28 - kernel_sizes[0][0] + 1) // 2
layer0_output_size = (batch_size, num_kernels[0], edge0, edge0)
# check that we have an even multiple of 2 before pooling
assert ((28 - kernel_sizes[0][0] + 1) % 2) == 0

# The actual input is the placeholder x reshaped to the input size of the network
layer0_input = x.reshape(layer0_input_size)

layer0 = LeNetConvPoolLayer(rng,
                            input=layer0_input,
                            image_shape=layer0_input_size,
                            filter_shape=(num_kernels[0], 1) + kernel_sizes[0],
                            poolsize=(2, 2))

Layer 1 - Second convolutional Layer¶

The second layer takes (batch_size, 10, 10, 10) as input, convolves it with 10 different 10x5x5 filters, and then downsamples (via maxpooling) in a 2x2 region. Each filter/maxpool combination produces an output of size (10-5+1)/2 = 3 on a side.

The size of the second layer's output is therefore (batch_size, 10, 3, 3).

In [14]:

layer1_input_size = layer0_output_size
edge1 = (edge0 - kernel_sizes[1][0] + 1) / 2
layer1_output_size = (batch_size, num_kernels[1], edge1, edge1)
# check that we have an even multiple of 2 before pooling
assert ((edge0 - kernel_sizes[1][0] + 1) % 2) == 0

layer1 = LeNetConvPoolLayer(rng,
                            input=layer0.output,
                            image_shape=layer1_input_size,
                            filter_shape=(num_kernels[1], num_kernels[0]) + kernel_sizes[1],
                            poolsize=(2, 2))

Layer 2 - Fully connected sigmoidal layer¶

The sigmoidal layer takes a vector as input.

We flatten all but the first two dimensions, to get an input of size (batch_size, 30 * 4 * 4).

In [15]:

layer2_input = layer1.output.flatten(2)

layer2 = HiddenLayer(rng,
                     input=layer2_input,
                     n_in=num_kernels[1] * edge1 * edge1,
                     n_out=sigmoidal_output_size,
                     activation=T.tanh)

Layer 3 - Logistic regression output layer¶

A fully connected logistic regression layer converts the sigmoid's layer output to a class label.

In [16]:

layer3 = LogisticRegression(input=layer2.output,
                            n_in=sigmoidal_output_size,
                            n_out=10)

</Convolutional Network>¶

Training the network¶

To train the network, we have to define a cost function. We'll use the Negative Log Likelihood of the model, relative to the true labels y.

In [17]:

# The cost we minimize during training is the NLL of the model.
# Recall: y is a placeholder we defined above
cost = layer3.negative_log_likelihood(y)

Gradient descent¶

We will train with Stochastic Gradient Descent. To do so, we need the gradient of the cost relative to the parameters of the model. We can get the parameters for each label via the .params attribute.

In [18]:

# create a list of all model parameters to be fit by gradient descent
params = layer3.params + layer2.params + layer1.params + layer0.params

# create a list of gradients for all model parameters
grads = T.grad(cost, params)

Training function¶

The actual training function takes a batch of inputs (images x and labels y), and updates the model parameters with a small step in the steepest direction that reduces the gradient.

This is the (param_i, param_i - learning_rate * grad_i) line in the code below.

We use the updates keyword to theano.function() to update these variables in-place. See the documentation for function().

We also make use of the givens keyword, to alias x and y to batches (aka, slices) of train_set_x and train_set_y.

In [19]:

# train_model is a function that updates the model parameters by SGD, and returns the current cost
#
# We create the updates list by automatically looping over all
# (params[i], grads[i]) pairs.
updates = [
    (param_i, param_i - learning_rate * grad_i)  # <=== SGD update step
    for param_i, grad_i in zip(params, grads)
]

index = T.lscalar()  # index to a batch of training/validation/testing examples

train_model = theano.function(
    [index],
    cost,
    updates=updates,
    givens={
        x: train_set_x[index * batch_size: (index + 1) * batch_size],  # <=== batching
        y: train_set_y[index * batch_size: (index + 1) * batch_size]   # <=== batching
    }
)

Validation function¶

To track progress on a held-out set, we count the number of misclassified examples in the validation set.

In [20]:

validate_model = theano.function(
        [index],
        layer3.errors(y),
        givens={
            x: valid_set_x[index * batch_size: (index + 1) * batch_size],
            y: valid_set_y[index * batch_size: (index + 1) * batch_size]
        }
    )

Test function¶

After training, we check the number of misclassified examples in the test set.

In [21]:

test_model = theano.function(
    [index],
    layer3.errors(y),
    givens={
        x: test_set_x[index * batch_size: (index + 1) * batch_size],
        y: test_set_y[index * batch_size: (index + 1) * batch_size]
    }
)

Training loop¶

We use SGD for a fixed number of iterations over the full training set (an "epoch"). Usually, we'd use a more complicated rule, such as iterating until a certain number of epochs fail to produce improvement in the validation set.

In [22]:

for epoch in range(10):
    costs = [train_model(i) for i in range(int(n_train_batches))]
    validation_losses = [validate_model(i) for i in range(n_valid_batches)]
    print("Epoch {}    NLL {:.2}    %err in validation set {:.1%}".format(epoch + 1, np.mean(costs), np.mean(validation_losses)))

Epoch 1    NLL 0.47    %err in validation set 4.3%
Epoch 2    NLL 0.13    %err in validation set 3.0%
Epoch 3    NLL 0.097    %err in validation set 2.7%
Epoch 4    NLL 0.078    %err in validation set 2.3%
Epoch 5    NLL 0.067    %err in validation set 2.1%
Epoch 6    NLL 0.058    %err in validation set 2.0%
Epoch 7    NLL 0.052    %err in validation set 2.0%
Epoch 8    NLL 0.046    %err in validation set 1.9%
Epoch 9    NLL 0.042    %err in validation set 1.9%
Epoch 10    NLL 0.038    %err in validation set 1.8%

Learned features¶

We can visualize the filters learned by the first layer. In most vision tasks, these will be simple features, such as lines, points, or edges (such as in Figure 2 of this paper).

The MNIST data doesn't contain as much variation as natural images, so we don't see as obvious edge features, but we still see some stroke-like features appearing in the lowest level. With some standard tricks (adding noise to the training set, shifting examples around, using Dropout), we'd probably see smoother features (as in the paper linked above).

In [23]:

filters = tile_raster_images(layer0.W.get_value(borrow=True), img_shape=(9, 9), tile_shape=(1,10), tile_spacing=(3, 3),
                       scale_rows_to_unit_interval=True,
                       output_pixel_vals=True)

plt.imshow(filters)
plt.show()

Check performance on the test set¶

In [24]:

test_errors = [test_model(i) for i in range(n_test_batches)]
print("test errors: {:.1%}".format(np.mean(test_errors)))

test errors: 2.0%