Convolutional networks (also known as convolutional neural networks or CNNs)are a specialized kind of neural network for processing data that has a known,grid-like topology.
The name “convolutional neural network”indicates that the network employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation.
The convolution operation is typically denoted with an asterisk:
In convolutional network terminology, the first argument (in this example,the function x) to the convolution is often referred to as the input and the secondargument (in this example, the function w) as the kernel. The output is sometimes referred to as the feature map.
In machine learning applications, the input is usually a multidimensional arrayof data and the kernel is usually a multidimensional array of learn-able parameters.
We will refer to these multidimensional arrays as tensors.
For example, if we use a two-dimensional image I as our input, we probably alsowant to use a two-dimensional kernel K :
Note that convolution is commutative, meaning we can equivalently write:
While the commutative property is useful for writing proofs, it is not usuallyan important property of a neural network implementation. Instead, many neuralnetwork libraries implement a related function called the cross-correlation, whichis the same as convolution but without flipping the kernel:
Discrete convolution can be viewed as multiplication by a matrix.
Viewing convolution as matrix multiplication usually does nothelp to implement convolution operations, but it is useful for understanding anddesigning neural networks.
Convolution leverages three important ideas that can help improve a machinelearning system:
A typical layer of a convolutional network consists of three stages (see Fig. 9.7).
A pooling function
In all cases, pooling helps to make the representation become invariant to small translations of the input.
KEY IDEA : Invariance to local translationcan be a very useful property if we care more about whether somefeature is present than exactly where it is.
The use of pooling can be viewed as adding an infinitely strong prior thatthe function the layer learns must be invariant to small translations. When thisassumption is correct, it can greatly improve the statistical efficiency of the network.
Pooling over spatial regions produces invariance to translation, but if we pool over the outputs of separately parametrized convolutions, the features can learn which transformations to become invariant to (see Fig. 9.9).
Additionally, the input is usually not just a grid of real values. Rather, it is agrid of vector-valued observations.
We may also want to skip over some positions of the kernel in order to reducethe computational cost (at the expense of not extracting our features as finely).
One essential feature of any convolutional network implementation is the ability to implicitly zero-pad the input V in order to make it wider.
Usually the optimal amount ofzero padding (in terms of test set classification accuracy) lies somewhere between“valid” and “same” convolution.
Tiled convolution (Gregor and LeCun, 2010; Le et al., 2010) offers a compro-mise between a convolutional layer and a locally connected layer. Rather thanlearning a separate set of weights at every spatial location, we learn a set of kernelsthat we rotate through as we move through space
This means that immediatelyneighboring locations will have different filters, like in a locally connected layer,but the memory requirements for storing the parameters will increase only by afactor of the size of this set of kernels, rather than the size of the entire outputfeature map.
To define tiled convolution algebraically, let k be a 6-D tensor, where two ofthe dimensions correspond to different locations in the output map. Rather thanhaving a separate index for each location in the output map, output locationscycle through a set of t different choices of kernel stack in each direction. If t isequal to the output width, this is the same as a locally connected layer.
To train the network, we need to compute the derivatives with respect to theweights in the kernel. To do so, we can use a function
If this layer is not the bottom layer of the network, we’ll need to compute thegradient with respect to V in order to backpropagate the error farther down. Todo so, we can use a function
We could also use h to define the reconstruction of a convolutional autoencoder, or the probability distribution over visible given hidden units in a convo-lutional RBM or sparse coding model. Suppose we have hidden units H in thesame format as Z and we define a reconstruction
In order to train the autoencoder, we will receive the gradient with respectto R as a tensor E. To train the decoder, we need to obtain the gradient withrespect to K. This is given by g(H, E, s).
To train the encoder, we need to obtainthe gradient with respect to H. This is given by c(K, E, s). It is also possible todifferentiate through g using c and h, but these operations are not needed for thebackpropagation algorithm on any standard network architectures.
The data used with a convolutional network usually consists of several channels,each channel being the observation of a different quantity at some point in spaceor time. See Table 9.1 for examples of data types with different dimensionalitiesand number of channels.
So far we have discussed only the case where every example in the trainand test data has the same spatial dimensions. One advantage to convolutionalnetworks is that they can also process inputs with varying spatial extents. Thesekinds of input simply cannot be represented by traditional, matrix multiplication-based neural networks.
Table 9.1: Examples of different formats of data that can be used with convolutional networks
Convolution is equivalent to converting both the input and the kernel to thefrequency domain using a Fourier transform, performing point-wise multiplicationof the two signals, and converting back to the time domain using an inverseFourier transform. For some problem sizes, this can be faster than the naiveimplementation of discrete convolution.
Typically, the most expensive part of convolutional network training is learningthe features. The output layer is usually relatively inexpensive due to the smallnumber of features provided as input to this layer after passing through severallayers of pooling. When performing supervised training with gradient descent,every gradient step requires a complete run of forward propagation and backwardpropagation through the entire network.
One way to reduce the cost of convo-lutional network training is to use features that are not trained in a supervised fashion.
There are two basic strategies for obtaining convolution kernels without supervised training.
Random filters often work surprisingly well in convolutional networks (Jar-rett et al., 2009b; Saxe et al., 2011; Pinto et al., 2011; Cox and Pinto, 2011).Saxe et al. (2011) showed that layers consisting of convolution following by pool-ing naturally become frequency selective and translation invariant when assignedrandom weights. They argue that this provides an inexpensive way to choose thearchitecture of a convolutional network: first evaluate the performance of severalconvolutional network architectures by training only the last layer, then take thebest of these architectures and train the entire architecture using a more expensiveapproach.
An intermediate approach is to learn the features, but using methods that do not require full forward and back-propagation at every gradient step. Aswith multilayer perceptrons, we use greedy layer-wise unsupervised pretraining,to train the first layer in isolation, then extract all features from the first layeronly once, then train the second layer in isolation given those features, and so on
As with other approaches to unsupervised pretraining, it remains difficult totease apart the cause of some of the benefits seen with this approach. Unsuper-vised pretraining may offer some regularization relative to supervised training,or it may simply allow us to train much larger architectures due to the reducedcomputational cost of the learning rule.