Sparse autoencoders are a way of automatically learning features from unlabeled data, and images are one such plentiful data source. This notebook demonstrates using a basic autoencoder to extract features from natural grayscale images, following the exercise in the Stanford UFLDL tutorial.
The original tutorial uses MATLAB. This python implementation uses numpy
, scipy
, and matplotlib
libraries. Code for the autoencoder class and all related functions is available here.
The network architecture consists of one hidden layer between the input and output layers. Unlike in a typical neural network, the input and output layers are the same data, and the main idea is that due to certain constraints on the hidden layer, the network is forced to learn a compressed representation of the input. The two constraints used here are choosing a low number of hidden units and forcing the average activations of the those units to be close to zero. A detailed explanation is available on this UFLDL page.
The data provided is a set of 10 512x512 pixel grayscale images of nature. We look at a couple of these:
from sparseae_functions import *
from sparseae import *
show_full_images(nrows=1,ncols=2)
The scenery is a little hard to see, but there are rocks and a tree trunk in the images.
We will be using 10000 8x8 pixel patches to train the autoencoder. Let's sample some random patches from the set, and look at a few:
patches = sample_images(numpatches=10000,patchsize=8)
display_network(patches,nrows=4,ncols=6)
The data is preprocessed before learning. Normalizing the images includes subtracting the mean, truncating to +/- 3 standard deviations, and scaling to [0.1,0.9]
. The same patches are shown after normalizing:
norm_patches = normalize_data(patches)
display_network(norm_patches,nrows=4,ncols=6)
The data is now ready for training. Now we initialize a network (using default parameters). Since we're using 8x8 images to train, the input vector size is 64. The hidden layer here is chosen to consist of 25 features. This implementation uses sigmoid activation functions.
sae = sparseae()
print sae.vSize #size of input and output layers
print sae.hSize #size of hidden layer
64 25
We now train the encoder (This takes about 2 minutes on an 8-core, 16gb RAM Ubuntu machine). This function is using an L-BFGS implementation in scipy.optimize
:
weights = sae.train(norm_patches)
display_weights(arr = weights, nrows=5, ncols=5, vSide = sae.vSide)
Looking at the trained weights, we see that the network learned some edge detectors. Each of these images maximally activates its corresponding hidden unit. Given new input, these activations would produce a (hopefully) better representation of the data. The activations (plus the original input as well) can be further fed to a classifier.
As another example, let's see what happens when training a network on 20000 12x12 patches (this takes about 16 minutes to run).
patches2 = sample_images(numpatches=20000,patchsize=12)
patches2 = normalize_data(patches2)
sae2 = sparseae(vSide = 12)
weights2 = sae2.train(patches2)
display_weights(arr = weights2, nrows=5, ncols=5, vSide = sae2.vSide)
The resulting weights are 12x12, and look similar to the 8x8 weights.
To demonstrate the importance of preprocessing, we train another autoencoder on the original (unprocessed) 8x8 patches:
sae3 = sparseae()
weights3 = sae3.train(patches)
display_weights(arr = weights3, nrows=5, ncols=5, vSide = sae3.vSide)
There is no apparent structure in the weights. It looks like the encoder hasn't learned anything useful.