# cs231n 2019. A1, part 3. Softmax exercise #### Solution by Yury Kashnitsky (@yorko)

Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the assignments page on the course website.

This exercise is analogous to the SVM exercise. You will:

• implement a fully-vectorized loss function for the Softmax classifier
• implement the fully-vectorized expression for its analytic gradient
• use a validation set to tune the learning rate and regularization strength
• optimize the loss function with SGD
• visualize the final learned weights
In :
import random
import numpy as np
from tqdm import tqdm_notebook
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'


In :
def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000, num_dev=500):
"""
Load the CIFAR-10 dataset from disk and perform preprocessing to prepare
it for the linear classifier. These are the same steps as we used for the
SVM, but condensed to a single function.
"""
# Load the raw CIFAR-10 data
cifar10_dir = '/home/yorko/data/cifar-10-batches-py'
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)

# subsample the data
mask = list(range(num_training, num_training + num_validation))

# Preprocessing: reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape, -1))
X_val = np.reshape(X_val, (X_val.shape, -1))
X_test = np.reshape(X_test, (X_test.shape, -1))
X_dev = np.reshape(X_dev, (X_dev.shape, -1))

# Normalize the data: subtract the mean image
mean_image = np.mean(X_train, axis = 0)
X_train -= mean_image
X_val -= mean_image
X_test -= mean_image
X_dev -= mean_image

# add bias dimension and transform into columns
X_train = np.hstack([X_train, np.ones((X_train.shape, 1))])
X_val = np.hstack([X_val, np.ones((X_val.shape, 1))])
X_test = np.hstack([X_test, np.ones((X_test.shape, 1))])
X_dev = np.hstack([X_dev, np.ones((X_dev.shape, 1))])

return X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev

# Invoke the above function to get our data.
X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev = get_CIFAR10_data()
print('Train data shape: ', X_train.shape)
print('Train labels shape: ', y_train.shape)
print('Validation data shape: ', X_val.shape)
print('Validation labels shape: ', y_val.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)
print('dev data shape: ', X_dev.shape)
print('dev labels shape: ', y_dev.shape)

Train data shape:  (49000, 3073)
Train labels shape:  (49000,)
Validation data shape:  (1000, 3073)
Validation labels shape:  (1000,)
Test data shape:  (1000, 3073)
Test labels shape:  (1000,)
dev data shape:  (500, 3073)
dev labels shape:  (500,)


## Softmax Classifier¶

Your code for this section will all be written inside cs231n/classifiers/softmax.py.

In :
# First implement the naive softmax loss function with nested loops.
# Open the file cs231n/classifiers/softmax.py and implement the
# softmax_loss_naive function.

from cs231n.classifiers.softmax import softmax_loss_naive
import time

# Generate a random softmax weight matrix and use it to compute the loss.
W = np.random.randn(3073, 10) * 0.0001
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)

# As a rough sanity check, our loss should be something close to -log(0.1).
print('loss: %f' % loss)
print('sanity check: %f' % (-np.log(0.1)))

loss: 2.364941
sanity check: 2.302585


### Inline Question #1¶

Why do we expect our loss to be close to -log(0.1)? Explain briefly.

$\color{blue}{\textit Your Answer:}$ There are ten classes here, so if the scores are random and mostly equal, we expect the ratio in the softmax formula to be $\approx$ 0.1.

In :
# Complete the implementation of softmax_loss_naive and implement a (naive)
# version of the gradient that uses nested loops.
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)

# As we did for the SVM, use numeric gradient checking as a debugging tool.
f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 0.0)

# similar to SVM case, do another gradient check with regularization
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 5e1)
f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 5e1)

numerical: 4.414098 analytic: 4.414098, relative error: 8.407148e-09
numerical: 0.115917 analytic: 0.115917, relative error: 2.561369e-07
numerical: -1.337843 analytic: -1.337843, relative error: 2.223465e-10
numerical: 1.663433 analytic: 1.663433, relative error: 2.801221e-09
numerical: 0.345482 analytic: 0.345482, relative error: 2.811123e-08
numerical: -2.875655 analytic: -2.875655, relative error: 1.403804e-08
numerical: 3.905061 analytic: 3.905061, relative error: 7.269116e-09
numerical: -1.111623 analytic: -1.111623, relative error: 1.084694e-08
numerical: 1.578582 analytic: 1.578581, relative error: 1.587520e-08
numerical: -2.557901 analytic: -2.557901, relative error: 4.136910e-09
numerical: -0.080359 analytic: -0.080359, relative error: 3.252329e-07
numerical: -1.137684 analytic: -1.137684, relative error: 2.093039e-08
numerical: 0.979711 analytic: 0.979711, relative error: 9.203386e-08
numerical: 1.420139 analytic: 1.420139, relative error: 1.984016e-08
numerical: 3.553916 analytic: 3.553915, relative error: 1.270606e-08
numerical: -3.555908 analytic: -3.555908, relative error: 7.909795e-09
numerical: 0.037696 analytic: 0.037696, relative error: 2.082735e-07
numerical: -2.779842 analytic: -2.779842, relative error: 3.970456e-09
numerical: -1.119180 analytic: -1.119180, relative error: 2.000238e-09
numerical: -1.375382 analytic: -1.375382, relative error: 5.295108e-10

In :
# Now that we have a naive implementation of the softmax loss function and its gradient,
# implement a vectorized version in softmax_loss_vectorized.
# The two versions should compute the same results, but the vectorized version should be
# much faster.
tic = time.time()
y_dev, 0.000005)
toc = time.time()
print('naive loss: %e computed in %fs' % (loss_naive, toc - tic))

from cs231n.classifiers.softmax import softmax_loss_vectorized
tic = time.time()
loss_vectorized, grad_vectorized = softmax_loss_vectorized(W, X_dev, y_dev, 0.000005)
toc = time.time()
print('vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic))

# As we did for the SVM, we use the Frobenius norm to compare the two versions
print('Loss difference: %f' % np.abs(loss_naive - loss_vectorized))

naive loss: 2.364941e+00 computed in 0.132490s
vectorized loss: 2.364941e+00 computed in 0.002477s
Loss difference: 0.000000

In :
%%time
# Use the validation set to tune hyperparameters (regularization strength and
# learning rate). You should experiment with different ranges for the learning
# rates and regularization strengths; if you are careful you should be able to
# get a classification accuracy of over 0.35 on the validation set.
from cs231n.classifiers import Softmax
results = {}
best_val = -1
best_softmax_clf = None
learning_rates = np.linspace(3e-7, 5e-7, 3)
regularization_strengths = np.linspace(5e3, 5e4, 3)

################################################################################
# TODO:                                                                        #
# Use the validation set to set the learning rate and regularization strength. #
# This should be identical to the validation that you did for the SVM; save    #
# the best trained softmax classifer in best_softmax.                          #
################################################################################
for lr in tqdm_notebook(learning_rates):
for reg in tqdm_notebook(regularization_strengths):
softmax_clf = Softmax()
_ = softmax_clf.train(X_train, y_train, learning_rate=lr,
reg=reg,
num_iters=1500, verbose=False)
y_train_pred = softmax_clf.predict(X_train)
train_acc = np.mean(y_train == y_train_pred)
y_val_pred = softmax_clf.predict(X_val)
val_acc = np.mean(y_val == y_val_pred)
results[(lr, reg)] = (train_acc, val_acc)
print('lr %e reg %e train accuracy: %f val accuracy: %f' % (
lr, reg, train_acc, val_acc))
if val_acc > best_val:
best_val = val_acc
best_softmax_clf = softmax_clf
################################################################################
#                              END OF YOUR CODE                                #
################################################################################

print('best validation accuracy achieved during cross-validation: %f' % best_val)

lr 3.000000e-07 reg 5.000000e+03 train accuracy: 0.372755 val accuracy: 0.384000
lr 3.000000e-07 reg 2.750000e+04 train accuracy: 0.321755 val accuracy: 0.331000
lr 3.000000e-07 reg 5.000000e+04 train accuracy: 0.300612 val accuracy: 0.320000

lr 4.000000e-07 reg 5.000000e+03 train accuracy: 0.375265 val accuracy: 0.382000
lr 4.000000e-07 reg 2.750000e+04 train accuracy: 0.327633 val accuracy: 0.335000
lr 4.000000e-07 reg 5.000000e+04 train accuracy: 0.298449 val accuracy: 0.313000

lr 5.000000e-07 reg 5.000000e+03 train accuracy: 0.375714 val accuracy: 0.386000
lr 5.000000e-07 reg 2.750000e+04 train accuracy: 0.318204 val accuracy: 0.335000
lr 5.000000e-07 reg 5.000000e+04 train accuracy: 0.293245 val accuracy: 0.305000

best validation accuracy achieved during cross-validation: 0.386000
CPU times: user 19min 26s, sys: 21.3 s, total: 19min 48s
Wall time: 4min 16s

In :
# evaluate on test set
# Evaluate the best softmax on test set
y_test_pred = best_softmax_clf.predict(X_test)
test_accuracy = np.mean(y_test == y_test_pred)
print('Softmax on raw pixels final test set accuracy: %f' % (test_accuracy, ))

Softmax on raw pixels final test set accuracy: 0.377000


Inline Question 2 - True or False

Suppose the overall training loss is defined as the sum of the per-datapoint loss over all training examples. It is possible to add a new datapoint to a training set that would leave the SVM loss unchanged, but this is not the case with the Softmax classifier loss.

$\color{blue}{\textit Your Answer:}$ True

$\color{blue}{\textit Your Explanation:}$ Hinge (or SVM) loss can be strictly equal to zero for data points with big enough margin. But logarithmic loss (Softmax classifier loss) is always positive.

In :
# Visualize the learned weights for each class
w = best_softmax_clf.W[:-1,:] # strip out the bias
w = w.reshape(32, 32, 3, 10)

w_min, w_max = np.min(w), np.max(w)

classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
for i in range(10):
plt.subplot(2, 5, i + 1)

# Rescale the weights to be between 0 and 255
wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)
plt.imshow(wimg.astype('uint8'))
plt.axis('off')
plt.title(classes[i]) 