Credit cs231n.stanford.edu

Do not worry, if you do not understand everything here. Your task now is to use documentation and to understand basics of PyTorch syntax.

- It is OK
**not to know**these: batch, convolution, fully-connected network, Dense layer (you will learn about these at the course). - You
**should know**these: dataset, model, training, weights, accuracy, gradient.**In this case**, we reccomend you to familiarize yourself with machine learning basics (ODS course, Andrew Ng course, fast.ai course, Yandex Coursera Specialization) or at any other place you might like.

We use image classification example for it to be simplier in terms of preprocessing and numerical representation compared to texts (that we will learn at the course)

PyTorch is a system for executing dynamic computational graphs over Tensor objects that behave similarly as numpy ndarray. It comes with a powerful automatic differentiation engine that removes the need for manual back-propagation.

Follow the detailed PyTorch installation instructions on the official site.

Justin Johnson has made an excellent tutorial for PyTorch.

You can also find the detailed API doc here. If you have other questions that are not addressed by the API docs, the PyTorch forum is a much better place to ask than StackOverflow, however PyTorch.org might be blocked by some Internet provides.

This assignment has 4 parts. You will learn PyTorch on different levels of abstractions.

- Preparation: we will use CIFAR-10 dataset.
- Barebones PyTorch: we will work directly with the lowest-level PyTorch Tensors.
- PyTorch Module API: we will use
`nn.Module`

to define arbitrary neural network architecture. - PyTorch Sequential API: we will use
`nn.Sequential`

to define a linear feed-forward network very conveniently.

Here is a table of comparison:

API | Flexibility | Convenience |
---|---|---|

Barebone | High | Low |

`nn.Module` |
High | Medium |

`nn.Sequential` |
Low | High |

First, we load the CIFAR-10 dataset. This might take a couple minutes the first time you do it, but the files should stay cached after that.

In [ ]:

```
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data import sampler
import torchvision.datasets as dset
import torchvision.transforms as T
import numpy as np
```

In [ ]:

```
NUM_TRAIN = 49000
# The torchvision.transforms package provides tools for preprocessing data
# and for performing data augmentation; here we set up a transform to
# preprocess the data by subtracting the mean RGB value and dividing by the
# standard deviation of each RGB value; we've hardcoded the mean and std.
transform = T.Compose([
T.ToTensor(),
T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])
# We set up a Dataset object for each split (train / val / test); Datasets load
# training examples one at a time, so we wrap each Dataset in a DataLoader which
# iterates through the Dataset and forms minibatches. We divide the CIFAR-10
# training set into train and val sets by passing a Sampler object to the
# DataLoader telling how it should sample from the underlying Dataset.
cifar10_train = dset.CIFAR10('./datasets', train=True, download=True,
transform=transform)
loader_train = DataLoader(cifar10_train, batch_size=64,
sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN)))
cifar10_val = dset.CIFAR10('./datasets', train=True, download=True,
transform=transform)
loader_val = DataLoader(cifar10_val, batch_size=64,
sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))
cifar10_test = dset.CIFAR10('./datasets', train=False, download=True,
transform=transform)
loader_test = DataLoader(cifar10_test, batch_size=64)
```

You have an option to **use GPU by setting the flag to True below**. It is not necessary to use GPU for this assignment. Note that if your computer does not have CUDA enabled, `torch.cuda.is_available()`

will return False and this notebook will fallback to CPU mode.

The global variables `dtype`

and `device`

will control the data types throughout this assignment.

In [ ]:

```
USE_GPU = True
dtype = torch.float32 # we will be using float throughout this tutorial
if USE_GPU and torch.cuda.is_available():
device = torch.device('cuda')
else:
device = torch.device('cpu')
# Constant to control how frequently we print train loss
print_every = 100
print('using device:', device)
```

PyTorch ships with high-level APIs to help us define model architectures conveniently, which we will cover in Part II of this tutorial. In this section, we will start with the barebone PyTorch elements to understand the autograd engine better. After this exercise, you will come to appreciate the high-level model API more.

We will start with a simple fully-connected ReLU network with two hidden layers and no biases for CIFAR classification. This implementation computes the forward pass using operations on PyTorch Tensors, and uses PyTorch autograd to compute gradients. It is important that you understand every line, because you will write a harder version after the example.

When we create a PyTorch Tensor with `requires_grad=True`

, then operations involving that Tensor will not just compute values; they will also build up a computational graph in the background, allowing us to easily backpropagate through the graph to compute gradients of some Tensors with respect to a downstream loss. Concretely if x is a Tensor with `x.requires_grad == True`

then after backpropagation `x.grad`

will be another Tensor holding the gradient of x with respect to the scalar loss at the end.

A PyTorch Tensor is conceptionally similar to a numpy array: it is an n-dimensional grid of numbers, and like numpy PyTorch provides many functions to efficiently operate on Tensors. As a simple example, we provide a `flatten`

function below which reshapes image data for use in a fully-connected neural network.

Recall that image data is typically stored in a Tensor of shape N x C x H x W, where:

- N is the number of datapoints
- C is the number of channels
- H is the height of the intermediate feature map in pixels
- W is the height of the intermediate feature map in pixels

This is the right way to represent the data when we are doing something like a 2D convolution, that needs spatial understanding of where the intermediate features are relative to each other. When we use fully connected affine layers to process the image, however, we want each datapoint to be represented by a single vector -- it's no longer useful to segregate the different channels, rows, and columns of the data. So, we use a "flatten" operation to collapse the `C x H x W`

values per representation into a single long vector. The flatten function below first reads in the N, C, H, and W values from a given batch of data, and then returns a "view" of that data. "View" is analogous to numpy's "reshape" method: it reshapes x's dimensions to be N x ??, where ?? is allowed to be anything (in this case, it will be C x H x W, but we don't need to specify that explicitly).

In [ ]:

```
def flatten(x):
################################################################################
# TODO: Implement flatten function. #
################################################################################
x_flat = x
################################################################################
# END OF YOUR CODE #
################################################################################
return x_flat
def test_flatten():
x = torch.arange(12).view(2, 1, 3, 2)
print('Before flattening: ', x)
print('After flattening: ', flatten(x))
test_flatten()
```

Here we define a function `two_layer_fc`

which performs the forward pass of a two-layer fully-connected ReLU network on a batch of image data. After defining the forward pass we check that it doesn't crash and that it produces outputs of the right shape by running zeros through the network.

In [ ]:

```
import torch.nn.functional as F # useful stateless functions
def two_layer_fc(x, params):
"""
A fully-connected neural networks; the architecture is:
NN is fully connected -> ReLU -> fully connected layer.
Note that this function only defines the forward pass;
PyTorch will take care of the backward pass for us.
The input to the network will be a minibatch of data, of shape
(N, d1, ..., dM) where d1 * ... * dM = D. The hidden layer will have H units,
and the output layer will produce scores for C classes.
Inputs:
- x: A PyTorch Tensor of shape (N, d1, ..., dM) giving a minibatch of
input data.
- params: A list [w1, w2] of PyTorch Tensors giving weights for the network;
w1 has shape (D, H) and w2 has shape (H, C).
Returns:
- scores: A PyTorch Tensor of shape (N, C) giving classification scores for
the input data x.
"""
# first we flatten the image
x = flatten(x) # shape: [batch_size, C x H x W]
w1, w2 = params
# Forward pass: compute predicted y using operations on Tensors. Since w1 and
# w2 have requires_grad=True, operations involving these Tensors will cause
# PyTorch to build a computational graph, allowing automatic computation of
# gradients. Since we are no longer implementing the backward pass by hand we
# don't need to keep references to intermediate values.
# you can also use `.clamp(min=0)`, equivalent to F.relu()
################################################################################
# TODO: Implement the forward pass for the two-layer fully-connected network. #
# In more details: matrix multiply input x and weight w1, apply ReLu #
# nonlinearity (search the documentation for it, we discuss it in more detail #
# at the course) then matrix multiply result of the previous operation by w2 #
# and return the result. #
#
# Write each shape-chanding operation on a separate line and provide comment #
# with the current shape of the tensor #
################################################################################
pass
################################################################################
# END OF YOUR CODE #
################################################################################
return x
def two_layer_fc_test():
hidden_layer_size = 42
x = torch.zeros((64, 50), dtype=dtype) # minibatch size 64, feature dimension 50
w1 = torch.zeros((50, hidden_layer_size), dtype=dtype)
w2 = torch.zeros((hidden_layer_size, 10), dtype=dtype)
if str(device) == "cuda":
x, w1, w2 = x.to("cuda"), w1.to("cuda"), w2.to("cuda")
scores = two_layer_fc(x, [w1, w2])
print(scores.size()) # you should see [64, 10]
two_layer_fc_test()
```

Let's write a couple utility methods to initialize the weight matrices for our models.

`random_weight(shape)`

initializes a weight tensor with the Kaiming normalization method.`zero_weight(shape)`

initializes a weight tensor with all zeros. Useful for instantiating bias parameters.

The `random_weight`

function uses the Kaiming normal initialization method, described in:

He et al, *Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification*, ICCV 2015, https://arxiv.org/abs/1502.01852

In [ ]:

```
def random_weight(shape):
"""
Create random Tensors for weights; setting requires_grad=True means that we
want to compute gradients for these Tensors during the backward pass.
We use Kaiming normalization: sqrt(2 / fan_in)
"""
if len(shape) == 2: # FC weight
fan_in = shape[0]
else:
fan_in = np.prod(shape[1:]) # conv weight [out_channel, in_channel, kH, kW]
# randn is standard normal distribution generator.
w = torch.randn(shape, device=device, dtype=dtype) * np.sqrt(2. / fan_in)
w.requires_grad = True
return w
def zero_weight(shape):
return torch.zeros(shape, device=device, dtype=dtype, requires_grad=True)
# create a weight of shape [3 x 5]
# you should see the type `torch.cuda.FloatTensor` if you use GPU.
# Otherwise it should be `torch.FloatTensor`
random_weight((3, 5))
```

When training the model we will use the following function to check the accuracy of our model on the training or validation sets.

When checking accuracy we don't need to compute any gradients; as a result we don't need PyTorch to build a computational graph for us when we compute scores. To prevent a graph from being built we scope our computation under a `torch.no_grad()`

context manager.

In [ ]:

```
def check_accuracy_part2(loader, model_fn, params):
"""
Check the accuracy of a classification model.
Inputs:
- loader: A DataLoader for the data split we want to check
- model_fn: A function that performs the forward pass of the model,
with the signature scores = model_fn(x, params)
- params: List of PyTorch Tensors giving parameters of the model
Returns: Nothing, but prints the accuracy of the model
"""
split = 'val' if loader.dataset.train else 'test'
print('Checking accuracy on the %s set' % split)
num_correct, num_samples = 0, 0
with torch.no_grad():
for x, y in loader:
x = x.to(device=device, dtype=dtype) # move to device, e.g. GPU
y = y.to(device=device, dtype=torch.int64)
scores = model_fn(x, params)
_, preds = scores.max(1)
num_correct += (preds == y).sum()
num_samples += preds.size(0)
acc = float(num_correct) / num_samples
print('Got %d / %d correct (%.2f%%)' % (num_correct, num_samples, 100 * acc))
```

We can now set up a basic training loop to train our network. We will train the model using stochastic gradient descent without momentum. We will use `torch.functional.cross_entropy`

to compute the loss; you can read about it here.

The training loop takes as input the neural network function, a list of initialized parameters (`[w1, w2]`

in our example), and learning rate.

In [ ]:

```
def train_part2(model_fn, params, learning_rate):
"""
Train a model on CIFAR-10.
Inputs:
- model_fn: A Python function that performs the forward pass of the model.
It should have the signature scores = model_fn(x, params) where x is a
PyTorch Tensor of image data, params is a list of PyTorch Tensors giving
model weights, and scores is a PyTorch Tensor of shape (N, C) giving
scores for the elements in x.
- params: List of PyTorch Tensors giving weights for the model
- learning_rate: Python scalar giving the learning rate to use for SGD
Returns: Nothing
"""
for t, (x, y) in enumerate(loader_train):
# Move the data to the proper device (GPU or CPU)
x = x.to(device=device, dtype=dtype)
y = y.to(device=device, dtype=torch.long)
# Forward pass: compute scores and loss
scores = model_fn(x, params)
loss = F.cross_entropy(scores, y)
# Backward pass: PyTorch figures out which Tensors in the computational
# graph has requires_grad=True and uses backpropagation to compute the
# gradient of the loss with respect to these Tensors, and stores the
# gradients in the .grad attribute of each Tensor.
loss.backward()
# Update parameters. We don't want to backpropagate through the
# parameter updates, so we scope the updates under a torch.no_grad()
# context manager to prevent a computational graph from being built.
with torch.no_grad():
for w in params:
w -= learning_rate * w.grad
# Manually zero the gradients after running the backward pass
w.grad.zero_()
if t % print_every == 0:
print('Iteration %d, loss = %.4f' % (t, loss.item()))
check_accuracy_part2(loader_val, model_fn, params)
print()
```

What is the line number in the cell above, where

- neural network is executed? (predictions on train data are made)
- gradient descent step is made?
- how convert scores into probability distribution?

In [ ]:

```
1. # YOUR ANSWER HERE
2. # YOUR ANSWER HERE
3. # YOUR ANSWER HERE
```

Now we are ready to run the training loop. We need to explicitly allocate tensors for the fully connected weights, `w1`

and `w2`

.

Each minibatch of CIFAR has 64 examples, so the tensor shape is `[64, 3, 32, 32]`

.

After flattening, `x`

shape should be `[64, 3 * 32 * 32]`

. This will be the size of the first dimension of `w1`

.
The second dimension of `w1`

is the hidden layer size, which will also be the first dimension of `w2`

.

Finally, the output of the network is a 10-dimensional vector that represents the probability distribution over 10 classes.

You don't need to tune any hyperparameters but you should see accuracies above 40% after training for one epoch.

In [ ]:

```
hidden_layer_size = 4000
learning_rate = 1e-2
w1 = random_weight((3 * 32 * 32, hidden_layer_size))
w2 = random_weight((hidden_layer_size, 10))
train_part2(two_layer_fc, [w1, w2], learning_rate)
```

Barebone PyTorch requires that we track all the parameter tensors by hand. This is fine for small networks with a few tensors, but it would be extremely inconvenient and error-prone to track tens or hundreds of tensors in larger networks.

PyTorch provides the `nn.Module`

API for you to define arbitrary network architectures, while tracking every learnable parameters for you. In Part II, we implemented SGD ourselves. PyTorch also provides the `torch.optim`

package that implements all the common optimizers, such as RMSProp, Adagrad, and Adam. It even supports approximate second-order methods like L-BFGS! You can refer to the doc for the exact specifications of each optimizer.

To use the Module API, follow the steps below:

Subclass

`nn.Module`

. Give your network class an intuitive name like`TwoLayerFC`

.In the constructor

`__init__()`

, define all the layers you need as class attributes. Layer objects like`nn.Linear`

are themselves`nn.Module`

subclasses and contain learnable parameters, so that you don't have to instantiate the raw tensors yourself.`nn.Module`

will track these internal parameters for you. Refer to the doc to learn more about the dozens of builtin layers.**Warning**: don't forget to call the`super().__init__()`

first!In the

`forward()`

method, define the*connectivity*of your network. You should use the attributes defined in`__init__`

as function calls that take tensor as input and output the "transformed" tensor. Do*not*create any new layers with learnable parameters in`forward()`

! All of them must be declared upfront in`__init__`

.

After you define your Module subclass, you can instantiate it as an object and call it just like the NN forward function in part II.

**Tip:** Dense, Linear and fully connected (fc) layers are synonims and mean the same thing.

**Tip 2:** Do **not** apply nonlinearity after the last layer of the network.

Here is a concrete example of a 2-layer fully connected network:

In [ ]:

```
class TwoLayerFC(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super().__init__()
########################################################################
# TODO: Assign layer objects to class attributes. Use Linear layer #
########################################################################
self.fc1 = None
self.fc2 = None
self.relu = None
########################################################################
# END OF YOUR CODE #
########################################################################
# nn.init package contains convenient initialization methods
# http://pytorch.org/docs/master/nn.html#torch-nn-init
nn.init.kaiming_normal_(self.fc1.weight)
nn.init.kaiming_normal_(self.fc2.weight)
# we do not initialize ReLu, because it does not have parameters
def forward(self, x):
########################################################################
# TODO: Make fully-connected neural network just like before, but #
# using layer objects (self.fc1, self.fc2). Don't forget to flatten x! #
########################################################################
pass
########################################################################
# END OF YOUR CODE #
########################################################################
return x
def test_TwoLayerFC():
input_size = 50
x = torch.zeros((64, input_size), dtype=dtype) # minibatch size 64, feature dimension 50
model = TwoLayerFC(input_size, 42, 10)
scores = model(x)
print(scores.size()) # you should see [64, 10]
test_TwoLayerFC()
```

Given the validation or test set, we can check the classification accuracy of a neural network.

This version is slightly different from the one in part II. You don't manually pass in the parameters anymore.

In [ ]:

```
def check_accuracy_part34(loader, model):
if loader.dataset.train:
print('Checking accuracy on validation set')
else:
print('Checking accuracy on test set')
num_correct = 0
num_samples = 0
model.eval() # set model to evaluation mode
with torch.no_grad():
for x, y in loader:
x = x.to(device=device, dtype=dtype) # move to device, e.g. GPU
y = y.to(device=device, dtype=torch.long)
scores = model(flatten(x))
_, preds = scores.max(1)
num_correct += (preds == y).sum()
num_samples += preds.size(0)
acc = float(num_correct) / num_samples
print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))
```

We also use a slightly different training loop. Rather than updating the values of the weights ourselves, we use an Optimizer object from the `torch.optim`

package, which abstract the notion of an optimization algorithm and provides implementations of most of the algorithms commonly used to optimize neural networks.

In [ ]:

```
import torch.nn.functional as F
def train_part34(model, optimizer, epochs=1):
"""
Train a model on CIFAR-10 using the PyTorch Module API.
Inputs:
- model: A PyTorch Module giving the model to train.
- optimizer: An Optimizer object we will use to train the model
- epochs: (Optional) A Python integer giving the number of epochs to train for
Returns: Nothing, but prints model accuracies during training.
"""
model = model.to(device=device) # move the model parameters to CPU/GPU
for e in range(epochs):
for t, (x, y) in enumerate(loader_train):
model.train() # put model to training mode
x = x.to(device=device, dtype=dtype) # move to device, e.g. GPU
y = y.to(device=device, dtype=torch.long)
scores = model(flatten(x))
loss = F.cross_entropy(scores, y)
# Zero out all of the gradients for the variables which the optimizer
# will update.
optimizer.zero_grad()
# This is the backwards pass: compute the gradient of the loss with
# respect to each parameter of the model.
loss.backward()
# Actually update the parameters of the model using the gradients
# computed by the backwards pass.
optimizer.step()
if t % print_every == 0:
print('Iteration %d, loss = %.4f' % (t, loss.item()))
check_accuracy_part34(loader_val, model)
print()
```

What is the line number in the cell above, where

- neural network is executed? (predictions on train data are made)
- gradient descent step is made?

In [ ]:

```
1. # YOUR ANSWER HERE
2. # YOUR ANSWER HERE
```

Now we are ready to run the training loop. In contrast to part II, we don't explicitly allocate parameter tensors anymore.

Simply pass the input size, hidden layer size, and number of classes (i.e. output size) to the constructor of `TwoLayerFC`

.

You also need to define an optimizer that tracks all the learnable parameters inside `TwoLayerFC`

.

You don't need to tune any hyperparameters, but you should see model accuracies above 40% after training for one epoch.

In [ ]:

```
hidden_layer_size = 4000
learning_rate = 1e-2
model = TwoLayerFC(3 * 32 * 32, hidden_layer_size, 10)
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
train_part34(model, optimizer)
```

Part III introduced the PyTorch Module API, which allows you to define arbitrary learnable layers and their connectivity.

For simple models like a stack of feed forward layers, you still need to go through 3 steps: subclass `nn.Module`

, assign layers to class attributes in `__init__`

, and call each layer one by one in `forward()`

. Is there a more convenient way?

Fortunately, PyTorch provides a container Module called `nn.Sequential`

, which merges the above steps into one. It is not as flexible as `nn.Module`

, because you cannot specify more complex topology than a feed-forward stack, but it's good enough for many use cases.

Let's see how to rewrite our two-layer fully connected network example with `nn.Sequential`

, and train it using the training loop defined above.

Again, you don't need to tune any hyperparameters here, but you shoud achieve above 40% accuracy after one epoch of training.

In [ ]:

```
# We need to wrap `flatten` function in a module in order to stack it
# in nn.Sequential
class Flatten(nn.Module):
def forward(self, x):
return flatten(x)
hidden_layer_size = 4000
learning_rate = 1e-2
########################################################################
# TODO: use nn.Sequential to make the same network as before #
########################################################################
model = None
########################################################################
# END OF YOUR CODE #
########################################################################
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
train_part34(model, optimizer)
```

The following cell shows the classification results on the test set. Stop running whenever you feel comfortable.

In [ ]:

```
from matplotlib import pyplot
for t, (x, y) in enumerate(loader_test):
scores = model(flatten(x))
_, preds = scores.max(1)
for i, img in enumerate(x):
pyplot.title('Label:'+str(y.tolist()[i])+' Pred:'+str(preds.tolist()[i]))
pyplot.imshow(x[i][0], interpolation='nearest', aspect='auto', cmap=pyplot.get_cmap('gray'))
pyplot.show()
```