# Assignment 2. Language modeling.¶

This task is devoted to language modeling. Its goal is to write in PyTorch an RNN-based language model. Since word-based language modeling requires long training and is memory-consuming due to large vocabulary, we start with character-based language modeling. We are going to train the model to generate words as sequence of characters. During training we teach it to predict characters of the words in the training set.

## Task 1. Character-based language modeling: data preparation (15 points).¶

We train the language models on the materials of Sigmorphon 2018 Shared Task. First, download the Russian datasets.

In [ ]:
!wget https://raw.githubusercontent.com/sigmorphon/conll2018/master/task1/surprise/russian-train-high


1.1 (1 points) All the files contain tab-separated triples <lemma>-<form>-<tags>, where <form> may contain spaces (будете соответствовать). Write a function that loads a list of all word forms, that do not contain spaces.

In [ ]:
def read_infile(infile):
"""
"""
return words

In [ ]:
train_words = read_infile("russian-train-high")
print(len(train_words), len(dev_words), len(test_words))
print(*train_words[:10])


1.2 (2 points) Write a Vocabulary class that allows to transform symbols into their indexes. The class should have the method __call__ that applies this transformation to sequences of symbols and batches of sequences as well. You can also use SimpleVocabulary from DeepPavlov. Fit an instance of this class on the training data.

In [ ]:
from deeppavlov.core.data.simple_vocab import SimpleVocabulary
"""
"""

vocab = # == YOUR CODE HERE ==
vocab.fit([list(x) for x in train_words])
print(len(vocab))


1.3 (2 points) Write a Dataset class, which should be inherited from torch.utils.data.Dataset. It should take a list of words and the vocab as initialization arguments.

In [ ]:
import torch
from torch.utils.data import Dataset as TorchDataset

class Dataset(TorchDataset):

def __init__(self, data, vocab):
self.data = data
self.vocab = vocab

def __getitem__(self, index):
"""
Returns one tensor pair (source and target). The source tensor corresponds to the input word,
with "BEGIN" and "END" symbols attached. The target tensor should contain the answers
for the language model that obtain these word as input.
"""
"""
"""

def __len__(self):
"""
"""

In [ ]:
train_dataset = Dataset(train_words, vocab)
dev_dataset = Dataset(dev_words, vocab)
test_dataset = Dataset(test_words, vocab)


1.4 (3 points) Use a standard torch.utils.data.DataLoader to obtain an iterable over batches. Print the shape of first 10 input batches with batch_size=1.

In [ ]:
from torch.utils.data import DataLoader

"""
"""


(1.5) 1 point Explain, why this does not work with larger batch size.

(1.6) 5 points Write a function collate that allows you to deal with batches of greater size. See discussion for an example. Implement your function as a class __call__ method to make it more flexible.

In [ ]:
def pad_tensor(vec, length, dim, pad_symbol):
"""
Pads a vector vec up to length length along axis dim with pad symbol pad_symbol.
"""
"""
"""

self.dim = dim

def __call__(self, batch):
"""
"""


(1.7) 1 points Again, use torch.utils.data.DataLoader to obtain an iterable over batches. Print the shape of first 10 input batches with the batch size you like.

In [ ]:
from torch.utils.data import DataLoader

"""
"""


## Task 2. Character-based language modeling. (35 points)¶

2.1 (5 points) Write a network that performs language modeling. It should include three layers:

1. Embedding layer that transforms input symbols into vectors.
2. An RNN layer that outputs a sequence of hidden states (you may use https://pytorch.org/docs/stable/nn.html#gru).
3. A Linear layer with softmax activation that produces the output distribution for each symbol.
In [ ]:
import torch.nn as nn

class RNNLM(nn.Module):

def __init__(self, vocab_size, embeddings_dim, hidden_size):
super(RNNLM, self).__init__()
"""
"""

def forward(self, inputs, hidden=None):
"""
"""


2.2 (1 points) Write a function validate_on_batch that takes as input a model, a batch of inputs and a batch of outputs, and the loss criterion, and outputs the loss tensor for the whole batch. This loss should not be normalized.

In [ ]:
def validate_on_batch(model, criterion, x, y):
"""
"""


2.3 (1 points) Write a function train_on_batch that accepts all the arguments of validate_on_batch and also an optimizer, calculates loss and makes a single step of gradient optimization. This function should call validate_on_batch inside.

In [ ]:
def train_on_batch(model, criterion, x, y, optimizer):
"""
"""


2.4 (3 points) Write a training loop. You should define your RNNLM model, the criterion, the optimizer and the hyperparameters (number of epochs and batch size). Then train the model for a required number of epochs. On each epoch evaluate the average training loss and the average loss on the validation set.

2.5 (3 points) Do not forget to average your loss over only non-padding symbols, otherwise it will be too optimistic.

In [ ]:
"""
"""


2.6 (5 points) Write a function predict_on_batch that outputs letter probabilities of all words in the batch.

In [ ]:
"""
"""


2.7 (1 points) Calculate the letter probabilities for all words in the test dataset. Print them for 20 last words. Do not forget to disable shuffling in the DataLoader.

In [ ]:
"""
"""


2.8 (5 points) Write a function that generates a single word (sequence of indexes) given the model. Do not forget about the hidden state! Be careful about start and end symbol indexes. Use torch.multinomial for sampling.

In [ ]:
def generate(model, max_length=20, start_index=1, end_index=2):
"""
"""


2.9 (1 points) Use generate to sample 20 pseudowords. Do not forget to transform indexes to letters.

In [ ]:
for i in range(20):
"""
"""


(2.10) 5 points Write a batched version of the generation function. You should sample the following symbol only for the words that are not finished yet, so apply a boolean mask to trace active words.

In [ ]:
def generate_batch(model, batch_size, max_length = 20, start_index=1, end_index=2):
"""
"""

In [ ]:
generated = []
for _ in range(2):
generated += generate_batch(model, batch_size=10)
"""