Preprocessing is not the most exciting part of NLP, but it is still one of the most important ones. Your task is to preprocess raw text (you can use your own, or this one. For this task text preprocessing mostly consists of:
collections.Counter
for that.ATTN!: If you use your own data, please, attach a download link.
Your goal is to make Batcher class which returns two numpy tensors with word indices. It should be possible to use one for word2vec training. You can implement batcher for Skip-Gram or CBOW architecture, the picture below can be helpful to remember the difference.
There are several ways to do it right. Shapes could be x_batch.shape = (batch_size, 2*window_size)
, y_batch.shape = (batch_size,)
for CBOW or (batch_size,)
, (batch_size, 2*window_size)
for Skip-Gram. You should not do negative sampling here.
They should be adequately parametrized: CBOW(window_size, ...), SkipGram(window_size, ...). You should implement only one batcher in this task; and it's up to you which one to chose.
Useful links:
You can write the code in this notebook, or in a separate file. It can be reused for the next task. The result of your work should represent that your batch has a proper structure (right shapes) and content (words should be from one context, not some random indices). To show that, translate indices back to words and print them to show something like this:
text = ['first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including']
window_size = 2
# CBOW:
indices_to_words(x_batch) = \
[['first', 'used', 'early', 'working'],
['used', 'against', 'working', 'class'],
['against', 'early', 'class', 'radicals'],
['early', 'working', 'radicals', 'including']]
indices_to_words(labels_batch) = ['against', 'early', 'working', 'class']