#!/usr/bin/env python
# coding: utf-8

# ## Recommender Systems

# In this notebook we'll explore a "deep learning" approach to building a recommender system.  I say that in quotes because this particular application doesn't actually involve a "deep" network at all.  However, it does take advantage of the power of a modern computation framework like Keras to implement a recommender that performs at a very high level with minimal code.  We'll try a couple different approaches using a technique called [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering).  Finally we'll build a true neural network and see how it compares to the collaborative filtering approach.
# 
# The data used for this task is the [MovieLens](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip) data set.  This content is inspired by the work Jeremy Howard did in his first [fast.ai course](https://course.fast.ai/).
# 
# I've already saved the zip file to a local directory so we can get started with some imports and reading in the ratings.csv file, which is where the data for this task comes from.

# In[2]:


get_ipython().run_line_magic('matplotlib', 'inline')
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

PATH = '/home/paperspace/data/ml-latest-small/'


# In[2]:


ratings = pd.read_csv(PATH + 'ratings.csv')
ratings.head()


# The data is tabular and consists of a user ID, a movie ID, and a rating (there's also a timestamp but we won't use it for this task).  Our task is to predict the rating for a user/movie pair, with the idea that if we had a model that's good at this task then we could predict how a user would rate movies they haven't seen yet and recommend movies with the highest predicted rating.
# 
# The zip file also includes a listing of movies and their associated genres.  We don't actually need this for the model but it's useful to know about.

# In[3]:


movies = pd.read_csv(PATH + 'movies.csv')
movies.head()


# To get a better sense of what the data looks like, we can turn it into a table by selecting the top 15 users/movies from the data and joining them together.  The result shows how each of the top users rated each of the top movies.

# In[4]:


g = ratings.groupby('userId')['rating'].count()
top_users = g.sort_values(ascending=False)[:15]

g = ratings.groupby('movieId')['rating'].count()
top_movies = g.sort_values(ascending=False)[:15]

top_r = ratings.join(top_users, rsuffix='_r', how='inner', on='userId')
top_r = top_r.join(top_movies, rsuffix='_r', how='inner', on='movieId')

pd.crosstab(top_r.userId, top_r.movieId, top_r.rating, aggfunc=np.sum)


# To build our first collaborative filtering model, we need to take care of a few steps first.  The user/movie fields are currently non-sequential integers representing some unique ID for that entity.  We need them to be sequential starting at zero to use for modeling (you'll see why later).  We can use scikit-learn's LabelEncoder class to transform the fields.  We'll also create variables with the total number of unique users and movies in the data, as well as the min and max ratings present in the data, for reasons that will become apparent shortly.

# In[5]:


user_enc = LabelEncoder()
ratings['user'] = user_enc.fit_transform(ratings['userId'].values)
n_users = ratings['user'].nunique()

item_enc = LabelEncoder()
ratings['movie'] = item_enc.fit_transform(ratings['movieId'].values)
n_movies = ratings['movie'].nunique()

ratings['rating'] = ratings['rating'].values.astype(np.float32)
min_rating = min(ratings['rating'])
max_rating = max(ratings['rating'])

n_users, n_movies, min_rating, max_rating


# Create a traditional (X, y) pairing of data and label, then split the data into training and test data sets.

# In[6]:


X = ratings[['user', 'movie']].values
y = ratings['rating'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


# Another constant we'll need for the model is the number of factors per user/movie.  This number can be whatever we want, however for the collaborative filtering model it does need to be the same size for both users and movies.  In his class, Jeremy said he played around with different numbers and 50 seemed to work best so we'll go with that.
# 
# Finally, we need to turn users and movies into separate arrays in the training and test data.  This is because in Keras they'll each be defined as distinct inputs, and the way Keras works is each input needs to be fed in as its own array.

# In[7]:


n_factors = 50

X_train_array = [X_train[:, 0], X_train[:, 1]]
X_test_array = [X_test[:, 0], X_test[:, 1]]


# Now we get to the model itself.  The main idea here is we're going to use embeddings to represent each user and each movie in the data.  These embeddings will be vectors (of size n_factors) that start out as random numbers but are fit by the model to capture the essential qualities of each user/movie.  We can accomplish this by computing the dot product between a user vector and a movie vector to get a predicted rating.  The code is fairly simple, there isn't even a traditional neural network layer or activation involved.  I stuck some regularization on the embedding layers and used a different initializer but even that probably isn't necessary.  Notice that this is where we need the number of unique users and movies, since those are required to define the size of each embedding matrix.

# In[8]:


from keras.models import Model
from keras.layers import Input, Reshape, Dot
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.regularizers import l2

def RecommenderV1(n_users, n_movies, n_factors):
    user = Input(shape=(1,))
    u = Embedding(n_users, n_factors, embeddings_initializer='he_normal',
                  embeddings_regularizer=l2(1e-6))(user)
    u = Reshape((n_factors,))(u)
    
    movie = Input(shape=(1,))
    m = Embedding(n_movies, n_factors, embeddings_initializer='he_normal',
                  embeddings_regularizer=l2(1e-6))(movie)
    m = Reshape((n_factors,))(m)
    
    x = Dot(axes=1)([u, m])

    model = Model(inputs=[user, movie], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt)

    return model


# This is kind of a neat example of how flexible and powerful modern computation frameworks like Keras and PyTorch are.  Even though these are billed as deep learning libraries, they have the building blocks to quickly create any computation graph you want and get automatic differentiation essentially for free.  Below you can see that all of the parameters are in the embedding layers, we don't have any traditional neural net components at all.

# In[9]:


model = RecommenderV1(n_users, n_movies, n_factors)
model.summary()


# Let's go ahead and train this for a few epochs and see what we get.

# In[10]:


history = model.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,
                    verbose=1, validation_data=(X_test_array, y_test))


# Not bad for a first try.  We can make some improvements though.  The first thing we can do is add a "bias" to each embedding.  The concept is similar to the bias in a fully-connected layer or the intercept in a linear model.  It just provides an extra degree of freedom.  We can implement this idea using new embedding layers with a vector length of one.  The bias embeddings get added to the result of the dot product.
# 
# The second improvement we can make is running the output of the dot product through a sigmoid layer and then scaling the result using the min and max ratings in the data.  This is a neat technique that introduces a non-linearity into the output and results in a modest performance bump.
# 
# I also refactored the code a bit by pulling out the embedding layer and reshape operation into a separate class.

# In[11]:


from keras.layers import Add, Activation, Lambda

class EmbeddingLayer:
    def __init__(self, n_items, n_factors):
        self.n_items = n_items
        self.n_factors = n_factors
    
    def __call__(self, x):
        x = Embedding(self.n_items, self.n_factors, embeddings_initializer='he_normal',
                      embeddings_regularizer=l2(1e-6))(x)
        x = Reshape((self.n_factors,))(x)
        return x

def RecommenderV2(n_users, n_movies, n_factors, min_rating, max_rating):
    user = Input(shape=(1,))
    u = EmbeddingLayer(n_users, n_factors)(user)
    ub = EmbeddingLayer(n_users, 1)(user)
    
    movie = Input(shape=(1,))
    m = EmbeddingLayer(n_movies, n_factors)(movie)
    mb = EmbeddingLayer(n_movies, 1)(movie)

    x = Dot(axes=1)([u, m])
    x = Add()([x, ub, mb])
    x = Activation('sigmoid')(x)
    x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)

    model = Model(inputs=[user, movie], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt)

    return model


# The model summary shows the new graph.  Notice the additional embedding layers with parameter numbers equal to the unique user and movie counts.

# In[12]:


model = RecommenderV2(n_users, n_movies, n_factors, min_rating, max_rating)
model.summary()


# In[13]:


history = model.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,
                    verbose=1, validation_data=(X_test_array, y_test))


# Those two additions to the model resulted in a pretty sizable improvement.  Validation error is now down to ~0.76 which is about as good as what Jeremy got (and I believe close to SOTA for this data set).
# 
# That pretty much covers the conventional approach to solving this problem, but there's another way we can tackle this.  Instead of taking the dot product of the embedding vectors, what if we just concatenated the embeddings together and stuck a fully-connected layer on top of them?  It's still not technically "deep" but it would at least be a neural network!  To modify the code, we can remove the bias embeddings from V2 and do a concat on the embedding layers instead.  Then we can add some dropout, insert a dense layer, and stick some dropout on the dense layer as well.  Finally, we'll run it through a single-unit dense layer keep the sigmoid trick at the end.

# In[14]:


from keras.layers import Concatenate, Dense, Dropout

def RecommenderNet(n_users, n_movies, n_factors, min_rating, max_rating):
    user = Input(shape=(1,))
    u = EmbeddingLayer(n_users, n_factors)(user)
    
    movie = Input(shape=(1,))
    m = EmbeddingLayer(n_movies, n_factors)(movie)
    
    x = Concatenate()([u, m])
    x = Dropout(0.05)(x)
    
    x = Dense(10, kernel_initializer='he_normal')(x)
    x = Activation('relu')(x)
    x = Dropout(0.5)(x)
    
    x = Dense(1, kernel_initializer='he_normal')(x)
    x = Activation('sigmoid')(x)
    x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)

    model = Model(inputs=[user, movie], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt)

    return model


# Most of the parameters are still in the embedding layers, but we have some added learning capability from the dense layers.

# In[15]:


model = RecommenderNet(n_users, n_movies, n_factors, min_rating, max_rating)
model.summary()


# In[16]:


history = model.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,
                    verbose=1, validation_data=(X_test_array, y_test))


# Without doing any tuning at all we still managed to get a result that's pretty close to the best performance we saw with the traditional approach.  This technique has the added benefit that we can easily incorporate additional features into the model.  For instance, we could create some date features from the timestamp or throw in the movie genres as a new embedding layer.  We could tune the size of the movie and user embeddings independently since they no longer need to match.  Lots of possibilites here.

# In[ ]: