In [134]:

import matplotlib.pyplot as plt
import numpy as np
import sklearn.cluster

Load the data.

In [131]:

data = np.loadtxt('/tmp/Netflixishdata.txt', dtype=int)
print zip(data.shape, ('customers', 'movies'))
print np.prod(data.size), 'ratings'

[(200, 'customers'), (100, 'movies')]
20000 ratings

Plotting a histogram of all 20,000 ratings shows that there are about 4,000 instances of each rating. That's odd.

In [125]:

plt.hist(data.flatten(), bins=np.linspace(0.5, 5.5, 6))
plt.title("Histogram of all ratings")

Out[125]:

<matplotlib.text.Text at 0xd30508c>

Looking even closer shows that there's actually one rating of 0!

In [132]:

for i in xrange(6):
    print sum(sum(data == i)), "movies have a rating of", i

1 movies have a rating of 0
4000 movies have a rating of 1
4000 movies have a rating of 2
4000 movies have a rating of 3
4000 movies have a rating of 4
3999 movies have a rating of 5

Now I average over columns (movies) and rows (customers) to show the distributions of average movie ratings and customer ratings.

In [133]:

fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.hist(np.mean(data, axis=0), bins=25)
ax1.set_title("Average Movie Ratings")
ax2.hist(np.mean(data, axis=1), bins=25)
ax2.set_title("Average Customer Ratings")

Out[133]:

<matplotlib.text.Text at 0xd44808c>

Finally, I use K-means to cluster the movies into a varying number of clusters. Inertia, an overall measure of how close the data points are to their centroids (lower is a better fit), is reported. The inertia can always be decreased by using more clusters (e.g., setting k equal to the number of data points can provide a perfect fit), but the plot below shows that the inertia begins to level off above 4 clusters which suggests there are 4 main types of movies.

In [139]:

k_range = range(1, 15)
inertias = [sklearn.cluster.KMeans(k=k).fit(data.T).inertia_ for k in k_range]
plt.plot(k_range, inertias)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')

Out[139]:

<matplotlib.text.Text at 0xd95c18c>