import matplotlib.pyplot as plt
import numpy as np
import sklearn.cluster
Load the data.
data = np.loadtxt('/tmp/Netflixishdata.txt', dtype=int)
print zip(data.shape, ('customers', 'movies'))
print np.prod(data.size), 'ratings'
[(200, 'customers'), (100, 'movies')] 20000 ratings
Plotting a histogram of all 20,000 ratings shows that there are about 4,000 instances of each rating. That's odd.
plt.hist(data.flatten(), bins=np.linspace(0.5, 5.5, 6))
plt.title("Histogram of all ratings")
<matplotlib.text.Text at 0xd30508c>
Looking even closer shows that there's actually one rating of 0!
for i in xrange(6):
print sum(sum(data == i)), "movies have a rating of", i
1 movies have a rating of 0 4000 movies have a rating of 1 4000 movies have a rating of 2 4000 movies have a rating of 3 4000 movies have a rating of 4 3999 movies have a rating of 5
Now I average over columns (movies) and rows (customers) to show the distributions of average movie ratings and customer ratings.
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.hist(np.mean(data, axis=0), bins=25)
ax1.set_title("Average Movie Ratings")
ax2.hist(np.mean(data, axis=1), bins=25)
ax2.set_title("Average Customer Ratings")
<matplotlib.text.Text at 0xd44808c>
Finally, I use K-means to cluster the movies into a varying number of clusters. Inertia, an overall measure of how close the data points are to their centroids (lower is a better fit), is reported. The inertia can always be decreased by using more clusters (e.g., setting k equal to the number of data points can provide a perfect fit), but the plot below shows that the inertia begins to level off above 4 clusters which suggests there are 4 main types of movies.
k_range = range(1, 15)
inertias = [sklearn.cluster.KMeans(k=k).fit(data.T).inertia_ for k in k_range]
plt.plot(k_range, inertias)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
<matplotlib.text.Text at 0xd95c18c>