I got my start in programming with C++ (not counting hacking DOS video games like Duke Nukem in a hex editor). One summer in college I had an undergraduate grant with a professor who had me write a neural network algorithm using back propagation. The next summer this same professor had me recode it all in Java, as it was the hot new language. So from the start, my programming needs have been in the context of analytics, often exploratory in nature, which is why I quickly gravitated from compiled languages like C++ and Java (statically compiled) towards dynamic languages like MATLAB, R and Python (runtime compiled).
The trade-off in using dynamic languages over compiled ones is about speed. Dynamic languages better enable the exploratory programming needed when tackling new analytical problems, which leads to faster development of solutions. The trade-off is that this flexibility means that compilers do not have enough information about the code and data to optimize for run-time execution speed. In many settings, increasing computer time for saved human time is a good trade-off, but of course sometimes execution speed is critical, even in exploratory work (i.e., it is harder to iteratively refine a modeling approach when code takes many hours to run). Wouldn't it be great if there was a way to have the flexibility and expressiveness afforded by dynamic languages but with more of the execution performance of compiled languages?
This wouldn't it be great if wish has been around for a while, but three events centering on the LLVM compiler over the past year make this wish a lot closer to reality:
While this seems like an unrelated mix of events, the common thread is that important tools in the modern technical computing stack are moving towards using LLVM. So, I figured it was about time I became more familiar with the topic and the rest of this post shows some of my recent foray.
Matrix factorization methods serve as the backbone of many statistical algorithms. Two particular approaches -- singular value decomposition (SVD) and non-negative matrix factorization -- gained broader exposure with the 2009 Netflix Prize for improving movie recommendations. These are both computationally expensive algorithms, the kind that you want implemented in a compiled language. But, there is a nice tutorial on using matrix factorization for recommendation systems that codes the algorithm in Python, making the algorithm easy to follow and test. I'll be using a slightly modified version of the code used in this tutorial as the basis for my explorations.
Mathematically, what we are doing is taking an N by M matrix R of ratings by N people over M items and decomposing it into two constituent matrices, P and Q (I'll stick with the notation used in the tutorial). P will be an N by K matrix that we can think of as N people's preference weightings over K "hidden" (or latent) features of the items. Q will be a K by M matrix that we can think of as the "loadings" on the K features by each of these M items. The algorithm seeks P and Q such that the difference between R and PQ is minimized (in general, this solution is non-unique).
In what follows I'll compare algorithm speed across three scenarios:
Here is the code that will serve as the baseline for our testing (again, a slightly modified version of what is hosted on the tutorial). We'll save this code as a file called mf.py
:
# -*- coding: utf-8 -*-
#!/usr/bin/python
'''
Adapted from example code by Albert Au Yeung (2010)
http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/#source-code
An implementation of matrix factorization...
@INPUT:
R : a matrix to be factorized, dimension N x M
P : an initial matrix of dimension N x K
Q : an initial matrix of dimension M x K
K : the number of latent features
steps : the maximum number of steps to perform the optimisation
alpha : the learning rate
beta : the regularization parameter
@OUTPUT:
the final matrices P and Q, steps taken and error
'''
import numpy
def matrix_factorization(R, P, Q, K, steps=5000, alpha=0.0002, beta=0.02):
half_beta = beta / 2.0
for step in xrange(steps):
for i in xrange(len(R)):
for j in xrange(len(R[i])):
if R[i,j] > 0:
eij = R[i,j] - numpy.dot(P[i,:],Q[j,:])
for k in xrange(K):
P[i,k] += alpha * (2 * eij * Q[j,k] - beta * P[i,k])
Q[j,k] += alpha * (2 * eij * P[i,k] - beta * Q[j,k])
e = 0
for i in xrange(len(R)):
for j in xrange(len(R[i])):
if R[i,j] > 0:
e += (R[i,j] - numpy.dot(P[i,:],Q[j,:]))**2
for k in xrange(K):
e += half_beta * (P[i,k]*P[i,k] + Q[j,k]*Q[j,k])
if e < 0.001:
break
return P, Q, step, e
Right away, we can see this will probably run very slowly in a language like Python compared to C. We have deeply nested for
loops and such looping is almost always a bottleneck in dynamically typed languages (relative to doing the exact same looping in a compiled language in which the compiler knows about the variable types and can optimize operations).
In two parts of the code there is an important conditional statement to call out with respect to how the algorithm is adapted to this rating recommendation problem:
if R[i,j] > 0:
In the R matrix of peoples' ratings on items, R[i,j] = 0 is a sentinel meaning that person i has not rated item j. In many settings (like Netflix's inventory of thousands of movies), most elements of R will be 0 because most people have only interacted with a small subset of the items. The algorithm above takes this into account when assessing the factorization error by ignoring the differences between the estimated R and the actual R where the actual ratings are unknown. In this way, the algorithm tries to find P and Q such that the known ratings are recovered as accurately as possible. Estimates for the unobserved ratings, i.e., the predicted ratings, can then be taken from PQ wherever R was zero. We assume that ratings are non-negative, e.g. 1-5 stars.
One final thing deserves mention before we run the code: we have to pick K, the number of latent features. But if the features are latent, how do we know how many of them there are? There is no easy answer in most settings. Some amount of experimentation and domain expertise (to interpret the factors) is needed to confidently pick K, but this touches on an important issue that pervades statistical modeling: estimation error due to model selection is often harder to assess and correct than error due to model fitting, but much more is written about the latter. For this example we'll use 8 factors.
OK, finally we are on to testing performance. In support of this testing I am going to define two variants of the baseline code, and finally a piece of code to easily call all three variants.
The first variant we'll save to a file called mf_cython.pyx
. We'll take advantage of a Cython, which essentially lets us add some type information to our Python code above and then machine-generate analogous C code that be compiled and execute (much) more quickly. I won't post this code here but will link to it an the end.
The second variant we'll save to a file called mf_numba.py
and will use numba to add type information and generate "intermediate" code that can be JIT compiled by LLVM. One of the nice things about numba is that you can do this through a simple function decorator. For example, instead of our matrix_factorization
method beginning as
def matrix_factorization(R, P, Q, K):
we import autojit
from numba and then add this decorator. It auto-majically annotates the variables types in the code and prepares it for JIT compilation:
from numba.decorators import autojit
@autojit
def matrix_factorization(R, P, Q, K):
At this point, we have our baseline Python implementation of matrix factorization for recommendation problems, along with two variants that use C and LLVM for improving execution speed. Finally, we need a bit of code to drive the test, which I have in a file called mf_rec_test.py
. This code creates some random ratings data for a desired number of people and items. The run
method then calls one of the three matrix factorization variants we have and measures execution time along with the error in recovering the known ratings. Again, I won't post the code here but will link to it at the end.
With everything in place we can drive the test across the three approaches an compare speed. Again, 8 factors will be used in the solution, and we'll pick a relatively small matrix of 30 people having ratings over 60 items...
In [1]: import mf_rec_test as mfrt
In [2]: mfrt.run(method='python', num_factors=8, num_users=30, num_items=60)
Recommendation test time in minutes: 2.9021
Factorization RMSE: 0.087
In [3]: mfrt.run(method='cython', num_factors=8, num_users=30, num_items=60)
Recommendation test time in minutes: 0.0028
Factorization RMSE: 0.081
So, the Python code takes almost 3 minutes whereas the (machine-generated) C code executes in one thousandth of the time. For languages like Python and R much of the mathematical and statistical methods are actually calling underlying C and Fortran code, so you can get reasonably good execution speed to go along with the development productivity you get. But, as this example shows, there are cases where highly iterative algorithms can suffer horribly. How does the numba-ized code fair?
In [4]: mfrt.run(method='numba', num_factors=8, num_users=30, num_items=60)
Recommendation test time in minutes: 0.0652
Factorization RMSE: 0.088
Pretty impressive: for code that we basically added a one-line decorator to, we sped things up by a factor of 44. Now, watch what happens if we re-execute the exact same call:
In [5]: mfrt.run(method='numba', num_factors=8, num_users=30, num_items=60)
Recommendation test time in minutes: 0.0031
Factorization RMSE: 0.084
The numba-ized code now executes in the same time as the C code, speeding up again by a factor of 21. How so? The first time the numba-ized code is run it gets translated to byte code that LLVM can JIT compile. The next time we run the code, the results from the JIT compilation can be used and we get another big speed up.
Writing this post certainly pushed me past my comfort zone. I have spent most of my focus over the years on applying analytics to business problems and not on lower-level computing issues like compilers and memory management. But, as in many areas of computing these days, more data, more complex questions, and expectations of real-time results means that issues of the past have come back to the fore. For example, mobile developers have to deal with limited screen space and be hyper-focused on keeping memory consumption down, just as people doing any kind of computing had to in the 1980s. It basically means we have to head the advice of Peter Norvig to remember that there is a "computer" in "computer science". To do even exploratory analytical work nowadays means having to develop and maintain some level of maturity around technical computing. And then of course, there is the rest of being a data scientist, like keeping up with the evolutions in analytic methods, software implementations of said methods, domain knowledge for usefully applying said methods, techniques for visualizing results, and best practices for conveying these methods and their results to technical and non-technical audiences. Sigh...
As overwhelming as this collection of issues feels sometimes, it motivates my excitement about projects like LLVM and numba: They provide a single, consistent abstraction layer on which I can maintain reasonably high performance code in languages that are very flexible and efficient for me to develop in. I can target execution against a single core on a CPU, multiple cores, or even completely different hardware like NVIDIA's GPUs that have thousands of cores, all through LLVM and numba. If I were to continue developing this matrix factorization algorithm for a recommendation system I would now have a way of iteratively testing approaches on much larger, more realistic data sets rather than waiting nearly 3 minutes every time I tested against a toy data set. This means I would be more likely to actually explore and evaluate solutions, and that is what is most exciting.