Notebook

In [1]:

%matplotlib inline  
from numpy import exp, log, linspace
import matplotlib
import matplotlib.pyplot as plt

Let's Write: The Logistical Regression Classifier¶

Introduction¶

Logistical regression classifiers are a well-known machine learning technique for classifying data. In this notebook we'll go through the basics of how they work and write a simple implementation. Finally we'll test our implementation by using it to detect spam in a corpus of email

Throughout this notebook we'll use spam classification as a convenient example to help explain how the classifier works.

Resources¶

This notebook and the code it uses can be found on github.

Prerequisites¶

To get the most out of this notebook, you'll need the following prerequisites:

Python, numpy
Basic calculus, sums and series
Basic linear algebra, vectors, matrices, the dot product
Basics of numerical optimization, finding the minimum point of functions
Machine learning basics, test and training sets, cost functions, supervised learning, precision and recall

The Logistical Regression Classifier¶

A logistical regression classifier gets its name from the function at its core, the logistical, or sigmoid, function:

$$ h(x) = 1 / (1 + e^{a - bx}) \ \ a, b \in \mathbb{R} $$

This function has an interesting shape:

In [2]:

x = linspace(-15,15,100)
a = 0
b = 1
plt.plot(x, 1 / (1 + exp(a + -b*x)))
plt.show()

The function varies between 0 and 1 and is close to one of either extreme over most of its domain. This suggests a way of using this function as a classifier - if we can find some way of representing our problem as points on the x axis, then we can use this function to put each point into one of two classes, either 0 or 1. If for some point x, h(x) < 0.5, we say that x is in class 0. If h(x) > 0.5, we say that x is in class 1. We call 0.5 the decision boundary.

For example, consider classifying spam. We'd like to use the sigmoid function h(x) to classify a message as spam or not-spam. Do do this, we need to find a way of mapping each message onto the X-axis. After that the sigmoid function will predict which class the method falls into.

One such mapping could be the length of the longest capitalized word in the message. Spammers will sometimes use capitalized words to get our attention, so we expect spam messages to have more and longer capitalized words. We can take the length of the longest capitalized word and use that as the message's position on X-axis.

This is a good start but it won't quite work. The decision boundary is set at 0.5, which occurs at x = 0. Since we can't have a capitalized word of negative length, this strategy would classify almost all messages as spam. To solve this we modify the parameters of the sigmoid funtion, a and b. We can shift the function left or right by setting the a parameter:

In [3]:

a = 5
b = 1
plt.plot(x, 1 / (1 + exp(a + -b*x)))
plt.show()

The decision boundary is now above 5 instead of above 0, so any message with a capitalized word of length greater than 5 will be classified as spam. By changing the parameters of the sigmoid function we've changed how we classify emails.

Generalizing to Multiple Dimensions¶

In the discussion above we chose the length of the longest capitalized word to determine if a message was spam. However there are many more features of email messages that can be used, for example:

the email address they originate from
the presence of the word "free" or "money"
the presence of some symbols
the length of the message
etc.

Later on in this notebook we'll use the logistical regression classifier to classify messages with 57 features. We need to find some way of generalizing our approach to many dimensions.

Assume we can process any email message to extract a vector of real-valued features. Call this vector $ \vec{x} $. To use this multidimensional feature vector with the sigmoid function we need to find a way of projecting it onto one dimension. Linear algebra provides us with a way of performing this projection operation: the dot product. We need another vector, call it $ \vec{\theta} $, in order to do this.

If we let

$$ \vec{x} = [x_0, x_1, x_2 ... x_m] $$

and

$$ \vec{\theta} = [p_0, p_1, p_2 ... p_m] $$

where $ m \in \mathbb{N} $ is the number of features we've extracted, then the dot product is

$$ \vec{x} \cdot \vec{\theta} = x_0p_0 + x_1p_1 + x_2p_2 ... x_mp_m $$

If we fix $ x_0 = 1 $, then this looks similar to the exponential term in the sigmoid function above. We can rewrite it as

$$ h(\vec{x}, \vec{\theta}) = 1 / (1 + e^{\vec{x} \cdot \vec{\theta}}) $$$$ h(\vec{x}, \vec{\theta}) = 1 / (1 + e^{p_0 + x_1p_1 + x_2p_2 ... x_mp_m}) $$

Set $ m = 1 $ and this becomes

$$ h(\vec{x}, \vec{\theta}) = 1 / (1 + e^{p_0 + x_1p_1}) $$

Which is identical to the original sigmoid function with the parameters $ a = p_0 $ and $ b = -p_1$.

So if we are given a parameter vector $ \vec{\theta} $ and an email message, we can classify the email by:

extracting a feature vector $ \vec{x} $ from the message
performing the dot product operation with $ \vec{\theta} $
passing the result into the sigmoid function
checking which side of the decision boundary the output is on

It's clear that if we are given a new feature vector $ \vec{x} $ the way we classify it depends entirely on the values in $ \vec{\theta} $. Our classification model is completely defined by the values in the parameter vector $ \vec{\theta} $.

Choosing the Parameters¶

Following the conventions we set above, we call $ \vec{\theta} $ the parameter vector. The problem now becomes choosing the values in $\vec{\theta} $.

In our previous discussion we chose the parameters $ a $ and $ b $ by guesswork. However with multiple dimensions and large training sets we need an algorithmic way of choose the values of $ \vec{\theta} $. Choosing the parameters is called training the classifier.

The logistical regression classifier is a supervised algorithm. This means that in order to train the algorithm we must provide it with a training set of feature vectors $ \{ \vec{x_i} \mid i = 1 .. n \} $ and a set of labels $ \{ \vec{y_i} \mid y_i \in \{ 0, 1 \} \forall i = 1 .. n \} $ which indicate the correct class in which to place each of the $ \vec{x_i} $.

Training the algorithm depends on the concept of a cost function which does the following:

Given a parameter vector $ \vec{\theta} $,
For each feature vector $ \vec{x_i} $ in the training set
- Make a prediction from our sigmoid model, $ h( \vec{x_i} , \vec{\theta} $)
- Compare our prediction against the correct label for $ \vec{x_i} $
- If our prediction was incorrect, return a high value for the cost
- If the prediction was correct, return a low value for the cost.

To train the classifier we can use numerical methods to iteratively find the values of $ \vec{\theta} $ which minimize this cost function.

The cost function for the logistical regression classifier is

$$ J(\vec{\theta}, X, \vec{y}) = (-1 / n)[ \sum_{i=1}^{n} y_i log(h(\vec{x_i}, \vec{\theta})) + (1 - y_i) log(1 - h(\vec{x_i}, \vec{\theta}))] + \lambda\sum_{j=1}^{m} \theta_j^2$$

Where:

$ n \in \mathbb{N} $ is the number of training vectors in our training set
$ X $ is an $ n $ x $ m $ matrix containing all $ n $ vectors in training set, each with $ m $ features
$ \vec{x_i} $ is the $ i^{th} $ vector in the training set
$ y_i $ is is the class label for $ x_i $
$ h(\vec{x}, \vec{\theta}) $ is the sigmoid function used by our model, as above
$ \vec{\theta} $ is the parameter vector, as above
$ \lambda $ is the regularization parameter

The way we derive this function and the properties which make is useful as a cost function are outside the scope of this discussion. However we can see that it does what we need it to do:

if $ y_i $ is 1 and $ h(\vec{x_i}, \vec{\theta}) $ is close to 1 then the first and second terms in the sum are close to 0 and the cost $ J $ is low
if $ y_i $ is 1 and $ h(\vec{x_i}, \vec{\theta}) $ is close to 0 then the first term becomes large and the cost is high
if $ y_i $ is 0 and $ h(\vec{x_i}, \vec{\theta}) $ is close to 1 then the second term becomes large and the cost is high
if $ y_i $ is 0 and $ h(\vec{x_i}, \vec{\theta}) $ is close to 0 then both terms approach 0 and the cost is low

By finding the cost of each training example $ \vec{x_i} $ and summing them we get a total cost of $ \vec{\theta} $ over the entire training set.

The final term in the cost function, $ \lambda\sum_{j=1}^{m} \theta_j^2 $, is used to regularize the cost. Regularization prevents the parameters of $ \vec{\theta} $ from becoming too large by penalizing large values. This strategy tends to improve the accuracy of the classifier on the test set because it prevents the classifier from being overfitted to the training set. The parameter $ \lambda \in \mathbb{R} $ can be used to control the degree of regularization and is called the regularization parameter.

We now need to find the values of $ \vec{\theta} $ that minimize the cost function $ J $. The problem of minimizing functions is a well-studied branch of mathematics and there are many libraries that can do this for us. Having chosen the cost function $ J $ we can have the scipy.optimize library automatically find the best value of $ \vec{\theta} $, thereby training our classifier.

Vectorizing the Cost Function¶

There's one important thing we need to do before training the classifier - we need to vectorize the function $ J $.

The function $ J $ as we wrote it performs an operation and sums over the entire training set. There are various optimizations in the numpy and scipy libraries that depend on using matrices and vectors. We can express our cost function in terms of matrix and vector operations to take advantage of these optimizations.

We can write our cost function as

$$ J(\vec{\theta}, X, \vec{y}) = (-1 / n) [\vec{y}' \cdot log(h(X, \vec{\theta})) + (\vec{I}' - \vec{y}') \cdot log(\vec{I} - h(X, \vec{\theta}))] + \lambda \vec{\theta}\cdot\vec{\theta}'$$

Where:

$\vec{I}$ is the unit vector, a vector of ones
the $ ' $ symbol represents the transpose of a vector
the $ log $ and $ h() $ functions are applied to each element of the vectors they operate on
all the other parameters are as above

The code included with this notebook implements the algorithm in this way.

Putting it all Together¶

Now we're ready to look at some code. We have 3 components of our classifier algorithm:

the prediction function
the cost function
the optimization function

Implementations of each of these components can be found in src/logistic.py:

The predict() function takes a matrix of feature vectors $ X $ and the parameter vector $ \vec{\theta} $. It mulitples the feature matrix by the parameter vector and then applies the sigmoid function to each element of the resulting vector. This creates a vector of $ n $ predictions for each of the $ n $ feature vectors in $ X $.
The cost() function computes the cost of a parameter vector $ \vec{\theta} $ over a training set $ \{ X $ , $ \vec{y} \} $.
The train() function finds the parameter vector which minimizes the cost function.

To train our classifier, we pass a training set $ \{ X , \vec{y} \} $ and an initial guess at the model parameters $ \vec{\theta} $ into the train() function. The optimization routines in scipy.optimize will find the parameter vector which minimizes the cost() function over the training set $ \{ X , \vec{y} \} $.

Classifying Spam¶

To test the classifier, we'll use the well-known UCI Machine Learning Repository Spambase Data Set. This data set is constructed by extracting feature vectors with 57 dimension from over 4600 emails. Each email has been manually labelled as spam and not-spam, making this perfect for our purposes.

To download the data set, run the download() method in the src/spambase.py file:

In [4]:

from src import spambase
spambase.download()

downloading spam data set ...
download complete, unzipping data ...
complete

We use the read_data() function in src/spambase.py to get read $ X $ and $ \vec{y} $, split them into train and tests sets, and generate the initial value for $ \vec{\theta} $:

In [5]:

theta, Xtrain, Xtest, ytrain, ytest = spambase.read_data()

Then we train the classifier using the train() function in src/logistic.py:

In [6]:

from src import logistic
trained_theta = logistic.train(theta, Xtrain, ytrain)

Optimization terminated successfully.
         Current function value: 0.615512
         Iterations: 5
         Function evaluations: 660
         Gradient evaluations: 11

trained_theta now defines our model and we can use it to make predictions. To test the accuracy or our classifier we generate a set of predictions for Xtest and then check the accuracy and precision against ytest.

In [7]:

from numpy import matrix, array
predictions = logistic.predict(matrix(trained_theta).transpose(), Xtest)

# recall that 0.5 is our decision boundary
predictions = [1 if p > 0.5 else 0 for p in predictions]

# make a list of tuples, each tuple contains the predicted class and the labeled class
predictions = zip(predictions, [i[0] for i in ytest.tolist()])
print(predictions[:10])

[(0, 1), (0, 1), (1, 1), (0, 0), (0, 0), (1, 0), (1, 0), (0, 0), (1, 1), (1, 1)]

We can now calculate the precision and recall of our classifier:

In [8]:

tp = len([1 for x in predictions if x[0] == 1 and x[0] == x[1]])
fp = len([1 for x in predictions if x[0] == 1 and x[0] != x[1]])
precision = float(tp) / (float(tp) + float(fp))

tn = len([1 for x in predictions if x[0] == 0 and x[0] == x[1]])
fn = len([1 for x in predictions if x[0] == 0 and x[0] != x[1]])
recall = float(tp) / (float(tp) + float(fn))

print("Precision: {0}".format(precision))
print("Recall: {0}".format(recall))

Precision: 0.916955017301
Recall: 0.864600326264

Of all the messages we marked as spam, 91% were actually spam. Of all the spam messages in the training set, we found 87% of them. Not bad for our first try.

Next Steps¶

There are many improvements that we could make to the algorithm, and many details that were ommitted from our discussion for the sake of brevity.

Choosing the Initial Values of $ \vec{\theta} $¶

In src/spambase.py the initial values of $ \vec{\theta} $ are chosen to be normally distributed about 0. This was done to prevent the cost() function from finding an initial cost of Inf or -Inf, which is possible because of the way cost() uses the logarithm. By initializing $ \vec{\theta} $ around 0, we ensure that the values passed into the logarithm are manageable and the function does not run away.

There are other ways of dealing with this problem but we did not implement them here to avoid the complexity. The purpose of this exercise was to learn how to implement the basics of logistical regression classifier, not deal with all possible edge cases.

The Gradient¶

The optimization routine we used to train the classifier was scipy.optimize.fmin_cg(). This is an implementation of gradient descent which seeks the optimal solution by following the gradient (the derivative in multiple dimensions) down to the minimal point. The algorithm depends on being able to calculate the gradient at each point and its performance can be improved by finding the gradient function and passing it in to the fprime parameter. Since we did not do this, the fmin_cg() function was forced to numerically estimate the gradient at each point which hurt performance.

Better Implementations¶

It's important to note that the code included with this notebook is meant as a basic implementation of the logistical regression classifier and you should not use this code for anything other than instruction. If you're looking for a classifier to use in production you should use one of the standard implementations such as sklearn.linear_model.LogisticRegression.

Comparisons to Other Classifiers¶

In this discussion we did not say anything about how the logistical regression classifier compares to other classifiers such as the naive bayes or the support vector machine. Different algorithms perform better on different problems and you should evaluate several before choosing one solution for the problem you are working on.

Conclusion¶

In this notebook we went through the basics of how a logistical regression classifier works and wrote a simple implementation. We then evaluated its performance on a spam classification problem and found that it performed fairly well.

There are many improvements that could be made but I hope this was useful, if only for a basic introduction to some of the concepts in machine learning and classification.

Thanks for reading!

Appendix¶

More Resources:¶

Machine Learning on Coursera: https://www.coursera.org/learn/machine-learning
An Introduction to Statistical Learning: http://www-bcf.usc.edu/~gareth/ISL/getbook.html
Scikit-learn's implementation of the logistic regression classifier: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Wikipedia: http://en.wikipedia.org/wiki/Logistic_regression

Meta¶

Author: Matthew Fournier
Website: http://mfournier.ca
First Published: May 18, 2015