%matplotlib inline
from numpy import exp, log, linspace
import matplotlib
import matplotlib.pyplot as plt
Logistical regression classifiers are a well-known machine learning technique for classifying data. In this notebook we'll go through the basics of how they work and write a simple implementation. Finally we'll test our implementation by using it to detect spam in a corpus of email
Throughout this notebook we'll use spam classification as a convenient example to help explain how the classifier works.
This notebook and the code it uses can be found on github.
To get the most out of this notebook, you'll need the following prerequisites:
A logistical regression classifier gets its name from the function at its core, the logistical, or sigmoid, function:
$$ h(x) = 1 / (1 + e^{a - bx}) \ \ a, b \in \mathbb{R} $$This function has an interesting shape:
x = linspace(-15,15,100)
a = 0
b = 1
plt.plot(x, 1 / (1 + exp(a + -b*x)))
plt.show()
The function varies between 0 and 1 and is close to one of either extreme over most of its domain. This suggests a way of using this function as a classifier - if we can find some way of representing our problem as points on the x axis, then we can use this function to put each point into one of two classes, either 0 or 1. If for some point x
, h(x) < 0.5
, we say that x
is in class 0. If h(x) > 0.5
, we say that x
is in class 1. We call 0.5
the decision boundary.
For example, consider classifying spam. We'd like to use the sigmoid function h(x)
to classify a message as spam or not-spam. Do do this, we need to find a way of mapping each message onto the X-axis. After that the sigmoid function will predict which class the method falls into.
One such mapping could be the length of the longest capitalized word in the message. Spammers will sometimes use capitalized words to get our attention, so we expect spam messages to have more and longer capitalized words. We can take the length of the longest capitalized word and use that as the message's position on X-axis.
This is a good start but it won't quite work. The decision boundary is set at 0.5
, which occurs at x = 0
. Since we can't have a capitalized word of negative length, this strategy would classify almost all messages as spam. To solve this we modify the parameters of the sigmoid funtion, a
and b
. We can shift the function left or right by setting the a
parameter:
a = 5
b = 1
plt.plot(x, 1 / (1 + exp(a + -b*x)))
plt.show()
The decision boundary is now above 5 instead of above 0, so any message with a capitalized word of length greater than 5 will be classified as spam. By changing the parameters of the sigmoid function we've changed how we classify emails.
In the discussion above we chose the length of the longest capitalized word to determine if a message was spam. However there are many more features of email messages that can be used, for example:
Later on in this notebook we'll use the logistical regression classifier to classify messages with 57 features. We need to find some way of generalizing our approach to many dimensions.
Assume we can process any email message to extract a vector of real-valued features. Call this vector $ \vec{x} $. To use this multidimensional feature vector with the sigmoid function we need to find a way of projecting it onto one dimension. Linear algebra provides us with a way of performing this projection operation: the dot product. We need another vector, call it $ \vec{\theta} $, in order to do this.
If we let
$$ \vec{x} = [x_0, x_1, x_2 ... x_m] $$and
$$ \vec{\theta} = [p_0, p_1, p_2 ... p_m] $$where $ m \in \mathbb{N} $ is the number of features we've extracted, then the dot product is
$$ \vec{x} \cdot \vec{\theta} = x_0p_0 + x_1p_1 + x_2p_2 ... x_mp_m $$If we fix $ x_0 = 1 $, then this looks similar to the exponential term in the sigmoid function above. We can rewrite it as
$$ h(\vec{x}, \vec{\theta}) = 1 / (1 + e^{\vec{x} \cdot \vec{\theta}}) $$$$ h(\vec{x}, \vec{\theta}) = 1 / (1 + e^{p_0 + x_1p_1 + x_2p_2 ... x_mp_m}) $$Set $ m = 1 $ and this becomes
$$ h(\vec{x}, \vec{\theta}) = 1 / (1 + e^{p_0 + x_1p_1}) $$Which is identical to the original sigmoid function with the parameters $ a = p_0 $ and $ b = -p_1$.
So if we are given a parameter vector $ \vec{\theta} $ and an email message, we can classify the email by:
It's clear that if we are given a new feature vector $ \vec{x} $ the way we classify it depends entirely on the values in $ \vec{\theta} $. Our classification model is completely defined by the values in the parameter vector $ \vec{\theta} $.
Following the conventions we set above, we call $ \vec{\theta} $ the parameter vector. The problem now becomes choosing the values in $\vec{\theta} $.
In our previous discussion we chose the parameters $ a $ and $ b $ by guesswork. However with multiple dimensions and large training sets we need an algorithmic way of choose the values of $ \vec{\theta} $. Choosing the parameters is called training the classifier.
The logistical regression classifier is a supervised algorithm. This means that in order to train the algorithm we must provide it with a training set of feature vectors $ \{ \vec{x_i} \mid i = 1 .. n \} $ and a set of labels $ \{ \vec{y_i} \mid y_i \in \{ 0, 1 \} \forall i = 1 .. n \} $ which indicate the correct class in which to place each of the $ \vec{x_i} $.
Training the algorithm depends on the concept of a cost function which does the following:
To train the classifier we can use numerical methods to iteratively find the values of $ \vec{\theta} $ which minimize this cost function.
The cost function for the logistical regression classifier is
$$ J(\vec{\theta}, X, \vec{y}) = (-1 / n)[ \sum_{i=1}^{n} y_i log(h(\vec{x_i}, \vec{\theta})) + (1 - y_i) log(1 - h(\vec{x_i}, \vec{\theta}))] + \lambda\sum_{j=1}^{m} \theta_j^2$$Where:
The way we derive this function and the properties which make is useful as a cost function are outside the scope of this discussion. However we can see that it does what we need it to do:
By finding the cost of each training example $ \vec{x_i} $ and summing them we get a total cost of $ \vec{\theta} $ over the entire training set.
The final term in the cost function, $ \lambda\sum_{j=1}^{m} \theta_j^2 $, is used to regularize the cost. Regularization prevents the parameters of $ \vec{\theta} $ from becoming too large by penalizing large values. This strategy tends to improve the accuracy of the classifier on the test set because it prevents the classifier from being overfitted to the training set. The parameter $ \lambda \in \mathbb{R} $ can be used to control the degree of regularization and is called the regularization parameter.
We now need to find the values of $ \vec{\theta} $ that minimize the cost function $ J $. The problem of minimizing functions is a well-studied branch of mathematics and there are many libraries that can do this for us. Having chosen the cost function $ J $ we can have the scipy.optimize
library automatically find the best value of $ \vec{\theta} $, thereby training our classifier.
There's one important thing we need to do before training the classifier - we need to vectorize the function $ J $.
The function $ J $ as we wrote it performs an operation and sums over the entire training set. There are various optimizations in the numpy
and scipy
libraries that depend on using matrices and vectors. We can express our cost function in terms of matrix and vector operations to take advantage of these optimizations.
We can write our cost function as
$$ J(\vec{\theta}, X, \vec{y}) = (-1 / n) [\vec{y}' \cdot log(h(X, \vec{\theta})) + (\vec{I}' - \vec{y}') \cdot log(\vec{I} - h(X, \vec{\theta}))] + \lambda \vec{\theta}\cdot\vec{\theta}'$$Where:
The code included with this notebook implements the algorithm in this way.
Now we're ready to look at some code. We have 3 components of our classifier algorithm:
Implementations of each of these components can be found in src/logistic.py
:
predict()
function takes a matrix of feature vectors $ X $ and the parameter vector $ \vec{\theta} $. It mulitples the feature matrix by the parameter vector and then applies the sigmoid function to each element of the resulting vector. This creates a vector of $ n $ predictions for each of the $ n $ feature vectors in $ X $.cost()
function computes the cost of a parameter vector $ \vec{\theta} $ over a training set $ \{ X $ , $ \vec{y} \} $.train()
function finds the parameter vector which minimizes the cost function.To train our classifier, we pass a training set $ \{ X , \vec{y} \} $ and an initial guess at the model parameters $ \vec{\theta} $ into the train()
function. The optimization routines in scipy.optimize
will find the parameter vector which minimizes the cost()
function over the training set $ \{ X , \vec{y} \} $.
To test the classifier, we'll use the well-known UCI Machine Learning Repository Spambase Data Set. This data set is constructed by extracting feature vectors with 57 dimension from over 4600 emails. Each email has been manually labelled as spam and not-spam, making this perfect for our purposes.
To download the data set, run the download()
method in the src/spambase.py
file:
from src import spambase
spambase.download()
downloading spam data set ... download complete, unzipping data ... complete
We use the read_data()
function in src/spambase.py
to get read $ X $ and $ \vec{y} $, split them into train and tests sets, and generate the initial value for $ \vec{\theta} $:
theta, Xtrain, Xtest, ytrain, ytest = spambase.read_data()
Then we train the classifier using the train()
function in src/logistic.py
:
from src import logistic
trained_theta = logistic.train(theta, Xtrain, ytrain)
Optimization terminated successfully. Current function value: 0.615512 Iterations: 5 Function evaluations: 660 Gradient evaluations: 11
trained_theta
now defines our model and we can use it to make predictions. To test the accuracy or our classifier we generate a set of predictions for Xtest
and then check the accuracy and precision against ytest
.
from numpy import matrix, array
predictions = logistic.predict(matrix(trained_theta).transpose(), Xtest)
# recall that 0.5 is our decision boundary
predictions = [1 if p > 0.5 else 0 for p in predictions]
# make a list of tuples, each tuple contains the predicted class and the labeled class
predictions = zip(predictions, [i[0] for i in ytest.tolist()])
print(predictions[:10])
[(0, 1), (0, 1), (1, 1), (0, 0), (0, 0), (1, 0), (1, 0), (0, 0), (1, 1), (1, 1)]
We can now calculate the precision and recall of our classifier:
tp = len([1 for x in predictions if x[0] == 1 and x[0] == x[1]])
fp = len([1 for x in predictions if x[0] == 1 and x[0] != x[1]])
precision = float(tp) / (float(tp) + float(fp))
tn = len([1 for x in predictions if x[0] == 0 and x[0] == x[1]])
fn = len([1 for x in predictions if x[0] == 0 and x[0] != x[1]])
recall = float(tp) / (float(tp) + float(fn))
print("Precision: {0}".format(precision))
print("Recall: {0}".format(recall))
Precision: 0.916955017301 Recall: 0.864600326264
Of all the messages we marked as spam, 91% were actually spam. Of all the spam messages in the training set, we found 87% of them. Not bad for our first try.
There are many improvements that we could make to the algorithm, and many details that were ommitted from our discussion for the sake of brevity.
In src/spambase.py
the initial values of $ \vec{\theta} $ are chosen to be normally distributed about 0. This was done to prevent the cost()
function from finding an initial cost of Inf
or -Inf
, which is possible because of the way cost()
uses the logarithm. By initializing $ \vec{\theta} $ around 0, we ensure that the values passed into the logarithm are manageable and the function does not run away.
There are other ways of dealing with this problem but we did not implement them here to avoid the complexity. The purpose of this exercise was to learn how to implement the basics of logistical regression classifier, not deal with all possible edge cases.
The optimization routine we used to train the classifier was scipy.optimize.fmin_cg()
. This is an implementation of gradient descent which seeks the optimal solution by following the gradient (the derivative in multiple dimensions) down to the minimal point. The algorithm depends on being able to calculate the gradient at each point and its performance can be improved by finding the gradient function and passing it in to the fprime
parameter. Since we did not do this, the fmin_cg()
function was forced to numerically estimate the gradient at each point which hurt performance.
It's important to note that the code included with this notebook is meant as a basic implementation of the logistical regression classifier and you should not use this code for anything other than instruction. If you're looking for a classifier to use in production you should use one of the standard implementations such as sklearn.linear_model.LogisticRegression
.
In this discussion we did not say anything about how the logistical regression classifier compares to other classifiers such as the naive bayes or the support vector machine. Different algorithms perform better on different problems and you should evaluate several before choosing one solution for the problem you are working on.
In this notebook we went through the basics of how a logistical regression classifier works and wrote a simple implementation. We then evaluated its performance on a spam classification problem and found that it performed fairly well.
There are many improvements that could be made but I hope this was useful, if only for a basic introduction to some of the concepts in machine learning and classification.
Thanks for reading!
Scikit-learn
's implementation of the logistic regression classifier: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html