In this notebook we will see how a general machine learning problem is represented and solved in shogun. As a primer to shoguns's many capabilities, we will see how various types of data and its attributes are handled and also how prediction is done.
Machine learning implies we want to to make and improve predictions or behaviors based on some data. The prediction can be of different types: Supervised, Unsupervised, etc. Shogun provides a host of functionality to do most of these. To get off the mark, let us see how shogun handles the attributes of the data using Features. A feature is generally an individual measurable property of some data being observed.
Shogun supports wide range of feature representations. Among these are String features, Dense features, Sparse features, etc. To start with let's see how we can define some easy data using float type values.
Let us consider an example, instead of just abstract concepts. We have a dataset about various attributes of individuals and we know whether or not they are diabetic. Now this is a classic machine learning problem. We know the attributes for which a patient is diabetic and want to predict for other unknown values if he is diabetic. This type of learning problem falls under Supervised learning.
We will play with two attributes, Plasma glucose concentration and Body Mass Index (BMI) and try to learn something about their relationship with the disease. Now we define some random values but later we will try this on a real world dataset.
#To import all shogun classes
from modshogun import *
#Generate some random data
X = 2 * random.randn(20,2)
traindata=r_[X + 5, X + 8].T
print traindata
[[ 6.04738887 5.05927385 8.6155217 1.68794317 3.67849965 3.33080522 7.29608287 4.49767861 3.76506408 7.37308062 2.81125636 4.93952075 6.74775034 4.59171972 2.79037788 7.74826855 4.98350423 4.12205125 3.46974765 4.17738159 9.04738887 8.05927385 11.6155217 4.68794317 6.67849965 6.33080522 10.29608287 7.49767861 6.76506408 10.37308062 5.81125636 7.93952075 9.74775034 7.59171972 5.79037788 10.74826855 7.98350423 7.12205125 6.46974765 7.17738159] [ 2.30753949 6.46830775 4.80789714 4.32066431 1.75856142 3.18375054 4.39286335 3.2282854 6.20755361 5.83269262 3.34168723 5.004786 6.57484942 4.8575268 2.71612347 5.17725946 3.44037076 9.14311705 5.65248302 5.26326961 5.30753949 9.46830775 7.80789714 7.32066431 4.75856142 6.18375054 7.39286335 6.2282854 9.20755361 8.83269262 6.34168723 8.004786 9.57484942 7.8575268 5.71612347 8.17725946 6.44037076 12.14311705 8.65248302 8.26326961]]
We now have a matrix with 2 rows. These rows are our two attributes/features. Let's call it the feature matrix. To convert these features in shogun format let us use RealFeatures
which are nothing but the above mentioned Dense features of 64bit Float
type. To do this call RealFeatures
with the feature matrix as the argument.
feats_train=RealFeatures(traindata)
We need to label data to be able to differentiate between them. Shogun provides various types of labels to do this through Clabels. In this particular problem, our data can be of two types either diabetic or non-diabetic, so we need binary labels. This makes it a Binary Classification problem, where we classify the data in two groups: if an individual has diabetes and does not have it.
#create array of labels with 1 and -1
trainlab=concatenate((ones(20),-ones(20)))
print trainlab
#convert to shogun format labels
labels=BinaryLabels(trainlab)
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.]
Shogun provides tools for classification, regression, etc. through CMachine. Basically we need to $\it train$ the machine on some training data to be able to learn from it. Then we $\it apply$ it to test data to get predictions.
# Plot the training data
figure(figsize=(6,4))
gray()
_=scatter(traindata[0, :], traindata[1,:], c=labels, s=50)
title("Training Data")
gray()
We can intuitively understand how the data can be separated in the plot. Now let us see if our classifer is up to the task. It is unlikely we will get such an easy separation with real data though. Moving on to the prediction part, we will use Liblinear, a linear SVM, to do the classification (more on SVMs in this notebook).
#prameters to svm
C=0.9
epsilon=1e-3
svm=LibLinear(C, feats_train, labels)
svm.set_liblinear_solver_type(L2R_L2LOSS_SVC)
svm.set_epsilon(epsilon)
#train
svm.train()
size=100
We will now apply
on test features to get predictions. Let us use the whole XY grid as test data.
x1=linspace(0, 14, size)
x2=linspace(0, 14, size)
x, y=meshgrid(x1, x2)
#Generate X-Y grid test data
grid=RealFeatures(array((ravel(x), ravel(y))))
#apply on test grid
predictions = svm.apply(grid)
z=predictions.get_values().reshape((size, size))
#plot
jet()
figure(figsize=(8,6))
title("Classification")
c=pcolor(x, y, z)
_=contour(x, y, z, linewidths=1, colors='black', hold=True)
_=colorbar(c)
gray()
_=scatter(traindata[0, :], traindata[1,:], c=labels, s=50)
gray()
As we can see a nice boundary is predicted to classify the data and this is surely close to what our human machine reasoned! To play with this interactively have a look at this: web demo
Shogun provides the capability to load datasets of different formats using CFile. Now we will use a real world dataset regarding the previous example:Pima Indians Diabetes data set. We load the LibSVM
format file using shogun's LibSVMFile class. Since LibSVM
format files have labels included in the file, we get them with load_with_labels
.
f=SparseRealFeatures()
#Load the file and generate labels.
trainlab=f.load_with_labels(LibSVMFile('../../../data/toy/diabetes_scale.svm'))
labels=BinaryLabels(trainlab)
#Get the feature matrix
mat=f.get_full_feature_matrix()
feats=array(mat[1])
feats=vstack((feats, array(mat[5])))
print feats, feats.shape
[[ 0.487437 -0.145729 0.839196 ..., 0.21608 0.266332 -0.0653266 ] [ 0.00149028 -0.207153 -0.305514 ..., -0.219076 -0.102832 -0.0938897 ]] (2, 768)
Once we get hold of the feature matrix we get vectors 1 and 5 which are the attributes we are interested in: Plasma glucose concentration and Body Mass Index (BMI).
#convert to shogun format
feats_train=RealFeatures(feats)
#plot the training data
figure(figsize=(6,5))
_=scatter(feats[0, :], feats[1,:], c=trainlab, s=50)
_=xlabel('Plasma glucose concentration')
_=ylabel('Body mass index')
What follows next is the now familiar routine of training and applying, similar to the previous section.
C=0.9
epsilon=1e-3
svm=LibLinear(C, feats_train, labels)
svm.set_liblinear_solver_type(L2R_L2LOSS_SVC)
svm.set_epsilon(epsilon)
#train
svm.train()
True
size=100
x1=linspace(-1.2, 1.2, size)
x2=linspace(-1.2, 1.2, size)
x, y=meshgrid(x1, x2)
#Generate X-Y grid test data
grid=RealFeatures(array((ravel(x), ravel(y))))
#apply on test grid
predictions = svm.apply(grid)
z=predictions.get_values().reshape((size, size))
#plot
jet()
figure(figsize=(8,6))
title("Classification")
c=pcolor(x, y, z)
_=contour(x, y, z, linewidths=1, colors='black', hold=True)
_=colorbar(c)
gray()
_=scatter(feats[0, :], feats[1,:], c=trainlab, s=50)
_=xlabel('Plasma glucose concentration')
_=ylabel('Body mass index')
This seems like a decent enough prediction. We can thus infer that individuals below a ceratin level of BMI and glucose are most certainly safe. To have more strict boundaries explore many more of shogun's classifiers including Kernel machines, etc.
How do you assess the quality of a prediction? Shogun provides various ways to do this using CEvaluation. To keep things simple let us split the dataset, train on one part and evaluate performance on other using ROCEvaluation.
#split features for training and evaluation
feats=array(mat[1])
feats_t=feats[0:700]
feats_e=feats[700:785]
feats=array(mat[5])
feats_t1=feats[0:700]
feats_e1=feats[700:785]
feats_t=vstack((feats_t, feats_t1))
feats_e=vstack((feats_e, feats_e1))
feats_train=RealFeatures(feats_t)
feats_evaluate=RealFeatures(feats_e)
Let's see the accuracy by applying on test features.
label_t=trainlab[0:700]
labels=BinaryLabels(label_t)
label_e=trainlab[700:785]
labels_true=BinaryLabels(label_e)
svm=LibLinear(C, feats_train, labels)
svm.set_liblinear_solver_type(L2R_L2LOSS_SVC)
svm.set_epsilon(epsilon)
#train and evaluate
svm.train()
output=svm.apply(feats_evaluate)
#use ROCEvaluation to get accuracy
evaluator=ROCEvaluation()
print 'Accuracy(%):'
print evaluator.evaluate(output,labels_true)*100
Accuracy(%): 80.6684733514
To evaluate more efficiently cross-validation is used. As you might have wondered how are the parameters of the classifier selected? Shogun has a model selection framework to select the best parameters. More description of these things in this notebook.