Notebook

Supervised Learning: Classification of Iris Data¶

[ This notebook file is a modified version of the PyCon13 scikit-learn tutorial designed by Jake VanderPlas (jakevdp@cs.washington.edu) ]¶

By the end of this section you will

Know how to instantiate a scikit-learn classifier
Know how to train a classifier by calling the fit(...) method
Know how to predict new labels by calling the predict(...) method

In this example we will perform classification of the iris data with several different classifiers.

Linear Support Vector Classifier (SVC)¶

First we'll load the iris data as we did before:

In [24]:

from sklearn.datasets import load_iris
iris = load_iris()

In the iris dataset example, suppose we are assigned the task to guess the class of an individual flower given the measurements of petals and sepals. This is a classification task, hence we have:

In [25]:

X = iris.data
y = iris.target

print X.shape
print y.shape

(150, 4)
(150,)

Once the data has this format it is trivial to train a classifier, for instance a support vector machine with a linear kernel:

In [26]:

from sklearn.svm import LinearSVC

LinearSVC is an example of a scikit-learn classifier. If you're curious about how it is used, you can use ipython's "?" magic function to see the documentation:

In [27]:

#LinearSVC?

The first thing to do is to create an instance of the classifier. This can be done simply by calling the class name, with any arguments that the object accepts:

In [28]:

clf = LinearSVC(loss = 'l2')

clf is a statistical model that has parameters that control the learning algorithm (those parameters are sometimes called the hyperparameters). Those hyperparameters can be supplied by the user in the constructor of the model. We will explain later how to choose a good combination using either simple empirical rules or data driven selection:

In [29]:

print clf

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=None, tol=0.0001, verbose=0)

By default the model parameters are not initialized. They will be tuned automatically from the data by calling the fit method with the data X and labels y:

In [30]:

clf = clf.fit(X, y)

We can now see some of the fit parameters within the classifier object.

In scikit-learn, parameters defined by training have a trailing underscore.

In [31]:

clf.coef_
# shape of clf.coef_ [n_classes x n_features] (coefficients are the weights used in inner product)

Out[31]:

array([[ 0.18423472,  0.45122606, -0.80794397, -0.45071444],
       [ 0.05194039, -0.89261957,  0.40495891, -0.93818941],
       [-0.85066873, -0.98658995,  1.38093611,  1.86531807]])

In [32]:

clf.intercept_
# shape of clf.intercept [n_classes x 1] (Constants in decision function.)

Out[32]:

array([ 0.10955953,  1.66695128, -1.70960306])

Once the model is trained, it can be used to predict the most likely outcome on unseen data. For instance let us define a list of simple sample that looks like the first sample of the iris dataset:

In [33]:

X_new = [[ 5.0,  3.6,  1.3,  0.25]]
print clf.predict(X_new)

[0]

Support Vector Classifier with different kernels¶

In [34]:

%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm

X = iris.data[:, 2:4]  # we only take the first two features. We could
y = iris.target
h = .02  # step size in the mesh

# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0  # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=C).fit(X, y)
rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X, y)
poly_svc = svm.SVC(kernel='poly', degree=3, C=C).fit(X, y)
lin_svc = svm.LinearSVC(C=C).fit(X, y)

# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# title for the plots
titles = ['SVC with linear kernel',
          'LinearSVC (linear kernel)',
          'SVC with RBF kernel',
          'SVC with polynomial (degree 3) kernel']


plt.figure(figsize=(16,10))
for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    plt.subplot(2, 2, i + 1)
    plt.subplots_adjust(wspace=0.1, hspace=0.25)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], s=80, c=y, cmap=plt.cm.Paired)
    plt.xlabel('Sepal length (first feature)',fontsize=15)
    plt.ylabel('Sepal width (second feature)',fontsize=15)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title(titles[i],fontsize=15)

plt.show()

Exercise: Using a Different Classifier ( e.g. Logistic Regression )¶

Now we'll take a few minutes and try out another learning model. Because of scikit-learn's uniform interface, the syntax is identical to that of LinearSVC above.

There are many possibilities of classifiers; you could try any of the methods discussed at http://scikit-learn.org/stable/supervised_learning.html. Alternatively, you can explore what's available in scikit-learn using just the tab-completion feature. For example, import the linear_model submodule:

In [35]:

from sklearn import linear_model

And use the tab completion to find what's available. Type linear_model. and then the tab key to see an interactive list of the functions within this submodule. The ones which begin with capital letters are the models which are available.

In [35]:

Now select a new classifier and try out a classification of the iris data.

Some good choices are

sklearn.naive_bayes.GaussianNB : Gaussian Naive Bayes model. This is an unsophisticated model which can be trained very quickly. It is often used to obtain baseline results before moving to a more sophisticated classifier.
sklearn.svm.LinearSVC : Support Vector Machines without kernels based on liblinear
sklearn.svm.SVC : Support Vector Machines with kernels based on libsvm
sklearn.linear_model.LogisticRegression : Regularized Logistic Regression based on liblinear
sklearn.linear_model.SGDClassifier : Regularized linear models (SVM or logistic regression) using a Stochastic Gradient Descent algorithm written in Cython
sklearn.neighbors.NeighborsClassifier : k-Nearest Neighbors classifier based on the ball tree datastructure for low dimensional data and brute force search for high dimensional data
sklearn.tree.DecisionTreeClassifier : A classifier based on a series of binary decisions. This is another very fast classifier, which can be very powerful.

Choose one of the above, import it, and use the ? feature to learn about it.

Some models have additional prediction modes. For example, if clf is a LogisticRegression classifier, then it is possible to do a probibilistic prediction for any point. This can be done through the predict_proba function:

In [36]:

#linear_model.LogisticRegression?

Now instantiate this model as we did with LinearSVC above.

In [37]:

clf2 = linear_model.LogisticRegression(C=1e5)

Now use our data X and y to train the model, using the fit(...) method

In [38]:

X = iris.data [:,2:4]
y = iris.target
clf2.fit (X, y)

Out[38]:

LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, penalty='l2',
          random_state=None, tol=0.0001)

Now call the predict method, and find the classification of X_new.

In [39]:

X_new = [[3.6, 0.25]]
print clf2.predict_proba(X_new)

[[  8.51520645e-02   9.14847935e-01   3.22256261e-10]]

The result gives the probability (between zero and one) that the test point comes from any of the three classes.

This means that the model estimates that the sample in X_new has:

8.5% likelyhood to belong to the ‘setosa’ class (target = 0)
91% likelyhood to belong to the ‘versicolor’ class (target = 1)
< 1% likelyhood to belong to the ‘virginica’ class (target = 2)

Of course, the predict method that outputs the label id of the most likely outcome is also available:

In [40]:

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf2.predict(np.c_[xx.ravel(), yy.ravel()])
Zprob = clf2.predict_proba(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
Zprob = Zprob.reshape(xx.shape[0],xx.shape[1],3)
plt.figure(figsize=(16,10))

# title for the plots
titles = ['Logistic regression (LR)',
          'LR class 1 - probabilities',
          'LR class 2 - probabilities',
          'LR class 3 - probabilities']
labels = ['class 1', 'class 2', 'class 3']

# number of classes and plot colors
n_classes = np.amax(y)+1
plot_colors = 'rgy'

for i, boundaries in enumerate((Z, Zprob[:,:,0], Zprob[:,:,1], Zprob[:,:,2])):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    plt.subplot(2, 2, i + 1)
    plt.subplots_adjust(wspace=0.1, hspace=0.25)
    plt.pcolormesh(xx, yy, boundaries, cmap=plt.cm.Paired)
    plt.colorbar()

    # Plot also the training
    for j, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == j)
        plt.scatter(X[idx, 0], X[idx, 1], s=100, c=color, label=labels[j], cmap=plt.cm.Paired)
    
    #plt.scatter(X[:, 0], X[:, 1], s=100, c=y, edgecolors='k', cmap=plt.cm.Paired)
    
    plt.xlabel('Sepal length', fontsize=15)
    plt.ylabel('Sepal width', fontsize=15)

    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title(titles[i],fontsize=15)
    plt.legend(loc='upper left')

plt.show()

Evaluating the Model¶

Predicting a new value is nice, but how do we guage how well we've done? We'll explore this in more depth later, but here's a quick taste now.

Let's get a rough evaluation our model by using it to predict the values of the training data:

In [41]:

y_model = clf2.predict(X)
print y_model == y
print "Accuracy:", float(np.sum(y_model == y)) / len(y)

[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True False  True
  True  True  True  True  True False  True  True  True  True  True False
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True False  True
  True  True  True  True  True  True  True  True  True  True  True False
  True  True  True  True  True  True  True  True  True  True  True  True
  True False  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True]
Accuracy: 0.96

We see that most of the predictions are correct!

Model Evaluation Using Cross-Validation¶

Now let's try 10-fold cross-validation, to see if the accuracy holds up more rigorously.

In [42]:

from sklearn.cross_validation import cross_val_score
# evaluate the model using 10-fold cross-validation
scores = cross_val_score(linear_model.LogisticRegression(C=1e2), X, y, scoring='accuracy', cv=10)
print scores
print scores.mean()

[ 1.          1.          1.          0.93333333  0.93333333  0.93333333
  0.86666667  1.          1.          1.        ]
0.966666666667