Cross Validation¶

As a first step we need a classifier.

In [1]:

import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn import cross_validation
from sklearn import metrics

Scikit-learn has datasets that are already ready for use:

In [2]:

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

A scikit-learn data object is container object with whose interesting attributes are:

‘data’, the data to learn,
‘target’, the classification labels,
‘target_names’, the meaning of the labels,
‘feature_names’, the meaning of the features, and
‘DESCR’, the full description of the dataset.

In [3]:

X = data.data
y = data.target
data.target_names
data.feature_names

Out[3]:

array(['malignant', 'benign'], 
      dtype='<U9')

Out[3]:

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'], 
      dtype='<U23')

In [4]:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

Out[4]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Let's compute the accuracy of our predictions (in two different ways):

In [5]:

len(np.where(np.equal(y_pred, y_test))[0])/len(y_test)

np.sum(y_pred==y_test)/len(y_test)

Out[5]:

0.9692982456140351

Out[5]:

0.9692982456140351

We can do the same using scikit-learn:

In [6]:

metrics.accuracy_score(y_test, y_pred)

Out[6]:

0.9692982456140351

Now let's compute accuracy using cross-validation instead:

In [7]:

cross_validation.cross_val_score(classifier, X, y, cv=5, scoring='accuracy')

Out[7]:

array([ 0.93913043,  0.93913043,  0.97345133,  0.95575221,  0.96460177])

You can obtain accuracy for other metrics, such as area under the ROC curve:

In [8]:

cross_validation.cross_val_score(classifier, X, y, cv=5, scoring='roc_auc')

Out[8]:

array([ 0.99418605,  0.99192506,  0.99731724,  0.98222669,  0.99664655])

It's a good idea to first obtain the predictions, and then compute accuracy:

In [9]:

y_predict = cross_validation.cross_val_predict(classifier, X, y, cv=5)
metrics.accuracy_score(y, y_predict)

Out[9]:

0.95430579964850615

Here's an alternative way of doing cross-validation. We first divide the data into folds:

In [10]:

cv = cross_validation.StratifiedKFold(y, 5)

Using this division of data into folds we can run cross-validation:

In [11]:

y_predict = cross_validation.cross_val_predict(classifier, X, y, cv=cv)
metrics.accuracy_score(y, y_predict)

Out[11]:

0.95430579964850615

We can see how examples were divided into folds by looking at the test_folds attribute:

In [12]:

print(cv.test_folds)
 

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1
 0 1 0 1 1 0 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 1 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0
 0 0 0 0 0 0 1 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 2 2 1 1 1 1 2 1 1 2 2 2 1 2
 1 2 1 1 1 2 1 1 2 2 1 2 2 2 2 1 2 2 2 1 2 1 2 1 1 2 1 2 2 2 2 1 1 2 2 1 1
 1 2 1 1 1 1 1 2 2 1 1 2 1 1 2 2 1 2 1 1 1 1 2 1 1 1 1 1 2 1 2 2 2 3 3 3 3
 3 3 3 3 3 3 3 1 1 1 1 1 1 3 1 3 1 1 3 1 1 3 1 3 3 1 1 1 1 1 2 2 2 2 2 2 2
 2 3 2 2 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 2 3 2 2 2 2 3 3 3 2 2
 2 2 3 2 3 2 3 2 2 2 3 2 2 2 2 2 2 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 3 3 2 3 3
 3 2 3 3 2 2 2 2 2 3 2 2 2 2 3 3 3 3 3 4 3 3 4 4 3 3 3 3 3 3 4 3 3 3 3 3 3
 3 4 3 3 3 3 3 4 3 3 4 3 3 3 3 3 3 3 3 3 3 3 3 4 3 4 4 3 4 3 3 3 3 3 4 3 3
 4 3 4 3 3 4 3 4 3 3 3 3 3 3 3 3 4 4 3 3 3 3 3 3 4 3 3 3 3 3 3 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4]

Hmm... this is not ideal, so let's shuffle things a bit...

In [13]:

 
cv = cross_validation.StratifiedKFold(y, 5, shuffle=True)
print (cv.test_folds)

[3 1 0 3 2 4 3 4 1 2 2 4 1 4 0 0 3 0 1 2 3 1 2 4 2 0 1 1 2 0 3 4 0 1 1 0 1
 4 4 3 2 3 3 1 0 1 4 4 2 4 1 2 3 1 0 0 2 2 2 3 0 4 1 4 1 4 0 3 1 2 1 0 4 0
 0 0 2 2 1 4 0 1 0 3 3 0 4 3 3 0 4 0 3 3 0 3 2 1 1 4 2 0 3 1 1 4 4 2 2 0 3
 1 0 2 2 4 4 0 1 2 3 4 1 4 1 3 1 0 4 4 3 1 2 0 2 2 0 4 0 3 4 2 2 3 1 2 4 4
 1 0 0 3 2 3 2 0 4 2 0 0 2 3 0 1 1 2 2 2 2 1 4 0 3 3 3 0 0 3 2 2 3 2 0 3 1
 2 1 0 4 0 4 4 2 3 4 3 0 2 0 1 0 0 3 1 4 4 4 2 1 2 3 1 0 3 3 4 1 1 3 2 4 0
 0 3 1 2 2 2 3 0 3 3 4 0 1 1 0 1 3 2 3 4 1 1 2 1 3 3 0 3 2 2 2 2 3 4 4 1 4
 0 1 3 1 3 0 4 4 1 3 0 4 2 2 2 3 2 2 1 1 0 1 1 1 2 4 2 0 3 3 1 2 3 1 4 0 1
 0 4 2 1 3 2 1 2 4 0 4 0 4 2 2 4 4 1 1 2 3 1 1 1 4 4 3 4 4 3 1 0 4 4 2 3 2
 4 3 4 4 3 0 1 0 0 2 3 2 4 0 0 3 2 4 1 0 3 4 0 4 3 1 3 4 4 4 4 0 2 3 3 2 4
 3 1 0 2 1 4 4 3 1 4 3 4 1 1 2 4 2 2 1 1 3 0 2 3 1 2 2 0 2 4 0 1 0 0 2 1 0
 3 3 3 0 0 4 1 0 1 2 0 0 3 0 4 1 0 1 3 3 0 0 2 1 3 0 3 0 1 3 1 3 3 2 4 0 0
 0 0 3 0 3 2 1 3 1 3 3 2 1 0 4 0 2 2 4 0 4 2 1 0 4 2 2 0 0 3 4 2 3 2 4 1 2
 1 4 3 3 3 1 0 4 0 1 2 4 4 1 3 4 1 2 0 0 4 2 1 3 3 3 1 0 2 2 0 1 4 3 0 0 1
 4 4 1 3 4 1 1 4 3 4 3 3 2 0 3 4 2 3 4 1 0 1 4 4 3 1 1 2 0 3 1 4 4 2 4 4 1
 1 2 0 1 2 2 0 2 4 4 2 0 3 2]

If you run division into folds multiple times you will get a different answer:

In [14]:

cv = cross_validation.StratifiedKFold(y, 5, shuffle=True)
print (cv.test_folds)

[3 1 0 2 1 4 0 0 4 4 0 4 2 3 3 4 2 0 4 1 2 4 3 3 1 0 1 4 3 3 3 2 0 1 1 0 3
 4 1 3 0 0 0 2 0 3 0 4 4 0 0 0 0 4 1 4 3 2 2 1 1 4 2 4 4 2 4 1 1 3 1 1 3 2
 4 4 3 3 3 0 4 0 0 4 3 1 0 0 1 1 1 0 0 0 1 2 1 4 1 4 0 1 2 2 2 1 4 3 2 2 4
 3 2 3 4 2 1 3 2 0 1 3 4 4 2 2 3 2 2 2 3 0 3 4 1 0 1 0 1 3 3 4 2 2 0 1 4 0
 0 2 2 1 3 3 3 0 0 0 2 1 1 0 4 0 1 4 1 0 0 1 3 3 3 2 0 3 0 0 4 0 4 2 0 4 1
 2 4 4 3 4 0 1 1 4 0 3 2 2 2 3 2 0 2 3 0 0 1 2 0 2 1 0 4 0 1 2 3 3 2 0 2 1
 4 3 0 1 1 1 2 3 0 1 3 2 0 2 4 1 0 0 0 0 4 3 2 4 4 3 4 3 4 4 3 1 0 2 4 1 0
 4 3 2 1 1 0 2 2 3 0 4 4 4 3 4 0 3 2 1 2 4 2 4 1 2 1 1 1 1 3 1 2 3 1 0 4 4
 3 4 2 2 1 0 4 0 1 3 0 1 1 1 4 3 0 4 2 1 0 1 0 4 2 3 4 4 4 2 2 2 0 1 3 0 4
 3 1 3 2 4 3 1 3 3 4 0 1 1 3 0 2 4 0 4 4 3 3 3 4 0 0 4 2 3 0 0 0 4 1 1 2 4
 1 4 1 2 4 3 3 2 1 4 4 1 4 1 4 1 4 4 2 4 2 0 2 3 2 2 3 2 3 1 2 2 4 3 0 1 2
 4 2 2 2 3 3 3 4 0 3 2 4 3 2 2 2 0 0 3 3 4 0 3 3 1 1 2 0 4 2 3 1 0 3 3 4 0
 1 1 4 1 4 2 4 3 2 4 2 3 1 0 3 2 2 3 1 0 2 3 1 2 0 4 3 0 0 3 2 2 4 3 3 3 2
 4 1 1 0 1 2 1 4 2 1 1 3 2 0 4 3 0 1 2 2 1 4 1 3 0 1 3 3 2 3 1 3 4 1 3 1 4
 3 3 2 4 2 2 1 2 0 0 2 2 2 0 3 4 0 4 0 4 1 4 4 1 0 2 1 0 2 3 0 0 1 4 1 0 0
 0 3 1 3 0 1 4 1 3 3 1 2 0 1]

If you want to consistently get the same division into folds:

In [15]:

cv = cross_validation.StratifiedKFold(y, 5, shuffle=True, random_state=0)
# random_state sets the seed for the random number generator.