Notebook

Let's learn to train a machine learning algorithm and test it¶

This notebook will teach you to use one of the popular machine learning package called Scikit Learn to train a simple machine learning algorith to do a classification task.

Take a look at this website and different examples to explore further.

I encourage you to play around with the code and see what happens !

We will start by loading the necessary libraries to the workspace.

In [ ]:

# You don't want to change anything here now

import numpy as np   # For some numerical stuff
import matplotlib.pyplot as plt # For making beautiful plots
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier  # A simple machine learning model known as KNN
from sklearn.cross_validation import train_test_split # A utility to split data
from sklearn.metrics import precision_score
%pylab inline

# You may see some messages in the next line, don't worry about them

Now lets load the data to our workspace¶

In [ ]:

dataset = load_iris() # Load the complete iris data structure to this variable

# Now lets get the features
features = dataset['data']

# Lets also get the name of the features
feature_names = dataset['feature_names']

# The class labels
labels = dataset['target']

In [ ]:

# Lets have a look at the names of the features and dimensions (shape) of the feature array and also see how many classes are present.
# Verify if the number of feature names are equal to the number of columns

print 'Feature names are :', feature_names

print '\nThe feature array has %d rows and %d columns'%(features.shape[0],features.shape[1])

print '\nThere are %d classes of objects in the dataset'%(len(np.unique(labels)))

Lets plot the data in a two dimensional space with the first feature on the x-axis and second on the y-axis¶

In [ ]:

index_1 = 0 # Modify this to change the x-axis . Now it will take the first column. [In python index 1 starts at '0']
index_2 = 1 # Modify this to change the y-axis

plt.scatter(features[:,index_1],features[:,index_2],c=labels) # Make the scatter plot
plt.xlabel(feature_names[index_1])
plt.ylabel(feature_names[index_2])

Split the data into train and test sample¶

Generally when training a machine learning algorithm, we have to validate its learning accuracy againts a set of test data whose labels are known. Performing this test will help us evalute how good the algorithm has learned. As a general practise we split our data into training and test samples. Usually 70% of the total data is used for training and the rest 30% for validation.

The following peice of code splits the data into training and test sets.

In [ ]:

# train_data --> feature samples for training
# test_data  --> feature samples to evaluate / test
# train_labels --> class labels for the training data
# test_labels --> class labels for the test data

train_data,test_data,train_labels,test_labels = train_test_split(features,labels,test_size=0.3,random_state=0)

In [ ]:

# Lets have a look at the size of the train and test data

print 'Train data has %d samples'%(train_data.shape[0])
print 'Test data has %d samples'%(test_data.shape[0])

Training the machine¶

In this example we will train a simple machine learning algorithm called K-nearest neighbors to classify the 3 different classes in the data we have loaded.

In [ ]:

mymodel = KNeighborsClassifier(n_neighbors=5,)  # Create the classifier object to a variable 'mymodel'

mymodel = mymodel.fit(train_data,train_labels) # Train the algorithm and save the model mymodel 

That's it ! We have trained our first machine learning algorithm. Now lets test it.

Testing the algorithm¶

Testing the algorithm is simple as training it. To evaluate the performance we will use an evaluation metric called 'Percision Score'. The precision score is defined as

$\mathrm{precision = \frac{Number \ of \ correctly \ classified \ samples}{Number \ of \ correctly \ classified \ samples \ + \ Number \ of \ incorrectly \ classified \ samples}}$

The higher this number better the performance of the machine learning algorithm. This simply means the machine learning algorithm has learnt the pattern well.

In [ ]:

# Test the performance of the algorithm on the test data which was generated through the splitting before.

predictions = mymodel.predict(test_data)

# Now we have the class labels predicted by the algorithm for each test samples in the variable 'predictions'

In [ ]:

# Time to check the precision score

score = precision_score(predictions,test_labels,average='micro')

print 'The precision score is %f'%(score*100)

As an excercise change the values of the following parameters in the above code and check how it affects the precision score.

test_size=0.2 in test_train_split [ Change it values like 0.5, 0.2 etc]
"n_neighbours=5" in clf = KNeighborsClassifier(n_neighbors=5) [Change the value between 1 and 25]