Coursera Data Analysis -- Assignment 2¶

In [1]:

import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier # random forest
from sklearn.svm import SVC # support vector machine classifier
from sklearn.grid_search import GridSearchCV # hyperparameter grid search to find best model parameters
from sklearn import preprocessing # preprocess string labels into numerics

Get the data:

In [2]:

samsungData = pd.read_csv('/Users/erriza/Dropbox/Public/samsungData.csv')

In [3]:

samsungData['dxy'] = (samsungData['tGravityAcc-mean()-X'] - samsungData['tGravityAcc-mean()-Y'])
samsungData['dxz'] = (samsungData['tGravityAcc-mean()-X'] - samsungData['tGravityAcc-mean()-Z'])
samsungData['dyz'] = (samsungData['tGravityAcc-mean()-Y'] - samsungData['tGravityAcc-mean()-Z'])

In [4]:

samsungData.ix[:,-6::].head(10)

Out[4]:

	angle(Z,gravityMean)	subject	activity	dxy	dxz	dyz
1	-0.058627	1	standing	1.104236	0.848021	-0.256215
2	-0.054317	1	standing	1.108112	0.857182	-0.250930
3	-0.049118	1	standing	1.108888	0.864994	-0.243894
4	-0.047663	1	standing	1.111592	0.867765	-0.243827
5	-0.043892	1	standing	1.116975	0.873739	-0.243236
6	-0.042126	1	standing	1.116158	0.876038	-0.240120
7	-0.043010	1	standing	1.112212	0.874785	-0.237427
8	-0.041976	1	standing	1.115197	0.876783	-0.238414
9	-0.037364	1	standing	1.122855	0.883363	-0.239492
10	-0.034417	1	standing	1.124720	0.887544	-0.237176

In [5]:

samsungData

Out[5]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7352 entries, 1 to 7352
Columns: 566 entries, tBodyAcc-mean()-X to dyz
dtypes: float64(564), int64(1), object(1)

In [6]:

samsungData.columns[:10]

Out[6]:

Index([tBodyAcc-mean()-X, tBodyAcc-mean()-Y, tBodyAcc-mean()-Z, tBodyAcc-std()-X, tBodyAcc-std()-Y, tBodyAcc-std()-Z, tBodyAcc-mad()-X, tBodyAcc-mad()-Y, tBodyAcc-mad()-Z, tBodyAcc-max()-X], dtype=object)

Check if there is any missing value:

In [7]:

pd.isnull(samsungData).any().sum()

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.py:3576: FutureWarning: rename with inplace=True  will return None from pandas 0.11 onward
  " from pandas 0.11 onward", FutureWarning)

Out[7]:

Look at the subject IDs

In [8]:

np.unique(samsungData['subject'])

Out[8]:

1        1
348      3
689      5
991      6
1316     7
1624     8
1905    11
2221    14
2544    15
2872    16
3238    17
3606    19
3966    21
4374    22
4695    23
5067    25
5476    26
5868    27
6244    28
6626    29
6970    30
Name: subject

How many subjects are there?

In [9]:

len(np.unique(samsungData['subject']))

Out[9]:

How many activities are there?

In [10]:

np.unique(samsungData['activity'])

Out[10]:

52       laying
28      sitting
1      standing
79         walk
126    walkdown
151      walkup
Name: activity

Split the data into training and test dataset

In [11]:

# take the label column name
labels = 'activity'
# take the subject column name
subject = 'subject'
# the the features column name (:= {all columns} - {subject, labels})
features = filter(lambda x: x not in list([labels, subject]), samsungData.columns)

# encode labels as integer
le = preprocessing.LabelEncoder()
le.fit(samsungData[labels])

test_subjects = [27, 28, 29, 30] # what is required: [27, 28, 29, 30]
train_subjects = filter(lambda x: x not in test_subjects, np.unique(samsungData['subject'])) # what is required: [1, 3, 5, 6]

# take training and test set from requested subjects
train = samsungData[samsungData['subject'].isin(train_subjects)]
test = samsungData[samsungData['subject'].isin(test_subjects)]

In [12]:

# look for distribution of feature values
plot(samsungData[features].mean(), 'o', markersize=3);

In [13]:

# training data
X = train[features]
y = le.transform(train[labels])

# test data
Xt = test[features]
yt = le.transform(test[labels])

Random Forest¶

First we try training a Random Forest classifier and see how it performs. Pay attention to the line with GridSearchCV call; there's really a lot of things going on there: it accepts the all hyperparameter combinations in the tuned_parameters variable, it splits the data into training and cross-validation sets, and does it three times to do a 3-fold cross-validation.. phew..

In [14]:

# train random forest classifier
# set the parameters by cross-validation
tuned_parameters = [{'max_features': ['sqrt', 'log2'], 'n_estimators': [100, 200, 500]}]
rf = GridSearchCV( RandomForestClassifier(min_samples_split=1, compute_importances=False, n_jobs=-1), tuned_parameters, cv=3, verbose=2 ).fit(X, y)

[GridSearchCV] max_features=sqrt, n_estimators=100 .............................
[GridSearchCV] .................... max_features=sqrt, n_estimators=100 -  28.9s
[GridSearchCV] max_features=sqrt, n_estimators=100 .............................

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:   28.9s

[GridSearchCV] .................... max_features=sqrt, n_estimators=100 -  26.4s
[GridSearchCV] max_features=sqrt, n_estimators=100 .............................
[GridSearchCV] .................... max_features=sqrt, n_estimators=100 -  26.0s
[GridSearchCV] max_features=sqrt, n_estimators=200 .............................
[GridSearchCV] .................... max_features=sqrt, n_estimators=200 -  55.1s
[GridSearchCV] max_features=sqrt, n_estimators=200 .............................
[GridSearchCV] .................... max_features=sqrt, n_estimators=200 -  55.5s
[GridSearchCV] max_features=sqrt, n_estimators=200 .............................
[GridSearchCV] .................... max_features=sqrt, n_estimators=200 -  50.0s
[GridSearchCV] max_features=sqrt, n_estimators=500 .............................
[GridSearchCV] .................... max_features=sqrt, n_estimators=500 - 2.1min
[GridSearchCV] max_features=sqrt, n_estimators=500 .............................
[GridSearchCV] .................... max_features=sqrt, n_estimators=500 - 2.3min
[GridSearchCV] max_features=sqrt, n_estimators=500 .............................
[GridSearchCV] .................... max_features=sqrt, n_estimators=500 - 2.1min
[GridSearchCV] max_features=log2, n_estimators=100 .............................
[GridSearchCV] .................... max_features=log2, n_estimators=100 -  27.3s
[GridSearchCV] max_features=log2, n_estimators=100 .............................
[GridSearchCV] .................... max_features=log2, n_estimators=100 -  26.7s
[GridSearchCV] max_features=log2, n_estimators=100 .............................
[GridSearchCV] .................... max_features=log2, n_estimators=100 -  27.9s
[GridSearchCV] max_features=log2, n_estimators=200 .............................
[GridSearchCV] .................... max_features=log2, n_estimators=200 -  52.6s
[GridSearchCV] max_features=log2, n_estimators=200 .............................
[GridSearchCV] .................... max_features=log2, n_estimators=200 -  56.2s
[GridSearchCV] max_features=log2, n_estimators=200 .............................
[GridSearchCV] .................... max_features=log2, n_estimators=200 -  51.8s
[GridSearchCV] max_features=log2, n_estimators=500 .............................
[GridSearchCV] .................... max_features=log2, n_estimators=500 - 2.3min
[GridSearchCV] max_features=log2, n_estimators=500 .............................
[GridSearchCV] .................... max_features=log2, n_estimators=500 - 2.2min
[GridSearchCV] max_features=log2, n_estimators=500 .............................
[GridSearchCV] .................... max_features=log2, n_estimators=500 - 2.3min

[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed: 21.4min finished

In [15]:

print 'Best parameters set found on development set:'
print rf.best_estimator_

Best parameters set found on development set:
RandomForestClassifier(bootstrap=True, compute_importances=True,
            criterion=gini, max_depth=None, max_features=log2,
            min_density=0.1, min_samples_leaf=1, min_samples_split=1,
            n_estimators=500, n_jobs=-1, oob_score=False,
            random_state=None, verbose=0)

Now let's see how it performs on the test set. Note the classifier performance is computed as classification accuracy:

$A = \frac{1}{N_T}\sum_{i=1}^{N_T}1(\hat{y}_i = y_i)$

where $N_T$ is the number of samples in the training set, $\hat{y}_i$ is the classifier prediction, and $y_i$ is the label of the test sample. $1(\cdot)$ is an indicator variable that takes the value of one if the expression in the paranthesis evaluates to True, and zero if it evaluates to False.

In [16]:

pred = rf.predict(Xt)
print 'prediction accuracy: %.4f' % (1 - (1. / len(yt) * sum( pred != yt )))

prediction accuracy: 0.9630

96.57% of prediction accuracy. Not bad, but let's see if other models can do better...

But first, let's get some feature importance values for the report.

In [17]:

rfclf = RandomForestClassifier(max_features='log2', n_estimators=500, min_samples_split=1, compute_importances=True, n_jobs=-1).fit(X, y)

In [22]:

rfclf.feature_importances_.shape

Out[22]:

(564,)

In [44]:

numfeat = 11
indices = np.argsort(rfclf.feature_importances_)[::-1][:numfeat]
bar(xrange(numfeat), rfclf.feature_importances_[indices], align='center', alpha=.5)
xticks(xrange(numfeat), X.columns[indices], rotation='vertical', fontsize=12)
xlim([-1, numfeat])

ylabel('Feature importances', fontsize=12)

suptitle('Feature importances computed by Random Forest', fontsize=16)

f = gcf()
subplots_adjust(bottom=0.3)
f.set_size_inches(11, 8);
f.savefig('feature_importance.png', dpi=150);

SVM¶

Let's see how a Support Vector Machine classifier performs. Again, pay attention to the GridSearchCV line...

In [24]:

# train the SVM classifier
# set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [0.1, 1e-2, 1e-3], 'C': [10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [10, 100, 1000]}]

svm1 = GridSearchCV( SVC(), tuned_parameters, cv=3, verbose=2 ).fit(X, y)

[GridSearchCV] kernel=rbf, C=10, gamma=0.1 .....................................
[GridSearchCV] ............................ kernel=rbf, C=10, gamma=0.1 -  37.1s
[GridSearchCV] kernel=rbf, C=10, gamma=0.1 .....................................
[GridSearchCV] ............................ kernel=rbf, C=10, gamma=0.1 -  31.1s
[GridSearchCV] kernel=rbf, C=10, gamma=0.1 .....................................
[GridSearchCV] ............................ kernel=rbf, C=10, gamma=0.1 -  29.4s
[GridSearchCV] kernel=rbf, C=10, gamma=0.01 ....................................
[GridSearchCV] ........................... kernel=rbf, C=10, gamma=0.01 -   6.2s
[GridSearchCV] kernel=rbf, C=10, gamma=0.01 ....................................
[GridSearchCV] ........................... kernel=rbf, C=10, gamma=0.01 -   6.3s
[GridSearchCV] kernel=rbf, C=10, gamma=0.01 ....................................
[GridSearchCV] ........................... kernel=rbf, C=10, gamma=0.01 -   6.4s
[GridSearchCV] kernel=rbf, C=10, gamma=0.001 ...................................
[GridSearchCV] .......................... kernel=rbf, C=10, gamma=0.001 -   7.8s
[GridSearchCV] kernel=rbf, C=10, gamma=0.001 ...................................
[GridSearchCV] .......................... kernel=rbf, C=10, gamma=0.001 -   7.9s
[GridSearchCV] kernel=rbf, C=10, gamma=0.001 ...................................
[GridSearchCV] .......................... kernel=rbf, C=10, gamma=0.001 -   7.9s
[GridSearchCV] kernel=rbf, C=100, gamma=0.1 ....................................
[GridSearchCV] ........................... kernel=rbf, C=100, gamma=0.1 -  28.9s
[GridSearchCV] kernel=rbf, C=100, gamma=0.1 ....................................
[GridSearchCV] ........................... kernel=rbf, C=100, gamma=0.1 -  28.9s
[GridSearchCV] kernel=rbf, C=100, gamma=0.1 ....................................
[GridSearchCV] ........................... kernel=rbf, C=100, gamma=0.1 -  29.4s
[GridSearchCV] kernel=rbf, C=100, gamma=0.01 ...................................
[GridSearchCV] .......................... kernel=rbf, C=100, gamma=0.01 -   5.9s
[GridSearchCV] kernel=rbf, C=100, gamma=0.01 ...................................
[GridSearchCV] .......................... kernel=rbf, C=100, gamma=0.01 -   6.0s
[GridSearchCV] kernel=rbf, C=100, gamma=0.01 ...................................
[GridSearchCV] .......................... kernel=rbf, C=100, gamma=0.01 -   5.9s
[GridSearchCV] kernel=rbf, C=100, gamma=0.001 ..................................
[GridSearchCV] ......................... kernel=rbf, C=100, gamma=0.001 -   4.7s
[GridSearchCV] kernel=rbf, C=100, gamma=0.001 ..................................
[GridSearchCV] ......................... kernel=rbf, C=100, gamma=0.001 -   4.7s
[GridSearchCV] kernel=rbf, C=100, gamma=0.001 ..................................
[GridSearchCV] ......................... kernel=rbf, C=100, gamma=0.001 -   4.8s
[GridSearchCV] kernel=rbf, C=1000, gamma=0.1 ...................................
[GridSearchCV] .......................... kernel=rbf, C=1000, gamma=0.1 -  29.0s
[GridSearchCV] kernel=rbf, C=1000, gamma=0.1 ...................................
[GridSearchCV] .......................... kernel=rbf, C=1000, gamma=0.1 -  30.7s
[GridSearchCV] kernel=rbf, C=1000, gamma=0.1 ...................................
[GridSearchCV] .......................... kernel=rbf, C=1000, gamma=0.1 -  29.3s
[GridSearchCV] kernel=rbf, C=1000, gamma=0.01 ..................................
[GridSearchCV] ......................... kernel=rbf, C=1000, gamma=0.01 -   5.9s
[GridSearchCV] kernel=rbf, C=1000, gamma=0.01 ..................................
[GridSearchCV] ......................... kernel=rbf, C=1000, gamma=0.01 -   5.9s
[GridSearchCV] kernel=rbf, C=1000, gamma=0.01 ..................................
[GridSearchCV] ......................... kernel=rbf, C=1000, gamma=0.01 -   5.9s
[GridSearchCV] kernel=rbf, C=1000, gamma=0.001 .................................
[GridSearchCV] ........................ kernel=rbf, C=1000, gamma=0.001 -   4.0s
[GridSearchCV] kernel=rbf, C=1000, gamma=0.001 .................................
[GridSearchCV] ........................ kernel=rbf, C=1000, gamma=0.001 -   4.2s
[GridSearchCV] kernel=rbf, C=1000, gamma=0.001 .................................
[GridSearchCV] ........................ kernel=rbf, C=1000, gamma=0.001 -   4.2s
[GridSearchCV] kernel=linear, C=10 .............................................
[GridSearchCV] .................................... kernel=linear, C=10 -   3.8s
[GridSearchCV] kernel=linear, C=10 .............................................
[GridSearchCV] .................................... kernel=linear, C=10 -   3.9s
[GridSearchCV] kernel=linear, C=10 .............................................
[GridSearchCV] .................................... kernel=linear, C=10 -   5.1s
[GridSearchCV] kernel=linear, C=100 ............................................
[GridSearchCV] ................................... kernel=linear, C=100 -   3.9s
[GridSearchCV] kernel=linear, C=100 ............................................
[GridSearchCV] ................................... kernel=linear, C=100 -   4.1s
[GridSearchCV] kernel=linear, C=100 ............................................
[GridSearchCV] ................................... kernel=linear, C=100 -   4.1s
[GridSearchCV] kernel=linear, C=1000 ...........................................
[GridSearchCV] .................................. kernel=linear, C=1000 -   4.4s
[GridSearchCV] kernel=linear, C=1000 ...........................................
[GridSearchCV] .................................. kernel=linear, C=1000 -   4.2s
[GridSearchCV] kernel=linear, C=1000 ...........................................
[GridSearchCV] .................................. kernel=linear, C=1000 -   4.1s

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:   37.1s
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  6.9min finished

In [25]:

print 'Best parameters set found on development set:'
print svm1.best_estimator_

Best parameters set found on development set:
SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.01, kernel=rbf, max_iter=-1, probability=False, shrinking=True,
  tol=0.001, verbose=False)

In [27]:

# prediction on test set
pred = svm1.predict(Xt)
print 'prediction accuracy: %.4f' % (1 - (1. / len(yt) * sum( pred != yt )))

prediction accuracy: 0.9737

So 96.16% of prediction accuracy. Again not bad, but could be better. Let's see how an ensemble of models would perform...

Combined models each trained on bootstrap data¶

In the following we'll combine seven classifiers, two Random Forests and five SVMs. The choice of seven is as we have six types of activities and we want to do a majority voting to get the final prediction. By the pigeonhole principle, we need at least seven prediction outputs to guarantee that there will always be a majority, i.e. an activity class with more than one vote.

Also, for each of the classifier type, we use the original training dataset for one of the models, and bootstrapped data for the others.

First, train two Random Forests:

In [35]:

repeat = 2

classifiers = []

# train random forest classifiers
for i in xrange(repeat):
    # generate bootstrap data
    if i == 0:
        tix = np.random.choice(train.index, size=len(train), replace=False)
    else:
        tix = np.random.choice(train.index, size=len(train), replace=True)
    this_train = train.ix[tix,:]
    
    # training data
    X = this_train[features]
    y = le.transform(this_train[labels])
    
    # set the parameters by cross-validation
    tuned_parameters = [{'n_estimators': [100, 200, 500]}]
    clf = GridSearchCV( RandomForestClassifier(n_jobs=-1, min_samples_split=1), tuned_parameters, cv=3, verbose=1 ).fit(X, y)
    
    classifiers.append(clf)

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:   35.4s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 14.3min finished
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:   32.8s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 11.9min finished

Then train five SVMs:

In [39]:

repeat = 7

classifiers = []

# train SVM classifiers
for i in xrange(repeat):
    # generate bootstrap data
    if i == 0:
        tix = np.random.choice(train.index, size=len(train), replace=False)
    else:
        tix = np.random.choice(train.index, size=len(train), replace=True)
    this_train = train.ix[tix,:]
    
    # training data
    X = this_train[features]
    y = le.transform(this_train[labels])
    
    # set the parameters by cross-validation
    tuned_parameters = [{'gamma': [.1, 1e-2, 1e-3, 1e-4], 'C': [10., 100., 1000.]}]
    clf = GridSearchCV( SVC(), tuned_parameters, cv=3, verbose=1 ).fit(X, y)
    
    classifiers.append(clf)

[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:   33.5s
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  7.9min finished
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:   25.7s
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  7.0min finished
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:   25.3s
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  6.9min finished
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:   25.8s
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  6.9min finished
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:   25.5s
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  6.9min finished
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:   25.7s
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  6.9min finished
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:   25.9s
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  7.0min finished

Concatenate the all seven classifiers and perform prediction on the test set:

In [40]:

preds = np.zeros((len(classifiers), len(yt)), dtype=int)

i = 0
for clf in classifiers:
    preds[i] = clf.predict(Xt)
    i += 1

In [41]:

preds.shape

Out[41]:

(7, 1485)

The total/final prediction is obtained as the majority of all the seven predictions:

In [42]:

total_pred = [np.bincount(x).argmax() for x in preds.T]
print 'prediction accuracy: %.4f' % (1 - (1. / len(yt) * sum( total_pred != yt )))

prediction accuracy: 0.9717

Here we get a prediction accuracy of 97.91%, which is better than the individual Random Forest and SVM prediction.

As a final check, let's see where our classifier fails to predict correctly:

In [43]:

tt = np.array(total_pred)

zip(tt[tt != yt], yt[tt != yt])

Out[43]:

[(1, 2),
 (1, 2),
 (1, 2),
 (1, 2),
 (1, 2),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (2, 1),
 (1, 2),
 (1, 2),
 (1, 2),
 (1, 2),
 (1, 2),
 (1, 2),
 (1, 2),
 (1, 2),
 (1, 2),
 (1, 2),
 (2, 1),
 (2, 1)]

As it turns out, all misclassifications happen for activities 1 (SITTING) and 2 (STANDING). This looks like a systematic misclassification. Further improvement can be achieved if we could search for features (or derivation/combination/transformation of them) that are able to cleanly separate these two activities.

For completeness, let's see if other classifier types can do any better:

naive bayes

In [91]:

# naive bayes
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB().fit(X, y)
yt = gnb.predict(Xt)
print 'prediction accuracy: %.4f' % (1 - (1. / len(yt) * sum( pred != yt )))

prediction accuracy: 0.7421

k-nearest neighbour

In [93]:

# kNN
from sklearn import neighbors

tuned_parameters = [{'n_neighbors' : [5, 10, 15, 20], 'weights' : ['uniform', 'distance']}]
knn = GridSearchCV( neighbors.KNeighborsClassifier(n_neighbors=5), tuned_parameters, cv=3, n_jobs=-1 ).fit(X, y)

In [94]:

print 'Best parameters set found on development set:'
print knn.best_estimator_

Best parameters set found on development set:
KNeighborsClassifier(algorithm=auto, leaf_size=30, n_neighbors=5, p=2,
           warn_on_equidistant=True, weights=distance)

In [95]:

# prediction on test set
pred = knn.predict(Xt)
print 'prediction accuracy: %.4f' % (1 - (1. / len(yt) * sum( pred != yt )))

prediction accuracy: 0.7480

We see that naive bayes and kNN classifiers don't do any better in this case.