import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier # random forest
from sklearn.svm import SVC # support vector machine classifier
from sklearn.grid_search import GridSearchCV # hyperparameter grid search to find best model parameters
from sklearn import preprocessing # preprocess string labels into numerics
Get the data:
samsungData = pd.read_csv('/Users/erriza/Dropbox/Public/samsungData.csv')
samsungData['dxy'] = (samsungData['tGravityAcc-mean()-X'] - samsungData['tGravityAcc-mean()-Y'])
samsungData['dxz'] = (samsungData['tGravityAcc-mean()-X'] - samsungData['tGravityAcc-mean()-Z'])
samsungData['dyz'] = (samsungData['tGravityAcc-mean()-Y'] - samsungData['tGravityAcc-mean()-Z'])
samsungData.ix[:,-6::].head(10)
angle(Z,gravityMean) | subject | activity | dxy | dxz | dyz | |
---|---|---|---|---|---|---|
1 | -0.058627 | 1 | standing | 1.104236 | 0.848021 | -0.256215 |
2 | -0.054317 | 1 | standing | 1.108112 | 0.857182 | -0.250930 |
3 | -0.049118 | 1 | standing | 1.108888 | 0.864994 | -0.243894 |
4 | -0.047663 | 1 | standing | 1.111592 | 0.867765 | -0.243827 |
5 | -0.043892 | 1 | standing | 1.116975 | 0.873739 | -0.243236 |
6 | -0.042126 | 1 | standing | 1.116158 | 0.876038 | -0.240120 |
7 | -0.043010 | 1 | standing | 1.112212 | 0.874785 | -0.237427 |
8 | -0.041976 | 1 | standing | 1.115197 | 0.876783 | -0.238414 |
9 | -0.037364 | 1 | standing | 1.122855 | 0.883363 | -0.239492 |
10 | -0.034417 | 1 | standing | 1.124720 | 0.887544 | -0.237176 |
samsungData
<class 'pandas.core.frame.DataFrame'> Int64Index: 7352 entries, 1 to 7352 Columns: 566 entries, tBodyAcc-mean()-X to dyz dtypes: float64(564), int64(1), object(1)
samsungData.columns[:10]
Index([tBodyAcc-mean()-X, tBodyAcc-mean()-Y, tBodyAcc-mean()-Z, tBodyAcc-std()-X, tBodyAcc-std()-Y, tBodyAcc-std()-Z, tBodyAcc-mad()-X, tBodyAcc-mad()-Y, tBodyAcc-mad()-Z, tBodyAcc-max()-X], dtype=object)
Check if there is any missing value:
pd.isnull(samsungData).any().sum()
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.py:3576: FutureWarning: rename with inplace=True will return None from pandas 0.11 onward " from pandas 0.11 onward", FutureWarning)
0
Look at the subject IDs
np.unique(samsungData['subject'])
1 1 348 3 689 5 991 6 1316 7 1624 8 1905 11 2221 14 2544 15 2872 16 3238 17 3606 19 3966 21 4374 22 4695 23 5067 25 5476 26 5868 27 6244 28 6626 29 6970 30 Name: subject
How many subjects are there?
len(np.unique(samsungData['subject']))
21
How many activities are there?
np.unique(samsungData['activity'])
52 laying 28 sitting 1 standing 79 walk 126 walkdown 151 walkup Name: activity
Split the data into training and test dataset
# take the label column name
labels = 'activity'
# take the subject column name
subject = 'subject'
# the the features column name (:= {all columns} - {subject, labels})
features = filter(lambda x: x not in list([labels, subject]), samsungData.columns)
# encode labels as integer
le = preprocessing.LabelEncoder()
le.fit(samsungData[labels])
test_subjects = [27, 28, 29, 30] # what is required: [27, 28, 29, 30]
train_subjects = filter(lambda x: x not in test_subjects, np.unique(samsungData['subject'])) # what is required: [1, 3, 5, 6]
# take training and test set from requested subjects
train = samsungData[samsungData['subject'].isin(train_subjects)]
test = samsungData[samsungData['subject'].isin(test_subjects)]
# look for distribution of feature values
plot(samsungData[features].mean(), 'o', markersize=3);
# training data
X = train[features]
y = le.transform(train[labels])
# test data
Xt = test[features]
yt = le.transform(test[labels])
First we try training a Random Forest classifier and see how it performs. Pay attention to the line with GridSearchCV call; there's really a lot of things going on there: it accepts the all hyperparameter combinations in the tuned_parameters variable, it splits the data into training and cross-validation sets, and does it three times to do a 3-fold cross-validation.. phew..
# train random forest classifier
# set the parameters by cross-validation
tuned_parameters = [{'max_features': ['sqrt', 'log2'], 'n_estimators': [100, 200, 500]}]
rf = GridSearchCV( RandomForestClassifier(min_samples_split=1, compute_importances=False, n_jobs=-1), tuned_parameters, cv=3, verbose=2 ).fit(X, y)
[GridSearchCV] max_features=sqrt, n_estimators=100 ............................. [GridSearchCV] .................... max_features=sqrt, n_estimators=100 - 28.9s [GridSearchCV] max_features=sqrt, n_estimators=100 .............................
[Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 28.9s
[GridSearchCV] .................... max_features=sqrt, n_estimators=100 - 26.4s [GridSearchCV] max_features=sqrt, n_estimators=100 ............................. [GridSearchCV] .................... max_features=sqrt, n_estimators=100 - 26.0s [GridSearchCV] max_features=sqrt, n_estimators=200 ............................. [GridSearchCV] .................... max_features=sqrt, n_estimators=200 - 55.1s [GridSearchCV] max_features=sqrt, n_estimators=200 ............................. [GridSearchCV] .................... max_features=sqrt, n_estimators=200 - 55.5s [GridSearchCV] max_features=sqrt, n_estimators=200 ............................. [GridSearchCV] .................... max_features=sqrt, n_estimators=200 - 50.0s [GridSearchCV] max_features=sqrt, n_estimators=500 ............................. [GridSearchCV] .................... max_features=sqrt, n_estimators=500 - 2.1min [GridSearchCV] max_features=sqrt, n_estimators=500 ............................. [GridSearchCV] .................... max_features=sqrt, n_estimators=500 - 2.3min [GridSearchCV] max_features=sqrt, n_estimators=500 ............................. [GridSearchCV] .................... max_features=sqrt, n_estimators=500 - 2.1min [GridSearchCV] max_features=log2, n_estimators=100 ............................. [GridSearchCV] .................... max_features=log2, n_estimators=100 - 27.3s [GridSearchCV] max_features=log2, n_estimators=100 ............................. [GridSearchCV] .................... max_features=log2, n_estimators=100 - 26.7s [GridSearchCV] max_features=log2, n_estimators=100 ............................. [GridSearchCV] .................... max_features=log2, n_estimators=100 - 27.9s [GridSearchCV] max_features=log2, n_estimators=200 ............................. [GridSearchCV] .................... max_features=log2, n_estimators=200 - 52.6s [GridSearchCV] max_features=log2, n_estimators=200 ............................. [GridSearchCV] .................... max_features=log2, n_estimators=200 - 56.2s [GridSearchCV] max_features=log2, n_estimators=200 ............................. [GridSearchCV] .................... max_features=log2, n_estimators=200 - 51.8s [GridSearchCV] max_features=log2, n_estimators=500 ............................. [GridSearchCV] .................... max_features=log2, n_estimators=500 - 2.3min [GridSearchCV] max_features=log2, n_estimators=500 ............................. [GridSearchCV] .................... max_features=log2, n_estimators=500 - 2.2min [GridSearchCV] max_features=log2, n_estimators=500 ............................. [GridSearchCV] .................... max_features=log2, n_estimators=500 - 2.3min
[Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 21.4min finished
print 'Best parameters set found on development set:'
print rf.best_estimator_
Best parameters set found on development set: RandomForestClassifier(bootstrap=True, compute_importances=True, criterion=gini, max_depth=None, max_features=log2, min_density=0.1, min_samples_leaf=1, min_samples_split=1, n_estimators=500, n_jobs=-1, oob_score=False, random_state=None, verbose=0)
Now let's see how it performs on the test set. Note the classifier performance is computed as classification accuracy:
where NT is the number of samples in the training set, ˆyi is the classifier prediction, and yi is the label of the test sample. 1(⋅) is an indicator variable that takes the value of one if the expression in the paranthesis evaluates to True, and zero if it evaluates to False.
pred = rf.predict(Xt)
print 'prediction accuracy: %.4f' % (1 - (1. / len(yt) * sum( pred != yt )))
prediction accuracy: 0.9630
96.57% of prediction accuracy. Not bad, but let's see if other models can do better...
But first, let's get some feature importance values for the report.
rfclf = RandomForestClassifier(max_features='log2', n_estimators=500, min_samples_split=1, compute_importances=True, n_jobs=-1).fit(X, y)
rfclf.feature_importances_.shape
(564,)
numfeat = 11
indices = np.argsort(rfclf.feature_importances_)[::-1][:numfeat]
bar(xrange(numfeat), rfclf.feature_importances_[indices], align='center', alpha=.5)
xticks(xrange(numfeat), X.columns[indices], rotation='vertical', fontsize=12)
xlim([-1, numfeat])
ylabel('Feature importances', fontsize=12)
suptitle('Feature importances computed by Random Forest', fontsize=16)
f = gcf()
subplots_adjust(bottom=0.3)
f.set_size_inches(11, 8);
f.savefig('feature_importance.png', dpi=150);
Let's see how a Support Vector Machine classifier performs. Again, pay attention to the GridSearchCV line...
# train the SVM classifier
# set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [0.1, 1e-2, 1e-3], 'C': [10, 100, 1000]},
{'kernel': ['linear'], 'C': [10, 100, 1000]}]
svm1 = GridSearchCV( SVC(), tuned_parameters, cv=3, verbose=2 ).fit(X, y)
[GridSearchCV] kernel=rbf, C=10, gamma=0.1 ..................................... [GridSearchCV] ............................ kernel=rbf, C=10, gamma=0.1 - 37.1s [GridSearchCV] kernel=rbf, C=10, gamma=0.1 ..................................... [GridSearchCV] ............................ kernel=rbf, C=10, gamma=0.1 - 31.1s [GridSearchCV] kernel=rbf, C=10, gamma=0.1 ..................................... [GridSearchCV] ............................ kernel=rbf, C=10, gamma=0.1 - 29.4s [GridSearchCV] kernel=rbf, C=10, gamma=0.01 .................................... [GridSearchCV] ........................... kernel=rbf, C=10, gamma=0.01 - 6.2s [GridSearchCV] kernel=rbf, C=10, gamma=0.01 .................................... [GridSearchCV] ........................... kernel=rbf, C=10, gamma=0.01 - 6.3s [GridSearchCV] kernel=rbf, C=10, gamma=0.01 .................................... [GridSearchCV] ........................... kernel=rbf, C=10, gamma=0.01 - 6.4s [GridSearchCV] kernel=rbf, C=10, gamma=0.001 ................................... [GridSearchCV] .......................... kernel=rbf, C=10, gamma=0.001 - 7.8s [GridSearchCV] kernel=rbf, C=10, gamma=0.001 ................................... [GridSearchCV] .......................... kernel=rbf, C=10, gamma=0.001 - 7.9s [GridSearchCV] kernel=rbf, C=10, gamma=0.001 ................................... [GridSearchCV] .......................... kernel=rbf, C=10, gamma=0.001 - 7.9s [GridSearchCV] kernel=rbf, C=100, gamma=0.1 .................................... [GridSearchCV] ........................... kernel=rbf, C=100, gamma=0.1 - 28.9s [GridSearchCV] kernel=rbf, C=100, gamma=0.1 .................................... [GridSearchCV] ........................... kernel=rbf, C=100, gamma=0.1 - 28.9s [GridSearchCV] kernel=rbf, C=100, gamma=0.1 .................................... [GridSearchCV] ........................... kernel=rbf, C=100, gamma=0.1 - 29.4s [GridSearchCV] kernel=rbf, C=100, gamma=0.01 ................................... [GridSearchCV] .......................... kernel=rbf, C=100, gamma=0.01 - 5.9s [GridSearchCV] kernel=rbf, C=100, gamma=0.01 ................................... [GridSearchCV] .......................... kernel=rbf, C=100, gamma=0.01 - 6.0s [GridSearchCV] kernel=rbf, C=100, gamma=0.01 ................................... [GridSearchCV] .......................... kernel=rbf, C=100, gamma=0.01 - 5.9s [GridSearchCV] kernel=rbf, C=100, gamma=0.001 .................................. [GridSearchCV] ......................... kernel=rbf, C=100, gamma=0.001 - 4.7s [GridSearchCV] kernel=rbf, C=100, gamma=0.001 .................................. [GridSearchCV] ......................... kernel=rbf, C=100, gamma=0.001 - 4.7s [GridSearchCV] kernel=rbf, C=100, gamma=0.001 .................................. [GridSearchCV] ......................... kernel=rbf, C=100, gamma=0.001 - 4.8s [GridSearchCV] kernel=rbf, C=1000, gamma=0.1 ................................... [GridSearchCV] .......................... kernel=rbf, C=1000, gamma=0.1 - 29.0s [GridSearchCV] kernel=rbf, C=1000, gamma=0.1 ................................... [GridSearchCV] .......................... kernel=rbf, C=1000, gamma=0.1 - 30.7s [GridSearchCV] kernel=rbf, C=1000, gamma=0.1 ................................... [GridSearchCV] .......................... kernel=rbf, C=1000, gamma=0.1 - 29.3s [GridSearchCV] kernel=rbf, C=1000, gamma=0.01 .................................. [GridSearchCV] ......................... kernel=rbf, C=1000, gamma=0.01 - 5.9s [GridSearchCV] kernel=rbf, C=1000, gamma=0.01 .................................. [GridSearchCV] ......................... kernel=rbf, C=1000, gamma=0.01 - 5.9s [GridSearchCV] kernel=rbf, C=1000, gamma=0.01 .................................. [GridSearchCV] ......................... kernel=rbf, C=1000, gamma=0.01 - 5.9s [GridSearchCV] kernel=rbf, C=1000, gamma=0.001 ................................. [GridSearchCV] ........................ kernel=rbf, C=1000, gamma=0.001 - 4.0s [GridSearchCV] kernel=rbf, C=1000, gamma=0.001 ................................. [GridSearchCV] ........................ kernel=rbf, C=1000, gamma=0.001 - 4.2s [GridSearchCV] kernel=rbf, C=1000, gamma=0.001 ................................. [GridSearchCV] ........................ kernel=rbf, C=1000, gamma=0.001 - 4.2s [GridSearchCV] kernel=linear, C=10 ............................................. [GridSearchCV] .................................... kernel=linear, C=10 - 3.8s [GridSearchCV] kernel=linear, C=10 ............................................. [GridSearchCV] .................................... kernel=linear, C=10 - 3.9s [GridSearchCV] kernel=linear, C=10 ............................................. [GridSearchCV] .................................... kernel=linear, C=10 - 5.1s [GridSearchCV] kernel=linear, C=100 ............................................ [GridSearchCV] ................................... kernel=linear, C=100 - 3.9s [GridSearchCV] kernel=linear, C=100 ............................................ [GridSearchCV] ................................... kernel=linear, C=100 - 4.1s [GridSearchCV] kernel=linear, C=100 ............................................ [GridSearchCV] ................................... kernel=linear, C=100 - 4.1s [GridSearchCV] kernel=linear, C=1000 ........................................... [GridSearchCV] .................................. kernel=linear, C=1000 - 4.4s [GridSearchCV] kernel=linear, C=1000 ........................................... [GridSearchCV] .................................. kernel=linear, C=1000 - 4.2s [GridSearchCV] kernel=linear, C=1000 ........................................... [GridSearchCV] .................................. kernel=linear, C=1000 - 4.1s
[Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 37.1s [Parallel(n_jobs=1)]: Done 36 out of 36 | elapsed: 6.9min finished
print 'Best parameters set found on development set:'
print svm1.best_estimator_
Best parameters set found on development set: SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.01, kernel=rbf, max_iter=-1, probability=False, shrinking=True, tol=0.001, verbose=False)
# prediction on test set
pred = svm1.predict(Xt)
print 'prediction accuracy: %.4f' % (1 - (1. / len(yt) * sum( pred != yt )))
prediction accuracy: 0.9737
So 96.16% of prediction accuracy. Again not bad, but could be better. Let's see how an ensemble of models would perform...
In the following we'll combine seven classifiers, two Random Forests and five SVMs. The choice of seven is as we have six types of activities and we want to do a majority voting to get the final prediction. By the pigeonhole principle, we need at least seven prediction outputs to guarantee that there will always be a majority, i.e. an activity class with more than one vote.
Also, for each of the classifier type, we use the original training dataset for one of the models, and bootstrapped data for the others.
First, train two Random Forests:
repeat = 2
classifiers = []
# train random forest classifiers
for i in xrange(repeat):
# generate bootstrap data
if i == 0:
tix = np.random.choice(train.index, size=len(train), replace=False)
else:
tix = np.random.choice(train.index, size=len(train), replace=True)
this_train = train.ix[tix,:]
# training data
X = this_train[features]
y = le.transform(this_train[labels])
# set the parameters by cross-validation
tuned_parameters = [{'n_estimators': [100, 200, 500]}]
clf = GridSearchCV( RandomForestClassifier(n_jobs=-1, min_samples_split=1), tuned_parameters, cv=3, verbose=1 ).fit(X, y)
classifiers.append(clf)
[Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 35.4s [Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 14.3min finished [Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 32.8s [Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 11.9min finished
Then train five SVMs:
repeat = 7
classifiers = []
# train SVM classifiers
for i in xrange(repeat):
# generate bootstrap data
if i == 0:
tix = np.random.choice(train.index, size=len(train), replace=False)
else:
tix = np.random.choice(train.index, size=len(train), replace=True)
this_train = train.ix[tix,:]
# training data
X = this_train[features]
y = le.transform(this_train[labels])
# set the parameters by cross-validation
tuned_parameters = [{'gamma': [.1, 1e-2, 1e-3, 1e-4], 'C': [10., 100., 1000.]}]
clf = GridSearchCV( SVC(), tuned_parameters, cv=3, verbose=1 ).fit(X, y)
classifiers.append(clf)
[Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 33.5s [Parallel(n_jobs=1)]: Done 36 out of 36 | elapsed: 7.9min finished [Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 25.7s [Parallel(n_jobs=1)]: Done 36 out of 36 | elapsed: 7.0min finished [Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 25.3s [Parallel(n_jobs=1)]: Done 36 out of 36 | elapsed: 6.9min finished [Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 25.8s [Parallel(n_jobs=1)]: Done 36 out of 36 | elapsed: 6.9min finished [Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 25.5s [Parallel(n_jobs=1)]: Done 36 out of 36 | elapsed: 6.9min finished [Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 25.7s [Parallel(n_jobs=1)]: Done 36 out of 36 | elapsed: 6.9min finished [Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 25.9s [Parallel(n_jobs=1)]: Done 36 out of 36 | elapsed: 7.0min finished
Concatenate the all seven classifiers and perform prediction on the test set:
preds = np.zeros((len(classifiers), len(yt)), dtype=int)
i = 0
for clf in classifiers:
preds[i] = clf.predict(Xt)
i += 1
preds.shape
(7, 1485)
The total/final prediction is obtained as the majority of all the seven predictions:
total_pred = [np.bincount(x).argmax() for x in preds.T]
print 'prediction accuracy: %.4f' % (1 - (1. / len(yt) * sum( total_pred != yt )))
prediction accuracy: 0.9717
Here we get a prediction accuracy of 97.91%, which is better than the individual Random Forest and SVM prediction.
As a final check, let's see where our classifier fails to predict correctly:
tt = np.array(total_pred)
zip(tt[tt != yt], yt[tt != yt])
[(1, 2), (1, 2), (1, 2), (1, 2), (1, 2), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (2, 1), (1, 2), (1, 2), (1, 2), (1, 2), (1, 2), (1, 2), (1, 2), (1, 2), (1, 2), (1, 2), (2, 1), (2, 1)]
As it turns out, all misclassifications happen for activities 1 (SITTING) and 2 (STANDING). This looks like a systematic misclassification. Further improvement can be achieved if we could search for features (or derivation/combination/transformation of them) that are able to cleanly separate these two activities.
For completeness, let's see if other classifier types can do any better:
naive bayes
# naive bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(X, y)
yt = gnb.predict(Xt)
print 'prediction accuracy: %.4f' % (1 - (1. / len(yt) * sum( pred != yt )))
prediction accuracy: 0.7421
k-nearest neighbour
# kNN
from sklearn import neighbors
tuned_parameters = [{'n_neighbors' : [5, 10, 15, 20], 'weights' : ['uniform', 'distance']}]
knn = GridSearchCV( neighbors.KNeighborsClassifier(n_neighbors=5), tuned_parameters, cv=3, n_jobs=-1 ).fit(X, y)
print 'Best parameters set found on development set:'
print knn.best_estimator_
Best parameters set found on development set: KNeighborsClassifier(algorithm=auto, leaf_size=30, n_neighbors=5, p=2, warn_on_equidistant=True, weights=distance)
# prediction on test set
pred = knn.predict(Xt)
print 'prediction accuracy: %.4f' % (1 - (1. / len(yt) * sum( pred != yt )))
prediction accuracy: 0.7480
We see that naive bayes and kNN classifiers don't do any better in this case.