We study a dataset with five classes (sitting-down, standing-up, standing, walking, and sitting) of 'human activity' collected on eight hours of activities of four healthy subjects. The manner in which four subjects performed the five classes of 'human activity' were quantified by attaching accelerometers to parts of the body during the activity. The goal is to predict the manner in which the activity was performed. There is a 'outcome' variable in the training set which labels the activity. The features represent the data recorded from the accelerometers. We will build a model and make predictions using these features.
Sensors were placed on the waist (1), left thigh (2), right ankle (3), and right arm (4). See Ref [1] for details. The 12 features selected by through this procedure were: (1) Sensor on the Belt: discretization of the module of acceleration vector, variance of pitch, and variance of roll; (2) Sensor on the left thigh: module of acceleration vector, discreti zation, and variance of pitch; (3) Sensor on the right ankle: variance of pitch, and variance of roll; (4) Sensor on the right arm: discretization of the module of acceleration vector; From all sensors: average accele- ration and standard deviation of acceleration.
import numpy as np
from sklearn import cross_validation
from sklearn import svm
from matplotlib import pyplot as plt
from sklearn import preprocessing
from sklearn import metrics
%matplotlib inline
import pandas as pd
import matplotlib
from minepy import MINE
import copy
from mpl_toolkits.mplot3d import Axes3D
Import data set:
mydir = ""
filename = mydir+"dataset_har.csv"
#filename = 'test.csv'
conv = lambda valstr: float(valstr.replace(',','.'))
c = {3:conv, 4:conv, 5:conv}
col = [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
age = np.genfromtxt(filename,delimiter=";", skip_header=1 ,usecols=(2),dtype=int)
weight = np.genfromtxt(filename,delimiter=";", skip_header=1 ,usecols=(4),dtype=int)
data_height_bmi= np.genfromtxt(filename,delimiter=";", skip_header=1 ,usecols=(3,5),dtype=None, converters = c)
data= np.genfromtxt(filename,delimiter=";", skip_header=1 ,usecols=col,dtype=int)
data=data*1.0
target = np.genfromtxt(filename,delimiter=";", skip_header=1 ,usecols=18,dtype=str)
Preprocess data for use with machine learning algorithms:
rawdata = copy.copy(data)
data = preprocessing.scale(data)
def digitize(starget):
"""
Convert string output labels to floats.
Machine learning classifer only accepts float as the output value.
"""
stringlabels= np.unique(starget)
lenlabels = len(stringlabels)
dindex = 0
mydict = {}
for dlabel in stringlabels:
mydict[dlabel] = dindex
dindex = dindex + 1
print mydict
dtarget = copy.copy(starget)
for i in stringlabels:
myindex = np.where(starget == i)[0]
dtarget[myindex]=mydict.get(i)
return dtarget
The output string labels are converted to floats as follows:
dtarget = digitize(target)
{'standing': 2, 'walking': 4, 'sittingdown': 1, 'standingup': 3, 'sitting': 0}
In the next few steps we will create variables that we can use in our data visualization and machine learning algorithms. We divide data by 'class' (standig, walking etc.) to facilitate data visualization.
stand_index= np.where(target=='standing')
walk_index = np.where(target == 'walking')
sitdown_index = np.where(target == 'sittingdown')
standup_index = np.where(target == 'standingup')
sit_index = np.where(target == 'sitting')
height = data_height_bmi[:,0]
BMI = data_height_bmi[:,1]
x1 = data[:,0]
y1 = data[:,1]
z1 = data[:,2]
x2 = data[:,3]
y2 = data[:,4]
z2 = data[:,5]
x3 = data[:,6]
y3 = data[:,7]
z3 = data[:,8]
x4 = data[:,9]
y4 = data[:,10]
z4 = data[:,11]
x1r = rawdata[:,0]
y1r = rawdata[:,1]
z1r = rawdata[:,2]
x2r = rawdata[:,3]
y2r = rawdata[:,4]
z2r = rawdata[:,5]
x3r = rawdata[:,6]
y3r = rawdata[:,7]
z3r = rawdata[:,8]
x4r = rawdata[:,9]
y4r = rawdata[:,10]
z4r = rawdata[:,11]
In the following plot, the markers are colored by the activity 'class'
position = (dtarget.astype(np.float))
plt.subplot(2,2,1)
plt.scatter(x1r,y1r,c=position, cmap=plt.cm.Paired, alpha=0.9)
plt.xlabel('x1')
plt.ylabel('y1')
plt.subplot(2,2,2)
plt.scatter(x4r,y4r,c=position, cmap=plt.cm.Paired,alpha=0.9)
plt.xlabel('x4')
plt.ylabel('y4')
plt.subplot(2,2,3)
plt.scatter(x2r,y2r,c=position, cmap=plt.cm.Paired,alpha=0.9)
plt.xlabel('x2')
plt.ylabel('y2')
plt.subplot(2,2,4)
plt.scatter(x3r,y3r,c=position, cmap=plt.cm.Paired,alpha=0.9)
plt.xlabel('x3')
plt.ylabel('y3')
plt.subplots_adjust(hspace=0.35,wspace=0.45)
vec1 = rawdata[:,0:3]
vec1_stand = vec1[stand_index,:][0]
vec1_walk = vec1[walk_index,][0]
vec1_sitdown = vec1[sitdown_index,][0]
vec1_standup = vec1[standup_index,:][0]
vec1_sit = vec1[sit_index,:][0]
vec2 = rawdata[:,3:6]
print np.squeeze(vec1_stand).shape
print vec1_stand.shape
(47370, 3) (47370, 3)
fig = plt.figure()
ax = Axes3D(fig, elev=-150, azim=110)
ax.scatter(vec1_stand[:,0],vec1_stand[:,1],vec1_stand[:,2], marker='x',color='b',label='standing')
ax.scatter(vec1_sit[:,0],vec1_sit[:,1],vec1_sit[:,2], marker='o',color='r',label='sitting',alpha=0.5)
#ax.scatter(vec1_walk[:,0],vec1_walk[:,1],vec1_walk[:,2], marker='s',color='m',label='walking')
ax.scatter(vec1_standup[:,0],vec1_standup[:,1],vec1_standup[:,2], marker='>',color='g',label='stand up',alpha=0.9)
ax.set_title("Sitting, Standing, Walking")
ax.set_xlabel("x-axis")
#ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("y-axis")
#ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("z-axis")
#ax.w_zaxis.set_ticklabels([])
plt.legend(loc='upper left', numpoints=1, ncol=3, fontsize=8, bbox_to_anchor=(0, 0))
colors = ['b','r','g']
markers = ['x','o','>']
scatter1_proxy = matplotlib.lines.Line2D([0],[0], linestyle="none", c=colors[0], marker = markers[0])
scatter2_proxy = matplotlib.lines.Line2D([0],[0], linestyle="none", c=colors[1], marker = markers[1])
scatter3_proxy = matplotlib.lines.Line2D([0],[0], linestyle="none", c=colors[2], marker = markers[2])
ax.legend([scatter1_proxy, scatter2_proxy,scatter3_proxy], ['standing', 'sitting','standup'], numpoints = 1)
plt.show()
/usr/lib/pymodules/python2.7/matplotlib/axes.py:4486: UserWarning: No labeled objects found. Use label='...' kwarg on individual plots. warnings.warn("No labeled objects found. "
position = (dtarget.astype(np.float))
plt.subplot(2,2,1)
plt.scatter(vec1[:,0],vec1[:,1],c=position, cmap=plt.cm.Paired, alpha=0.5)
plt.xlabel('x1')
plt.ylabel('y1')
plt.subplot(2,2,2)
plt.scatter(vec1_stand[:,0],vec1_stand[:,1],alpha=0.5)
plt.xlabel('x1_stand')
plt.ylabel('y1_stand')
plt.subplot(2,2,3)
plt.scatter(vec1_sit[:,0],vec1_sit[:,1],alpha=0.5)
plt.xlabel('x1_sit')
plt.ylabel('y1_sot')
plt.subplot(2,2,4)
plt.scatter(vec1_walk[:,0],vec1_walk[:,1],alpha=0.5)
plt.xlabel('x1_walk')
plt.ylabel('y1_walk')
plt.subplots_adjust(hspace=0.35,wspace=0.45)
Using the cross_validation module we radomly assign training and test data from the dataset. There is a 60-40 split. Then we build a support vector machine model, make a prediction and calculate the accuracy of our prediction using a k-fold cross validation (k=5). We repeat an accuracy estimation using the metrics module - which just compares the actual results to the predicted results. Note how the evaluation metrics change depending on the method used to assess the performance of the model.
X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, dtarget, test_size=0.4, random_state=0)
clf = svm.SVC(kernel='rbf',C=10)
clf.fit(X_train, y_train)
prediction = (clf.predict(X_test))
cvscore = cross_validation.cross_val_score(clf, X_test, y_test, scoring='accuracy',cv=5)
print "Accuracy is: ", np.mean(cvscore)
Accuracy is: 0.98264261516
my_accuracy = metrics.accuracy_score(y_test,prediction)
print my_accuracy
0.98516315996
The confusion matrix is quite useful for comparing the actual activity to the prediction made by the model as a function of activity class. Rows are the actual activity and columns are the predicted activity. The confusion matrix is illustrated nicely in a plot a few lines down.
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,prediction)
print cm
target_label = ('sitting','sittingdown','standing','standingup','walking')
# {'standing': 2, 'walking': 4, 'sittingdown': 1, 'standingup': 3, 'sitting': 0}
[[20201 7 0 9 2] [ 7 4506 47 56 38] [ 0 9 18914 23 57] [ 29 110 99 4659 32] [ 0 81 324 53 16991]]
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(target_label))
plt.xticks(tick_marks, target, rotation=45)
plt.yticks(tick_marks, target)
#plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
cm = confusion_matrix(y_test, prediction)
np.set_printoptions(precision=2)
print('Confusion matrix, without normalization')
print(cm)
plt.figure()
plot_confusion_matrix(cm)
# Normalize the confusion matrix by row (i.e by the number of samples
# in each class)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('Normalized confusion matrix')
print(cm_normalized)
plt.figure()
plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')
Confusion matrix, without normalization [[20201 7 0 9 2] [ 7 4506 47 56 38] [ 0 9 18914 23 57] [ 29 110 99 4659 32] [ 0 81 324 53 16991]] Normalized confusion matrix [[ 1. 0. 0. 0. 0. ] [ 0. 0.97 0.01 0.01 0.01] [ 0. 0. 1. 0. 0. ] [ 0.01 0.02 0.02 0.95 0.01] [ 0. 0. 0.02 0. 0.97]]
I initially used a support vector machine with a linear kernel to analyze the above data, but this yielded a rather poor accuracy. The radial basis function kernel used above was more successful. It is useful to explore a few models in an effort to optimize classification accuracy. The random forest classifier is a popular model and will be examined below.
from sklearn import ensemble
rfc = ensemble.RandomForestClassifier(n_estimators=100)
rfc.fit(X_train,y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)
Using the Random Forest Classifier we make a prediction and analyze its performance using a confusion matrix and metrics.accuracy_score function
rfc_prediction = rfc.predict(X_test)
rfc_cm = confusion_matrix(y_test,rfc_prediction)
print target_label
print rfc_cm
('sitting', 'sittingdown', 'standing', 'standingup', 'walking') [[20204 7 0 8 0] [ 1 4555 10 52 36] [ 0 0 18923 5 75] [ 5 79 39 4772 34] [ 0 15 36 14 17384]]
rf_accuracy = metrics.accuracy_score(y_test,rfc_prediction)
print rf_accuracy
0.993721133818
We have succesfully classifed human activity data from sensors placed on the human body using machine learning techniques.
The data used in this sample maybe be found at: http://groupware.les.inf.puc-rio.br/har and was inspired by Ref. [1]
[1] Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers' Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.