Data science portfolio¶

Trevor D Rhone¶

5/10/2015¶

Data Analytics of Human Activity Recognition¶

Using machine learning to predict qualitative features of human activity¶

We study a dataset with five classes (sitting-down, standing-up, standing, walking, and sitting) of 'human activity' collected on eight hours of activities of four healthy subjects. The manner in which four subjects performed the five classes of 'human activity' were quantified by attaching accelerometers to parts of the body during the activity. The goal is to predict the manner in which the activity was performed. There is a 'outcome' variable in the training set which labels the activity. The features represent the data recorded from the accelerometers. We will build a model and make predictions using these features.

Sensors were placed on the waist (1), left thigh (2), right ankle (3), and right arm (4). See Ref [1] for details. The 12 features selected by through this procedure were: (1) Sensor on the Belt: discretization of the module of acceleration vector, variance of pitch, and variance of roll; (2) Sensor on the left thigh: module of acceleration vector, discreti zation, and variance of pitch; (3) Sensor on the right ankle: variance of pitch, and variance of roll; (4) Sensor on the right arm: discretization of the module of acceleration vector; From all sensors: average accele- ration and standard deviation of acceleration.

In [5]:

import numpy as np
from sklearn import cross_validation
from sklearn import svm
from matplotlib import pyplot as plt
from sklearn import preprocessing
from sklearn import metrics
%matplotlib inline 
import pandas as pd
import matplotlib

In [6]:

from minepy import MINE
import copy
from mpl_toolkits.mplot3d import Axes3D

Import data set:

In [7]:

mydir = ""
filename = mydir+"dataset_har.csv" 
#filename = 'test.csv'

conv = lambda valstr: float(valstr.replace(',','.'))
c = {3:conv, 4:conv, 5:conv}
col = [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
age = np.genfromtxt(filename,delimiter=";", skip_header=1 ,usecols=(2),dtype=int) 
weight = np.genfromtxt(filename,delimiter=";", skip_header=1 ,usecols=(4),dtype=int)     
data_height_bmi= np.genfromtxt(filename,delimiter=";", skip_header=1 ,usecols=(3,5),dtype=None, converters = c)  
data= np.genfromtxt(filename,delimiter=";", skip_header=1 ,usecols=col,dtype=int)  
data=data*1.0
target = np.genfromtxt(filename,delimiter=";", skip_header=1 ,usecols=18,dtype=str)  

Preprocess data for use with machine learning algorithms:

In [8]:

rawdata = copy.copy(data)
data = preprocessing.scale(data)

In [9]:

def digitize(starget):
    """
       Convert string output labels to floats. 
       Machine learning classifer only accepts float as the output value.
    """
    stringlabels= np.unique(starget)
    lenlabels = len(stringlabels)
    dindex = 0
    mydict = {}
    for dlabel in stringlabels:
        mydict[dlabel] = dindex
        dindex = dindex + 1
    print mydict 
    dtarget = copy.copy(starget)
    for i in stringlabels:
        myindex = np.where(starget == i)[0]
        dtarget[myindex]=mydict.get(i)
    return dtarget

The output string labels are converted to floats as follows:

In [10]:

dtarget = digitize(target)

{'standing': 2, 'walking': 4, 'sittingdown': 1, 'standingup': 3, 'sitting': 0}

In the next few steps we will create variables that we can use in our data visualization and machine learning algorithms. We divide data by 'class' (standig, walking etc.) to facilitate data visualization.

In [11]:

stand_index= np.where(target=='standing')
walk_index = np.where(target == 'walking')
sitdown_index = np.where(target == 'sittingdown')
standup_index = np.where(target == 'standingup')
sit_index = np.where(target == 'sitting')

In [12]:

height = data_height_bmi[:,0]
BMI = data_height_bmi[:,1]

x1 = data[:,0]
y1 = data[:,1]
z1 = data[:,2]
x2 = data[:,3]
y2 = data[:,4]
z2 = data[:,5]
x3 = data[:,6]
y3 = data[:,7]
z3 = data[:,8]
x4 = data[:,9]
y4 = data[:,10]
z4 = data[:,11]

In [13]:

x1r = rawdata[:,0]
y1r = rawdata[:,1]
z1r = rawdata[:,2]
x2r = rawdata[:,3]
y2r = rawdata[:,4]
z2r = rawdata[:,5]
x3r = rawdata[:,6]
y3r = rawdata[:,7]
z3r = rawdata[:,8]
x4r = rawdata[:,9]
y4r = rawdata[:,10]
z4r = rawdata[:,11]

In the following plot, the markers are colored by the activity 'class'

In [14]:

position = (dtarget.astype(np.float))

plt.subplot(2,2,1)
plt.scatter(x1r,y1r,c=position, cmap=plt.cm.Paired, alpha=0.9)
plt.xlabel('x1')
plt.ylabel('y1')

plt.subplot(2,2,2)
plt.scatter(x4r,y4r,c=position, cmap=plt.cm.Paired,alpha=0.9)
plt.xlabel('x4')
plt.ylabel('y4')

plt.subplot(2,2,3)
plt.scatter(x2r,y2r,c=position, cmap=plt.cm.Paired,alpha=0.9)
plt.xlabel('x2')
plt.ylabel('y2')

plt.subplot(2,2,4)
plt.scatter(x3r,y3r,c=position, cmap=plt.cm.Paired,alpha=0.9)
plt.xlabel('x3')
plt.ylabel('y3')

plt.subplots_adjust(hspace=0.35,wspace=0.45)

In [15]:

vec1 = rawdata[:,0:3]
vec1_stand = vec1[stand_index,:][0]
vec1_walk = vec1[walk_index,][0]
vec1_sitdown = vec1[sitdown_index,][0]
vec1_standup = vec1[standup_index,:][0]
vec1_sit = vec1[sit_index,:][0]
vec2 = rawdata[:,3:6]

print np.squeeze(vec1_stand).shape
print vec1_stand.shape

(47370, 3)
(47370, 3)

In [16]:

fig = plt.figure()
ax = Axes3D(fig, elev=-150, azim=110)
ax.scatter(vec1_stand[:,0],vec1_stand[:,1],vec1_stand[:,2], marker='x',color='b',label='standing')
ax.scatter(vec1_sit[:,0],vec1_sit[:,1],vec1_sit[:,2], marker='o',color='r',label='sitting',alpha=0.5)
#ax.scatter(vec1_walk[:,0],vec1_walk[:,1],vec1_walk[:,2], marker='s',color='m',label='walking')
ax.scatter(vec1_standup[:,0],vec1_standup[:,1],vec1_standup[:,2], marker='>',color='g',label='stand up',alpha=0.9)
ax.set_title("Sitting, Standing, Walking")
ax.set_xlabel("x-axis")
#ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("y-axis")
#ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("z-axis")
#ax.w_zaxis.set_ticklabels([])

plt.legend(loc='upper left', numpoints=1, ncol=3, fontsize=8, bbox_to_anchor=(0, 0))

colors = ['b','r','g']
markers = ['x','o','>']
scatter1_proxy = matplotlib.lines.Line2D([0],[0], linestyle="none", c=colors[0], marker = markers[0])
scatter2_proxy = matplotlib.lines.Line2D([0],[0], linestyle="none", c=colors[1], marker = markers[1])
scatter3_proxy = matplotlib.lines.Line2D([0],[0], linestyle="none", c=colors[2], marker = markers[2])
ax.legend([scatter1_proxy, scatter2_proxy,scatter3_proxy], ['standing', 'sitting','standup'], numpoints = 1)

plt.show()

/usr/lib/pymodules/python2.7/matplotlib/axes.py:4486: UserWarning: No labeled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labeled objects found. "

In [17]:

position = (dtarget.astype(np.float))

plt.subplot(2,2,1)
plt.scatter(vec1[:,0],vec1[:,1],c=position, cmap=plt.cm.Paired, alpha=0.5)
plt.xlabel('x1')
plt.ylabel('y1')

plt.subplot(2,2,2)
plt.scatter(vec1_stand[:,0],vec1_stand[:,1],alpha=0.5)
plt.xlabel('x1_stand')
plt.ylabel('y1_stand')

plt.subplot(2,2,3)
plt.scatter(vec1_sit[:,0],vec1_sit[:,1],alpha=0.5)
plt.xlabel('x1_sit')
plt.ylabel('y1_sot')

plt.subplot(2,2,4)
plt.scatter(vec1_walk[:,0],vec1_walk[:,1],alpha=0.5)
plt.xlabel('x1_walk')
plt.ylabel('y1_walk')

plt.subplots_adjust(hspace=0.35,wspace=0.45)

Using the cross_validation module we radomly assign training and test data from the dataset. There is a 60-40 split. Then we build a support vector machine model, make a prediction and calculate the accuracy of our prediction using a k-fold cross validation (k=5). We repeat an accuracy estimation using the metrics module - which just compares the actual results to the predicted results. Note how the evaluation metrics change depending on the method used to assess the performance of the model.

In [18]:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, dtarget, test_size=0.4, random_state=0)

In [19]:

clf = svm.SVC(kernel='rbf',C=10)
clf.fit(X_train, y_train)
prediction = (clf.predict(X_test))

In [20]:

cvscore = cross_validation.cross_val_score(clf, X_test, y_test, scoring='accuracy',cv=5)
print "Accuracy is: ", np.mean(cvscore)

Accuracy is:  0.98264261516

In [21]:

my_accuracy = metrics.accuracy_score(y_test,prediction)
print my_accuracy

0.98516315996

The confusion matrix is quite useful for comparing the actual activity to the prediction made by the model as a function of activity class. Rows are the actual activity and columns are the predicted activity. The confusion matrix is illustrated nicely in a plot a few lines down.

In [22]:

from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,prediction)
print cm
target_label = ('sitting','sittingdown','standing','standingup','walking')
# {'standing': 2, 'walking': 4, 'sittingdown': 1, 'standingup': 3, 'sitting': 0}

[[20201     7     0     9     2]
 [    7  4506    47    56    38]
 [    0     9 18914    23    57]
 [   29   110    99  4659    32]
 [    0    81   324    53 16991]]

In [23]:

def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(target_label))
    plt.xticks(tick_marks, target, rotation=45)
    plt.yticks(tick_marks, target)
    #plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [24]:

# Compute confusion matrix
cm = confusion_matrix(y_test, prediction)
np.set_printoptions(precision=2)
print('Confusion matrix, without normalization')
print(cm)
plt.figure()
plot_confusion_matrix(cm)

# Normalize the confusion matrix by row (i.e by the number of samples
# in each class)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('Normalized confusion matrix')
print(cm_normalized)
plt.figure()
plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')

Confusion matrix, without normalization
[[20201     7     0     9     2]
 [    7  4506    47    56    38]
 [    0     9 18914    23    57]
 [   29   110    99  4659    32]
 [    0    81   324    53 16991]]
Normalized confusion matrix
[[ 1.    0.    0.    0.    0.  ]
 [ 0.    0.97  0.01  0.01  0.01]
 [ 0.    0.    1.    0.    0.  ]
 [ 0.01  0.02  0.02  0.95  0.01]
 [ 0.    0.    0.02  0.    0.97]]

I initially used a support vector machine with a linear kernel to analyze the above data, but this yielded a rather poor accuracy. The radial basis function kernel used above was more successful. It is useful to explore a few models in an effort to optimize classification accuracy. The random forest classifier is a popular model and will be examined below.

Random Forest Classifier¶

In [25]:

from sklearn import ensemble

In [26]:

rfc = ensemble.RandomForestClassifier(n_estimators=100)
rfc.fit(X_train,y_train)

Out[26]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Using the Random Forest Classifier we make a prediction and analyze its performance using a confusion matrix and metrics.accuracy_score function

In [27]:

rfc_prediction = rfc.predict(X_test)
rfc_cm = confusion_matrix(y_test,rfc_prediction)
print target_label
print rfc_cm

('sitting', 'sittingdown', 'standing', 'standingup', 'walking')
[[20204     7     0     8     0]
 [    1  4555    10    52    36]
 [    0     0 18923     5    75]
 [    5    79    39  4772    34]
 [    0    15    36    14 17384]]

In [28]:

rf_accuracy = metrics.accuracy_score(y_test,rfc_prediction)
print rf_accuracy

0.993721133818

Remarks:¶

We have succesfully classifed human activity data from sensors placed on the human body using machine learning techniques.

References:¶

The data used in this sample maybe be found at: http://groupware.les.inf.puc-rio.br/har and was inspired by Ref. [1]

[1] Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers' Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.