author: @amanqa
source: github.com/amanahuja/change-detection-tutorial
# Modified Stylesheet for notebook.
from IPython.core.display import HTML
def css_styling():
styles = open("custom.css", "r").read()
return HTML(styles)
css_styling()
Part 06: EKG Data
import matplotlib.pyplot as plt
import numpy as np
np.set_printoptions(precision=2)
from IPython.display import Image
import os,sys
The EKG / ECG is the measurement of electrical activity in the heart. These are the electrical polarization and depolarizations of the cardiac tissue.
When translated to waveform, the signal is composed of various components, labeled P Q R S T U. Three of the major features are the P-Wave, the R-Wave, and the T-Wave.
Image('https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/EKG_Complex_en.svg/300px-EKG_Complex_en.svg.png')
# via Wikipedia
The P wave is the first positive hump, representing the contraction of the atrea (right and left).
The R wave is the large positive spike wrapped in the "QRS complex". It represents the contraction of the ventricles, which, being larger than the atrea, require a larger polarization. The Q and S are negative polarizations to either side of the R.
The T wave is the refractory bump, as the polarization of the ventricles returns to normal.
Specialists and diagnosticians look at the EKG to help determine if a heart is healthy and behaving normally.
The QRS complex in the waveform stands out and looks quite important, but in fact, the timing shape and size of the T-wave often carries more useful diagnostic information. The intervals between the features, such as the PR interval and QT interval shown on the diagram above, can be of diagnostic importance. It may also be important to look at the polarity -- the PR segment and the ST segment are both of neutral polarization in a healthy heart.
This analysis requires experience and training.
Our goal is to create an automated anomaly detection system by "training" the system on EKG data of normal healthy hearts, and recognizing deviations from this normal.
There are many approaches to this problem. For the purposes of this tutorial, we will use Ted Dunning's approach, described in his talk at the Strata Conference. We'll measure our success, and discuss limitations and potential improvements.
We'll get our data from Physionet's CinC Challenge.
A sample of this data is included in this tutorial repository on github under the folder '/data'
Input file description:
This file is 6.1MB in size and contains several hours of recorded EKG data from a patient in a sleep apnea study. This file contains 3.2 million samples of which we use the first 200,000 for training.
http://physionet.org/physiobank/database/apnea-ecg/
Data from Physionet
To read in this data we will use the ecgtk and wfdbtools by @rajajs
https://twitter.com/rajajs
https://github.com/RajaS/ecgtk
sys.path.append(os.path.join('..', 'ecgtk'))
import wfdbtools
data_path = os.path.join('..', 'data')
os.listdir(data_path)
['a02.qrs', 'a02.dat', 'a01.dat', 'a02.apn', 'a01.hea', 'a01.apn', 'a03.hea', 'a03.dat', 'a02.hea']
fpath = os.path.join(data_path, 'a02.hea')
print "Reading data header file {}".format(fpath)
print "-" * 10
with open(fpath, 'r') as f:
for ii in range(5):
print f.readline()
#!head 'data/a02.qrs'
#!head 'data/a02.apn'
Reading data header file ../data/a02.hea ---------- a02 1 100 3182000 a02.dat 16 200 12 0 -4 28438 0 ECG
"a02.dat 16" tells us the data format we are dealing with is PhysioNet Format 16 (not, for example, Format 212). We need to know this so we can tell wfdbtools how to read our data.
For more information on how to read these data files, please refer to
http://physionet.org/physiobank/database/apnea-ecg/
wfpath = fpath.rstrip('.hea')
# here we use wfdbtools to load our data and metadata
dat, info = wfdbtools.rdsamp(
wfpath,
start=0, # define time range instead of reading
end=20, # reading in the whole file.
)
# info is a dict of metadata
for k,v in info.iteritems():
print "{:<15} : {}".format(k,v)
# dat is our actual data
print "\nData shape: {}".format(dat.shape)
samp_freq : 100.0 signal_count : 1 gains : [200.0] file_format : ['16'] samp_count : 3182000 first_values : [-4.0] zero_values : [0.0] signal_names : ['ECG'] units : ['mV'] Data shape: (2000, 3)
# wfdbtools includes a simple way to plot our data
wfdbtools.plot_data(dat, info)
have 1 signals...
Why? Well, I like control over the plots I made. Pandas provides an easy way to plot our data exactly like we want it. We'll also be able to leverage the extensive time series data analysis tools available in pandas.
import pandas as pd
# Let's reload the data
# This time ALL the data not just the first 20 seconds
dat, info = wfdbtools.rdsamp(wfpath)
print dat.shape
(3182000, 3)
# Load numpy array into pandas and label columns
df = pd.DataFrame(dat, columns= ['#', 'sec', 'EKG'])
# Set index so the x-axis is properly labeled in our plot.
df.set_index('sec', drop=False, inplace=True)
# Just plot all the data
# This is a lot of data. Takes a few seconds to plot, and looks messy.
df.EKG.plot()
<matplotlib.axes.AxesSubplot at 0x3d68950>
# Plot customization
# Using a slice of just the first few seconds of EKG data
ekg_signal = df.EKG[:6]
def plot_ekg(ekg, title = "EKG Plot, Customized"):
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(10,5))
ekg.plot(
marker='x',
linestyle='',
alpha=0.8,
ax=ax
)
ekg.plot(
c='b',
linestyle='-',
alpha=0.5,
ax=ax
)
plt.title(title);
plot_ekg(ekg_signal)
Now that we've loaded the data, we can build our anomaly detection system.
See @tdunning's method here
http://www.slideshare.net/tdunning/strata-2014-anomaly-detection
https://github.com/tdunning/anomaly-detection/
Here's what we'll be doing:
First step is to break the EKG data into a bunch of small windows.
window_size = 32
print "Window size: " , window_size
step_size = window_size
print "Step size: ", step_size
total_time_available = info['samp_count'] / info['samp_freq'] # in seconds
print "Total data available: {} seconds".format(total_time_available)
print "Sample Frequency: {} samples per second".format(info['samp_freq'])
print "Size of data being processed: {} observations".format(len(fullsignal))
print "\t({} seconds)".format(len(fullsignal) / info['samp_freq'])
Window size: 32 Step size: 32 Total data available: 31820.0 seconds Sample Frequency: 100.0 samples per second Size of data being processed: 1000000 observations (10000.0 seconds)
#Choose how much of the signal to use here:
training_size = 2000000
fullsignal = dat[:training_size,2]
# selecting just the 3rd column
# 1st column is index, 2nd is time, 3d is EKG data.
print "Using about {:.02f}% of the total available signal".format(
len(fullsignal) / float(info['samp_count']) * 100
)
Using about 62.85% of the total available signal
window_list = []
nwindows = 0
for range_start in xrange(0, len(fullsignal), step_size):
nwindows += 1
range_end = range_start + step_size #- 1
if range_end > len(fullsignal):
continue
# DEBUG print "{} to {}".format(range_start, range_end)
window = fullsignal[range_start:range_end]
# Multiply signal by sin()
window = window * np.sin(np.linspace(0, np.pi, num=window_size ))# - 1) )
window_list.append(window)
print "\nFull signal of size {} was broken up into {} pieces of length {} each.".format(
len(fullsignal),
len(window_list),
len(window_list[0])
)
Full signal of size 2000000 was broken up into 62500 pieces of length 32 each.
# All windows should be the same size.
window_lengths = np.array([len(w) for w in window_list])
print "Sanity check: Each window is the same size:",
print np.all(window_lengths == window_lengths[0])
Sanity check: Each window is the same size: True
Now we'll cluster all the windows into a smaller number of "representative" windows, which will compose a dictionary of windows.
k-means clustering will give us a "dictionary" of whatever size we want. The cluster centroids will be our dictionary values.
from sklearn.cluster import MiniBatchKMeans
"Minibatch" kmeans is variation of the usual kmeans algorithm that incrementally updates the centers positions in mini-batches. This works much faster for larger datasets.
More info at the sklearn documentation
# Do the clustering here
print "Number of windows", len(window_list)
k = 200 # number of clusters
print "Number of clusters: ", k
#testdata = window_list[:500] # To help with testing, limit
model = MiniBatchKMeans(
n_clusters=k,
verbose=1,
)
model.fit_predict(window_list)
Number of windows 62500 Number of clusters: 200 Init 1/3 with method: k-means++ Inertia for init 1/3: 4.481913 Init 2/3 with method: k-means++ Inertia for init 2/3: 3.801283 Init 3/3 with method: k-means++ Inertia for init 3/3: 4.195376 Minibatch iteration 1/62500:mean batch inertia: 0.048536, ewa inertia: 0.048536 Minibatch iteration 2/62500:mean batch inertia: 0.027129, ewa inertia: 0.048467 Minibatch iteration 3/62500:mean batch inertia: 0.038008, ewa inertia: 0.048434 Minibatch iteration 4/62500:mean batch inertia: 0.025955, ewa inertia: 0.048362 Minibatch iteration 5/62500:mean batch inertia: 0.027299, ewa inertia: 0.048295 Minibatch iteration 6/62500:mean batch inertia: 0.026446, ewa inertia: 0.048225 Minibatch iteration 7/62500:mean batch inertia: 0.025140, ewa inertia: 0.048151 Minibatch iteration 8/62500:mean batch inertia: 0.038665, ewa inertia: 0.048121 Minibatch iteration 9/62500:mean batch inertia: 0.023342, ewa inertia: 0.048041 [MiniBatchKMeans] Reassigning 51 cluster centers. Minibatch iteration 10/62500:mean batch inertia: 0.021435, ewa inertia: 0.047956 Minibatch iteration 11/62500:mean batch inertia: 0.029545, ewa inertia: 0.047897 Minibatch iteration 12/62500:mean batch inertia: 0.028000, ewa inertia: 0.047833 Minibatch iteration 13/62500:mean batch inertia: 0.025600, ewa inertia: 0.047762 Minibatch iteration 14/62500:mean batch inertia: 0.033826, ewa inertia: 0.047718 Minibatch iteration 15/62500:mean batch inertia: 0.054509, ewa inertia: 0.047739 Minibatch iteration 16/62500:mean batch inertia: 0.026320, ewa inertia: 0.047671 Minibatch iteration 17/62500:mean batch inertia: 0.036107, ewa inertia: 0.047634 Minibatch iteration 18/62500:mean batch inertia: 0.025470, ewa inertia: 0.047563 Minibatch iteration 19/62500:mean batch inertia: 0.040632, ewa inertia: 0.047541 [MiniBatchKMeans] Reassigning 47 cluster centers. Minibatch iteration 20/62500:mean batch inertia: 0.027569, ewa inertia: 0.047477 Minibatch iteration 21/62500:mean batch inertia: 0.037528, ewa inertia: 0.047445 Minibatch iteration 22/62500:mean batch inertia: 0.034020, ewa inertia: 0.047402 Minibatch iteration 23/62500:mean batch inertia: 0.026277, ewa inertia: 0.047335 Minibatch iteration 24/62500:mean batch inertia: 0.023579, ewa inertia: 0.047259 Minibatch iteration 25/62500:mean batch inertia: 0.025318, ewa inertia: 0.047188 Minibatch iteration 26/62500:mean batch inertia: 0.033006, ewa inertia: 0.047143 Minibatch iteration 27/62500:mean batch inertia: 0.023561, ewa inertia: 0.047067 Minibatch iteration 28/62500:mean batch inertia: 0.032479, ewa inertia: 0.047021 Minibatch iteration 29/62500:mean batch inertia: 0.031833, ewa inertia: 0.046972 [MiniBatchKMeans] Reassigning 47 cluster centers. Minibatch iteration 30/62500:mean batch inertia: 0.030142, ewa inertia: 0.046918 Minibatch iteration 31/62500:mean batch inertia: 0.033408, ewa inertia: 0.046875 Minibatch iteration 32/62500:mean batch inertia: 0.030839, ewa inertia: 0.046824 Minibatch iteration 33/62500:mean batch inertia: 0.031048, ewa inertia: 0.046773 Minibatch iteration 34/62500:mean batch inertia: 0.026125, ewa inertia: 0.046707 Minibatch iteration 35/62500:mean batch inertia: 0.028499, ewa inertia: 0.046649 Minibatch iteration 36/62500:mean batch inertia: 0.027246, ewa inertia: 0.046587 Minibatch iteration 37/62500:mean batch inertia: 0.023064, ewa inertia: 0.046512 Minibatch iteration 38/62500:mean batch inertia: 0.022456, ewa inertia: 0.046435 Minibatch iteration 39/62500:mean batch inertia: 0.027263, ewa inertia: 0.046373 [MiniBatchKMeans] Reassigning 46 cluster centers. Minibatch iteration 40/62500:mean batch inertia: 0.022839, ewa inertia: 0.046298 Minibatch iteration 41/62500:mean batch inertia: 0.024188, ewa inertia: 0.046227 Minibatch iteration 42/62500:mean batch inertia: 0.021882, ewa inertia: 0.046149 Minibatch iteration 43/62500:mean batch inertia: 0.027035, ewa inertia: 0.046088 Minibatch iteration 44/62500:mean batch inertia: 0.026181, ewa inertia: 0.046024 Minibatch iteration 45/62500:mean batch inertia: 0.024447, ewa inertia: 0.045955 Minibatch iteration 46/62500:mean batch inertia: 0.027315, ewa inertia: 0.045896 Minibatch iteration 47/62500:mean batch inertia: 0.024776, ewa inertia: 0.045828 Minibatch iteration 48/62500:mean batch inertia: 0.022327, ewa inertia: 0.045753 Minibatch iteration 49/62500:mean batch inertia: 0.022701, ewa inertia: 0.045679 [MiniBatchKMeans] Reassigning 43 cluster centers. Minibatch iteration 50/62500:mean batch inertia: 0.035274, ewa inertia: 0.045646 Minibatch iteration 51/62500:mean batch inertia: 0.035597, ewa inertia: 0.045614 Minibatch iteration 52/62500:mean batch inertia: 0.023831, ewa inertia: 0.045544 Minibatch iteration 53/62500:mean batch inertia: 0.053024, ewa inertia: 0.045568 Minibatch iteration 54/62500:mean batch inertia: 0.034279, ewa inertia: 0.045532 Minibatch iteration 55/62500:mean batch inertia: 0.036536, ewa inertia: 0.045503 Minibatch iteration 56/62500:mean batch inertia: 0.087500, ewa inertia: 0.045637 Minibatch iteration 57/62500:mean batch inertia: 0.021366, ewa inertia: 0.045560 Minibatch iteration 58/62500:mean batch inertia: 0.030818, ewa inertia: 0.045513 Minibatch iteration 59/62500:mean batch inertia: 0.027370, ewa inertia: 0.045455 [MiniBatchKMeans] Reassigning 42 cluster centers. Minibatch iteration 60/62500:mean batch inertia: 0.020375, ewa inertia: 0.045374 Minibatch iteration 61/62500:mean batch inertia: 0.028590, ewa inertia: 0.045321 Minibatch iteration 62/62500:mean batch inertia: 0.028678, ewa inertia: 0.045267 Minibatch iteration 63/62500:mean batch inertia: 0.025145, ewa inertia: 0.045203 Minibatch iteration 64/62500:mean batch inertia: 0.022792, ewa inertia: 0.045131 Minibatch iteration 65/62500:mean batch inertia: 0.019130, ewa inertia: 0.045048 Minibatch iteration 66/62500:mean batch inertia: 0.028840, ewa inertia: 0.044996 Minibatch iteration 67/62500:mean batch inertia: 0.021433, ewa inertia: 0.044921 Minibatch iteration 68/62500:mean batch inertia: 0.020207, ewa inertia: 0.044842 Minibatch iteration 69/62500:mean batch inertia: 0.028201, ewa inertia: 0.044788 [MiniBatchKMeans] Reassigning 42 cluster centers. Minibatch iteration 70/62500:mean batch inertia: 0.025167, ewa inertia: 0.044726 Minibatch iteration 71/62500:mean batch inertia: 0.039569, ewa inertia: 0.044709 Minibatch iteration 72/62500:mean batch inertia: 0.024828, ewa inertia: 0.044646 Minibatch iteration 73/62500:mean batch inertia: 0.029786, ewa inertia: 0.044598 Minibatch iteration 74/62500:mean batch inertia: 0.022303, ewa inertia: 0.044527 Minibatch iteration 75/62500:mean batch inertia: 0.019727, ewa inertia: 0.044447 Minibatch iteration 76/62500:mean batch inertia: 0.030928, ewa inertia: 0.044404 Minibatch iteration 77/62500:mean batch inertia: 0.022254, ewa inertia: 0.044333 Minibatch iteration 78/62500:mean batch inertia: 0.025288, ewa inertia: 0.044272 Minibatch iteration 79/62500:mean batch inertia: 0.030164, ewa inertia: 0.044227 [MiniBatchKMeans] Reassigning 43 cluster centers. Minibatch iteration 80/62500:mean batch inertia: 0.023506, ewa inertia: 0.044161 Minibatch iteration 81/62500:mean batch inertia: 0.024419, ewa inertia: 0.044098 Minibatch iteration 82/62500:mean batch inertia: 0.018365, ewa inertia: 0.044015 Minibatch iteration 83/62500:mean batch inertia: 0.033865, ewa inertia: 0.043983 Minibatch iteration 84/62500:mean batch inertia: 0.022169, ewa inertia: 0.043913 Minibatch iteration 85/62500:mean batch inertia: 0.018212, ewa inertia: 0.043831 Minibatch iteration 86/62500:mean batch inertia: 0.028212, ewa inertia: 0.043781 Minibatch iteration 87/62500:mean batch inertia: 0.021623, ewa inertia: 0.043710 Minibatch iteration 88/62500:mean batch inertia: 0.021874, ewa inertia: 0.043640 Minibatch iteration 89/62500:mean batch inertia: 0.109601, ewa inertia: 0.043851 [MiniBatchKMeans] Reassigning 42 cluster centers. Minibatch iteration 90/62500:mean batch inertia: 0.027968, ewa inertia: 0.043800 Minibatch iteration 91/62500:mean batch inertia: 0.021486, ewa inertia: 0.043729 Minibatch iteration 92/62500:mean batch inertia: 0.022243, ewa inertia: 0.043660 Minibatch iteration 93/62500:mean batch inertia: 0.021677, ewa inertia: 0.043590 Minibatch iteration 94/62500:mean batch inertia: 0.021436, ewa inertia: 0.043519 Minibatch iteration 95/62500:mean batch inertia: 0.042801, ewa inertia: 0.043517 Minibatch iteration 96/62500:mean batch inertia: 0.025223, ewa inertia: 0.043458 Minibatch iteration 97/62500:mean batch inertia: 0.025686, ewa inertia: 0.043401 Minibatch iteration 98/62500:mean batch inertia: 0.027901, ewa inertia: 0.043352 Minibatch iteration 99/62500:mean batch inertia: 0.030300, ewa inertia: 0.043310 [MiniBatchKMeans] Reassigning 39 cluster centers. Minibatch iteration 100/62500:mean batch inertia: 0.019186, ewa inertia: 0.043233 Minibatch iteration 101/62500:mean batch inertia: 0.025110, ewa inertia: 0.043175 Minibatch iteration 102/62500:mean batch inertia: 0.022918, ewa inertia: 0.043110 Minibatch iteration 103/62500:mean batch inertia: 0.025367, ewa inertia: 0.043053 Minibatch iteration 104/62500:mean batch inertia: 0.016035, ewa inertia: 0.042967 Minibatch iteration 105/62500:mean batch inertia: 0.054233, ewa inertia: 0.043003 Minibatch iteration 106/62500:mean batch inertia: 0.048622, ewa inertia: 0.043021 Minibatch iteration 107/62500:mean batch inertia: 0.023790, ewa inertia: 0.042959 Minibatch iteration 108/62500:mean batch inertia: 0.050999, ewa inertia: 0.042985 Minibatch iteration 109/62500:mean batch inertia: 0.026082, ewa inertia: 0.042931 [MiniBatchKMeans] Reassigning 38 cluster centers. Minibatch iteration 110/62500:mean batch inertia: 0.024896, ewa inertia: 0.042873 Minibatch iteration 111/62500:mean batch inertia: 0.027930, ewa inertia: 0.042825 Minibatch iteration 112/62500:mean batch inertia: 0.028185, ewa inertia: 0.042778 Minibatch iteration 113/62500:mean batch inertia: 0.040676, ewa inertia: 0.042772 Minibatch iteration 114/62500:mean batch inertia: 0.089196, ewa inertia: 0.042920 Minibatch iteration 115/62500:mean batch inertia: 0.023132, ewa inertia: 0.042857 Minibatch iteration 116/62500:mean batch inertia: 0.027164, ewa inertia: 0.042807 Minibatch iteration 117/62500:mean batch inertia: 0.028522, ewa inertia: 0.042761 Minibatch iteration 118/62500:mean batch inertia: 0.021156, ewa inertia: 0.042692 Minibatch iteration 119/62500:mean batch inertia: 0.021837, ewa inertia: 0.042625 [MiniBatchKMeans] Reassigning 38 cluster centers. Minibatch iteration 120/62500:mean batch inertia: 0.020484, ewa inertia: 0.042554 Minibatch iteration 121/62500:mean batch inertia: 0.029623, ewa inertia: 0.042513 Minibatch iteration 122/62500:mean batch inertia: 0.018590, ewa inertia: 0.042436 Minibatch iteration 123/62500:mean batch inertia: 0.022592, ewa inertia: 0.042373 Minibatch iteration 124/62500:mean batch inertia: 0.023400, ewa inertia: 0.042312 Minibatch iteration 125/62500:mean batch inertia: 0.025845, ewa inertia: 0.042259 Minibatch iteration 126/62500:mean batch inertia: 0.022994, ewa inertia: 0.042198 Minibatch iteration 127/62500:mean batch inertia: 0.023870, ewa inertia: 0.042139 Minibatch iteration 128/62500:mean batch inertia: 0.052541, ewa inertia: 0.042172 Minibatch iteration 129/62500:mean batch inertia: 0.024133, ewa inertia: 0.042115 [MiniBatchKMeans] Reassigning 39 cluster centers. Minibatch iteration 130/62500:mean batch inertia: 0.025589, ewa inertia: 0.042062 Minibatch iteration 131/62500:mean batch inertia: 0.022566, ewa inertia: 0.041999 Minibatch iteration 132/62500:mean batch inertia: 0.034757, ewa inertia: 0.041976 Minibatch iteration 133/62500:mean batch inertia: 0.022589, ewa inertia: 0.041914 Minibatch iteration 134/62500:mean batch inertia: 0.021538, ewa inertia: 0.041849 Minibatch iteration 135/62500:mean batch inertia: 0.023124, ewa inertia: 0.041789 Minibatch iteration 136/62500:mean batch inertia: 0.031280, ewa inertia: 0.041755 Minibatch iteration 137/62500:mean batch inertia: 0.023318, ewa inertia: 0.041696 Minibatch iteration 138/62500:mean batch inertia: 0.029702, ewa inertia: 0.041658 Minibatch iteration 139/62500:mean batch inertia: 0.023903, ewa inertia: 0.041601 [MiniBatchKMeans] Reassigning 38 cluster centers. Minibatch iteration 140/62500:mean batch inertia: 0.023936, ewa inertia: 0.041545 Minibatch iteration 141/62500:mean batch inertia: 0.030609, ewa inertia: 0.041510 Minibatch iteration 142/62500:mean batch inertia: 0.047197, ewa inertia: 0.041528 Minibatch iteration 143/62500:mean batch inertia: 0.026854, ewa inertia: 0.041481 Minibatch iteration 144/62500:mean batch inertia: 0.022541, ewa inertia: 0.041420 Minibatch iteration 145/62500:mean batch inertia: 0.026310, ewa inertia: 0.041372 Minibatch iteration 146/62500:mean batch inertia: 0.020729, ewa inertia: 0.041306 Minibatch iteration 147/62500:mean batch inertia: 0.023926, ewa inertia: 0.041250 Minibatch iteration 148/62500:mean batch inertia: 0.030780, ewa inertia: 0.041217 Minibatch iteration 149/62500:mean batch inertia: 0.032054, ewa inertia: 0.041187 [MiniBatchKMeans] Reassigning 38 cluster centers. Minibatch iteration 150/62500:mean batch inertia: 0.038527, ewa inertia: 0.041179 Minibatch iteration 151/62500:mean batch inertia: 0.026705, ewa inertia: 0.041133 Minibatch iteration 152/62500:mean batch inertia: 0.055808, ewa inertia: 0.041180 Minibatch iteration 153/62500:mean batch inertia: 0.041068, ewa inertia: 0.041179 Minibatch iteration 154/62500:mean batch inertia: 0.023882, ewa inertia: 0.041124 Minibatch iteration 155/62500:mean batch inertia: 0.025400, ewa inertia: 0.041074 Minibatch iteration 156/62500:mean batch inertia: 0.028410, ewa inertia: 0.041033 Minibatch iteration 157/62500:mean batch inertia: 0.029932, ewa inertia: 0.040997 Minibatch iteration 158/62500:mean batch inertia: 0.026726, ewa inertia: 0.040952 Minibatch iteration 159/62500:mean batch inertia: 0.021935, ewa inertia: 0.040891 [MiniBatchKMeans] Reassigning 37 cluster centers. Minibatch iteration 160/62500:mean batch inertia: 0.025971, ewa inertia: 0.040843 Minibatch iteration 161/62500:mean batch inertia: 0.023766, ewa inertia: 0.040789 Minibatch iteration 162/62500:mean batch inertia: 0.023445, ewa inertia: 0.040733 Minibatch iteration 163/62500:mean batch inertia: 0.021586, ewa inertia: 0.040672 Minibatch iteration 164/62500:mean batch inertia: 0.027943, ewa inertia: 0.040631 Minibatch iteration 165/62500:mean batch inertia: 0.019018, ewa inertia: 0.040562 Minibatch iteration 166/62500:mean batch inertia: 0.039945, ewa inertia: 0.040560 Minibatch iteration 167/62500:mean batch inertia: 0.029004, ewa inertia: 0.040523 Minibatch iteration 168/62500:mean batch inertia: 0.021858, ewa inertia: 0.040463 Minibatch iteration 169/62500:mean batch inertia: 0.017293, ewa inertia: 0.040389 [MiniBatchKMeans] Reassigning 36 cluster centers. Minibatch iteration 170/62500:mean batch inertia: 0.026170, ewa inertia: 0.040344 Minibatch iteration 171/62500:mean batch inertia: 0.055554, ewa inertia: 0.040392 Minibatch iteration 172/62500:mean batch inertia: 0.017897, ewa inertia: 0.040320 Minibatch iteration 173/62500:mean batch inertia: 0.038484, ewa inertia: 0.040314 Minibatch iteration 174/62500:mean batch inertia: 0.023643, ewa inertia: 0.040261 Minibatch iteration 175/62500:mean batch inertia: 0.023133, ewa inertia: 0.040206 Minibatch iteration 176/62500:mean batch inertia: 0.023724, ewa inertia: 0.040154 Minibatch iteration 177/62500:mean batch inertia: 0.025887, ewa inertia: 0.040108 Minibatch iteration 178/62500:mean batch inertia: 0.031428, ewa inertia: 0.040080 Minibatch iteration 179/62500:mean batch inertia: 0.022603, ewa inertia: 0.040024 [MiniBatchKMeans] Reassigning 36 cluster centers. Minibatch iteration 180/62500:mean batch inertia: 0.015252, ewa inertia: 0.039945 Minibatch iteration 181/62500:mean batch inertia: 0.018165, ewa inertia: 0.039875 Minibatch iteration 182/62500:mean batch inertia: 0.024784, ewa inertia: 0.039827 Minibatch iteration 183/62500:mean batch inertia: 0.024205, ewa inertia: 0.039777 Minibatch iteration 184/62500:mean batch inertia: 0.022287, ewa inertia: 0.039721 Minibatch iteration 185/62500:mean batch inertia: 0.018726, ewa inertia: 0.039654 Minibatch iteration 186/62500:mean batch inertia: 0.037348, ewa inertia: 0.039646 Minibatch iteration 187/62500:mean batch inertia: 0.028318, ewa inertia: 0.039610 Minibatch iteration 188/62500:mean batch inertia: 0.024806, ewa inertia: 0.039563 Minibatch iteration 189/62500:mean batch inertia: 0.020017, ewa inertia: 0.039500 [MiniBatchKMeans] Reassigning 35 cluster centers. Minibatch iteration 190/62500:mean batch inertia: 0.029808, ewa inertia: 0.039469 Minibatch iteration 191/62500:mean batch inertia: 0.022639, ewa inertia: 0.039415 Minibatch iteration 192/62500:mean batch inertia: 0.039533, ewa inertia: 0.039416 Minibatch iteration 193/62500:mean batch inertia: 0.030638, ewa inertia: 0.039388 Minibatch iteration 194/62500:mean batch inertia: 0.032318, ewa inertia: 0.039365 Minibatch iteration 195/62500:mean batch inertia: 0.051992, ewa inertia: 0.039405 Minibatch iteration 196/62500:mean batch inertia: 0.025651, ewa inertia: 0.039361 Minibatch iteration 197/62500:mean batch inertia: 0.019923, ewa inertia: 0.039299 Minibatch iteration 198/62500:mean batch inertia: 0.030839, ewa inertia: 0.039272 Minibatch iteration 199/62500:mean batch inertia: 0.020366, ewa inertia: 0.039212 [MiniBatchKMeans] Reassigning 35 cluster centers. Minibatch iteration 200/62500:mean batch inertia: 0.026291, ewa inertia: 0.039170 Minibatch iteration 201/62500:mean batch inertia: 0.020393, ewa inertia: 0.039110 Minibatch iteration 202/62500:mean batch inertia: 0.068226, ewa inertia: 0.039203 Minibatch iteration 203/62500:mean batch inertia: 0.027084, ewa inertia: 0.039165 Minibatch iteration 204/62500:mean batch inertia: 0.024238, ewa inertia: 0.039117 Minibatch iteration 205/62500:mean batch inertia: 0.021694, ewa inertia: 0.039061 Minibatch iteration 206/62500:mean batch inertia: 0.018176, ewa inertia: 0.038994 Minibatch iteration 207/62500:mean batch inertia: 0.026493, ewa inertia: 0.038954 Minibatch iteration 208/62500:mean batch inertia: 0.022240, ewa inertia: 0.038901 Minibatch iteration 209/62500:mean batch inertia: 0.050088, ewa inertia: 0.038937 [MiniBatchKMeans] Reassigning 34 cluster centers. Minibatch iteration 210/62500:mean batch inertia: 0.017931, ewa inertia: 0.038869 Minibatch iteration 211/62500:mean batch inertia: 0.025741, ewa inertia: 0.038827 Minibatch iteration 212/62500:mean batch inertia: 0.026291, ewa inertia: 0.038787 Minibatch iteration 213/62500:mean batch inertia: 0.021738, ewa inertia: 0.038733 Minibatch iteration 214/62500:mean batch inertia: 0.049872, ewa inertia: 0.038768 Minibatch iteration 215/62500:mean batch inertia: 0.023549, ewa inertia: 0.038720 Minibatch iteration 216/62500:mean batch inertia: 0.022596, ewa inertia: 0.038668 Minibatch iteration 217/62500:mean batch inertia: 0.037230, ewa inertia: 0.038663 Minibatch iteration 218/62500:mean batch inertia: 0.029363, ewa inertia: 0.038634 Minibatch iteration 219/62500:mean batch inertia: 0.027886, ewa inertia: 0.038599 [MiniBatchKMeans] Reassigning 33 cluster centers. Minibatch iteration 220/62500:mean batch inertia: 0.025968, ewa inertia: 0.038559 Minibatch iteration 221/62500:mean batch inertia: 0.043377, ewa inertia: 0.038574 Minibatch iteration 222/62500:mean batch inertia: 0.030727, ewa inertia: 0.038549 Minibatch iteration 223/62500:mean batch inertia: 0.016959, ewa inertia: 0.038480 Minibatch iteration 224/62500:mean batch inertia: 0.026582, ewa inertia: 0.038442 Minibatch iteration 225/62500:mean batch inertia: 0.033135, ewa inertia: 0.038425 Minibatch iteration 226/62500:mean batch inertia: 0.026734, ewa inertia: 0.038388 Minibatch iteration 227/62500:mean batch inertia: 0.029518, ewa inertia: 0.038359 Minibatch iteration 228/62500:mean batch inertia: 0.020932, ewa inertia: 0.038303 Minibatch iteration 229/62500:mean batch inertia: 0.027043, ewa inertia: 0.038267 [MiniBatchKMeans] Reassigning 33 cluster centers. Minibatch iteration 230/62500:mean batch inertia: 0.021643, ewa inertia: 0.038214 Minibatch iteration 231/62500:mean batch inertia: 0.023541, ewa inertia: 0.038167 Minibatch iteration 232/62500:mean batch inertia: 0.017787, ewa inertia: 0.038102 Minibatch iteration 233/62500:mean batch inertia: 0.026040, ewa inertia: 0.038063 Minibatch iteration 234/62500:mean batch inertia: 0.020907, ewa inertia: 0.038009 Minibatch iteration 235/62500:mean batch inertia: 0.023390, ewa inertia: 0.037962 Minibatch iteration 236/62500:mean batch inertia: 0.056923, ewa inertia: 0.038022 Minibatch iteration 237/62500:mean batch inertia: 0.018981, ewa inertia: 0.037961 Minibatch iteration 238/62500:mean batch inertia: 0.020450, ewa inertia: 0.037905 Minibatch iteration 239/62500:mean batch inertia: 0.027497, ewa inertia: 0.037872 [MiniBatchKMeans] Reassigning 33 cluster centers. Minibatch iteration 240/62500:mean batch inertia: 0.021681, ewa inertia: 0.037820 Minibatch iteration 241/62500:mean batch inertia: 0.030516, ewa inertia: 0.037797 Minibatch iteration 242/62500:mean batch inertia: 0.024793, ewa inertia: 0.037755 Minibatch iteration 243/62500:mean batch inertia: 0.050681, ewa inertia: 0.037797 Minibatch iteration 244/62500:mean batch inertia: 0.032622, ewa inertia: 0.037780 Minibatch iteration 245/62500:mean batch inertia: 0.021966, ewa inertia: 0.037730 Minibatch iteration 246/62500:mean batch inertia: 0.022618, ewa inertia: 0.037681 Minibatch iteration 247/62500:mean batch inertia: 0.028299, ewa inertia: 0.037651 Minibatch iteration 248/62500:mean batch inertia: 0.022028, ewa inertia: 0.037601 Minibatch iteration 249/62500:mean batch inertia: 0.029884, ewa inertia: 0.037576 [MiniBatchKMeans] Reassigning 33 cluster centers. Minibatch iteration 250/62500:mean batch inertia: 0.110171, ewa inertia: 0.037809 Minibatch iteration 251/62500:mean batch inertia: 0.016389, ewa inertia: 0.037740 Minibatch iteration 252/62500:mean batch inertia: 0.020751, ewa inertia: 0.037686 Minibatch iteration 253/62500:mean batch inertia: 0.036952, ewa inertia: 0.037684 Minibatch iteration 254/62500:mean batch inertia: 0.023497, ewa inertia: 0.037638 Minibatch iteration 255/62500:mean batch inertia: 0.025278, ewa inertia: 0.037599 Minibatch iteration 256/62500:mean batch inertia: 0.025852, ewa inertia: 0.037561 Minibatch iteration 257/62500:mean batch inertia: 0.018300, ewa inertia: 0.037499 Minibatch iteration 258/62500:mean batch inertia: 0.027551, ewa inertia: 0.037468 Minibatch iteration 259/62500:mean batch inertia: 0.053161, ewa inertia: 0.037518 [MiniBatchKMeans] Reassigning 33 cluster centers. Minibatch iteration 260/62500:mean batch inertia: 0.075673, ewa inertia: 0.037640 Minibatch iteration 261/62500:mean batch inertia: 0.020143, ewa inertia: 0.037584 Minibatch iteration 262/62500:mean batch inertia: 0.020593, ewa inertia: 0.037529 Minibatch iteration 263/62500:mean batch inertia: 0.024774, ewa inertia: 0.037489 Minibatch iteration 264/62500:mean batch inertia: 0.035878, ewa inertia: 0.037484 Minibatch iteration 265/62500:mean batch inertia: 0.023666, ewa inertia: 0.037439 Minibatch iteration 266/62500:mean batch inertia: 0.029143, ewa inertia: 0.037413 Minibatch iteration 267/62500:mean batch inertia: 0.019064, ewa inertia: 0.037354 Minibatch iteration 268/62500:mean batch inertia: 0.022011, ewa inertia: 0.037305 Minibatch iteration 269/62500:mean batch inertia: 0.022092, ewa inertia: 0.037256 [MiniBatchKMeans] Reassigning 33 cluster centers. Minibatch iteration 270/62500:mean batch inertia: 0.017865, ewa inertia: 0.037194 Minibatch iteration 271/62500:mean batch inertia: 0.020569, ewa inertia: 0.037141 Minibatch iteration 272/62500:mean batch inertia: 0.030289, ewa inertia: 0.037119 Minibatch iteration 273/62500:mean batch inertia: 0.023199, ewa inertia: 0.037075 Minibatch iteration 274/62500:mean batch inertia: 0.024794, ewa inertia: 0.037035 Minibatch iteration 275/62500:mean batch inertia: 0.020401, ewa inertia: 0.036982 Minibatch iteration 276/62500:mean batch inertia: 0.027010, ewa inertia: 0.036950 Minibatch iteration 277/62500:mean batch inertia: 0.027155, ewa inertia: 0.036919 Minibatch iteration 278/62500:mean batch inertia: 0.023898, ewa inertia: 0.036877 Minibatch iteration 279/62500:mean batch inertia: 0.019262, ewa inertia: 0.036821 [MiniBatchKMeans] Reassigning 33 cluster centers. Minibatch iteration 280/62500:mean batch inertia: 0.042300, ewa inertia: 0.036838 Minibatch iteration 281/62500:mean batch inertia: 0.020675, ewa inertia: 0.036787 Minibatch iteration 282/62500:mean batch inertia: 0.018620, ewa inertia: 0.036728 Minibatch iteration 283/62500:mean batch inertia: 0.017449, ewa inertia: 0.036667 Minibatch iteration 284/62500:mean batch inertia: 0.023427, ewa inertia: 0.036624 Minibatch iteration 285/62500:mean batch inertia: 0.020268, ewa inertia: 0.036572 Minibatch iteration 286/62500:mean batch inertia: 0.025871, ewa inertia: 0.036538 Minibatch iteration 287/62500:mean batch inertia: 0.019543, ewa inertia: 0.036483 Minibatch iteration 288/62500:mean batch inertia: 0.026017, ewa inertia: 0.036450 Minibatch iteration 289/62500:mean batch inertia: 0.015276, ewa inertia: 0.036382 [MiniBatchKMeans] Reassigning 31 cluster centers. Minibatch iteration 290/62500:mean batch inertia: 0.050834, ewa inertia: 0.036428 Minibatch iteration 291/62500:mean batch inertia: 0.020773, ewa inertia: 0.036378 Minibatch iteration 292/62500:mean batch inertia: 0.028502, ewa inertia: 0.036353 Minibatch iteration 293/62500:mean batch inertia: 0.020439, ewa inertia: 0.036302 Minibatch iteration 294/62500:mean batch inertia: 0.022922, ewa inertia: 0.036259 Minibatch iteration 295/62500:mean batch inertia: 0.021114, ewa inertia: 0.036211 Minibatch iteration 296/62500:mean batch inertia: 0.029234, ewa inertia: 0.036189 Minibatch iteration 297/62500:mean batch inertia: 0.032644, ewa inertia: 0.036177 Minibatch iteration 298/62500:mean batch inertia: 0.022295, ewa inertia: 0.036133 Minibatch iteration 299/62500:mean batch inertia: 0.022112, ewa inertia: 0.036088 [MiniBatchKMeans] Reassigning 31 cluster centers. Minibatch iteration 300/62500:mean batch inertia: 0.026631, ewa inertia: 0.036058 Minibatch iteration 301/62500:mean batch inertia: 0.038974, ewa inertia: 0.036067 Minibatch iteration 302/62500:mean batch inertia: 0.018874, ewa inertia: 0.036012 Minibatch iteration 303/62500:mean batch inertia: 0.026732, ewa inertia: 0.035982 Minibatch iteration 304/62500:mean batch inertia: 0.026885, ewa inertia: 0.035953 Minibatch iteration 305/62500:mean batch inertia: 0.020222, ewa inertia: 0.035903 Minibatch iteration 306/62500:mean batch inertia: 0.050841, ewa inertia: 0.035951 Minibatch iteration 307/62500:mean batch inertia: 0.029569, ewa inertia: 0.035930 Minibatch iteration 308/62500:mean batch inertia: 0.028927, ewa inertia: 0.035908 Minibatch iteration 309/62500:mean batch inertia: 0.015856, ewa inertia: 0.035844 [MiniBatchKMeans] Reassigning 31 cluster centers. Minibatch iteration 310/62500:mean batch inertia: 0.027286, ewa inertia: 0.035816 Minibatch iteration 311/62500:mean batch inertia: 0.031222, ewa inertia: 0.035802 Minibatch iteration 312/62500:mean batch inertia: 0.027846, ewa inertia: 0.035776 Minibatch iteration 313/62500:mean batch inertia: 0.023252, ewa inertia: 0.035736 Minibatch iteration 314/62500:mean batch inertia: 0.023367, ewa inertia: 0.035696 Minibatch iteration 315/62500:mean batch inertia: 0.023613, ewa inertia: 0.035658 Minibatch iteration 316/62500:mean batch inertia: 0.020355, ewa inertia: 0.035609 Minibatch iteration 317/62500:mean batch inertia: 0.020953, ewa inertia: 0.035562 Minibatch iteration 318/62500:mean batch inertia: 0.095787, ewa inertia: 0.035755 Minibatch iteration 319/62500:mean batch inertia: 0.023504, ewa inertia: 0.035715 [MiniBatchKMeans] Reassigning 31 cluster centers. Minibatch iteration 320/62500:mean batch inertia: 0.025410, ewa inertia: 0.035682 Minibatch iteration 321/62500:mean batch inertia: 0.023598, ewa inertia: 0.035644 Minibatch iteration 322/62500:mean batch inertia: 0.046006, ewa inertia: 0.035677 Minibatch iteration 323/62500:mean batch inertia: 0.166553, ewa inertia: 0.036096 Minibatch iteration 324/62500:mean batch inertia: 0.037800, ewa inertia: 0.036101 Minibatch iteration 325/62500:mean batch inertia: 0.022410, ewa inertia: 0.036057 Minibatch iteration 326/62500:mean batch inertia: 0.023663, ewa inertia: 0.036018 Minibatch iteration 327/62500:mean batch inertia: 0.018579, ewa inertia: 0.035962 Converged (lack of improvement in inertia) at iteration 327/62500 Computing label assignment and total inertia
array([171, 91, 170, ..., 3, 76, 16], dtype=int32)
Set the k-means cluster centers as our basis dictionary. This dictionary will be used to construct all our signals. It represents the most commonly encoutered windows in our "normal" training data set.
## Set the k-means cluster centers as our basis dictionary
basis_dict = model.cluster_centers_
# Visually examining the Basis Dictionary.
if k > 10:
ksmall = 15
print "Large dictionary. Plotting just the first {}.".format(ksmall)
print "... there are {} additional windows".format(k - ksmall)
print " in the dictionary that we will not plot."
ncols = 5
nrows = int(math.ceil(ksmall / 5.))
fig, axes = plt.subplots(
ncols=ncols,
nrows = nrows,
figsize = (3 * ncols,3 * nrows)
)
axiter = axes.flat
for clust_idx in xrange(ksmall):
axiter[clust_idx].plot(basis_dict[clust_idx])
Large dictionary. Plotting just the first 15. ... there are 185 additional windows in the dictionary that we will not plot.
k, the number of clusters in kmeans, will be the number of items in the basis dictionary we will use to reconstruct EKG signals.
Intuitively, we want a large dictionary, so our reconstruction has the most flexibility and, hopefully, smallest error.
However, since kmeans will force the dataset into the number of clusters that we specify, we must be careful about the quality of our dictionary. If k is too large, we may end up using very unusual windows to reconstruct our signals, which is bad.
We can get a sense for this by looking at how many windows are in each cluster.
from collections import Counter
clust_counter = Counter(model.labels_)
tt = clust_counter.most_common(5)
print [idx for idx,size in tt]
[11, 126, 138, 171, 82]
# These are the 10 largest clusters
# and the number of windows they contain
print "Largest clusters:"
print "\t(cluster_index, n_windows_in_cluster)"
large_clusters = clust_counter.most_common(5)
print "\t", large_clusters
fig, axes = plt.subplots( ncols=5, nrows=1,
figsize = (15,3)
)
axiter = axes.flat
for ii, clust_idx in enumerate([idx for idx,size in large_clusters]):
axiter[ii].plot(basis_dict[clust_idx])
Largest clusters: (cluster_index, n_windows_in_cluster) [(11, 1158), (126, 821), (138, 814), (171, 805), (82, 801)]
# These are the 10 smallest clusters
# and the number of windows they contain
print "Smallest clusters:"
print "\t(cluster_index, n_windows_in_cluster)"
small_clusters = clust_counter.most_common()[:-6:-1]
print "\t", small_clusters
fig, axes = plt.subplots( ncols=5, nrows=1,
figsize = (15,3)
)
axiter = axes.flat
for ii, clust_idx in enumerate([idx for idx,size in small_clusters]):
axiter[ii].plot(basis_dict[clust_idx])
Smallest clusters: (cluster_index, n_windows_in_cluster) [(188, 8), (58, 10), (153, 58), (48, 90), (19, 94)]
There are several clusters with only one or two windows in them. Are these important to have in the dictionary, or are they worsening our performance? Visualizing the results suggests that these are probably windows we do not want to include in our dictionary.
Remove some of the smallest clusters from the dictionary to see if this affects our performance.
Why redo all our work every time? sklearn has an easy way of saving and loading models which makes it very easy to experiment.
Here we save the sklearn model we've constructed. This is the trained minibatch kmeans produced by the model.fit()
command above, from which we got the basis dictionary.
Alternately, you could smay just try saving the basis dictionary instead using np.savez()
. These are all really just wrappers around python pickling.
modelfilename = "model_trainsize={}_stepsize={}_windowsize={}_nclusters={}.pkl".format(
len(fullsignal),
step_size,
window_size,
k
)
from sklearn.externals import joblib
joblib.dump(model, modelfilename, compress=9)
['model_trainsize=2000000_stepsize=32_windowsize=32_nclusters=200.pkl']
Use the following method to load a model you've previously saved.
modelfilename = <filename_of_your_saved_model>
from sklearn.externals import joblib
model_reloaded = joblib.load(modelfilename)
We'll now use this "dictionary" of normal windows to reconstruct signals.
For clarity, we'll do this in two steps. First we'll just look at one target window, and try to find the window in the dictionary that best represents the target. We'll measure success using mean squared error.
(Yeah, probably "reconstruction" is not the right word here. We are just picking one item from the dictionary.)
In the next step, we'll reconstruct a longer signal (composed of multiple windows).
from sklearn.metrics import mean_squared_error
## Reconstructing a single window
def reconstruct_window(window, basis, plot=False):
proj = np.apply_along_axis(window.dot, 1, basis)
#Find closest basis vector
maxidx = np.abs(proj).argmax()
if plot==True:
print "Best Projection: ", proj[maxidx]
# Visual inspection
fig, axes = plt.subplots(2, 1, figsize=(6,6))
axes[0].plot(window)
axes[0].set_title('Original')
axes[1].plot(basis[maxidx])
axes[1].set_title('Reconstructed')
# find error
print "Error: ", mean_squared_error(window, basis[maxidx])
return basis[maxidx]
# Select any window for testing
some_window = window_list[100]
# Reconstruct this window
basis_window = reconstruct_window(some_window, basis_dict, plot=True)
Best Projection: 0.307076572506 Error: 0.000165651622713
These two -- the original and the selection from the basis dictionary -- should look pretty similar, if all is going well. And of course, the error should be very small.
Now, given a longer signal, we will reconstruct it by combining multiple windows from the basis dictionary.
## Reconstruct a longer signal
def reconstruct_signal(sig, basis, plot=False):
dict_size, window_size = basis.shape
n_segments = 2 * int(len(sig) / window_size) - 1
rec = np.zeros(len(sig), dtype=float)
print "n segments to use is:", n_segments
print "dictionary (basis set) size is:", dict_size
#split the signal
for ii in xrange(n_segments):
# Obtain index range
start_idx = int(ii * window_size/2)
end_idx = start_idx + window_size
# Slice signal over range to get window
s = sig[start_idx:end_idx]
# reconstruct this window
bvec = reconstruct_window(s, basis, plot=False)
# Update complete reconstruction
rec[start_idx:end_idx] += bvec
if plot == True:
print "window size is: ", window_size
print "Error: ", mean_squared_error(sig, rec)
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(int(n_segments/3), 4*3))
axes[0].plot(sig)
axes[0].plot(rec)
axes[0].set_title('Original and reconstructed signal')
ylims = axes[0].get_ylim()
axes[1].plot(sig)
axes[1].set_title('Original signal')
axes[2].plot(rec)
axes[2].set_title('Reconstructed signal')
#axes[3].plot(sig - rec)
#axes[3].set_ylim(ylims)
#axes[3].set_title('Residual')
return rec
# Select a longer signal to reconstruct for testing.
signal = fullsignal[:800]
signal_reconstructed = reconstruct_signal(signal, basis_dict, plot=True)
n segments to use is: 49 dictionary (basis set) size is: 200 window size is: 32 Error: 0.0114822651959
The residual is just the difference between original signal and reconstructed signal. The better our reconstruction, the smaller and less significant the residual will be.
This section is TODO
np.set_printoptions(precision=2, linewidth=100)
# explicitly calculate the residual here.
residual = signal - signal_reconstructed
# bounds to look at:
start = 10; stop = 20
print signal[start:stop]
print signal_reconstructed[start:stop]
print residual[start:stop]
print '-'
# Sanity check on error calculation
tt = np.power(signal[start:stop] - signal_reconstructed[start:stop], 2)
print tt.mean()
print mean_squared_error(signal[start:stop], signal_reconstructed[start:stop])
[-0.04 -0.06 -0.04 -0.06 -0.06 -0.04 0.01 0.03 0.06 0.06] [-0.05 -0.07 -0.09 -0.08 -0.1 -0.08 -0.1 -0.09 -0.06 0.2 ] [ 0.02 0.02 0.05 0.03 0.04 0.04 0.11 0.12 0.12 -0.15] - 0.00680137869974 0.00680137869974
plt.plot(residual)
[<matplotlib.lines.Line2D at 0xc358890>]
Now let's reconstruct a much larger signal. We won't plot this, but we'll check the mean squared error to see how we did.
Note that we are still reconstructing part of our training data set. We should expect to see a small "training error" if things are going well.
Next we'll try loading data we didn't use to construct our basis dictionary, and see how well the reconstruction method performs. This "test error" will be a more trustworthy way to evaluate our success.
# Warning: Depending on the signal size, this can take a long time.
# Below we'll %%timeit to check running time, for reference.
sig_size = 50000
long_signal = fullsignal[:sig_size]
long_signal_reconstructed = reconstruct_signal(long_signal,
basis_dict,
plot=False)
n segments to use is: 3123 dictionary (basis set) size is: 200
%%timeit
long_signal_reconstructed = reconstruct_signal(long_signal,
basis_dict,
plot=False)
print """
Parameters:
Signal size to reconstruct = {}
Size of basis dictionary = {}
Window size = {}
Step size = {}
Results in Mean Squared Error: {}""".format(
sig_size,
k,
window_size,
step_size,
mean_squared_error(long_signal, long_signal_reconstructed)
)
Parameters: Signal size to reconstruct = 50000 Size of basis dictionary = 200 Window size = 32 Step size = 32 Results in Mean Squared Error: 0.016030055768
There's many things you might want to try to improve this model. I've listed some ideas below. Which do you think would yield the most improvement?
A good model for anomaly detection is one that has a very low error (a small residual) for normal cases, and a high error (a large residual) for anomalies of interest.
We have a small error for the signal used to build the basis, but that's to be expected. We need to confirm:
We'd like to find EKG data from an unhealthy heart. We can try to reconstruct that anomalous signal using our trained model -- which uses a dictionary constructed from healthy heart data only.
Thus, we expect a poor performance (high error) in reconstructing anomalous EKG signals. Thus we can use the error rate as an estimate of how likely a signal is to be anomalous.
# Load a signal segment known to be anolamous.
anomalous_dataset = 'a01'
start = 4825
end = start + 15
wfpath = os.path.join(data_path, anomalous_dataset)
dat, info = wfdbtools.rdsamp(wfpath,
start=start,
end=end
)
print info
print dat.shape
{'samp_freq': 100.0, 'signal_count': 1, 'gains': [200.0], 'file_format': ['16'], 'samp_count': 2957000, 'first_values': [-12.0], 'zero_values': [0.0], 'signal_names': ['ECG'], 'units': ['mV']} (1500, 3)
anomalous_ekg = dat[:, 2]
# Plot to see the anomalous signal
plt.figure(figsize=(16,3))
plt.plot(anomalous_ekg)
plt.title('{}, Start: {}'.format(fpath, start))
<matplotlib.text.Text at 0x853d510>
reconstruct_signal(anomalous_ekg, basis_dict, plot=True)
n segments to use is: 91 dictionary (basis set) size is: 200 window size is: 32 Error: 0.0429759867909
array([ 0. , -0.01, -0.01, ..., 0. , 0. , 0. ])
We should expect to see a larger error in reconstructing the anomalous EKG than when we reconstructed a normal signal. The more dramatic the difference, the better our model. We have a twofold goal:
Once our model is satisfactory, we could imagine creating an alarm that goes off when a signal has an error rate above some threshold. This would be our "stopping condition", as we discussed in previous sections.
Discuss limitations and potential improvements.
Return to the Change Detection Tutorial Table of Contents