author: @amanqa
source: github.com/amanahuja/change-detection-tutorial
# Modified Stylesheet for notebook.
from IPython.core.display import HTML
def css_styling():
styles = open("custom.css", "r").read()
return HTML(styles)
css_styling()
Lesson Plan
import matplotlib.pyplot as plt
import numpy as np
from collections import defaultdict
np.random.seed(seed=111111)
np.set_printoptions(precision=3, suppress=True)
So far, we have
We'll import the utility code so we can use it:
import sys; sys.path.append('../src/')
from change_detector import ChangeDetector, OnlineSimulator
How well would our previous change detector, the MeanDetector, work on a signal that had a seasonal component?
# Create a seasonal signal
# I imagined a metric that rises from 0 to 5 each calendar month
sig = np.linspace(0, 5, num=30)
sig = np.concatenate([sig for x in xrange(12)])
# Add a jump
jump_size = 5
sig[250:] = sig[250:] + jump_size
# Noise
noise = np.random.normal(
size=sig.shape,
scale=jump_size * 0.02)
plt.figure(figsize=(15,5))
plt.plot(sig + noise, 'b.', linestyle='')
plt.plot(sig + noise, 'b-', alpha=0.15)
plt.ylim(0,15)
plt.xlim(0,365)
plt.title("Imaginary Seasonal signal")
<matplotlib.text.Text at 0x39ba850>
We can imagine that the MeanDetector will be ineffective against such a signal. Why? What kind of change detector would work?
Streaming windows is an approach to change detection where we maintain statistics (such as mean) on two separate samples (or windows) of the signal.
Commonly, for one of the windows we use just the $n$ most recent data points. For the other window, we might use a much larger window or all of the signal that we've seen since $t=0$.
Then, we can compare what's been going on "recently" with what the signal has done as a whole "so far". There's plenty of research on how to compare two populations, which we'll be able to take advantage of.
There's a practical limitation, though, in streaming problems. It's usually impossible to store the entire history of data points and compute statistics at each step. We need clever ways of computing the statistics we need using on-line algorithms.
In this section, we'll use Welford's Method to compute mean and variance. This allows us to contimually update an estimate of these two parameters as our detector gets each new data point. We can do this without storing a history of all the data points we've seen before.
A separate section of the tutorial will take a closer look at Welford's method for those that are interested. Refer to the Tutorial Table of Contents if you'd like to read this material before proceeding.
When comparing two gaussian models, a z-score is the distance from the sample mean to the population mean in units of the standard error. It's often used to determine if a sample of data points is deviating from the expected statistics of the population.
$$z_{score} = \frac{(M - \sigma )}{SE}$$For our next change detector, we'll maintain two sets of statistics about our signal, and compare them using the z-score. Referring to the formulae above, we'll need to compute the mean and standard deviation for each set.
We'll use the global stream -- the entire signal that we've seen so far -- to compute the first set of statistics. Imagine this is our "population". We'll use a moving window of the $n$ most recent points for the second set of statistics, we'll think of that as our "sample".
The z-score, then, will tell us the "difference" between our local window and the global stream.
import change_detector
reload(change_detector)
from change_detector import ChangeDetector, OnlineSimulator
from collections import deque
class ZScoreDetector(ChangeDetector):
"""
A change detection algorithm based on z-scores
---
1. We will use Welford's algorithm to collect mean and variance
of the full stream.
2. At the same time, we will collect a window of recent
observations. We are storing all the recent observations for this
window, and compute local statistics traditionally.
3. Then, we compute a Z-score to compare the sample (window)
distribution to the population (global) distribution.
Our stopping rule is based on this computed z-score
"""
def __init__(self, window_size = 100, threshold=0.05):
super( ZScoreDetector, self ).__init__()
# hyper-parameters
self.threshold = threshold
self.window_size = window_size
# Interim and calculated values
self.k = 0 # total signal_size
self.g_mean = 0.0 # global mean
self.s = 0.0 # for Welford's method. variance = s / (k + 1)
# This is the window
self.window = deque(maxlen = window_size)
# ... and, finally, our residuals
self.z_score_ = np.nan
self.z_array = []
def update_residuals(self, new_signal_value):
self._update_base_residuals(new_signal_value)
x = new_signal_value
# Add new value to local window (deque will
# automatically drop a value to maintain window size)
self.window.append(x)
"""Calculate Statistics on global and local window """
# Calculate global statistics using welford's method
oldm = self.g_mean
newm = oldm + (x - oldm) / (self.k + 1)
s = self.s + (x - newm) * (x - oldm)
g_mean = newm # Global mean
g_std = np.sqrt(s / (self.k+1)) # Global std
# Calculate local statistics on the window
# We have all values stored for the window, so
# can use built-in numpy stats methods
w_mean = np.mean(self.window) # Window mean
w_std = np.std(self.window) # Window std
"""Calculate variables required for Zscore"""
# Calculate SE, see formula above
std_diff = (g_std - w_std) / g_std
SE = g_std / np.sqrt(self.window_size)
# Difference between the means
mean_diff = (g_mean - w_mean) / g_mean
# Z-score (residual)
self.z_score_ = (w_mean - g_mean) / SE
self.z_array.append(self.z_score_)
# Store attributes
self.g_mean = g_mean
self.s = s
# Update k (size of global window).
# This must be done at the end!
self.k += 1
def check_stopping_rules(self, new_signal_value):
# Check stopping rule!
if np.abs(self.z_score_) > self.threshold:
self.rules_triggered = True
To test our detector, let's see what happens when we pass some signals that we looked at in previous sections.
# Our first, most basic, signal.
basic_sig = np.ones(150)
basic_sig[:100] *= 50
basic_sig[100:] *= 4
# On Signal 1
detector = ZScoreDetector(window_size = 10, threshold=10.0)
OnlineSimulator(detector, basic_sig).run(
signal_name='Original first signal, sig1'
)
Residuals: ['z_score_'] Change detected. Stopping Rule triggered at 109.
-c:64: RuntimeWarning: invalid value encountered in double_scalars -c:71: RuntimeWarning: invalid value encountered in double_scalars
True
The stopping rules were triggered, but not immediately. This is an effect of having a local window. With a window size of 10, it took until data point 109 for the stopping rule to be triggered.
In general, we would now only be able to say that "the change took place somewhere in our local window." The larger the window, the less sure we are where the change took place, and the more delay in detection time.
This is a characteristic of many online change detectors. There is a trade off between the delay for detection and the frequency of false alarms.
TODO: insert Basseville Quote? This one isn't the one I was looking for...
"Apart from the tradeoff between the mean time between false alarms and the delay for detection -- both increasing when the sensitivity of the detector to high frequencies decreases -- there exists another tradeoff to be kept in mind which is closely related to the first one: efficiency vs complexity. -- Basseville"
plt.figure(figsize=(15,5))
plt.plot(sig + noise, 'b.', linestyle='')
plt.plot(sig + noise, 'b-', alpha=0.15)
plt.ylim(0,15)
plt.xlim(0,365)
plt.title("Imaginary Seasonal signal")
<matplotlib.text.Text at 0x3d77190>
detector = ZScoreDetector(window_size = 30, threshold=10.0)
OnlineSimulator(detector, sig).run(signal_name = "Imaginary Seasonal Signal")
Residuals: ['z_score_'] Change detected. Stopping Rule triggered at 276.
-c:68: RuntimeWarning: invalid value encountered in double_scalars
True
jump_size = 30
sig2 = np.linspace(40, 50, num=20)
sig2 = np.concatenate([sig2 for x in xrange(8)])
# TODO need a change to detect
sig2[120:] = sig2[120:] + jump_size
noise = np.random.normal(
size=sig2.shape,
scale=jump_size * 0.05
)
sig2 = sig2 + noise
detector = ZScoreDetector(window_size = 20, threshold=10.0)
OnlineSimulator(detector, sig2).run(signal_name = "Second Imaginary Seasonal Signal")
Residuals: ['z_score_'] Change detected. Stopping Rule triggered at 138.
True
This is pretty impressive, eh?
Even visually inspecting the original signal, it seems the change is small. Yet the detector does a good job here.
The residual stays at fairly low values, and then jumps dramatically after the point of the change.
The detector is resistant to the noise in the signal, and managed well with the seasonality.
The dramatic jump in our residual suggests that we're not too sensitive to the chosen threshold. The jump in the residual is large, even though the change (jump size) is not much larger than the seasonality magnitude and noise.
So far, the noise we've added has been gaussian in shape. What if we had signal points that were several standard deviations away from the mean?
This is a type of noise commonly found in signals. They can occur due to data entry or import errors, or missing data points.
We looked at a noisy signal before. This time, we'll increase the noise level and add some outliers to the signal, too.
jump_size = 20
signal_size = 300
# Create a signal with a small jump (change)
sig3 = np.ones(signal_size)
sig3[:150] = 30
sig3[150:] = 30 + jump_size
# Add noise
noise = np.random.normal(
scale = 0.10 * jump_size,
size=signal_size)
sig3 = sig3 + noise
# Add outliers
outlier_idx = np.random.choice(signal_size, size=3)
sig3[outlier_idx] = 0
outlier_idx = np.random.choice(signal_size, size=2)
sig3[outlier_idx] = 90
# What does this look like?
plt.figure(figsize=(15, 5))
plt.plot(sig3, 'b.')
plt.plot(sig3, '-', alpha=0.15)
plt.ylim(0,100)
(0, 100)
Early on, I said I was adding "noise" to the system, and I generated this noise by sampling a gaussian distribution. Now I'm adding these radical data points to the system and calling them anomalies.
It's time to get some definitions straight!
What do we mean when we talk about noise, outliers, and anomalies? How are they different?
[quote]
Since data entry errors and missing data can be considered to have been generated by a different mechanism, it would be sensible to call them anomalies.
But the real question is, do we want to set off the alarm when we see them, or do we want to ignore them? That's the critical question, and the answer depends on the application we're working on and the needs of its users.
[left-to-right chart: --> outlier (noise) --> outlier --> anomaly]
This is one of the first questions you'll have to address when designing any change detection or anomaly detection system: what kind of anomalies are important to identify? And what should be ignored?
Okay, let's see how our zscore detector does on sig3, which we created above.
# On Signal 3
detector = ZScoreDetector(window_size = 40, threshold=8.0)
OnlineSimulator(detector, sig3).run(signal_name = "Signal with noise and outlier")
Residuals: ['z_score_'] Change detected. Stopping Rule triggered at 180.
True
Return to the Change Detection Tutorial Table of Contents