This notebook demonstrates using BigQuery to slice/dice request logs to construct an observable timeseries over a metric (such as 99th percentile latency values), and then train an anomaly-detector to detect anomalies.
import numpy as np
import samples.cloud as cloud
from samples.timeseries import AnomalyDetector
We have a simple log table. Here is the schema of the collected data.
logs_table = cloud.BigQuery.table('[requestlogs.logs]')
logs_table.schema()
[{u'name': u'timestamp', u'type': u'TIMESTAMP'}, {u'name': u'latency', u'type': u'INTEGER'}, {u'name': u'status', u'type': u'INTEGER'}, {u'name': u'method', u'type': u'STRING'}, {u'name': u'endpoint', u'type': u'STRING'}]
Here is a query to get a sampling of some of data in the table to get a sense of the log data. We'll issue the query, and convert the results to a pandas data frame, which provides easy slicing and formatted tabular display.
%%bigQuery logs
SELECT timestamp,latency,status,method,endpoint
FROM $logs_table
ORDER by timestamp
LIMIT 100
df = logs.dataFrame()
df[:10]
endpoint | latency | method | status | timestamp | |
---|---|---|---|---|---|
0 | Interact3 | 122 | GET | 200 | 2014-06-15 07:00:00.003772 |
1 | Interact3 | 144 | GET | 200 | 2014-06-15 07:00:00.428897 |
2 | Interact3 | 48 | GET | 200 | 2014-06-15 07:00:00.536486 |
3 | Interact2 | 28 | GET | 405 | 2014-06-15 07:00:00.652760 |
4 | Interact3 | 103 | GET | 200 | 2014-06-15 07:00:00.670100 |
5 | Interact2 | 121 | GET | 405 | 2014-06-15 07:00:00.834251 |
6 | Other | 28 | GET | 200 | 2014-06-15 07:00:00.943075 |
7 | Interact2 | 124 | GET | 405 | 2014-06-15 07:00:01.000102 |
8 | Interact3 | 49 | GET | 200 | 2014-06-15 07:00:01.071107 |
9 | Other | 119 | GET | 200 | 2014-06-15 07:00:01.159701 |
We're going to build a timeseries over latency. In order to make it a useful metric, we'll look at 99th percentile latency of requests within a fixed 5min window using this SQL query issued to BigQuery (notice the grouping over a truncated timestamp, and quantile aggregation).
%%bigQuery latency
SELECT INTEGER(timestamp / 300000000) * 300000000 AS window,
NTH(99, QUANTILES(latency, 100)) as latency
FROM $logs_table
WHERE endpoint = 'Recent'
GROUP BY window
ORDER by window
df_latency = latency.dataFrame()
df_latency[:10]
latency | window | |
---|---|---|
0 | 427 | 1.402816e+15 |
1 | 329 | 1.402816e+15 |
2 | 293 | 1.402816e+15 |
3 | 242 | 1.402817e+15 |
4 | 332 | 1.402818e+15 |
5 | 288 | 1.402818e+15 |
6 | 299 | 1.402819e+15 |
7 | 294 | 1.402820e+15 |
8 | 111 | 1.402820e+15 |
9 | 361 | 1.402821e+15 |
latency_timeseries = df_latency['latency'].values
latency_timeseries_length = len(latency_timeseries)
figsize(11, 7)
latency_ticks = array(range(latency_timeseries_length))
plot(latency_ticks, latency_timeseries)
yscale('log')
grid()
We'll create an anomaly detector (implemented as part of this sample). This one uses a simple strategy/algorithm of tracking mean and standard deviation. You can give it a value indicating # of values it uses to train itself. Beyond that it continually trains to account for slow-evolving changes to the app.
For any value that is off from the mean by 3x the standard deviation, it flags it as an anomaly.
latency_anomalies = np.zeros(latency_timeseries_length)
latency_means = np.zeros(latency_timeseries_length)
latency_anomaly_detector = AnomalyDetector(36)
for i, value in enumerate(latency_timeseries):
anomaly, mean = latency_anomaly_detector.observation(value)
latency_anomalies[i] = anomaly
latency_means[i] = mean
plot(latency_ticks, latency_timeseries)
plot(latency_ticks[latency_anomalies == 1],
latency_timeseries[latency_anomalies == 1], 'ro')
plot(latency_ticks, latency_means, 'g', linewidth = 1)
yscale('log')
grid()
This same technique works for other metrics - eg. request load... which is the count of requests within a specific time window.
%%bigQuery load
SELECT INTEGER(timestamp / 300000000) * 300000000 AS window,
COUNT(timestamp) AS requests
FROM [data-studio-team:requestlogs.logs20140618]
WHERE endpoint = 'Home'
GROUP BY window
ORDER by window
df_load = load.dataFrame()
df_load[:10]
requests | window | |
---|---|---|
0 | 47 | 1.403075e+15 |
1 | 11 | 1.403075e+15 |
2 | 17 | 1.403075e+15 |
3 | 31 | 1.403076e+15 |
4 | 17 | 1.403076e+15 |
5 | 25 | 1.403076e+15 |
6 | 24 | 1.403077e+15 |
7 | 26 | 1.403077e+15 |
8 | 30 | 1.403077e+15 |
9 | 12 | 1.403078e+15 |
load_timeseries = df_load['requests'].values
load_timeseries_length = len(load_timeseries)
load_ticks = array(range(load_timeseries_length))
plot(load_ticks, load_timeseries)
yscale('log')
grid()
load_anomalies = np.zeros(load_timeseries_length)
load_means = np.zeros(load_timeseries_length)
load_anomaly_detector = AnomalyDetector(36)
for i, value in enumerate(load_timeseries):
anomaly, mean = load_anomaly_detector.observation(value)
load_anomalies[i] = anomaly
load_means[i] = mean
plot(load_ticks, load_timeseries)
plot(load_ticks[load_anomalies == 1],
load_timeseries[load_anomalies == 1], 'ro')
plot(load_ticks, load_means, 'g', linewidth = 1)
yscale('log')
grid()
BigQuery provides the ability to slice data, group it, aggregate it (as well as stream in data as it is collected - not shown here).
This notebook represented interactive analysis over historical data. The techniques demonstrated here could be applied to recent data (eg. last 5 mins worth of logs) on a periodic basis to detect anomalies on an on-going basis.