#!/usr/bin/env python
# coding: utf-8
# ## Having Fun with Syslog.
#
#
# In the next exercise we're going look at some syslog data. We'll take it up a notch by computing similarities with 'Banded MinHash' and running a hierarchical clustering algorithm.
#
# Systems logs are particularly challenging because they often lack any kind of structure as all. For this exercise we're going to be looking at the /var/log/system.log of a typical Mac OSX Laptop. The first steps will be standard data inspection and plots, after that we'll pull out the big guns!
#
#
#
#
So, lets take is up a notch with the good 'ol spice weasel! Bam!
# 1. **Compute similarities between all rows within the system log using LSH**
# Unlike conventional hash functions the goal of LSH (Locality Sensitive Hashing) is to maximize probability of "collision" of similar items rather than avoid collisions.
# * [MinHash](http://en.wikipedia.org/wiki/MinHash)
# * [Locality Sensitive Hashing](http://en.wikipedia.org/wiki/Locality_sensitive_hashing)
# * [Mining of Massive Datasets](http://infolab.stanford.edu/~ullman/mmds/ch3.pdf)
# 2. **Use those similarities as the basis of a Hierarchical Clustering Algorithm**
# Single-linkage clustering is one of several methods for agglomerative
# hierarchical clustering.
# - [Single Linkage Clustering](http://en.wikipedia.org/wiki/Single-linkage_clustering)
# - [Hierarchical Clustering](http://en.wikipedia.org/wiki/Hierarchical_clustering)
#
# Other popular online clustering algorithms
# - [DBSCAN](http://en.wikipedia.org/wiki/DBSCAN)
# - [OPTICS Algorithms](http://en.wikipedia.org/wiki/OPTICS_algorithm)
# In[15]:
# Even for small syslogs the number of similarity pairs to compute quickly
# becomes quite large O(N**2), so for 100k rows that's 10 billion possible
# pairs. Using Banded MinHash will drastically reduce the number of
# candidates that we have to compute.
import data_hacking.lsh_sims as lsh_sims
# Note: The parameters here are setup for feeding the results into a Hierarchical
# Clustering algorithm, which needs as many similarities as you can get.
# In general you'd parameters like num_hashes:20, lsh_bands:5 lsh_rows:4
# Note: lsh_bands*lsh_rows ^must^ equal num_hashes
params = {'num_hashes':20, 'lsh_bands':20, 'lsh_rows':1, 'drop_duplicates':True}
lsh = lsh_sims.LSHSimilarities(dataframe['features'], mh_params=params)
sims = lsh.batch_compute_similarities(distance_metric='jaccard', threshold=.2)
# #### The LSH Sims python class has two distance metrics
#
# 1) Jaccard Index: a set based distance metric (overlaps in sets of elements)
#
# - [Jaccard Index](http://en.wikipedia.org/wiki/Jaccard_index)
#
# 2) Levenshtein Distance: based on the edit distance of the elements (so order matters).
#
# - [Levenshtein Distance](http://en.wikipedia.org/wiki/Levenshtein_distance)
# In[16]:
# Lets look at the difference between Jaccard Similarity and Levenshtein Similarity
# So here similarity is a normalized measure of inverse distance...
print 'Jaccard Index (Sim): %f ' % lsh.jaccard_sim(['a','b','c','d'], ['a','b','d','c'])
print 'Levenshtein Distance: %f ' % lsh.levenshtein(['a','b','c','d'], ['a','b','d','c'])
print 'Levenshtein (Sim): %f ' % lsh.l_sim(['a','b','c','d'], ['a','b','d','c'])
# In[17]:
# One more example for intuition (also note they don't have to be the same size)
print 'Jaccard Index (Sim): %f ' % lsh.jaccard_sim(['a','b','c'], ['a','b','c','x'])
print 'Levenshtein Distance: %f ' % lsh.levenshtein(['a','b','c'], ['a','b','c','x'])
print 'Levenshtein (Sim): %f ' % lsh.l_sim(['a','b','c'], ['a','b','c','x'])
# In[18]:
# Okay now that we have the similarities between all the rows in our syslog
# we can start to investigate the results.
sims.sort(reverse=True)
print '*** Top sims ***'
sims[:10]
#sims[-10:]
# In[20]:
print dataframe.iloc[376]['features']
# In[21]:
print dataframe.iloc[1090]['features']
# In[22]:
# The two feature sets should look quite similar (recall that this
# is just our syslog row split on white space and thrown into a list)
# So now for any row in our syslog we can see what rows are highly
# similar to that row.
query_item = ['Google', 'Chrome', 'Helper[11545]:', 'Process', 'unable', 'to', 'create', 'connection', 'because', 'the', 'sandbox', 'denied', 'the', 'right', 'to', 'lookup', 'com.apple.coreservices.launchservicesd', 'and', 'so', 'this', 'process', 'cannot', 'talk', 'to', 'launchservicesd.', ':', 'LSXPCClient.cp', '#426', '___ZN26LSClientToServerConnection21setupServerConnectionEiPK14__CFDictionary_block_invoke()', 'q=com.apple.main-thread']
lsh.top_N(query_item,dataframe['label'], 5)
# In[23]:
# Note the query object doesn't have all the original features
query_item = ['Google', 'Chrome', 'Process', 'unable', 'to', 'sandbox']
lsh.top_N(query_item,dataframe['label'], 5)
#
# > cd data_hacking/fun_with_syslog # > python -m SimpleHTTPServer 9999 & ## # Now point your brower at the html file [http://localhost:9999/syslog_vis.html](http://localhost:9999/syslog_vis.html) # # ### ToDo # Really want to improve the interactive D3 based Hierarchical Tree Visualization [D3 Data Driven Documents](http://d3js.org) # # ### Conclusions # We pulled in some syslog data into a Pandas dataframe, made some plots, computed row similarities with 'Banded MinHash' and used single-linkage clustering to build an agglomerative hierarchical cluster. Lots of possibilities from here, you could use just the LSH datastructure or the H-Cluster... # # - LogFile Viewer # - Click on a row, filter out everything that's similar # - Click on a row, color code all the other rows based on similarity # - Super fancy D3 zoom in/filter awesomeness...