Contributions:
Approach:
import pandas as pd
pd.__version__
'0.13.1'
import numpy as np
np.__version__
'1.8.0'
# Just some plotting defaults
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.size'] = 14.0
plt.rcParams['figure.figsize'] = 12.0, 8.0
# Create a BRO log file reader and pull from the logfile
import bro_log_reader
bro_log = bro_log_reader.BroLogReader()
headers = bro_log.read_log('data/http_headers.log')
# Nice, so lets look at some of the outputs by tossing them into a pandas dataframe
dataframe = pd.DataFrame(headers)
# What do we have
print 'Number of Rows: %d Columns:%d' % (dataframe.shape[0], dataframe.shape[1])
dataframe.head()
Number of Rows: 4576 Columns:4
header_events_json | origin | ts | useragent | |
---|---|---|---|---|
0 | [{"ACCEPT":"*\/*"},{"ACCEPT-LANGUAGE":"en-US"}... | client | 2012-03-30 17:32:57.382264 | Mozilla/4.0 (compatible; MSIE 7.0; Windows NT ... |
1 | [{"CACHE-CONTROL":"no-cache"},{"DATE":"Fri, 30... | server | 2012-03-30 17:32:57.382264 | NA |
2 | [{"ACCEPT":"*\/*"},{"ACCEPT-LANGUAGE":"en-US"}... | client | 2012-03-30 17:32:57.382264 | Mozilla/4.0 (compatible; MSIE 7.0; Windows NT ... |
3 | [{"CACHE-CONTROL":"no-cache"},{"DATE":"Fri, 30... | server | 2012-03-30 17:32:57.382264 | NA |
4 | [{"ACCEPT":"*\/*"},{"ACCEPT-LANGUAGE":"en-US"}... | client | 2012-03-30 17:32:57.382264 | Mozilla/4.0 (compatible; MSIE 7.0; Windows NT ... |
5 rows × 4 columns
- We have both client and server events, just keep the client events for this exercise
- Transform the complicated user-agent string into something more managable (short-agent)
- Generate a 'feature vector' from the header keys
# Okay so were only interested in client header requests for this exercise
dataframe = dataframe[dataframe['origin']=='client']
# Okay we also want to process the header events (that are in a JSON blob)
# into a header feature vector (just pulling 'keys' not values).
import json
def make_header_features(json_header_info_series):
header_features = []
for header_info in json_header_info_series:
try:
header_list = json.loads(unicode(header_info, 'utf8'))
features = [item.keys()[0] for item in header_list]
# There are some lines w/no features
except Exception as e:
features = ''
header_features.append(features)
return header_features
# Create a nicely formatted feature vector and a string representation
dataframe['feature_vector'] = make_header_features(dataframe['header_events_json'])
dataframe['features'] = dataframe['feature_vector'].map(lambda x: ':'.join(x))
The user-agent strings are verbose with lots of information and variety based on agent versions/platforms/layout engines/dll linked/builds etc... the logic around short agent strings it to capture the essence of what the agent IS. So for instance this user-agent string becomes this short-agent:
- User-agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.0.3705; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.3; SRC 2.7.1 E1; MS-RTC LM 8; BRI/2; BOIE8;ENUSMSNIP)
- short-agent: mozilla/4.0:msie:7.0:windows:nt:5.1:trident/4.0:n1.0:n1.1:n2.0:n3.0:n3.5
# Making shorter agent names based on information from
# http://msdn.microsoft.com/library/ms537503.aspx
import re
def replace_stuff(m):
return 'n'+m.group(1) if '.net clr' in m.group() else ''
def short_agent_names(useragent_series, resolution=12):
short_agent_list = []
excludes = re.compile(r',|;|\(|\)|compatible|\.net clr ([0-9].[0-9])[^;]*;|khtml,|like')
for useragent in useragent_series:
processed_user_agent = re.sub(excludes, replace_stuff, useragent.lower()).strip()
short_agent = ':'.join(processed_user_agent.split()[:resolution])
short_agent_list.append(short_agent)
return short_agent_list
# Generate shorter agent names
dataframe['short_agent'] = short_agent_names(dataframe['useragent'])
# Remove any 'na' agents
dataframe = dataframe.replace('na',np.nan)
dataframe = dataframe.dropna()
- We can use groupby on the dataframe to see the different header request keys for various agents
- Transform the complicated user-agent string into something more managable (short-agent)
- Generate a 'feature vector' from the header keys
# Okay lets exercise some of the pandas dataframe functionality
dataframe['count'] = 1
agent_group_df = dataframe.groupby(['short_agent','features']).sum()
agent_group_df.head(20)
count | ||
---|---|---|
short_agent | features | |
memeo:autobackup:/4.60.0.7923:/platform=1 | ACCEPT-LANGUAGE:ACCEPT:USER-AGENT:HOST:CONNECTION | 2 |
microsoft-cryptoapi/6.1 | CACHE-CONTROL:CONNECTION:ACCEPT:IF-MODIFIED-SINCE:IF-NONE-MATCH:USER-AGENT:HOST | 2 |
CACHE-CONTROL:CONNECTION:ACCEPT:IF-MODIFIED-SINCE:USER-AGENT:HOST | 3 | |
CONNECTION:ACCEPT:IF-MODIFIED-SINCE:IF-NONE-MATCH:USER-AGENT:HOST | 2 | |
CONNECTION:ACCEPT:USER-AGENT:HOST | 3 | |
mozilla/4.0 | USER-AGENT:HOST | 3 |
USER-AGENT:HOST:IF-MODIFIED-SINCE:IF-NONE-MATCH:CONNECTION | 3 | |
mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322 | ACCEPT:ACCEPT-LANGUAGE:XXXXXXXXXXXXXXX:USER-AGENT:HOST:CONNECTION | 1 |
mozilla/4.0:msie:7.0:windows:nt:6.1:wow64:trident/5.0:slcc2:n2.0:n3.5:n3.0 | ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION | 1 |
ACCEPT:ACCEPT-LANGUAGE:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE | 4 | |
mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 | ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION | 2 |
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION | 15 | |
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE | 12 | |
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:CONTENT-TYPE:CONTENT-LENGTH:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:CACHE-CONTROL:COOKIE | 1 | |
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:CONTENT-TYPE:X-VERIFY:CONTENT-LENGTH:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:CACHE-CONTROL | 1 | |
ACCEPT:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION | 2 | |
ACCEPT:ACCEPT-LANGUAGE:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION | 5 | |
ACCEPT:ACCEPT-LANGUAGE:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE | 6 | |
ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:COOKIE:CONNECTION:HOST | 1 | |
ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION | 77 |
20 rows × 1 columns
# Now lets get the number of different header sequence permutations per agent
agent_counts = agent_group_df.count(level=0)
# Looks like MSIE agents have a higher number of permutations than all the other stuff
# So we 'groupby' a conditional statement (do you have msie in your agent string)
agent_types = agent_counts.groupby(by=lambda x: 'msie' if 'msie' in x else 'other')
agent_types.head(20)
count | ||
---|---|---|
short_agent | ||
other | memeo:autobackup:/4.60.0.7923:/platform=1 | 1 |
microsoft-cryptoapi/6.1 | 4 | |
mozilla/4.0 | 2 | |
msie | mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322 | 1 |
mozilla/4.0:msie:7.0:windows:nt:6.1:wow64:trident/5.0:slcc2:n2.0:n3.5:n3.0 | 2 | |
mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 | 12 | |
mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5 | 17 | |
mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0 | 4 | |
mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06 | 3 | |
other | mozilla/5.0:windows:nt:6.1:wow64:rv:12.0:gecko/20100101:firefox/12.0 | 2 |
mozilla/5.0:windows:nt:6.1:wow64:rv:14.0:gecko/20100101:firefox/14.0.1 | 5 | |
mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0 | 25 | |
mozilla/5.0:windows:u:windows:nt:6.1:en-us:rv:1.9.2.18:gecko/20110614:firefox/3.6.18 | 4 | |
nis/19.9.0.9:mid/{grpgbpdjzi9qbdsno/32ukaorrc}:sid/fhkuuaaaaaa | 1 | |
nis/19.9.0.9:mid/{grpgbpdjzi9qbdsno/32ukaorrc}:sid/fhkuuaaaaaa:lue/1.8.2.10:windows6.1sp1.0x64enu | 1 | |
shasta | 1 | |
shockwave:flash | 6 |
17 rows × 1 columns
# Get some quick descriptive stats and plot it!
fig, ax = plt.subplots(subplot_kw={'axisbg':'#EEEEE5'})
ax.grid(color='lightgrey', linestyle='solid')
agent_types.boxplot(False)
{'boxes': [<matplotlib.lines.Line2D at 0x10b266e50>, <matplotlib.lines.Line2D at 0x10b26ef90>], 'caps': [<matplotlib.lines.Line2D at 0x10b2661d0>, <matplotlib.lines.Line2D at 0x10b266810>, <matplotlib.lines.Line2D at 0x10b26e310>, <matplotlib.lines.Line2D at 0x10b26e950>], 'fliers': [<matplotlib.lines.Line2D at 0x10b269a10>, <matplotlib.lines.Line2D at 0x10b26d410>, <matplotlib.lines.Line2D at 0x10b273c50>, <matplotlib.lines.Line2D at 0x10b276290>], 'medians': [<matplotlib.lines.Line2D at 0x10b2693d0>, <matplotlib.lines.Line2D at 0x10b273610>], 'whiskers': [<matplotlib.lines.Line2D at 0x10b260850>, <matplotlib.lines.Line2D at 0x10b260ad0>, <matplotlib.lines.Line2D at 0x10b26da10>, <matplotlib.lines.Line2D at 0x10b26dc90>]}
# Now lets flip the group by around
features = dataframe[['short_agent','features','count']].groupby(['features','short_agent']).sum()
print features.shape
features.head(20)
(91, 1)
count | ||
---|---|---|
features | short_agent | |
ACCEPT-LANGUAGE:ACCEPT:USER-AGENT:HOST:CONNECTION | memeo:autobackup:/4.60.0.7923:/platform=1 | 2 |
ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION | mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 | 2 |
mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5 | 5 | |
mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0 | 1 | |
mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06 | 3 | |
ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE | mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5 | 9 |
ACCEPT:ACCEPT-ENCODING:USER-AGENT:IF-MODIFIED-SINCE:HOST:CONNECTION:COOKIE | mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5 | 8 |
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION | mozilla/4.0:msie:7.0:windows:nt:6.1:wow64:trident/5.0:slcc2:n2.0:n3.5:n3.0 | 1 |
mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 | 15 | |
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE | mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 | 12 |
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:IF-MODIFIED-SINCE:HOST:CONNECTION | mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5 | 1 |
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:CONTENT-TYPE:CONTENT-LENGTH:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:CACHE-CONTROL:COOKIE | mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 | 1 |
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:CONTENT-TYPE:X-VERIFY:CONTENT-LENGTH:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:CACHE-CONTROL | mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 | 1 |
ACCEPT:ACCEPT-LANGUAGE:REFERER:X-SVN-REV:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE | mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5 | 1 |
ACCEPT:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION | mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 | 2 |
mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06 | 2 | |
ACCEPT:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION:COOKIE | mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5 | 5 |
mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0 | 2 | |
ACCEPT:ACCEPT-LANGUAGE:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION | mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 | 5 |
ACCEPT:ACCEPT-LANGUAGE:X-FLASH-VERSION:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE | mozilla/4.0:msie:7.0:windows:nt:6.1:wow64:trident/5.0:slcc2:n2.0:n3.5:n3.0 | 4 |
20 rows × 1 columns
Other popular online clustering algorithms
- [DBSCAN](http://en.wikipedia.org/wiki/DBSCAN)
- [OPTICS Algorithms](http://en.wikipedia.org/wiki/OPTICS_algorithm)
Jaccard Index: a set based distance metric (overlaps in sets of elements)
Levenshtein Distance: based on the edit distance of the elements (so order matters).
In this case we are using Levenshtein not on individual letters in strings but tokens in sequences.
Examples:
a = ['ACCEPT', 'USER-AGENT', 'HOST', 'COOKIE']
b = ['ACCEPT', 'USER-AGENT', 'HOST']
c = ['ACCEPT', 'USER-AGENT', 'DORSEYS-MOM']
d = ['COOKIE', 'ACCEPT', 'USER-AGENT', 'HOST']
levenshtein(a,b) = 1.0
levenshtein(b,c) = 1.0
levenshtein(a,d) = 2.0
# Lets look at the a few examples of Levenshtein distance
import data_hacking.lsh_sims as lsh_sims
lsh = lsh_sims.LSHSimilarities([])
a = ['ACCEPT', 'USER-AGENT', 'HOST', 'COOKIE']
b = ['ACCEPT', 'USER-AGENT', 'HOST']
c = ['ACCEPT', 'USER-AGENT', 'DORSEYS-MOM']
d = ['COOKIE', 'ACCEPT', 'USER-AGENT', 'HOST']
print 'Levenshtein: %s -- %s ( %f )' % (a, b, lsh.levenshtein(a, b))
print 'Levenshtein: %s -- %s ( %f )' % (b, c, lsh.levenshtein(b, c))
print 'Levenshtein: %s -- %s ( %f )' % (a, d, lsh.levenshtein(a, d))
Levenshtein: ['ACCEPT', 'USER-AGENT', 'HOST', 'COOKIE'] -- ['ACCEPT', 'USER-AGENT', 'HOST'] ( 1.000000 ) Levenshtein: ['ACCEPT', 'USER-AGENT', 'HOST'] -- ['ACCEPT', 'USER-AGENT', 'DORSEYS-MOM'] ( 1.000000 ) Levenshtein: ['ACCEPT', 'USER-AGENT', 'HOST', 'COOKIE'] -- ['COOKIE', 'ACCEPT', 'USER-AGENT', 'HOST'] ( 2.000000 )
# Lets compute levenshtein distance between the header sequences for each agent
params = {'num_hashes':20, 'lsh_bands':20, 'lsh_rows':1, 'drop_duplicates':True}
agent_distances = {}
agent_groups = dataframe.groupby(['short_agent'])
for name, group in agent_groups:
lsh = lsh_sims.LSHSimilarities(group['feature_vector'], mh_params=params)
distances = lsh.batch_compute_similarities(distance_metric='levenshtein_tapered', threshold=10)
distances.sort()
agent_distances[name] = distances
# For one agent show the top 5 closest (levenshtein) header sequences
agent = 'mozilla/4.0:msie:7.0:windows:nt:6.1:wow64:trident/5.0:slcc2:n2.0:n3.5:n3.0'
distances = agent_distances[agent]
print '\nAgent: %s' % agent
print 'Distances:'
features = agent_groups.get_group(agent)['feature_vector']
for distance in distances[:5]:
print '\n%s\n%s' % (features.iloc[distance[1]], features.iloc[distance[2]])
Agent: mozilla/4.0:msie:7.0:windows:nt:6.1:wow64:trident/5.0:slcc2:n2.0:n3.5:n3.0 Distances: [u'ACCEPT', u'ACCEPT-LANGUAGE', u'X-FLASH-VERSION', u'ACCEPT-ENCODING', u'USER-AGENT', u'HOST', u'CONNECTION', u'COOKIE'] [u'ACCEPT', u'ACCEPT-LANGUAGE', u'REFERER', u'X-FLASH-VERSION', u'ACCEPT-ENCODING', u'USER-AGENT', u'HOST', u'CONNECTION']
We're using a bottom up method (image is flipped :), you simply sort the similarities and start building your tree from the bottom. If B and C are the most similar you link them, then D/E and so on until you complete the tree. The devil is definitely in the details on the implementation of this, so luckily we have a python class that does it for us.
# MLPD3 is a cool python module for using D3 as a back end to matplotlib
# go to https://github.com/jakevdp/mpld3 and behold the awesome.
# Note we're commenting this out that the nbviewer work correctly,
# but feel free to uncomment if you download the notebook and play
# with it yourself.
'''
try:
import mpld3
mpld3.enable_notebook(d3_url="/files/d3/d3.v3.js")
except ImportError:
print 'Info: Could not load mpld3 module. No worries stuff will still work fine...'
'''
'\ntry:\n import mpld3\n mpld3.enable_notebook(d3_url="/files/d3/d3.v3.js")\nexcept ImportError:\n print \'Info: Could not load mpld3 module. No worries stuff will still work fine...\'\n'
# Compute a hierarchical clustering from the header similarities for each agent
import data_hacking.hcluster as hcluster
agent_h_graphs = {}
groups = dict(list(agent_groups))
for name, group in groups.iteritems():
lsh = lsh_sims.LSHSimilarities(group['feature_vector'], mh_params=params)
distances = lsh.batch_compute_similarities(distance_metric='l_tapered_sim', threshold=0)
h_clustering = hcluster.HCluster(group['feature_vector'])
h_clustering.set_sim_method(lsh.l_sim)
h_graph, root = h_clustering.sims_to_hcluster(distances, agg_sim=.2)
agent_h_graphs[name] = {'graph':h_graph, 'root':root}
<<<< WTF Error: Looks like an empty graph >>>>> Graph 0 nodes 0 edges <<<< WTF Error: Looks like an empty graph >>>>> Graph 0 nodes 0 edges <<<< WTF Error: Looks like an empty graph >>>>> Graph 0 nodes 0 edges <<<< WTF Error: Looks like an empty graph >>>>> Graph 0 nodes 0 edges <<<< WTF Error: Looks like an empty graph >>>>> Graph 0 nodes 0 edges
# Plot a couple of agents
import networkx as nx
def plot_h_tree(graph, layout='neato'):
pos = nx.graphviz_layout(graph, prog=layout)
labels = {node[0]:node[1]['label'] for node in graph.nodes(data=True)}
nx.draw_networkx(graph, pos, node_size=800, alpha=.7, node_color=[.6,.4,.6], labels=labels)
edge_labels=dict([((u,v,),str(d['weight'])[:4]) for u,v,d in graph.edges(data=True)])
nx.draw_networkx_edge_labels(graph,pos,edge_labels=edge_labels)
# MSIE 8
msie_8 = 'mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5'
plot_h_tree(agent_h_graphs[msie_8]['graph'])
msie_9 = 'mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0'
plot_h_tree(agent_h_graphs[msie_9]['graph'])
flash = 'shockwave:flash'
plot_h_tree(agent_h_graphs[flash]['graph'])
import collections
def subtree_labels(g, root):
labels = nx.get_node_attributes(g,'label')
sub_labels = collections.defaultdict(list)
leaves = [k for k,v in g.out_degree().iteritems() if v == 0]
for leaf in leaves:
sub_labels[g.predecessors(leaf)[0]].append(labels[leaf])
return sub_labels
import pprint
g = agent_h_graphs[good_test_agent]['graph']
root = agent_h_graphs[good_test_agent]['root']
# Hmph, well just for fun we made a RE Morpher class; you simply keep adding
# strings to it and it figures out the RE that matches all the strings.
# It's very hack-tastic so a better way to auto-generate regular expressions
# will be a fun task for some contributor :)
import re
import re_morpher
# Lets experiment a bit
a = [u'HOST', u'CONNECTION', u'ACCEPT', u'USER-AGENT', u'ACCEPT-ENCODING']
b = [u'HOST', u'CONNECTION', u'AUTHORIZATION', u'ACCEPT', u'USER-AGENT', u'ACCEPT-ENCODING']
b = [u'HOST', u'CONNECTION', u'AUTHORIZATION', u'ACCEPT', u'USER-AGENT', u'DORSEYS-MOM']
my_re_morpher = re_morpher.REMorpher()
my_re_morpher.add_sequence(a)
print my_re_morpher.get_re_pattern()
my_re_morpher.add_sequence(b)
print my_re_morpher.get_re_pattern()
^HOSTCONNECTIONACCEPTUSER-AGENTACCEPT-ENCODING$ ^HOSTCONNECTION(AUTHORIZATION)?ACCEPTUSER-AGENT(DORSEYS-MOM)?(ACCEPT-ENCODING)?$
# Alright now try it out on our agents header sequences
import collections
agent_res = collections.defaultdict(list)
for agent, graph_info in agent_h_graphs.iteritems():
#for agent, graph_info in zip(good_test_agent,agent_h_graphs[good_test_agent]):
graph = graph_info['graph']
root = graph_info['root']
if graph:
# Get the re patterns for this agent
for sub_key, feature_list in subtree_labels(graph,root).iteritems():
for f in feature_list:
my_re_morpher.add_sequence(f.split(':'))
# Append to my re list
agent_res[agent].append(my_re_morpher.get_re_pattern())
my_re_morpher.reset_re()
# Print out the agent sets just to get an idea
for agent, graph_info in agent_h_graphs.iteritems():
print '\n%s' % agent
for my_re in agent_res[agent]:
print '\t%s' % my_re
mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5 ^(ACCEPT)?(ACCEPT-LANGUAGE)?(REFERER)?(X-SVN-REV)?(ACCEPT)?(ACCEPT-ENCODING)?(ACCEPT-LANGUAGE)?(USER-AGENT)?(X-FLASH-VERSION)?(ACCEPT-ENCODING)?(ACCEPT-LANGUAGE)?(USER-AGENT)?(CONTENT-TYPE)?(ACCEPT-ENCODING)?(IF-MODIFIED-SINCE)?(IF-NONE-MATCH)?(HOST)?(CONTENT-LENGTH)?(CONNECTION)?(COOKIE)?(IF-NONE-MATCH)?$ ^ACCEPTACCEPT-ENCODINGUSER-AGENT(IF-MODIFIED-SINCE)?HOSTCONNECTION(COOKIE)?$ ^X-REQUESTED-WITHACCEPT-LANGUAGEREFERERACCEPTCONTENT-TYPEACCEPT-ENCODINGUSER-AGENTIF-MODIFIED-SINCEIF-NONE-MATCHHOSTCONNECTIONCOOKIE$ mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06 ^ACCEPT(ACCEPT-ENCODING)?(ACCEPT-LANGUAGE)?USER-AGENT(ACCEPT-ENCODING)?HOSTCONNECTION$ mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5 ^ACCEPT(REFERER)?ACCEPT-LANGUAGEUSER-AGENTACCEPT-ENCODING(COOKIE)?(CONNECTION)?HOST(CONNECTION)?(COOKIE)?$ ^(X-REQUESTED-WITH)?(ACCEPT)?(ACCEPT-LANGUAGE)?(REFERER)?(ACCEPT)?(CONTENT-LENGTH)?ACCEPT-ENCODINGUSER-AGENTHOSTCONNECTION(COOKIE)?(CACHE-CONTROL)?$ ^ACCEPTACCEPT-LANGUAGE(REFERER)?X-FLASH-VERSION(CONTENT-TYPE)?(CONTENT-LENGTH)?ACCEPT-ENCODINGUSER-AGENTHOSTCONNECTION(CACHE-CONTROL)?(COOKIE)?$ mozilla/5.0:windows:nt:6.1:wow64:rv:14.0:gecko/20100101:firefox/14.0.1 ^HOSTUSER-AGENTACCEPTACCEPT-LANGUAGEACCEPT-ENCODINGCONNECTION(X-REQUESTED-WITH)?(X-YAHOO-MSGR-USER-AGENT)?(REFERER)?(COOKIE)?(IF-MODIFIED-SINCE)?(IF-NONE-MATCH)?$ mozilla/4.0:msie:7.0:windows:nt:6.1:wow64:trident/5.0:slcc2:n2.0:n3.5:n3.0 ^ACCEPTACCEPT-LANGUAGE(REFERER)?X-FLASH-VERSIONACCEPT-ENCODINGUSER-AGENTHOSTCONNECTION(COOKIE)?$ mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322 shasta mozilla/4.0 ^USER-AGENTHOST(IF-MODIFIED-SINCE)?(IF-NONE-MATCH)?(CONNECTION)?$ mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0 ^ACCEPT(REFERER)?ACCEPT-LANGUAGEUSER-AGENTACCEPT-ENCODINGHOSTCONNECTION(COOKIE)?$ ^ACCEPTACCEPT-ENCODINGUSER-AGENTHOSTCONNECTION$ mozilla/5.0:windows:nt:6.1:wow64:rv:12.0:gecko/20100101:firefox/12.0 ^HOSTUSER-AGENTACCEPTACCEPT-LANGUAGEACCEPT-ENCODINGCONNECTION(REFERER)?$ memeo:autobackup:/4.60.0.7923:/platform=1 mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0 ^HOSTUSER-AGENTACCEPTACCEPT-LANGUAGEACCEPT-ENCODINGCONNECTION(REFERER)?(ORIGIN)?(COOKIE)?(IF-MODIFIED-SINCE)?(CONTENT-TYPE)?(X-REQUESTED-WITH)?(REFERER)?(CONTENT-LENGTH)?(COOKIE)?(PRAGMA)?(IF-NONE-MATCH)?(CACHE-CONTROL)?$ nis/19.9.0.9:mid/{grpgbpdjzi9qbdsno/32ukaorrc}:sid/fhkuuaaaaaa:lue/1.8.2.10:windows6.1sp1.0x64enu shockwave:flash ^CONTENT-TYPEUSER-AGENTHOSTCONTENT-LENGTHCONNECTIONCACHE-CONTROL(COOKIE)?$ ^(REFERER)?X-FLASH-VERSIONUSER-AGENTHOST(CACHE-CONTROL)?(CONNECTION)?$ nis/19.9.0.9:mid/{grpgbpdjzi9qbdsno/32ukaorrc}:sid/fhkuuaaaaaa microsoft-cryptoapi/6.1 ^(CACHE-CONTROL)?CONNECTIONACCEPT(IF-MODIFIED-SINCE)?(IF-NONE-MATCH)?USER-AGENTHOST$ mozilla/5.0:windows:u:windows:nt:6.1:en-us:rv:1.9.2.18:gecko/20110614:firefox/3.6.18 ^HOSTUSER-AGENTACCEPTACCEPT-LANGUAGEACCEPT-ENCODINGACCEPT-CHARSETKEEP-ALIVECONNECTION(REFERER)?(COOKIE)?(IF-MODIFIED-SINCE)?$
- Make sure the regular expressions match all the agents/features in the training set.
- Test the expressions against data/PCAPs that have known bad/sneaky agents.
Well by definition the regular expressions are suppose to match the training set, so the first evaluation is more of a sanity check. For the second test we find 'matching' agents in the PCAP file and test their header sequences.
# An evaluation method for our auto-magically-generated RE expressions
import re
def evaluate_agents(agent_list, feature_list):
print 'Evaluating %d requests' % len(agent_list)
for agent, features in zip(agent_list, feature_list):
my_res = [re.compile(my_re) for my_re in agent_res[agent]]
match = any([my_re.match(features.replace(':','')) for my_re in my_res])
if not match:
print '\nAlert: No Match on Agent(%s) Sequence(%s)' % (agent,features)
# Evaluation against the training set (there should be no alerts)
t_agents = [(len(agent_res[agent])>0) for agent in dataframe['short_agent']] # Degenerate case where no H-Tree was built
training_agents = dataframe[t_agents]
evaluate_agents(training_agents['short_agent'], training_agents['features'])
Evaluating 2268 requests Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(X-REQUESTED-WITH:ACCEPT-LANGUAGE:REFERER:ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE) Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(X-REQUESTED-WITH:ACCEPT-LANGUAGE:REFERER:ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE) Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(X-REQUESTED-WITH:ACCEPT-LANGUAGE:REFERER:ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE) Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(X-REQUESTED-WITH:ACCEPT-LANGUAGE:REFERER:ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE) Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(X-REQUESTED-WITH:ACCEPT-LANGUAGE:REFERER:ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE) Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(X-REQUESTED-WITH:ACCEPT-LANGUAGE:REFERER:ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE) Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(X-REQUESTED-WITH:ACCEPT-LANGUAGE:REFERER:ACCEPT:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:COOKIE) Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION:COOKIE:IF-MODIFIED-SINCE:IF-NONE-MATCH) Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION:COOKIE:IF-MODIFIED-SINCE:IF-NONE-MATCH) Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION:COOKIE:IF-MODIFIED-SINCE:IF-NONE-MATCH) Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION:COOKIE:IF-MODIFIED-SINCE:IF-NONE-MATCH) Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:COOKIE:CONNECTION:HOST) Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.4:slcc2:n2.0:n3.5) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:CONTENT-TYPE:ACCEPT-ENCODING:HOST:CONTENT-LENGTH:CONNECTION:CACHE-CONTROL) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:14.0:gecko/20100101:firefox/14.0.1) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-TYPE:X-YAHOO-MSGR-USER-AGENT:REFERER:COOKIE) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:REFERER:ORIGIN:RANGE:IF-RANGE) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:REFERER:ORIGIN:RANGE:IF-RANGE) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE) Alert: No Match on Agent(mozilla/4.0:msie:8.0:windows:nt:6.1:wow64:trident/4.0:gtb7.2:slcc2:n2.0:n3.5) Sequence(ACCEPT:ACCEPT-LANGUAGE:REFERER:X-FLASH-VERSION:CONTENT-TYPE:X-VERIFY:CONTENT-LENGTH:ACCEPT-ENCODING:USER-AGENT:HOST:CONNECTION:CACHE-CONTROL) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:msie:9.0:windows:nt:6.1:wow64:trident/5.0:np06) Sequence(ACCEPT:REFERER:ACCEPT-LANGUAGE:USER-AGENT:ACCEPT-ENCODING:HOST:CONNECTION) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:CONTENT-LENGTH:CONTENT-TYPE) Alert: No Match on Agent(mozilla/5.0:windows:nt:6.1:wow64:rv:16.0:gecko/20100101:firefox/16.0) Sequence(HOST:USER-AGENT:ACCEPT:ACCEPT-LANGUAGE:ACCEPT-ENCODING:CONNECTION:COOKIE:REFERER:CONTENT-TYPE:CONTENT-LENGTH)
# Read in from contagio dumps' pcap samples for evaluation testing
bro_log = bro_log_reader.BroLogReader()
contagio_headers = bro_log.read_log('data/contagio.headers.txt')
contagio_df = pd.DataFrame(contagio_headers)
contagio_df.head()
header_events_json | origin | ts | useragent | |
---|---|---|---|---|
0 | [{"ACCEPT":"application\/octet-stream"},{"CONT... | client | 2013-08-10 23:26:48.150406 | Alina v5.3 |
1 | [{"DATE":"Sun, 11 Aug 2013 05:25:27 GMT"},{"SE... | server | 2013-08-10 23:26:48.150406 | NA |
2 | [{"ACCEPT":"application\/octet-stream"},{"CONT... | client | 2013-08-10 23:28:40.198085 | Alina v5.3 |
3 | [{"DATE":"Sun, 11 Aug 2013 05:27:19 GMT"},{"SE... | server | 2013-08-10 23:28:40.198085 | NA |
4 | [{"ACCEPT":"application\/octet-stream"},{"CONT... | client | 2013-08-10 23:32:41.074339 | Alina v5.3 |
5 rows × 4 columns
# A bit of processing on the raw data to prepate it for evaluation
contagio_df = contagio_df[contagio_df['origin']=='client']
contagio_df['short_agent'] = short_agent_names(contagio_df['useragent'])
contagio_df['feature_vector'] = make_header_features(contagio_df['header_events_json'])
contagio_df['features'] = contagio_df['feature_vector'].map(lambda x: ':'.join(x))
# Lets look at the overlap of agents from our training set and the contagio set
trained_agents = set(dataframe['short_agent'].unique())
evil_agents = set(contagio_df['short_agent'].unique())
evil_agents = evil_agents.intersection(trained_agents)
contagio_subset = contagio_df[contagio_df['short_agent'].isin(evil_agents)]
evil_agents
# Well only a couple of agents overlap our training data, but that's okay
# still a reasonable set of header requests to test against.
{'microsoft-cryptoapi/6.1', 'mozilla/4.0', 'mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322'}
# Lets see how the Contagio CrimeWare PCAP requests measure up against our dataset of computed regex's
evaluate_agents(contagio_subset['short_agent'],contagio_subset['features'])
Evaluating 33 requests Alert: No Match on Agent(mozilla/4.0) Sequence(CACHE-CONTROL:CONNECTION:PRAGMA:CONTENT-TYPE:USER-AGENT:CONTENT-LENGTH:HOST) Alert: No Match on Agent(mozilla/4.0) Sequence(CACHE-CONTROL:CONNECTION:PRAGMA:CONTENT-TYPE:USER-AGENT:CONTENT-LENGTH:HOST) Alert: No Match on Agent(mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322) Sequence(HOST:KEEP-ALIVE:CONNECTION:USER-AGENT) Alert: No Match on Agent(mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322) Sequence(HOST:KEEP-ALIVE:CONNECTION:USER-AGENT) Alert: No Match on Agent(mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322) Sequence(HOST:KEEP-ALIVE:CONNECTION:USER-AGENT) Alert: No Match on Agent(mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322) Sequence(HOST:KEEP-ALIVE:CONNECTION:USER-AGENT) Alert: No Match on Agent(mozilla/4.0:msie:6.0:windows:nt:5.1:sv1:.net:clr:1.1.4322) Sequence(HOST:KEEP-ALIVE:CONNECTION:USER-AGENT) Alert: No Match on Agent(mozilla/4.0) Sequence(HOST:USER-AGENT:CONTENT-TYPE:CONTENT-LENGTH:CONNECTION)
In general the material in this notebook represents fairly embryonic work.
Bartoli, Davanzo, De Lorenzo, Mauri, Medvet, Sorio, Automatic Generation of Regular Expressions from Examples with Genetic Programming, ACM Genetic and Evolutionary Computation Conference (GECCO), 2012, Philadelphia (US)
De Lorenzo, Medvet, Bartoli, Automatic String Replace by Examples, ACM Genetic and Evolutionary Computation Conference (GECCO), 2013, Amsterdam (Netherlands)—the string replace functionality described in this paper is based on an extension of the work showcased on this web app; it is currently not exposed on the web.
The IPython notebook uses a strategy to find a regex that given two python sequences matches the first but ensures that it does not match the second using a set cover technique and or'ing the components together. Please see: http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313.ipynb for more info.