In [2]:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Apache log analysis with Pandas

Nikolay Koldunov
[email protected]

In this notebook, I am going to show simple example of an Apache access log analysis with pandas. It's my first experience with pandas, and I am sure there are better and more efficient ways to do some of the things shown here. So comments, suggestions and corrections of my broken English are very welcome. You can send me an email, or create a PR for this notebook at github.

Load and parse the data

We will need apachelog module, for parsing the log. We also have to know the format of the log that is set in the Apache config. In my case I have no access to the Apache config, but the hoster provide description of the format on his help page. Below is the format itself and a short description of every element:

format = r'%V %h  %l %u %t \"%r\" %>s %b \"%i\" \"%{User-Agent}i\" %T'

Here (copied mostly from this SO post):

%V          - the server name according to the UseCanonicalName setting
%h          - remote host (ie the client IP)
%l          - identity of the user determined by identd (not usually used since not reliable)
%u          - user name determined by HTTP authentication
%t          - time the server finished processing the request.
%r          - request line from the client. ("GET / HTTP/1.0")
%>s         - status code sent from the server to the client (200, 404 etc.)
%b          - size of the response to the client (in bytes)
\"%i\"      - Referer is the page that linked to this URL.
User-agent  - the browser identification string
%T          - Apache request time  
In [3]:
import apachelog, sys

Set the format:

In [4]:
fformat = r'%V %h %l %u %t \"%r\" %>s %b \"%i\" \"%{User-Agent}i\" %T'

Create the parcer:

In [5]:
p = apachelog.parser(fformat)

Sample string: - - [16/Mar/2013:00:19:43 +0400] "GET /?p=364 HTTP/1.0" 200 65237 "" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11" 0
In [6]:
sample_string = ' - - [16/Mar/2013:00:19:43 +0400] "GET /?p=364 HTTP/1.0" 200 65237 "" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11" 0'
In [7]:
data = p.parse(sample_string)
In [8]:
{'%>s': '200',
 '%T': '0',
 '%V': '',
 '%b': '65237',
 '%h': '',
 '%i': '',
 '%l': '-',
 '%r': 'GET /?p=364 HTTP/1.0',
 '%t': '[16/Mar/2013:00:19:43 +0400]',
 '%u': '-',
 '%{User-Agent}i': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}

Looks like the parser works. Now to load some real world data (sample file located here and here):

In [9]:
log = open('access_log_for_pandas').readlines()

Parse every line of it and create a list of dictionaries:

In [10]:
log_list = []
for line in log:
          data = p.parse(line)
          sys.stderr.write("Unable to parse %s" % line)
       data['%t'] = data['%t'][1:12]+' '+data['%t'][13:21]+' '+data['%t'][22:27]

We had to tweak the time format a bit, since otherwise pandas will not be able to parse it.

Create and tune Data Frame

This will create a list of dictionaries, that can be transformed in to a Data Frame:

In [11]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame, Panel
In [12]:
df = DataFrame(log_list)

Show the first two lines of the Data Frame:

In [13]:
%>s %T %V %b %h %i %l %r %t %u %{User-Agent}i
0 200 0 26126 - - GET /index.php?option=com_content&task=section... 16/Mar/2013 08:00:25 +0400 - Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20...
1 200 0 10532 - GET /templates/ja_procyon/css/template_css.css... 16/Mar/2013 08:00:25 +0400 - Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20...

We are not going to use all the data, so let's delete some of the columns:

In [14]:
del df['%T']; del df['%V']; del df['%i']; del df['%l']; del df['%u']; del df['%{User-Agent}i']

and rename columns to something that humans can understand:

In [15]:
df = df.rename(columns={'%>s': 'Status', '%b':'b', 
                        '%h':'IP', '%r':'Request', '%t': 'Time'})

First five rows of resulting Data Frame:

In [16]:
Status b IP Request Time
0 200 26126 GET /index.php?option=com_content&task=section... 16/Mar/2013 08:00:25 +0400
1 200 10532 GET /templates/ja_procyon/css/template_css.css... 16/Mar/2013 08:00:25 +0400
2 200 1853 GET /templates/ja_procyon/switcher.js HTTP/1.0 16/Mar/2013 08:00:25 +0400
3 200 37153 GET /includes/js/overlib_mini.js HTTP/1.0 16/Mar/2013 08:00:25 +0400
4 200 3978 GET /modules/ja_transmenu/transmenuh.css HTTP/1.0 16/Mar/2013 08:00:25 +0400

Convert Time column to datetime format and make an index out of it (pop will drop original Time column):

In [17]:
df.index = pd.to_datetime(df.pop('Time'))

The Status variable is a string type, so we have to change it to int:

In [18]:
df['Status'] = df['Status'].astype('int')

Some of the rows in the b column contain '-' string, so we can't convert them with astype:

In [19]:

We can apply a custom function to this column, that will convert all dashes to NaN, and the rest to floats, additionally converting from bytes to megabytes:

In [20]:
def dash2nan(x):
    if x == '-':
        x = np.nan
        x = float(x)/1048576.
    return x
In [21]:
df['b'] = df['b'].apply(dash2nan)

I am sure there is a more elegant way to do this :)

Traffic analysis

Our first, simplest plot: outgoing traffic from the website:

In [22]:
<matplotlib.axes.AxesSubplot at 0xbf7574c>

Look's like somebody downloaded something big from the website around 9 a.m.

But actually the first thing that you probably want to know is how many visits (actually hits) your site has, and how they are distributed in time. We resample the series of variable b with 5 minute intervals, and calculate number of requests during every time span. Actually, in this case it doesn't matter what variable we use, these numbers will indicate just how many times information from the the website was requested.

In [23]:
df_s = df['b'].resample('5t', how='count')
<matplotlib.axes.AxesSubplot at 0xc14588c>

We can not only count number of requests per time, but also calculate the sum of the traffic for every time span:

In [24]:
df_b = df['b'].resample('10t', how=['count','sum'])
df_b['count'].plot( color='r')
<matplotlib.axes.AxesSubplot at 0xc2d53ac>

On the plot you can see, that the number of server requests do not always coincide with the amount of traffic, and correlation is actually not extremely high:

In [25]:
count sum
count 1.000000 0.512629
sum 0.512629 1.000000

We can have a closer look at the curious morning peak:

In [26]:
df_b['2013-03-16 6:00':'2013-03-16 10:00']['sum'].plot()
<matplotlib.axes.AxesSubplot at 0xc3f5dac>

Seems that this traffic spike was caused by only one request. Let's find out how this request looks. Choose all requests with the size of the response larger than 20 Mb (remember we convert bytes to Mb):

In [27]:
Status b IP Request
2013-03-16 09:02:59 200 21.365701 GET /books/Bondarenko.pdf HTTP/1.0

It was a .pdf file of a book (look at Request field), and this explains the peak in outgoing trafic at 2013-03-16 09:02:59.

Clearly 20 Mb is a large request (for our website at least :)). But what is the typical size of the server's response? The histogram of response sizes (less than 20 Mb) looks like this:

In [28]:
cc = df[df['b']<20]
<matplotlib.axes.AxesSubplot at 0xc52374c>

So, most of the files are less than 0.5 Mb large. In fact they are even smaller:

In [29]:
cc = df[df['b']<0.3]
<matplotlib.axes.AxesSubplot at 0xc5760ec>

Very small responses can be all kinds of stuff, but larger ones are probably some relatively big files. Let's have a look at these little spikes between 0.2 and 0.25 Mb:

In [30]:
cc = df[(df['b']>0.2)&(df['b']<0.25)]
<matplotlib.axes.AxesSubplot at 0xc57cacc>

We now can locate the size of the file more precisely, and find out its name:

In [31]:
cc = df[(df['b']>0.220)&(df['b']<0.224)]
Status b IP Request
2013-03-16 10:13:51 200 0.223101 GET /mypict/itpocover.png HTTP/1.0
2013-03-16 11:02:12 200 0.223101 GET /mypict/itpocover.png HTTP/1.0
2013-03-16 11:18:48 200 0.223101 GET /mypict/itpocover.png HTTP/1.0
2013-03-16 11:33:36 200 0.223101 GET /mypict/itpocover.png HTTP/1.0
2013-03-16 11:34:03 200 0.223101 GET /mypict/itpocover.png HTTP/1.0

This is one of the image files from the front page of the website.

Server status codes

I would like to know how different server responses are distributed through time. Let's first try a not very elegant approach. We create several variables with time series of Status values, each containing only a time series with one particular value. Then a Data Frame is created out of this time series.

In [32]:
t_span = '2H'
df_404 = df['Status'][df['Status'] == 404].resample(t_span, how='count')
df_403 = df['Status'][df['Status'] == 403].resample(t_span, how='count')
df_301 = df['Status'][df['Status'] == 301].resample(t_span, how='count')
df_304 = df['Status'][df['Status'] == 304].resample(t_span, how='count')
df_200 = df['Status'][df['Status'] == 200].resample(t_span, how='count')

status_df = DataFrame({'Not Found':df_404, 'Forbidden':df_403, 'Moved Permanently':df_301, 'Not Modified':df_304, 'OK':df_200,})
In [33]:
Forbidden Moved Permanently Not Found Not Modified OK
2013-03-16 08:00:00 22 17 3 4 375
2013-03-16 10:00:00 26 76 2 2 607
2013-03-16 12:00:00 45 83 2 23 780
2013-03-16 14:00:00 29 34 4 2 699
2013-03-16 16:00:00 27 94 5 3 886

We plot all values at once:

In [34]:
status_df.plot(figsize=(10, 3))
<matplotlib.axes.AxesSubplot at 0xcda1e6c>

Show only Error and Redirection status codes as a stacked plot:

In [35]:
status_df[['Not Found','Forbidden','Moved Permanently','Not Modified']].plot(kind='barh', stacked=True, figsize=(10, 7))
<matplotlib.axes.AxesSubplot at 0xce27e2c>

But there is a better way to receive statistics about different groups of values. Here we group our data by Status:

In [36]:
grouped_status = df.groupby('Status')

Data are now rearranged in to several groups, each corresponding to a certain value of Status (here, only the first two values of every group are shown):

In [37]:
Status b IP Request
Status Time
200 2013-03-16 08:00:25 200 0.024916 GET /index.php?option=com_content&task=section...
2013-03-16 08:00:25 200 0.010044 GET /templates/ja_procyon/css/template_css.css...
206 2013-03-16 11:42:53 206 0.031250 GET /images/stories/pdf/0104_044.pdf HTTP/1.0
2013-03-16 11:42:53 206 0.224717 GET /images/stories/pdf/0104_044.pdf HTTP/1.0
301 2013-03-16 08:13:08 301 0.000232 GET /forum HTTP/1.0
2013-03-16 08:13:09 301 0.000234 GET /podcast HTTP/1.0
302 2013-03-16 08:47:04 302 NaN POST /podcast/wp-comments-post.php HTTP/1.0
2013-03-16 11:27:40 302 NaN POST /podcast/wp-comments-post.php HTTP/1.0
304 2013-03-16 08:19:41 304 NaN GET /images/stories/researchers/laplace.jpg HT...
2013-03-16 08:33:14 304 NaN GET /images/stories/researchers/treshnikov.jpg...
400 2013-03-16 08:13:09 400 0.000999 GET /.. HTTP/1.0
403 2013-03-16 08:04:23 403 0.001054 GET /forum/download/file.php?avatar=156.gif HT...
2013-03-16 08:04:23 403 0.001054 GET /forum/download/file.php?avatar=216.jpg HT...
404 2013-03-16 08:04:25 404 0.001063 GET /apple-touch-icon-precomposed.png HTTP/1.0
2013-03-16 08:04:26 404 0.001063 GET /apple-touch-icon.png HTTP/1.0

We can count the number of values in each group:

In [38]:
<matplotlib.axes.AxesSubplot at 0xcfe76ac>

We can look at individual groups:

In [39]:
t_span = '30t'
grouped_status.get_group(301)['Status'].resample(t_span, how='count').plot(color='g', label='301')
grouped_status.get_group(200)['Status'].resample(t_span, how='count').plot(color='b', secondary_y=True, label='200')
<matplotlib.axes.AxesSubplot at 0xd0799ec>


Now I want to group by IPs and calculate how many requests we get from each of them:

In [40]:
ips = df.groupby('IP').size()

I don't want to see all of them, just the top 10:

In [41]:
<matplotlib.axes.AxesSubplot at 0xd13c0ec>

I just want to have a nice table with the top 10 IPs, so I convert them to Data Frame :)

In [42]:
ips_fd = DataFrame({'Number of requests':ips[-10:]})
ips_fd = ips_fd.sort(columns='Number of requests',  ascending=False)
Number of requests
IP 814 168 165 127 126 120 111 93 74 72

We also can group by both IP and Status:

In [43]:
ips_status = df.groupby(['IP', 'Status']).size()
In [44]:
<matplotlib.axes.AxesSubplot at 0xd3425cc>

Geographical information

I would like to know from which countries people or robots come to the website. It turns out that precise geolocation information is not free, but for our purposes free data bases are fine. We will use pygeoip to get information from GeoIP database, and that has to be downloaded from here.

Import module and initialise data base:

In [46]:
import pygeoip
gi = pygeoip.GeoIP('./GeoLiteCity.dat', pygeoip.MEMORY_CACHE)

This is how output of the pygeoip looks (we test our top IP from the previous section):

In [47]:
ipcon = gi.record_by_addr('')
{'area_code': 650,
 'city': 'Mountain View',
 'continent': 'NA',
 'country_code': 'US',
 'country_code3': 'USA',
 'country_name': 'United States',
 'dma_code': 807,
 'latitude': 37.41919999999999,
 'longitude': -122.0574,
 'metro_code': 'San Francisco, CA',
 'postal_code': '94043',
 'region_name': 'CA',
 'time_zone': 'America/Los_Angeles'}

Guess who it is :) Looks like Google Bot is my main reader. That's rather sad :(

Loop to create a list of dictionaries with information about IP numbers. We also add information about the number of requests from each of the IPs:

In [48]:
ipcon = []
for iip in ips.index:
    rres = gi.record_by_addr(iip)
#    rres['ip']                 = iip
    rres['Number'] = ips[iip]
    #delete some fields we don't need
    del rres['area_code']
    del rres['dma_code']
    del rres['metro_code']
    del rres['postal_code']
    del rres['region_name']
    del rres['time_zone']
    del rres['country_code']

Create Data Frame from this list, use ips index:

In [49]:
reg = DataFrame(ipcon, index = ips.index)
Number city continent country_code3 country_name latitude longitude
IP 1 Jakarta AS IDN Indonesia -6.1744 106.8294 1 Shenzhen AS CHN China 22.5333 114.1333 1 Agra AS IND India 27.1833 78.0167 1 Zhangzhou AS CHN China 24.5133 117.6556 1 Guangzhou AS CHN China 23.1167 113.2500

Group by Country:

In [50]:
country = reg.groupby('country_code3')
In [51]:
ff = country.Number.agg('sum').copy()
ff.sort( )
<matplotlib.axes.AxesSubplot at 0xd6c68cc>

The website is in Russian, so no surprises here.

Group by city:

In [52]:
city = reg.groupby('city')
In [53]:
ff = city.Number.agg('sum').copy()
ff.sort( )
ff[-20:].plot(kind='barh', figsize=(5,8))
<matplotlib.axes.AxesSubplot at 0xd86fa2c>

The first 3 are kind of expected, but Tyumen, a Russian city located in the middle of Eurasia thousands of kilometers away from the nearest ocean coast, is a bit of a surprise (the website is about Oceanography). This is something to investigate later on.

We have lats and lons for the data, so it is natural to put them on the map:

In [55]:
from mpl_toolkits.basemap import Basemap
import as cm
In [56]:
m = Basemap(projection='robin',lon_0=0,resolution='c')
x, y = m(reg['longitude'],reg['latitude'])
In [67]:

m.scatter(x,y,s=reg['Number']*3,c=reg['Number']/5,marker='o',zorder=4, cmap=cm.Paired,alpha=0.5)
<matplotlib.collections.PathCollection at 0x15facacc>

Have closer look at Europe:

In [68]:
m = Basemap(projection='cyl',llcrnrlat=35,urcrnrlat=72,\
x, y = m(reg['longitude'],reg['latitude'])
In [69]:
m.scatter(x,y,s=reg['Number']*30,c=reg['Number'],marker='o',zorder=4, cmap=cm.gist_ncar ,alpha=0.3)
<matplotlib.collections.PathCollection at 0x169381ec>

Catching spammers (request methods)

Spammers are a huge problem, and it would be nice to filter them out. My website has very low user activity, and if somebody tries to POST something more than couple of times it's already suspicious. First we have to get information about what request method is being used for the requests. For this we have to analyse the Request column, which contain strings with requests. Request methods are always located in the beginning of the request string, so it's trivial to get it:

In [70]:
def gmet(x):
    x = x.split()
    return x[0]

We apply this little function to the Request column and create new Method column:

In [71]:
df['Method'] = df.Request.apply(gmet)

Group by Method and IP:

In [72]:
met = df.groupby(['Method', 'IP']).size()
In [73]:
<bound method Series.head of Method  IP             
GET       26    23     12     4     14     56     32      13    45     9    13     2       4     13       10
POST      5      5      6      10      58      4      1     2     1        5     5       5       5       5       6
Length: 470>

Create POST time series (we have to have a copy of the time series to sort the data):

In [74]:
post = met['POST'].copy()

Plot the top 10 IPs that use POST method:

In [75]:
<matplotlib.axes.AxesSubplot at 0x16ddaeec>
In [76]:
IP    10      10    11      15      58

What is the winner doing?

In [77]:
df[df.IP ==''][0:5]
Status b IP Request Method
2013-03-16 08:28:43 403 0.001054 POST /podcast/wp-cron.php?doing_wp_cron=136340... POST
2013-03-16 08:45:08 403 0.001054 POST /podcast/wp-cron.php?doing_wp_cron=136340... POST
2013-03-16 09:19:52 403 0.001054 POST /podcast/wp-cron.php?doing_wp_cron=136341... POST
2013-03-16 09:36:26 403 0.001054 POST /podcast/wp-cron.php?doing_wp_cron=136341... POST
2013-03-16 09:40:22 403 0.001054 POST /podcast/wp-cron.php?doing_wp_cron=136341... POST

Making some Wordpress cron requests... This IP looks familiar... Oh, it's actually the IP of my website - this is my Wordpress doing something. Have to investigate later if I can disable this thing (anyway it's not working, getting 403 error) :)

Number two:

In [78]:
df[df.IP ==''][0:10]
Status b IP Request Method
2013-03-16 17:27:12 200 0.002224 GET /administrator/index.php HTTP/1.0 GET
2013-03-16 17:27:12 200 0.002224 GET /administrator/index.php HTTP/1.0 GET
2013-03-16 17:27:12 200 0.002224 GET /administrator/index.php HTTP/1.0 GET
2013-03-16 17:27:12 200 0.002224 GET /administrator/index.php HTTP/1.0 GET
2013-03-16 17:27:12 200 0.002224 GET /administrator/index.php HTTP/1.0 GET
2013-03-16 17:27:12 200 0.002224 POST /administrator/index.php HTTP/1.0 POST
2013-03-16 17:27:12 200 0.002224 POST /administrator/index.php HTTP/1.0 POST
2013-03-16 17:27:12 200 0.002224 GET /administrator/index.php HTTP/1.0 GET
2013-03-16 17:27:12 200 0.002224 POST /administrator/index.php HTTP/1.0 POST
2013-03-16 17:27:12 200 0.002224 POST /administrator/index.php HTTP/1.0 POST

This is suspicious; looks like it tries to get to the website's admin area - candidate for blocking.

Number three:

In [79]:
df[df.IP ==''][0:10]
Status b IP Request Method
2013-03-16 11:00:34 200 0.012539 GET /forum/ucp.php?mode=register&sid=8203c4843... GET
2013-03-16 11:00:34 200 0.026559 GET /forum/index.php HTTP/1.0 GET
2013-03-16 11:00:34 200 0.012011 GET /forum/ucp.php?mode=register&sid=c78851b8f... GET
2013-03-16 11:00:39 200 0.016414 POST /forum/ucp.php?mode=register&sid=12986e06... POST
2013-03-16 11:00:39 200 0.016414 POST /forum/ucp.php?mode=register HTTP/1.0 POST
2013-03-16 11:00:44 200 0.015024 POST /forum/ucp.php?mode=register HTTP/1.0 POST
2013-03-16 11:00:45 200 0.015028 POST /forum/ucp.php?mode=register HTTP/1.0 POST
2013-03-16 11:00:49 200 0.015024 POST /forum/ucp.php?mode=register HTTP/1.0 POST
2013-03-16 11:00:50 200 0.015028 POST /forum/ucp.php?mode=register HTTP/1.0 POST
2013-03-16 11:00:55 200 0.015024 POST /forum/ucp.php?mode=register HTTP/1.0 POST

This one tries to register on the forum every 5 second or so - clearly forum spammer. Block.