In [2]:

%pylab inline

Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Apache log analysis with Pandas¶

Nikolay Koldunov
koldunovn@gmail.com

koldunov.net

In this notebook, I am going to show simple example of an Apache access log analysis with pandas. It's my first experience with pandas, and I am sure there are better and more efficient ways to do some of the things shown here. So comments, suggestions and corrections of my broken English are very welcome. You can send me an email, or create a PR for this notebook at github.

Load and parse the data¶

We will need apachelog module, for parsing the log. We also have to know the format of the log that is set in the Apache config. In my case I have no access to the Apache config, but the hoster provide description of the format on his help page. Below is the format itself and a short description of every element:

format = r'%V %h  %l %u %t \"%r\" %>s %b \"%i\" \"%{User-Agent}i\" %T'

Here (copied mostly from this SO post):

%V          - the server name according to the UseCanonicalName setting
%h          - remote host (ie the client IP)
%l          - identity of the user determined by identd (not usually used since not reliable)
%u          - user name determined by HTTP authentication
%t          - time the server finished processing the request.
%r          - request line from the client. ("GET / HTTP/1.0")
%>s         - status code sent from the server to the client (200, 404 etc.)
%b          - size of the response to the client (in bytes)
\"%i\"      - Referer is the page that linked to this URL.
User-agent  - the browser identification string
%T          - Apache request time

In [3]:

import apachelog, sys

Set the format:

In [4]:

fformat = r'%V %h %l %u %t \"%r\" %>s %b \"%i\" \"%{User-Agent}i\" %T'

Create the parcer:

In [5]:

p = apachelog.parser(fformat)

Sample string:

    koldunov.net 85.26.235.202 - - [16/Mar/2013:00:19:43 +0400] "GET /?p=364 HTTP/1.0" 200 65237 "http://koldunov.net/?p=364" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11" 0

In [6]:

sample_string = 'koldunov.net 85.26.235.202 - - [16/Mar/2013:00:19:43 +0400] "GET /?p=364 HTTP/1.0" 200 65237 "http://koldunov.net/?p=364" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11" 0'

In [7]:

data = p.parse(sample_string)

In [8]:

data

Out[8]:

{'%>s': '200',
 '%T': '0',
 '%V': 'koldunov.net',
 '%b': '65237',
 '%h': '85.26.235.202',
 '%i': 'http://koldunov.net/?p=364',
 '%l': '-',
 '%r': 'GET /?p=364 HTTP/1.0',
 '%t': '[16/Mar/2013:00:19:43 +0400]',
 '%u': '-',
 '%{User-Agent}i': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}

Looks like the parser works. Now to load some real world data (sample file located here and here):

In [9]:

log = open('access_log_for_pandas').readlines()

Parse every line of it and create a list of dictionaries:

In [10]:

log_list = []
for line in log:
       try:
          data = p.parse(line)
       except:
          sys.stderr.write("Unable to parse %s" % line)
       data['%t'] = data['%t'][1:12]+' '+data['%t'][13:21]+' '+data['%t'][22:27]
      
       log_list.append(data)

We had to tweak the time format a bit, since otherwise pandas will not be able to parse it.

Create and tune Data Frame¶

This will create a list of dictionaries, that can be transformed in to a Data Frame:

In [11]:

import pandas as pd
import numpy as np
from pandas import Series, DataFrame, Panel

In [12]:

df = DataFrame(log_list)

Show the first two lines of the Data Frame:

In [13]:

df[0:2]

Out[13]:

	%>s	%T	%V	%b	%h	%i	%l	%r	%t	%u	%{User-Agent}i
0	200	0	www.oceanographers.ru	26126	109.165.31.156	-	-	GET /index.php?option=com_content&task=section...	16/Mar/2013 08:00:25 +0400	-	Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20...
1	200	0	www.oceanographers.ru	10532	109.165.31.156	http://www.oceanographers.ru/index.php?option=...	-	GET /templates/ja_procyon/css/template_css.css...	16/Mar/2013 08:00:25 +0400	-	Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20...

We are not going to use all the data, so let's delete some of the columns:

In [14]:

del df['%T']; del df['%V']; del df['%i']; del df['%l']; del df['%u']; del df['%{User-Agent}i']

and rename columns to something that humans can understand:

In [15]:

df = df.rename(columns={'%>s': 'Status', '%b':'b', 
                        '%h':'IP', '%r':'Request', '%t': 'Time'})

First five rows of resulting Data Frame:

In [16]:

df.head()

Out[16]:

	Status	b	IP	Request	Time
0	200	26126	109.165.31.156	GET /index.php?option=com_content&task=section...	16/Mar/2013 08:00:25 +0400
1	200	10532	109.165.31.156	GET /templates/ja_procyon/css/template_css.css...	16/Mar/2013 08:00:25 +0400
2	200	1853	109.165.31.156	GET /templates/ja_procyon/switcher.js HTTP/1.0	16/Mar/2013 08:00:25 +0400
3	200	37153	109.165.31.156	GET /includes/js/overlib_mini.js HTTP/1.0	16/Mar/2013 08:00:25 +0400
4	200	3978	109.165.31.156	GET /modules/ja_transmenu/transmenuh.css HTTP/1.0	16/Mar/2013 08:00:25 +0400

Convert Time column to datetime format and make an index out of it (pop will drop original Time column):

In [17]:

df.index = pd.to_datetime(df.pop('Time'))

The Status variable is a string type, so we have to change it to int:

In [18]:

df['Status'] = df['Status'].astype('int')

Some of the rows in the b column contain '-' string, so we can't convert them with astype:

In [19]:

df['b'][93]

Out[19]:

'-'

We can apply a custom function to this column, that will convert all dashes to NaN, and the rest to floats, additionally converting from bytes to megabytes:

In [20]:

def dash2nan(x):
    if x == '-':
        x = np.nan
    else:
        x = float(x)/1048576.
    
    return x

In [21]:

df['b'] = df['b'].apply(dash2nan)

I am sure there is a more elegant way to do this :)

Traffic analysis¶

Our first, simplest plot: outgoing traffic from the website:

In [22]:

df['b'].plot()

Out[22]:

<matplotlib.axes.AxesSubplot at 0xbf7574c>

Look's like somebody downloaded something big from the website around 9 a.m.

But actually the first thing that you probably want to know is how many visits (actually hits) your site has, and how they are distributed in time. We resample the series of variable b with 5 minute intervals, and calculate number of requests during every time span. Actually, in this case it doesn't matter what variable we use, these numbers will indicate just how many times information from the the website was requested.

In [23]:

df_s = df['b'].resample('5t', how='count')
df_s.plot()

Out[23]:

<matplotlib.axes.AxesSubplot at 0xc14588c>

We can not only count number of requests per time, but also calculate the sum of the traffic for every time span:

In [24]:

df_b = df['b'].resample('10t', how=['count','sum'])
df_b['count'].plot( color='r')
legend()
df_b['sum'].plot(secondary_y=True)

Out[24]:

<matplotlib.axes.AxesSubplot at 0xc2d53ac>

On the plot you can see, that the number of server requests do not always coincide with the amount of traffic, and correlation is actually not extremely high:

In [25]:

df_b.corr()

Out[25]:

	count	sum
count	1.000000	0.512629
sum	0.512629	1.000000

We can have a closer look at the curious morning peak:

In [26]:

df_b['2013-03-16 6:00':'2013-03-16 10:00']['sum'].plot()

Out[26]:

<matplotlib.axes.AxesSubplot at 0xc3f5dac>

Seems that this traffic spike was caused by only one request. Let's find out how this request looks. Choose all requests with the size of the response larger than 20 Mb (remember we convert bytes to Mb):

In [27]:

df[df['b']>20]

Out[27]:

	Status	b	IP	Request
Time
2013-03-16 09:02:59	200	21.365701	77.50.248.20	GET /books/Bondarenko.pdf HTTP/1.0

It was a .pdf file of a book (look at Request field), and this explains the peak in outgoing trafic at 2013-03-16 09:02:59.

Clearly 20 Mb is a large request (for our website at least :)). But what is the typical size of the server's response? The histogram of response sizes (less than 20 Mb) looks like this:

In [28]:

cc = df[df['b']<20]
cc.b.hist(bins=10)

Out[28]:

<matplotlib.axes.AxesSubplot at 0xc52374c>

So, most of the files are less than 0.5 Mb large. In fact they are even smaller:

In [29]:

cc = df[df['b']<0.3]
cc.b.hist(bins=100)

Out[29]:

<matplotlib.axes.AxesSubplot at 0xc5760ec>

Very small responses can be all kinds of stuff, but larger ones are probably some relatively big files. Let's have a look at these little spikes between 0.2 and 0.25 Mb:

In [30]:

cc = df[(df['b']>0.2)&(df['b']<0.25)]
cc.b.hist(bins=100)

Out[30]:

<matplotlib.axes.AxesSubplot at 0xc57cacc>

We now can locate the size of the file more precisely, and find out its name:

In [31]:

cc = df[(df['b']>0.220)&(df['b']<0.224)]
cc.head()

Out[31]:

	Status	b	IP	Request
Time
2013-03-16 10:13:51	200	0.223101	178.238.18.194	GET /mypict/itpocover.png HTTP/1.0
2013-03-16 11:02:12	200	0.223101	2.95.140.217	GET /mypict/itpocover.png HTTP/1.0
2013-03-16 11:18:48	200	0.223101	213.87.138.145	GET /mypict/itpocover.png HTTP/1.0
2013-03-16 11:33:36	200	0.223101	91.207.106.252	GET /mypict/itpocover.png HTTP/1.0
2013-03-16 11:34:03	200	0.223101	212.3.137.202	GET /mypict/itpocover.png HTTP/1.0

This is one of the image files from the front page of the website.

Server status codes¶

I would like to know how different server responses are distributed through time. Let's first try a not very elegant approach. We create several variables with time series of Status values, each containing only a time series with one particular value. Then a Data Frame is created out of this time series.

In [32]:

t_span = '2H'
df_404 = df['Status'][df['Status'] == 404].resample(t_span, how='count')
df_403 = df['Status'][df['Status'] == 403].resample(t_span, how='count')
df_301 = df['Status'][df['Status'] == 301].resample(t_span, how='count')
df_304 = df['Status'][df['Status'] == 304].resample(t_span, how='count')
df_200 = df['Status'][df['Status'] == 200].resample(t_span, how='count')

status_df = DataFrame({'Not Found':df_404, 'Forbidden':df_403, 'Moved Permanently':df_301, 'Not Modified':df_304, 'OK':df_200,})

In [33]:

status_df.head()

Out[33]:

	Forbidden	Moved Permanently	Not Found	Not Modified	OK
Time
2013-03-16 08:00:00	22	17	3	4	375
2013-03-16 10:00:00	26	76	2	2	607
2013-03-16 12:00:00	45	83	2	23	780
2013-03-16 14:00:00	29	34	4	2	699
2013-03-16 16:00:00	27	94	5	3	886

We plot all values at once:

In [34]:

status_df.plot(figsize=(10, 3))

Out[34]:

<matplotlib.axes.AxesSubplot at 0xcda1e6c>

Show only Error and Redirection status codes as a stacked plot:

In [35]:

status_df[['Not Found','Forbidden','Moved Permanently','Not Modified']].plot(kind='barh', stacked=True, figsize=(10, 7))

Out[35]:

<matplotlib.axes.AxesSubplot at 0xce27e2c>

But there is a better way to receive statistics about different groups of values. Here we group our data by Status:

In [36]:

grouped_status = df.groupby('Status')

Data are now rearranged in to several groups, each corresponding to a certain value of Status (here, only the first two values of every group are shown):

In [37]:

grouped_status.head(2)

Out[37]:

		Status	b	IP	Request
Status	Time
200	2013-03-16 08:00:25	200	0.024916	109.165.31.156	GET /index.php?option=com_content&task=section...
200	2013-03-16 08:00:25	200	0.010044	109.165.31.156	GET /templates/ja_procyon/css/template_css.css...
206	2013-03-16 11:42:53	206	0.031250	109.171.109.164	GET /images/stories/pdf/0104_044.pdf HTTP/1.0
206	2013-03-16 11:42:53	206	0.224717	109.171.109.164	GET /images/stories/pdf/0104_044.pdf HTTP/1.0
301	2013-03-16 08:13:08	301	0.000232	38.100.21.63	GET /forum HTTP/1.0
301	2013-03-16 08:13:09	301	0.000234	38.100.21.63	GET /podcast HTTP/1.0
302	2013-03-16 08:47:04	302	NaN	176.8.91.244	POST /podcast/wp-comments-post.php HTTP/1.0
302	2013-03-16 11:27:40	302	NaN	78.30.212.183	POST /podcast/wp-comments-post.php HTTP/1.0
304	2013-03-16 08:19:41	304	NaN	178.154.206.250	GET /images/stories/researchers/laplace.jpg HT...
304	2013-03-16 08:33:14	304	NaN	178.154.206.250	GET /images/stories/researchers/treshnikov.jpg...
400	2013-03-16 08:13:09	400	0.000999	38.100.21.63	GET /.. HTTP/1.0
403	2013-03-16 08:04:23	403	0.001054	5.79.199.116	GET /forum/download/file.php?avatar=156.gif HT...
403	2013-03-16 08:04:23	403	0.001054	5.79.199.116	GET /forum/download/file.php?avatar=216.jpg HT...
404	2013-03-16 08:04:25	404	0.001063	5.79.199.116	GET /apple-touch-icon-precomposed.png HTTP/1.0
404	2013-03-16 08:04:26	404	0.001063	5.79.199.116	GET /apple-touch-icon.png HTTP/1.0

We can count the number of values in each group:

In [38]:

grouped_status.size().plot(kind='bar')

Out[38]:

<matplotlib.axes.AxesSubplot at 0xcfe76ac>

We can look at individual groups:

In [39]:

t_span = '30t'
grouped_status.get_group(301)['Status'].resample(t_span, how='count').plot(color='g', label='301')
legend()
grouped_status.get_group(200)['Status'].resample(t_span, how='count').plot(color='b', secondary_y=True, label='200')

Out[39]:

<matplotlib.axes.AxesSubplot at 0xd0799ec>

IPs¶

Now I want to group by IPs and calculate how many requests we get from each of them:

In [40]:

ips = df.groupby('IP').size()

I don't want to see all of them, just the top 10:

In [41]:

ips.sort()
ips[-10:].plot(kind='barh')

Out[41]:

<matplotlib.axes.AxesSubplot at 0xd13c0ec>

I just want to have a nice table with the top 10 IPs, so I convert them to Data Frame :)

In [42]:

ips_fd = DataFrame({'Number of requests':ips[-10:]})
ips_fd = ips_fd.sort(columns='Number of requests',  ascending=False)
ips_fd

Out[42]:

	Number of requests
IP
66.249.73.137	814
178.154.206.250	168
85.114.191.235	165
173.199.115.187	127
85.142.104.227	126
178.70.210.234	120
85.142.104.228	111
95.118.89.119	93
217.69.133.68	74
92.255.161.124	72

We also can group by both IP and Status:

In [43]:

ips_status = df.groupby(['IP', 'Status']).size()

In [44]:

ips_status.sort()
ips_status[-20:].plot(kind='barh')

Out[44]:

<matplotlib.axes.AxesSubplot at 0xd3425cc>

Geographical information¶

I would like to know from which countries people or robots come to the website. It turns out that precise geolocation information is not free, but for our purposes free data bases are fine. We will use pygeoip to get information from GeoIP database, and that has to be downloaded from here.

Import module and initialise data base:

In [46]:

import pygeoip
gi = pygeoip.GeoIP('./GeoLiteCity.dat', pygeoip.MEMORY_CACHE)

This is how output of the pygeoip looks (we test our top IP from the previous section):

In [47]:

ipcon = gi.record_by_addr('64.233.161.99')
ipcon

Out[47]:

{'area_code': 650,
 'city': 'Mountain View',
 'continent': 'NA',
 'country_code': 'US',
 'country_code3': 'USA',
 'country_name': 'United States',
 'dma_code': 807,
 'latitude': 37.41919999999999,
 'longitude': -122.0574,
 'metro_code': 'San Francisco, CA',
 'postal_code': '94043',
 'region_name': 'CA',
 'time_zone': 'America/Los_Angeles'}

Guess who it is :) Looks like Google Bot is my main reader. That's rather sad :(

Loop to create a list of dictionaries with information about IP numbers. We also add information about the number of requests from each of the IPs:

In [48]:

ipcon = []
for iip in ips.index:
    rres = gi.record_by_addr(iip)
#    rres['ip']                 = iip
    rres['Number'] = ips[iip]
    #delete some fields we don't need
    del rres['area_code']
    del rres['dma_code']
    del rres['metro_code']
    del rres['postal_code']
    del rres['region_name']
    del rres['time_zone']
    del rres['country_code']
    ipcon.append(rres)
    

Create Data Frame from this list, use ips index:

In [49]:

reg = DataFrame(ipcon, index = ips.index)
reg.head()

Out[49]:

	Number	city	continent	country_code3	country_name	latitude	longitude
IP
112.78.139.130	1	Jakarta	AS	IDN	Indonesia	-6.1744	106.8294
116.25.249.140	1	Shenzhen	AS	CHN	China	22.5333	114.1333
117.199.193.19	1	Agra	AS	IND	India	27.1833	78.0167
117.27.175.158	1	Zhangzhou	AS	CHN	China	24.5133	117.6556
119.147.6.60	1	Guangzhou	AS	CHN	China	23.1167	113.2500

Group by Country:

In [50]:

country = reg.groupby('country_code3')

In [51]:

ff = country.Number.agg('sum').copy()
ff.sort( )
ff[-10:].plot(kind='barh')

Out[51]:

<matplotlib.axes.AxesSubplot at 0xd6c68cc>

The website is in Russian, so no surprises here.

Group by city:

In [52]:

city = reg.groupby('city')

In [53]:

ff = city.Number.agg('sum').copy()
ff.sort( )
ff[-20:].plot(kind='barh', figsize=(5,8))

Out[53]:

<matplotlib.axes.AxesSubplot at 0xd86fa2c>

The first 3 are kind of expected, but Tyumen, a Russian city located in the middle of Eurasia thousands of kilometers away from the nearest ocean coast, is a bit of a surprise (the website is about Oceanography). This is something to investigate later on.

We have lats and lons for the data, so it is natural to put them on the map:

In [55]:

from mpl_toolkits.basemap import Basemap
import matplotlib.cm as cm

In [56]:

m = Basemap(projection='robin',lon_0=0,resolution='c')
x, y = m(reg['longitude'],reg['latitude'])

In [67]:

figure(figsize=(15,15))
m.drawcoastlines(linewidth=0.25)
m.drawcountries(linewidth=0.25)

m.fillcontinents(color='coral',lake_color='aqua')
m.drawmapboundary(fill_color='white')
m.drawmeridians(np.arange(0,360,30))
m.drawparallels(np.arange(-90,90,30))
m.scatter(x,y,s=reg['Number']*3,c=reg['Number']/5,marker='o',zorder=4, cmap=cm.Paired,alpha=0.5)

Out[67]:

<matplotlib.collections.PathCollection at 0x15facacc>

Have closer look at Europe:

In [68]:

m = Basemap(projection='cyl',llcrnrlat=35,urcrnrlat=72,\
            llcrnrlon=-10,urcrnrlon=50,resolution='l')
x, y = m(reg['longitude'],reg['latitude'])

In [69]:

figure(figsize=(15,15))
m.drawcoastlines(linewidth=0.25)
m.drawcountries(linewidth=0.25)
m.fillcontinents(color='white',lake_color='aqua')
m.drawmapboundary(fill_color='aqua')
m.drawmeridians(np.arange(0,360,30))
m.drawparallels(np.arange(-90,90,30))
m.scatter(x,y,s=reg['Number']*30,c=reg['Number'],marker='o',zorder=4, cmap=cm.gist_ncar ,alpha=0.3)

Out[69]:

<matplotlib.collections.PathCollection at 0x169381ec>

Catching spammers (request methods)¶

Spammers are a huge problem, and it would be nice to filter them out. My website has very low user activity, and if somebody tries to POST something more than couple of times it's already suspicious. First we have to get information about what request method is being used for the requests. For this we have to analyse the Request column, which contain strings with requests. Request methods are always located in the beginning of the request string, so it's trivial to get it:

In [70]:

def gmet(x):
    x = x.split()
    return x[0]

We apply this little function to the Request column and create new Method column:

In [71]:

df['Method'] = df.Request.apply(gmet)

Group by Method and IP:

In [72]:

met = df.groupby(['Method', 'IP']).size()

In [73]:

met.head

Out[73]:

<bound method Series.head of Method  IP             
GET     100.44.124.8       26
        108.171.252.242    23
        109.165.31.156     12
        109.171.109.164     4
        109.191.82.110     14
        109.200.149.88     56
        109.205.252.77     32
        109.225.55.49      13
        109.226.101.186    45
        109.230.128.168     9
        109.230.128.169    13
        109.230.132.244     2
        109.87.146.37       4
        109.87.148.141     13
        111.73.45.49       10
...
POST    46.191.247.15      5
        46.29.255.194      5
        5.164.227.102      6
        66.23.234.10      10
        77.222.40.65      58
        78.30.212.183      4
        81.200.28.126      1
        85.114.191.235     2
        85.142.104.227     1
        91.207.4.78        5
        94.190.190.121     5
        94.45.180.57       5
        95.105.29.56       5
        95.27.28.213       5
        95.84.137.56       6
Length: 470>

Create POST time series (we have to have a copy of the time series to sort the data):

In [74]:

post = met['POST'].copy()
post.sort()

Plot the top 10 IPs that use POST method:

In [75]:

post[-10:].plot(kind='barh')

Out[75]:

<matplotlib.axes.AxesSubplot at 0x16ddaeec>

In [76]:

post[-5:]

Out[76]:

IP
173.208.108.10    10
66.23.234.10      10
213.248.47.229    11
160.75.185.7      15
77.222.40.65      58

What is the winner doing?

In [77]:

df[df.IP =='77.222.40.65'][0:5]

Out[77]:

	Status	b	IP	Request	Method
Time
2013-03-16 08:28:43	403	0.001054	77.222.40.65	POST /podcast/wp-cron.php?doing_wp_cron=136340...	POST
2013-03-16 08:45:08	403	0.001054	77.222.40.65	POST /podcast/wp-cron.php?doing_wp_cron=136340...	POST
2013-03-16 09:19:52	403	0.001054	77.222.40.65	POST /podcast/wp-cron.php?doing_wp_cron=136341...	POST
2013-03-16 09:36:26	403	0.001054	77.222.40.65	POST /podcast/wp-cron.php?doing_wp_cron=136341...	POST
2013-03-16 09:40:22	403	0.001054	77.222.40.65	POST /podcast/wp-cron.php?doing_wp_cron=136341...	POST

Making some Wordpress cron requests... This IP looks familiar... Oh, it's actually the IP of my website - this is my Wordpress doing something. Have to investigate later if I can disable this thing (anyway it's not working, getting 403 error) :)

Number two:

In [78]:

df[df.IP =='160.75.185.7'][0:10]

Out[78]:

	Status	b	IP	Request	Method
Time
2013-03-16 17:27:12	200	0.002224	160.75.185.7	GET /administrator/index.php HTTP/1.0	GET
2013-03-16 17:27:12	200	0.002224	160.75.185.7	GET /administrator/index.php HTTP/1.0	GET
2013-03-16 17:27:12	200	0.002224	160.75.185.7	GET /administrator/index.php HTTP/1.0	GET
2013-03-16 17:27:12	200	0.002224	160.75.185.7	GET /administrator/index.php HTTP/1.0	GET
2013-03-16 17:27:12	200	0.002224	160.75.185.7	GET /administrator/index.php HTTP/1.0	GET
2013-03-16 17:27:12	200	0.002224	160.75.185.7	POST /administrator/index.php HTTP/1.0	POST
2013-03-16 17:27:12	200	0.002224	160.75.185.7	POST /administrator/index.php HTTP/1.0	POST
2013-03-16 17:27:12	200	0.002224	160.75.185.7	GET /administrator/index.php HTTP/1.0	GET
2013-03-16 17:27:12	200	0.002224	160.75.185.7	POST /administrator/index.php HTTP/1.0	POST
2013-03-16 17:27:12	200	0.002224	160.75.185.7	POST /administrator/index.php HTTP/1.0	POST

This is suspicious; looks like it tries to get to the website's admin area - candidate for blocking.

Number three:

In [79]:

df[df.IP =='213.248.47.229'][0:10]

Out[79]:

	Status	b	IP	Request	Method
Time
2013-03-16 11:00:34	200	0.012539	213.248.47.229	GET /forum/ucp.php?mode=register&sid=8203c4843...	GET
2013-03-16 11:00:34	200	0.026559	213.248.47.229	GET /forum/index.php HTTP/1.0	GET
2013-03-16 11:00:34	200	0.012011	213.248.47.229	GET /forum/ucp.php?mode=register&sid=c78851b8f...	GET
2013-03-16 11:00:39	200	0.016414	213.248.47.229	POST /forum/ucp.php?mode=register&sid=12986e06...	POST
2013-03-16 11:00:39	200	0.016414	213.248.47.229	POST /forum/ucp.php?mode=register HTTP/1.0	POST
2013-03-16 11:00:44	200	0.015024	213.248.47.229	POST /forum/ucp.php?mode=register HTTP/1.0	POST
2013-03-16 11:00:45	200	0.015028	213.248.47.229	POST /forum/ucp.php?mode=register HTTP/1.0	POST
2013-03-16 11:00:49	200	0.015024	213.248.47.229	POST /forum/ucp.php?mode=register HTTP/1.0	POST
2013-03-16 11:00:50	200	0.015028	213.248.47.229	POST /forum/ucp.php?mode=register HTTP/1.0	POST
2013-03-16 11:00:55	200	0.015024	213.248.47.229	POST /forum/ucp.php?mode=register HTTP/1.0	POST

This one tries to register on the forum every 5 second or so - clearly forum spammer. Block.