A IPython Notebook to analyze the Gaza-Israel 2012 crisis

The Guardian is tracking and mapping live (link) the recent incidents in Gaza and Israel. As part of their data-journalism spirit, they are sharing the data as a Google Fusion Table available for access.

This notebook is an attempt to show, on the one hand, how the toolkit from the Python stack can be used for a real world data hack and, on the other, to offer deeper analysis beyond mapping of the events, both exploiting the spatial as well as the temporal dimension of the data.

  • The source document (.ipynb file) is stored on Github as a gist here, which means you can fork it and use it as a start for you own data-hack.
  • A viewable version is available here, via the IPython Notebook Viewer.

Collaborate on the notebook!!!

In its initial version (Nov. 20th), the notebook only contains code to stream the data from the Google Fusion Table into a pandas DataFrame (which means you get the data ready to hack!). Step in and collaborate in making it a good example of how Python can help analyze real world data. Add a new view, quick visualization, summary statistic of fancy model that helps understand the data better!

To contribute, just fork the gist as you would with any git repository.

Happy hacking!

In [91]:
import datetime
import urllib2, urllib
import pandas as pd
from StringIO import StringIO
In [92]:
# Trick from http://stackoverflow.com/questions/7800213/can-i-use-pythons-csv-reader-with-google-fusion-tables

request_url = 'https://www.google.com/fusiontables/api/query' 
query = 'SELECT * FROM 1KlX4PFF81wlx_TJ4zGudN_NoV_gq_GwrxuVau_M'

url = "%s?%s" % (request_url, urllib.urlencode({'sql': query}))
serv_req = urllib2.Request(url=url)
serv_resp = urllib2.urlopen(serv_req)
table = serv_resp.read()
print '\nLast pull of data from the Google FusionTable: ', datetime.datetime.now()

Last pull of data from the Google FusionTable:  2012-11-20 23:15:46.881851

In [93]:
def parse_loc(loc, ret_lon=True):
    try:
        lon, lat = loc.split(',')
        lon, lat = lon.strip(' '), lat.strip(' ')
        lon, lat = map(float, [lon, lat])
        if ret_lon:
            return lon
        else:
            return lat
    except:
        return None
In [94]:
db = pd.read_csv(StringIO(table))
db['lon'] = db['Location (approximate)'].apply(lambda x: parse_loc(x))
db['lat'] = db['Location (approximate)'].apply(lambda x: parse_loc(x, ret_lon=False))
db['Date'] = db['Date'].apply(pd.to_datetime)
db
Out[94]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 97 entries, 0 to 96
Data columns:
Date                      97  non-null values
Day                       97  non-null values
Name of place             96  non-null values
Location (approximate)    94  non-null values
Details                   97  non-null values
Source url                97  non-null values
Image url                 13  non-null values
Icon 1                    97  non-null values
lon                       92  non-null values
lat                       92  non-null values
dtypes: datetime64[ns](1), float64(2), object(7)

Very basic descriptive analysis

  • Volume of incidents by day
In [95]:
t = db['Date']
t = t.reindex(t)
by_day = t.groupby(lambda x: x.day).size()
by_day.plot(kind='bar')
title('Number of events by day')
Out[95]:
<matplotlib.text.Text at 0x49b4b10>
  • Location of events coloured by day
In [139]:
f = figure(figsize=(10, 6))
ax = f.add_subplot(111)
x, y = db['lon'], db['lat']
s = scatter(x, y, marker='.', color='k')
for d, day in db.set_index('Date').groupby(lambda x: x.day):
    x, y = day['lon'], day['lat']
    c = cm.Set1(d/30.)
    s = scatter(x, y, marker='^', color=c, label=str(d), s=20)
ax.get_yaxis().set_visible(False)
ax.get_xaxis().set_visible(False)
legend(loc=2)
title('Spatial distribution of events by day')
ax.set_axis_bgcolor("0.2")