Department of Computational Social Science, George Mason University
If you're reading this, chances are you're already excited about Global Data on Events, Location and Tone, better known as GDELT. If you aren't, you should be. Lots has been written about how revolutionary this dataset might be, and I won't try to add to it here.
Instead, let's dive right in! In this tutorial, I'll go through extracting some basic time series from GDELT.
To follow along, go download the data from the GDELT website and unzip it. The data is about 4.6 GB uncompressed in a series of text files, one per year
(First, some code to style the IPython notebook and make it more readable. I've adapted the CSS styling from the excellent Probabilistic Programming and Bayesian Methods for Hackers)_
from IPython.core.display import HTML
styles = open("Style.css").read()
HTML(styles)
We're going to need only a few libraries to start with: Matplotlib for visualization, datetime for handling date objects, and Pandas for handling, aggregating and reshaping some of the data. Pandas provides great functionality to easily plot time series, so we'll use it for that too. We'll also import defaultdict while we're at it, since it's often useful for data collection.
import datetime as dt
from collections import defaultdict
import matplotlib.pyplot as plt
import pandas
matplotlib.rcParams['figure.figsize'] = [8,4] # Set default figure size
# Set this variable to the directory where the GDELT data files are
PATH = "GDELT.1979-2012.reduced/"
# Peeking at the data:
!head -n 5 GDELT.1979-2012.reduced/2010.reduced.txt
Day Actor1Code Actor2Code EventCode QuadCategory GoldsteinScale Actor1Geo_Lat Actor1Geo_Long Actor2Geo_Lat Actor2Geo_Long ActionGeo_Lat ActionGeo_Long 20100101 AFG AFGCOP 173 4 -5.0 34.9669 69.265 34.9669 69.265 25 45 20100101 AFG AFGCVL 080 1 5.0 31 64 31 64 31 64 20100101 AFG AFGCVL 190 4 -10.0 35.3472 70.1485 35.3472 70.1485 35.3472 70.1485 20100101 AFG AFGGOV 043 2 2.8 31 64 31 64 31 64
It's important to know how big our dataset is. It's also important to know if the data available over time is biased -- does GDELT have more events for recent years than for distant ones? If so, is that because more has happened recently, or because the data collection has gotten better?
The paper introducing GDELT (warning: large PDF) goes over this, but it'll be good practice to replicate some basic diagnostics.
So let's start with a simple count of just how many events -- all events -- the dataset has per month (which is a common typical unit of temporal aggregation). To do that, we'll open each file, figure out which month each event (meaning each row) occured in, and add them up.
monthly_data = defaultdict(int) # We'll use this to store the counts
count = 0 # While we're at it, let's count how many records there are, total.
for year in range(1979, 2013):
#print year # Uncomment this line to see the program's progress.
f = open(PATH + str(year) + ".reduced.txt")
next(f) # Skip the header row.
for raw_row in f:
try:
row = raw_row.split("\t")
# Get the date, which is in YYYYMMDD format:
date_str = row[0]
year = int(date_str[:4])
month = int(date_str[4:6])
date = dt.datetime(year, month, 1)
monthly_data[date] += 1
count += 1
except:
pass # Skip error-generating rows for now.
print "Total rows processed:", count
print "Total months:", len(monthly_data)
Total rows processed: 67927691 Total months: 402
Now we just turn this dictionary into a Pandas series, and plot it. Pandas will automatically recognize that we're dealing with a time series, because it's useful like that.
monthly_events = pandas.Series(monthly_data)
monthly_events.plot()
<matplotlib.axes.AxesSubplot at 0x10804f510>
As we might expect, the number of events in the dataset isn't uniform, and goes up rapidly in the later years.
One important and useful feature of GDELT is the QuadCategory classification of each event. Per the documentation, each event has one of the following quad categories:
1. Material Cooperation
2. Verbal Cooperation
3. Verbal Conflict
4. Material Conflict
Let's repeat the analysis above, but now examine material cooperation and conflict. Very (very very) roughly, is the world becoming more cooperative, or more violent?
material_coop = defaultdict(int)
material_conf = defaultdict(int)
for year in range(1979, 2013):
f = open(PATH + str(year) + ".reduced.txt")
next(f) # Skip the header row.
for raw_row in f:
try:
row = raw_row.split("\t")
# Check the quadcat, and skip if not relevant:
if row[4] not in ['1', '4']:
continue
# Get the date, which is in YYYYMMDD format:
date_str = row[0]
year = int(date_str[:4])
month = int(date_str[4:6])
date = dt.datetime(year, month, 1)
if row[4] == '1':
material_coop[date] += 1
elif row[4] == '4':
material_conf[date] += 1
except:
pass # Skip error-generating rows for now.
# Convert both into time series:
monthly_coop = pandas.Series(material_coop)
monthly_conf = pandas.Series(material_conf)
# Join the time series together into a DataFrame
trends = pandas.DataFrame({"Material_Cooperation": monthly_coop,
"Material_Conflict": monthly_conf})
trends.plot()
<matplotlib.axes.AxesSubplot at 0x1080a7dd0>
Both seem to have roughly the same shape as the total counts, with material conflict slightly but persistently remaining more likely than material cooperation.
The Israeli-Palestinian conflict gets a lot of media attention, so we would expect it to be well-represented in the dataset. It's generally considered to be fairly important, with effects spilling over far from where it is actually taking place. It is also one of the case studies that Leetaru and Schrodt use to compare GDELT against a similar dataset in their paper.
All GDELT events have a source and a target actor. These are coded down to an impressive level of specificity, often down to whether a political party is a member of the government or the opposition when the event occurs. For a first pass, however, only the highest level of the actors will suffice. These will be ISR for Israel, and all Israeli actors; and either PSE or PAL for all Palestinian actors. We'll grab only those events which involve Israel-coded actors acting on Palestinian-coded actors, or vice versa.
Incidentally: learn from my mistakes, and RTFM. My first pass of this analysis was way off because I didn't read the GDELT documentation closely enough, and thought that the actor prefix for Palestine was PAL. In fact, almost all of the events are coded as PSE, the UN code for the Palestinian Occupied Territories. RTFM.
data = []
for year in range(1979, 2013):
f = open(PATH + str(year) + ".reduced.txt")
for raw_row in f:
row = raw_row.split("\t")
actor1 = row[1][:3]
actor2 = row[2][:3]
both = actor1 + actor2
if "ISR" in both and ("PAL" in both or "PSE" in both):
year = int(row[0][:4])
month = int(row[0][4:6])
day = int(row[0][6:])
quad_cat = row[4]
data.append([year, month, day, actor1, actor2, quad_cat])
print "Israeli-Palestinian Conflict Records:", len(data)
Israeli-Palestinian Conflict Records: 528698
Next, we can turn this data into a Pandas DataFrame; essentially a big table we can manipulate.
ilpalcon = pandas.DataFrame(data,
columns=["Year", "Month", "Day", "Actor1", "Actor2", "QuadCat"])
ilpalcon.head()
Year | Month | Day | Actor1 | Actor2 | QuadCat | |
---|---|---|---|---|---|---|
0 | 1979 | 1 | 3 | ISR | PSE | 2 |
1 | 1979 | 1 | 3 | PSE | ISR | 2 |
2 | 1979 | 1 | 4 | ISR | PSE | 3 |
3 | 1979 | 1 | 4 | PSE | ISR | 3 |
4 | 1979 | 1 | 5 | ISR | PSE | 3 |
Pandas provides some powerful table manipulation tools; I'm partial to pivot tables, possibly due to several years of using Excel heavily for work. Let's pivot the data so that we count the number of events by QuadCat for each month.
pivot = pandas.pivot_table(ilpalcon, values="Day", rows=["Year", "Month"], cols="QuadCat", aggfunc=len)
pivot = pivot.fillna(0) # Replace any missing data with zeros
pivot = pivot.reset_index() # Make Year and Month regular columns
pivot.head()
QuadCat | Year | Month | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|---|
0 | 1979 | 1 | 1 | 16 | 8 | 13 |
1 | 1979 | 2 | 0 | 14 | 5 | 5 |
2 | 1979 | 3 | 17 | 47 | 13 | 15 |
3 | 1979 | 4 | 2 | 18 | 10 | 56 |
4 | 1979 | 5 | 14 | 55 | 26 | 40 |
Now that we have a nice table of monthly event counts, we need to index it by date. It would also be nice to rename the columns to the QuadCat description. To create a date from the Year and Month, we need to create a function that generates a datetime object from them, and apply it to each row.
# date-generating function:
get_date = lambda x: dt.datetime(year=int(x[0]), month=int(x[1]), day=1)
pivot["date"] = pivot.apply(get_date, axis=1) # Apply row-wise
pivot = pivot.set_index("date") # Set the new date column as the index
# Now we no longer need the Year and Month columns, so let's drop them:
pivot = pivot[["1", "2", "3", "4"]]
# Rename the QuadCat columns
pivot = pivot.rename(columns={"1": "Material Cooperation",
"2": "Verbal Cooperation",
"3": "Verbal Conflict",
"4": "Material Conflict"})
pivot.plot(figsize=(8,4))
<matplotlib.axes.AxesSubplot at 0x10849e910>
Interestingly, it looks like Verbal Cooperation is the most common form of interaction, even when violence (Material Conflict) spikes. We can also clearly see the peace process of the 90s, where Verbal Cooperation events are significantly greater than all others, and the spike in Material Conflict when the Second Intifada breaks out.
Finally, let's see what a general 'peace index' might look like, measuring the difference in volume between cooperation and conflict events.
pivot["Peace_Index"] = (pivot["Material Cooperation"]
+ pivot["Verbal Cooperation"]
- pivot["Verbal Conflict"]
- pivot["Material Conflict"])
pivot["Peace_Index"].plot(figsize=(8,4))
<matplotlib.axes.AxesSubplot at 0x117bb3810>
This is barely scratching the surface of what we can do with the GDELT data. Hopefully it'll help you get started interacting with the data, so that you can do real work with it. Happy analyzing!