Notebook

GDELT Data Wrangler¶

James Houghton; April 8th, 2014 houghton@mit.edu

One of the challenges of dealing with GDELT is that its size makes implementation in a SQL database challenging. It is easier (although slower) to extract a subset of the data that we wish to work with, and do our actual data investigation afterwards. If we're interested in data broken down by date, we have it easy: GDELT files are provided this way. If instead we want to look at a single country, we have some work to do. This notebook parses through each of the GDELT files, one at a time, extracts the relevant lines, and exports them again to a smaller set of csv files. It then gives the option to load these files into a Pandas DataFrame, pickle the result, and remove the intermediate files.

There are certainly more computationally efficient methods for doing this, but this one works well enough.

Additional References¶

To export a single set of data, use this exporter: http://analysis.gdeltproject.org/module-event-exporter.html
Raw Data is available here: http://data.gdeltproject.org/events/index.html
A description of the data fields is here: http://data.gdeltproject.org/documentation/GDELT-Data_Format_Codebook.pdf
The event codes are described in this document: http://gdeltproject.org/data/documentation/CAMEO.Manual.1.1b3.pdf
This condensed version of the codes is easier to browse: http://cameocodes.wikispaces.com/EventCodes
CAMEO country codes are listed here: http://cameocodes.wikispaces.com/countrybyname
FIPS country codes are listed here: http://en.wikipedia.org/wiki/List_of_FIPS_country_codes

1. Identify the files we need to download¶

Links to each of the datafiles are available on the GDELT website. We extract this list to help collect and process the data. We probably want to perform this task every time we rebuild a dataset, so that we know we have the most up-to-date set of files.

In [ ]:

import requests
import lxml.html as lh

gdelt_base_url = 'http://data.gdeltproject.org/events/'

# get the list of all the links on the gdelt file page
page = requests.get(gdelt_base_url+'index.html')
doc = lh.fromstring(page.content)
link_list = doc.xpath("//*/ul/li/a/@href")

# separate out those links that begin with four digits 
file_list = [x for x in link_list if str.isdigit(x[0:4])]

2. Extract Relevant GDELT Rows to Intermediate Files¶

In this example, we pull out all of the rows with a specific country code, and create a set of files which mirrors that of GDELT itself in quantity and format. Each output file is smaller than its corresponding input file.

As we're pulling out rows based upon location, we'll use GDELT cells:

ActionGeo_CountryCode (51)
Actor1Geo_CountryCode (37)
Actor2Geo_CountryCode (44)

These use the FIPS country codes, for inconvenience.

The general algorithm we will follow can be loosely described as:

Get a list of the gdelt files
For each element in the list

Check and see if the files are available locally (We assume we can afford to keep a local copy of the compressed files, so that we don't have to continually ping the gdelt server.)

If not, get the files from the web

Unzip the compressed raw GDELT file

For each resulting CSV file (probably only one)

Create and open an output file

For each line of the input file

Read into a string

Split string by tab delimiter

Check for the desired values in the appropriate list indexes

If they are not there, continue the loop

If the are there, write the line to the output file

Increment a current-file counter

Close the input and output files

Delete the input file

We reset the infilecounter and outfilecounter external to the main algorithm cell so that if the algorithm encounters an error, and quits, we can pick up where we left off.

We choose to check for the files, and potentially download them once for each file that we try to open, as opposed to doing all file collection in advance. That way, we can start to run additional code as soon as possible. This will help with debugging as we'll "fail fast".

Remember to set the local_path to somewhere you're happy to store large files.

In [ ]:

infilecounter = 0
outfilecounter = 0

In [ ]:

import os.path
import urllib
import zipfile
import glob
import operator

local_path = '/Users/me/Desktop/GDELT_Data/'

fips_country_code = 'UK'

for compressed_file in file_list[infilecounter:]:
    print compressed_file,
    
    # if we dont have the compressed file stored locally, go get it. Keep trying if necessary.
    while not os.path.isfile(local_path+compressed_file): 
        print 'downloading,',
        urllib.urlretrieve(url=gdelt_base_url+compressed_file, 
                           filename=local_path+compressed_file)
        
    # extract the contents of the compressed file to a temporary directory    
    print 'extracting,',
    z = zipfile.ZipFile(file=local_path+compressed_file, mode='r')    
    z.extractall(path=local_path+'tmp/')
    
    # parse each of the csv files in the working directory, 
    print 'parsing,',
    for infile_name in glob.glob(local_path+'tmp/*'):
        outfile_name = local_path+'country/'+fips_country_code+'%04i.tsv'%outfilecounter
        
        # open the infile and outfile
        with open(infile_name, mode='r') as infile, open(outfile_name, mode='w') as outfile:
            for line in infile:
                # extract lines with our interest country code
                if fips_country_code in operator.itemgetter(51, 37, 44)(line.split('\t')):    
                    outfile.write(line)
            outfilecounter +=1
            
        # delete the temporary file
        os.remove(infile_name)
    infilecounter +=1
    print 'done'

3. Build intermediary files into a Pandas Dataframe¶

We may be content to use the data we just sampled into csv files in its present state. However, if we are working in python, it is convenient to load them into a DataFrame, save that DataFrame to a pickle, and delete the temporary files. This can save space on the disk, and make our future analysis of the data more simple.

Our algorithm here is simple - we build dataframes out of each of the temporary files, and then merge them into one big dataframe. We save that big dataframe, and delete the temporary files.

We use a helper file here which lists the column names. You can download the file to your working directory with this link:

http://gdeltproject.org/data/lookups/CSV.header.fieldids.xlsx

In [ ]:

import glob
import pandas as pd

# Get the GDELT field names from a helper file
colnames = pd.read_excel('CSV.header.fieldids.xlsx', sheetname='Sheet1', 
                         index_col='Column ID', parse_cols=1)['Field Name']

# Build DataFrames from each of the intermediary files
files = glob.glob(local_path+'country/'+fips_country_code+'*')
DFlist = []
for active_file in files:
    print active_file
    DFlist.append(pd.read_csv(active_file, sep='\t', header=None, dtype=str,
                              names=colnames, index_col=['GLOBALEVENTID']))

# Merge the file-based dataframes and save a pickle
DF = pd.concat(DFlist)
DF.to_pickle(local_path+'backup'+fips_country_code+'.pickle')    
    
# once everythin is safely stored away, remove the temporary files
for active_file in files:
    os.remove(active_file)