GDELT Data Wrangler

James Houghton; April 8th, 2014 [email protected]

One of the challenges of dealing with GDELT is that its size makes implementation in a SQL database challenging. It is easier (although slower) to extract a subset of the data that we wish to work with, and do our actual data investigation afterwards. If we're interested in data broken down by date, we have it easy: GDELT files are provided this way. If instead we want to look at a single country, we have some work to do. This notebook parses through each of the GDELT files, one at a time, extracts the relevant lines, and exports them again to a smaller set of csv files. It then gives the option to load these files into a Pandas DataFrame, pickle the result, and remove the intermediate files.

There are certainly more computationally efficient methods for doing this, but this one works well enough.

Additional References

1. Identify the files we need to download

Links to each of the datafiles are available on the GDELT website. We extract this list to help collect and process the data. We probably want to perform this task every time we rebuild a dataset, so that we know we have the most up-to-date set of files.

In [ ]:
import requests
import lxml.html as lh

gdelt_base_url = 'http://data.gdeltproject.org/events/'

# get the list of all the links on the gdelt file page
page = requests.get(gdelt_base_url+'index.html')
doc = lh.fromstring(page.content)
link_list = doc.xpath("//*/ul/li/a/@href")

# separate out those links that begin with four digits 
file_list = [x for x in link_list if str.isdigit(x[0:4])]

2. Extract Relevant GDELT Rows to Intermediate Files

In this example, we pull out all of the rows with a specific country code, and create a set of files which mirrors that of GDELT itself in quantity and format. Each output file is smaller than its corresponding input file.

As we're pulling out rows based upon location, we'll use GDELT cells:

  • ActionGeo_CountryCode (51)
  • Actor1Geo_CountryCode (37)
  • Actor2Geo_CountryCode (44)

These use the FIPS country codes, for inconvenience.

The general algorithm we will follow can be loosely described as:

  • Get a list of the gdelt files
  • For each element in the list
    • Check and see if the files are available locally (We assume we can afford to keep a local copy of the compressed files, so that we don't have to continually ping the gdelt server.)
      • If not, get the files from the web
    • Unzip the compressed raw GDELT file
    • For each resulting CSV file (probably only one)
      • Create and open an output file
      • For each line of the input file
        • Read into a string
        • Split string by tab delimiter
        • Check for the desired values in the appropriate list indexes
          • If they are not there, continue the loop
          • If the are there, write the line to the output file
      • Increment a current-file counter
      • Close the input and output files
      • Delete the input file

We reset the infilecounter and outfilecounter external to the main algorithm cell so that if the algorithm encounters an error, and quits, we can pick up where we left off.

We choose to check for the files, and potentially download them once for each file that we try to open, as opposed to doing all file collection in advance. That way, we can start to run additional code as soon as possible. This will help with debugging as we'll "fail fast".

Remember to set the local_path to somewhere you're happy to store large files.

In [ ]:
infilecounter = 0
outfilecounter = 0
In [ ]:
import os.path
import urllib
import zipfile
import glob
import operator

local_path = '/Users/me/Desktop/GDELT_Data/'

fips_country_code = 'UK'

for compressed_file in file_list[infilecounter:]:
    print compressed_file,
    
    # if we dont have the compressed file stored locally, go get it. Keep trying if necessary.
    while not os.path.isfile(local_path+compressed_file): 
        print 'downloading,',
        urllib.urlretrieve(url=gdelt_base_url+compressed_file, 
                           filename=local_path+compressed_file)
        
    # extract the contents of the compressed file to a temporary directory    
    print 'extracting,',
    z = zipfile.ZipFile(file=local_path+compressed_file, mode='r')    
    z.extractall(path=local_path+'tmp/')
    
    # parse each of the csv files in the working directory, 
    print 'parsing,',
    for infile_name in glob.glob(local_path+'tmp/*'):
        outfile_name = local_path+'country/'+fips_country_code+'%04i.tsv'%outfilecounter
        # open the infile and outfile
        with open(infile_name, mode='r') as infile, open(outfile_name, mode='w') as outfile:
            for line in infile:
                # extract lines with our interest country code
                if fips_country_code in operator.itemgetter(51, 37, 44)(line.split('\t')):    
                    outfile.write(line)
            outfilecounter +=1
            
        # delete the temporary file
        os.remove(infile_name)
    infilecounter +=1
    print 'done'

3. Build intermediary files into a Pandas Dataframe

We may be content to use the data we just sampled into csv files in its present state. However, if we are working in python, it is convenient to load them into a DataFrame, save that DataFrame to a pickle, and delete the temporary files. This can save space on the disk, and make our future analysis of the data more simple.

Our algorithm here is simple - we build dataframes out of each of the temporary files, and then merge them into one big dataframe. We save that big dataframe, and delete the temporary files.

We use a helper file here which lists the column names. You can download the file to your working directory with this link:

http://gdeltproject.org/data/lookups/CSV.header.fieldids.xlsx

In [ ]:
import glob
import pandas as pd

# Get the GDELT field names from a helper file
colnames = pd.read_excel('CSV.header.fieldids.xlsx', sheetname='Sheet1', 
                         index_col='Column ID', parse_cols=1)['Field Name']

# Build DataFrames from each of the intermediary files
files = glob.glob(local_path+'country/'+fips_country_code+'*')
DFlist = []
for active_file in files:
    print active_file
    DFlist.append(pd.read_csv(active_file, sep='\t', header=None, dtype=str,
                              names=colnames, index_col=['GLOBALEVENTID']))

# Merge the file-based dataframes and save a pickle
DF = pd.concat(DFlist)
DF.to_pickle(local_path+'backup'+fips_country_code+'.pickle')    
    
# once everythin is safely stored away, remove the temporary files
for active_file in files:
    os.remove(active_file)