By Felipe Hoffa (@felipehoffa, /r/bigquery)

To look further into how to read these results to understand history, look at:

Kalev Leetaru's GDELT correlations post.
Andrew Gelman's counterpoint.
Anthony A. Boyles's counterpoint.

In [13]:

# Some pre-flight work:
# - Libraries.
# - Patch to handle authentication automatically on Google Compute Enginefrom oauth2client.gce import AppAssertionCredentials
from bigquery_client import BigqueryClient
from pandas.io import gbq
import pandas as pd

def GetMetadata(path):
  import urllib2
  BASE_PATH = 'http://metadata/computeMetadata/v1/'
  request = urllib2.Request(BASE_PATH + path, headers={'Metadata-Flavor': 'Google'})
  return urllib2.urlopen(request).read()

credentials = AppAssertionCredentials(scope='https://www.googleapis.com/auth/bigquery')

client = BigqueryClient(credentials=credentials,
                        api='https://www.googleapis.com',
                        api_version='v2',
                        project_id=GetMetadata('project/project-id'))

gbq._authenticate = lambda: client

How many rows GDELT has for each country?

We are going to use a view specially defined for that. 'sample_views.country_date_matconf_numarts' counts the number of material conflicts per day per country.

To see the view definition, go to https://bigquery.cloud.google.com/table/gdelt-bq:sample_views.country_date_matconf_numarts and click on 'Details'.

In [2]:

query = """
SELECT country, SUM(c) AS count
FROM [gdelt-bq:sample_views.country_date_matconf_numarts]
GROUP BY country
ORDER BY count DESC
LIMIT 10
"""
gbq.read_gbq(query)

Waiting on bqjob_r3494ef2663a3c45e_00000147cc145563_1 ... (0s) Current status: DONE

Out[2]:

	country	count
0	Israel	13138702
1	Pakistan	11547139
2	Afghanistan	11334579
3	Iraq	10181510
4	Syria	9538310
5	United Kingdom	8764009
6	Russia	7569903
7	India	6988340
8	Egypt	6511839
9	China	6268915

Focusing solely in Egypt. Let's plot the number of material conflicts reported for each day over the last 35 years:

In [9]:

query = """
SELECT date, c AS count
FROM [gdelt-bq:sample_views.country_date_matconf_numarts]
WHERE country='Egypt'
ORDER BY date
"""
data=gbq.read_gbq(query)
data.index=pd.to_datetime(data['date']*1000)
x = data['count'].plot()

Waiting on bqjob_r46d7af66323c4ad9_00000147cc18e43f_8 ... (0s) Current status: DONE

What if we only focus on the 30 days previous to January 27th, 2011.

In [28]:

query = """
  SELECT country, date, c AS count
  FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a 
  CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 30) b
  WHERE country='Egypt'
  AND date+i*86400000000 = PARSE_UTC_USEC('2011-01-27')
"""
data_egypt=gbq.read_gbq(query)
data_egypt.index=pd.to_datetime(data_egypt['date']*1000)
x = data_egypt['count'].plot()

Waiting on bqjob_r2f64a01b0c420ec3_00000147cc356c4a_11 ... (0s) Current status: DONE

With that timeline defined, we can look for similar timelines across all years and countries.

To make this work, we used the trick explained on http://stackoverflow.com/questions/24923101/computing-a-moving-maximum-in-bigquery/24943950#24943950.

In [12]:

query = """
SELECT
  STRFTIME_UTC_USEC(a.ending_at, "%Y-%m-%d") ending_at1,
  STRFTIME_UTC_USEC(b.ending_at-30*86400000000, "%Y-%m-%d") starting_at2,
  STRFTIME_UTC_USEC(b.ending_at, "%Y-%m-%d") ending_at2,
  a.country, b.country, CORR(a.c, b.c) corr, COUNT(*) c
FROM (
  SELECT country, date+i*86400000000 ending_at, c, i
  FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a 
  CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 30) b
) b
JOIN (
  SELECT country, date+i*86400000000 ending_at, c, i
  FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a 
  CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 30) b
  WHERE country='Egypt'
  AND date+i*86400000000 = PARSE_UTC_USEC('2011-01-27')
) a
ON a.i=b.i
WHERE a.ending_at != b.ending_at
GROUP EACH BY ending_at1, ending_at2, starting_at2, a.country, b.country
HAVING (c = 30 AND ABS(corr) > 0.254)
ORDER BY corr DESC

LIMIT 10
"""
data=gbq.read_gbq(query)
data

Waiting on bqjob_r5ccda16b63fbcd4a_00000147cc1a0174_11 ... (203s) Current status: DONE

Out[12]:

	ending_at1	starting_at2	ending_at2	a_country	b_country	corr	c
0	2011-01-27	2013-10-19	2013-11-18	Egypt	Chad	0.875190	30
1	2011-01-27	2011-05-15	2011-06-14	Egypt	Jordan	0.862724	30
2	2011-01-27	2005-08-07	2005-09-06	Egypt	Turkey	0.856022	30
3	2011-01-27	2006-10-03	2006-11-02	Egypt	Italy	0.847174	30
4	2011-01-27	2006-05-29	2006-06-28	Egypt	Syria	0.846571	30
5	2011-01-27	2013-05-15	2013-06-14	Egypt	South Korea	0.842728	30
6	2011-01-27	1995-06-11	1995-07-11	Egypt	Iraq	0.842362	30
7	2011-01-27	2011-10-10	2011-11-09	Egypt	Portugal	0.841235	30
8	2011-01-27	2002-08-04	2002-09-03	Egypt	Sudan	0.835436	30
9	2011-01-27	2009-08-07	2009-09-06	Egypt	Germany	0.832848	30

Those are the 10 periods that most closely resemble the timeline of material conflicts in Egypt before January 27th, 2011.

Let's plot Chad 30 days before October 19th, 2013.

In [34]:

query = """
  SELECT country, date, c AS count
  FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a 
  CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 30) b
  WHERE country='Chad'
  AND date+i*86400000000 = PARSE_UTC_USEC('2013-11-18')
"""
data_chad=gbq.read_gbq(query)
data_chad.index=pd.to_datetime(data_chad['date']*1000)
data_chad['count'].plot()

Waiting on bqjob_r39c0b56b485a6249_00000147cc384b90_16 ... (0s) Current status: DONE

Out[34]:

<matplotlib.axes.AxesSubplot at 0x7feb5c1ecbd0>

In fact, both periods look remarkably similar (but not necesarily significant, more analysis is needed before reaching any conclusions):

In [33]:

data_egypt['count'].plot()

Out[33]:

<matplotlib.axes.AxesSubplot at 0x7feb5cd93890>

To look further into how to read these results to understand history, look at:

Kalev Leetaru's GDELT correlations post.
Andrew Gelman's counterpoint.
Anthony A. Boyles's counterpoint.

By Felipe Hoffa (@felipehoffa, /r/bigquery)