By Felipe Hoffa (@felipehoffa, /r/bigquery)
To look further into how to read these results to understand history, look at:
# Some pre-flight work:
# - Libraries.
# - Patch to handle authentication automatically on Google Compute Enginefrom oauth2client.gce import AppAssertionCredentials
from bigquery_client import BigqueryClient
from pandas.io import gbq
import pandas as pd
def GetMetadata(path):
import urllib2
BASE_PATH = 'http://metadata/computeMetadata/v1/'
request = urllib2.Request(BASE_PATH + path, headers={'Metadata-Flavor': 'Google'})
return urllib2.urlopen(request).read()
credentials = AppAssertionCredentials(scope='https://www.googleapis.com/auth/bigquery')
client = BigqueryClient(credentials=credentials,
api='https://www.googleapis.com',
api_version='v2',
project_id=GetMetadata('project/project-id'))
gbq._authenticate = lambda: client
How many rows GDELT has for each country?
We are going to use a view specially defined for that. 'sample_views.country_date_matconf_numarts' counts the number of material conflicts per day per country.
To see the view definition, go to https://bigquery.cloud.google.com/table/gdelt-bq:sample_views.country_date_matconf_numarts and click on 'Details'.
query = """
SELECT country, SUM(c) AS count
FROM [gdelt-bq:sample_views.country_date_matconf_numarts]
GROUP BY country
ORDER BY count DESC
LIMIT 10
"""
gbq.read_gbq(query)
Waiting on bqjob_r3494ef2663a3c45e_00000147cc145563_1 ... (0s) Current status: DONE
country | count | |
---|---|---|
0 | Israel | 13138702 |
1 | Pakistan | 11547139 |
2 | Afghanistan | 11334579 |
3 | Iraq | 10181510 |
4 | Syria | 9538310 |
5 | United Kingdom | 8764009 |
6 | Russia | 7569903 |
7 | India | 6988340 |
8 | Egypt | 6511839 |
9 | China | 6268915 |
Focusing solely in Egypt. Let's plot the number of material conflicts reported for each day over the last 35 years:
query = """
SELECT date, c AS count
FROM [gdelt-bq:sample_views.country_date_matconf_numarts]
WHERE country='Egypt'
ORDER BY date
"""
data=gbq.read_gbq(query)
data.index=pd.to_datetime(data['date']*1000)
x = data['count'].plot()
Waiting on bqjob_r46d7af66323c4ad9_00000147cc18e43f_8 ... (0s) Current status: DONE
What if we only focus on the 30 days previous to January 27th, 2011.
query = """
SELECT country, date, c AS count
FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a
CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 30) b
WHERE country='Egypt'
AND date+i*86400000000 = PARSE_UTC_USEC('2011-01-27')
"""
data_egypt=gbq.read_gbq(query)
data_egypt.index=pd.to_datetime(data_egypt['date']*1000)
x = data_egypt['count'].plot()
Waiting on bqjob_r2f64a01b0c420ec3_00000147cc356c4a_11 ... (0s) Current status: DONE
With that timeline defined, we can look for similar timelines across all years and countries.
To make this work, we used the trick explained on http://stackoverflow.com/questions/24923101/computing-a-moving-maximum-in-bigquery/24943950#24943950.
query = """
SELECT
STRFTIME_UTC_USEC(a.ending_at, "%Y-%m-%d") ending_at1,
STRFTIME_UTC_USEC(b.ending_at-30*86400000000, "%Y-%m-%d") starting_at2,
STRFTIME_UTC_USEC(b.ending_at, "%Y-%m-%d") ending_at2,
a.country, b.country, CORR(a.c, b.c) corr, COUNT(*) c
FROM (
SELECT country, date+i*86400000000 ending_at, c, i
FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a
CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 30) b
) b
JOIN (
SELECT country, date+i*86400000000 ending_at, c, i
FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a
CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 30) b
WHERE country='Egypt'
AND date+i*86400000000 = PARSE_UTC_USEC('2011-01-27')
) a
ON a.i=b.i
WHERE a.ending_at != b.ending_at
GROUP EACH BY ending_at1, ending_at2, starting_at2, a.country, b.country
HAVING (c = 30 AND ABS(corr) > 0.254)
ORDER BY corr DESC
LIMIT 10
"""
data=gbq.read_gbq(query)
data
Waiting on bqjob_r5ccda16b63fbcd4a_00000147cc1a0174_11 ... (203s) Current status: DONE
ending_at1 | starting_at2 | ending_at2 | a_country | b_country | corr | c | |
---|---|---|---|---|---|---|---|
0 | 2011-01-27 | 2013-10-19 | 2013-11-18 | Egypt | Chad | 0.875190 | 30 |
1 | 2011-01-27 | 2011-05-15 | 2011-06-14 | Egypt | Jordan | 0.862724 | 30 |
2 | 2011-01-27 | 2005-08-07 | 2005-09-06 | Egypt | Turkey | 0.856022 | 30 |
3 | 2011-01-27 | 2006-10-03 | 2006-11-02 | Egypt | Italy | 0.847174 | 30 |
4 | 2011-01-27 | 2006-05-29 | 2006-06-28 | Egypt | Syria | 0.846571 | 30 |
5 | 2011-01-27 | 2013-05-15 | 2013-06-14 | Egypt | South Korea | 0.842728 | 30 |
6 | 2011-01-27 | 1995-06-11 | 1995-07-11 | Egypt | Iraq | 0.842362 | 30 |
7 | 2011-01-27 | 2011-10-10 | 2011-11-09 | Egypt | Portugal | 0.841235 | 30 |
8 | 2011-01-27 | 2002-08-04 | 2002-09-03 | Egypt | Sudan | 0.835436 | 30 |
9 | 2011-01-27 | 2009-08-07 | 2009-09-06 | Egypt | Germany | 0.832848 | 30 |
Those are the 10 periods that most closely resemble the timeline of material conflicts in Egypt before January 27th, 2011.
Let's plot Chad 30 days before October 19th, 2013.
query = """
SELECT country, date, c AS count
FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a
CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 30) b
WHERE country='Chad'
AND date+i*86400000000 = PARSE_UTC_USEC('2013-11-18')
"""
data_chad=gbq.read_gbq(query)
data_chad.index=pd.to_datetime(data_chad['date']*1000)
data_chad['count'].plot()
Waiting on bqjob_r39c0b56b485a6249_00000147cc384b90_16 ... (0s) Current status: DONE
<matplotlib.axes.AxesSubplot at 0x7feb5c1ecbd0>
In fact, both periods look remarkably similar (but not necesarily significant, more analysis is needed before reaching any conclusions):
data_egypt['count'].plot()
<matplotlib.axes.AxesSubplot at 0x7feb5cd93890>
To look further into how to read these results to understand history, look at: