Notebook by: François Maillet
The 2015 Canadian federal election is in its final stretch and at Datacratic, we thought it would be a great opportunity to collect and play with some data.
Each party has a different view on what is important to move Canada forward. This has the effect of centering their official message around certain themes. The question we asked is: can we find patterns in the words used in each party’s press releases? Do some deal more with certain themes than others? Can we gain insights into their communication strategies?
This is what we came up with:
This image is a static scatter plot of press release. Each dot represents a press release, colored by each party’s official color (red for Liberals, blue for Conservatives, orange for the New Democrats and green for the Greens), and press releases are laid out such that similar press releases end up near each other. Similarity is derived from the textual contents of each press release. The axis are unitless.
The white text labels in the image above were placed by hand to give a sense of what the various groupings mean by inspecting how the algorithm laid out the press releases.
The plot also contains pink dots that represent major themes that we hand selected and then projected in the 2D space computed by our algorithm. The distinction with the white labels is important. While the white labels were placed by hand based on our interpretation of each region's press release titles, the pink dots were positioned by the algorithm itself. For example, for the pink dot representing health
, you can think of it as asking the algorithm where it would position a press release containing only the word health
. The following image shows what the pink dots represent with yellow labels.
The rest of this notebook is the exact code that was used to generate the scatter plot with Datacratic's Machine Learning Database (MLDB), along with an interactive plot at the bottom. The algorithms used to produce this image include word2vec and t-SNE.
import json, pandas, requests
from datetime import datetime
import numpy
from pymldb import Connection
mldb = Connection()
We wrote a small scraper using the scrapy Python library. We scraped the press releases of the NDP, the Liberals, the Green Party and the Conservatives. The resulting data was save in a CSV file with one press release per line.
As mentioned above, we added a few lines by hand representing the major themes we wish to project in our space, with the party name "category".
ds = mldb.v1.datasets("raw")
ds.put({
"type": "text.csv.tabular",
"params": {
"dataFileUrl":"file:///mldb_data/press_releases.csv.gz",
"ignoreBadLines": True,
"named": "'pr' + lineNumber()",
}
})
{
"status": {
"rowCount": 1141,
"numLineErrors": 4
},
"config": {
"params": {
"ignoreBadLines": true,
"named": "'pr' + lineNumber()",
"dataFileUrl": "file:///mldb_data/election2015/press_releases.csv.gz"
},
"type": "text.csv.tabular",
"id": "raw"
},
"state": "ok",
"type": "text.csv.tabular",
"id": "raw"
}
mldb.query("""
SELECT party, count(*)
FROM raw
WHERE to_timestamp(date) > to_timestamp('2015-08-01')
GROUP BY party
""")
count(*) | party | |
---|---|---|
_rowName | ||
["conservative"] | 104 | conservative |
["green"] | 90 | green |
["liberal"] | 289 | liberal |
["ndp"] | 162 | ndp |
mldb.query("SELECT * FROM raw LIMIT 1")
date | full_text | link | party | title | |
---|---|---|---|---|---|
_rowName | |||||
pr443 | 2015-01-09T00:00:00 | Official Opposition Critic for Human Rights Wa... | http://www.ndp.ca/news/ndp-calls-release-saudi... | ndp | NDP calls for the release of Saudi free speech... |
The next step is to take the full_text
column and tokenize its contents. We do this because we will need to compare press releases using their content word by word.
The tokenize
function will split the column into individual words, remove words with a length of less than 3 and apply a stoplist to it.
bag = mldb.v1.procedures("baggify").put({
"type": "transform",
"params": {
"inputDataset": "raw",
"outputDataset": {
"id": "bag_of_words",
"type": "sparse.mutable"
},
"select": """
tokenize(full_text,
{splitchars: ' ?!;/[]*"',
quotechar: ''}
) as *
""",
"where": """full_text IS NOT NULL AND
title IS NOT NULL AND
to_timestamp(date) > to_timestamp('2015-08-01')
""",
"runOnCreation": True
}
})
The result is a clean list of words that will be easy to compare across press releases. Cells that contain a number represent the frequency of the word in the press release while cells containing NaN
mean that that press release did not contain the given word. Because MLDB supports sparse datasets, it is more efficient to simply not record anything for press release/word pairs with zero frequency, which is why they show up as NaN
.
mldb.query("SELECT * FROM bag_of_words LIMIT 5")
- | -30- | 000 | 1 | 10 | 100% | 2014 | 2015 | 30 | 5 | ... | with | womans | work | world | worried | would | wrong | year | years | you | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
_rowName | |||||||||||||||||||||
pr189 | 1 | 1 | NaN | NaN | NaN | 1 | NaN | 1 | NaN | NaN | ... | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 | NaN |
pr201 | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 | NaN | ... | 1 | 1 | 1 | 1 | NaN | NaN | 1 | NaN | NaN | NaN |
pr19 | NaN | NaN | 1 | NaN | 1 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 1 | 2 | NaN | 2 | NaN | NaN | 1 | NaN |
pr803 | NaN | NaN | NaN | 1 | 1 | NaN | 2 | 1 | NaN | 1 | ... | 2 | NaN | NaN | NaN | 1 | NaN | NaN | 1 | NaN | 3 |
pr915 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 500 columns
df = mldb.query("SELECT sum({*}) as * FROM bag_of_words")
df2 = df.T
df2.columns = ["count"]
df2.sort(columns="count", ascending=False)[:15]
count | |
---|---|
the | 7175 |
and | 5364 |
to | 5092 |
of | 3750 |
in | 2806 |
a | 2538 |
for | 1937 |
that | 1832 |
will | 1548 |
is | 1500 |
on | 1184 |
our | 1067 |
Canadians | 1063 |
said | 979 |
with | 974 |
The word2vec tool is used to embed words into a high dimensional space.
Word2vec gives us word embeddings, but since what interests us is press releases and each one is represent by a bag of words, we actually need to compute document embeddings. This means that for a given press release, we need to embed each of its words and then represent each press release as the centroid of the coordinates for its component words.
You can obtain the trained word2vec embedding here.
w2v = mldb.v1.procedures("w2vimport").put({
"type": 'import.word2vec',
"params": {
"dataFileUrl": 'file:///mldb_data/GoogleNews-vectors-negative300.bin',
"outputDataset": {
"type": 'embedding',
"id": 'w2v'
},
"runOnCreation": True
}
})
mldb.query("SELECT * FROM w2v LIMIT 5")
000000 | 000001 | 000002 | 000003 | 000004 | 000005 | 000006 | 000007 | 000008 | 000009 | ... | 000290 | 000291 | 000292 | 000293 | 000294 | 000295 | 000296 | 000297 | 000298 | 000299 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
_rowName | |||||||||||||||||||||
0 | 0.152344 | -0.121094 | 0.102051 | -0.083984 | -0.184570 | 0.015320 | 0.238281 | -0.478516 | 0.072754 | 0.218750 | ... | 0.028320 | -0.164062 | -0.173828 | 0.361328 | -0.201172 | -0.142578 | -0.021606 | 0.013794 | -0.057373 | 0.277344 |
null | -0.115234 | -0.304688 | 0.292969 | -0.179688 | 0.000885 | -0.074707 | 0.091309 | -0.030884 | 0.240234 | 0.111328 | ... | 0.019775 | 0.073242 | -0.078125 | 0.051758 | 0.201172 | 0.394531 | -0.322266 | -0.166992 | -0.433594 | 0.100098 |
Pragmatix | 0.007721 | 0.123047 | 0.028076 | -0.157227 | -0.034424 | -0.053955 | 0.037598 | -0.071777 | -0.149414 | -0.072754 | ... | -0.100586 | -0.160156 | -0.042969 | -0.013062 | 0.075684 | -0.002808 | 0.014893 | -0.011414 | 0.184570 | -0.047852 |
griefing | 0.091797 | -0.113281 | 0.029907 | 0.048584 | -0.241211 | 0.195312 | 0.034912 | 0.302734 | -0.034424 | -0.082520 | ... | 0.104980 | 0.233398 | -0.118652 | -0.089355 | -0.142578 | 0.062988 | 0.061768 | 0.054199 | -0.225586 | 0.187500 |
Richard_Altomare | 0.056396 | 0.455078 | -0.189453 | -0.055420 | 0.096191 | -0.296875 | 0.195312 | 0.240234 | 0.224609 | -0.104004 | ... | -0.139648 | -0.141602 | 0.003479 | 0.159180 | 0.281250 | 0.010376 | -0.174805 | 0.060547 | -0.025269 | -0.174805 |
5 rows × 300 columns
We now have the w2v dataset containing embeddings for words. We also have the bag_of_words dataset that contains the frequency of words in each press release. What we want is to embed each press release in the word2vec space. To do this, we will apply a pooling
function to calculate the centroid of all the words in each press release. The centroid of a press release's words in the word2vec space will be the value for the press release.
w2v = mldb.v1.functions("pooler").put({
"type": "pooling",
"params": {
"aggregators": ["avg"],
"embeddingDataset": "w2v"
}
})
print mldb.v1.procedures("word2vec").put({
"type": "transform",
"params": {
"inputDataset": "bag_of_words",
"outputDataset": {
"id": "pr_word2vec",
"type": "sparse.mutable"
},
"select": "pooler({words: {*}}) as word2vec",
"runOnCreation": True
}
})
<Response [201]>
In the new word2vec dataset, each row is still a press release and the columns are the word2vec coordinates.
mldb.query("SELECT * FROM pr_word2vec LIMIT 5")
word2vec.embedding.000000 | word2vec.embedding.000001 | word2vec.embedding.000002 | word2vec.embedding.000003 | word2vec.embedding.000004 | word2vec.embedding.000005 | word2vec.embedding.000006 | word2vec.embedding.000007 | word2vec.embedding.000008 | word2vec.embedding.000009 | ... | word2vec.embedding.000290 | word2vec.embedding.000291 | word2vec.embedding.000292 | word2vec.embedding.000293 | word2vec.embedding.000294 | word2vec.embedding.000295 | word2vec.embedding.000296 | word2vec.embedding.000297 | word2vec.embedding.000298 | word2vec.embedding.000299 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
_rowName | |||||||||||||||||||||
pr189 | 0.015025 | 0.035481 | 0.025070 | 0.081255 | -0.055098 | -0.027271 | 0.023139 | -0.059069 | 0.078178 | 0.057806 | ... | -0.088459 | 0.012187 | -0.065785 | 0.027803 | -0.016705 | -0.021275 | 0.016819 | -0.011929 | 0.054700 | -0.030355 |
pr201 | -0.003131 | 0.033511 | 0.029290 | 0.060336 | -0.046791 | -0.029143 | 0.012239 | -0.077588 | 0.082869 | 0.013007 | ... | -0.066464 | -0.012209 | -0.065420 | 0.023533 | -0.072667 | -0.016398 | 0.052933 | -0.019730 | 0.036446 | -0.002662 |
pr19 | 0.016468 | 0.050475 | 0.049212 | 0.075096 | -0.058997 | -0.027791 | 0.029848 | -0.050204 | 0.093230 | 0.049287 | ... | -0.092226 | -0.024152 | -0.058700 | 0.025048 | -0.047662 | 0.012693 | 0.044788 | 0.002387 | 0.084831 | -0.009792 |
pr803 | 0.012424 | 0.036466 | 0.022585 | 0.083785 | -0.054085 | -0.032957 | 0.041109 | -0.048618 | 0.091955 | 0.040198 | ... | -0.101424 | 0.024314 | -0.060021 | 0.030629 | -0.052504 | -0.003902 | 0.010874 | 0.019480 | 0.059965 | -0.022820 |
pr915 | -0.003221 | -0.007836 | 0.045823 | 0.066040 | 0.048532 | -0.128305 | -0.021231 | -0.022651 | 0.061974 | -0.007080 | ... | -0.085088 | -0.039598 | -0.110807 | 0.087637 | -0.078998 | -0.054260 | 0.031264 | 0.015263 | 0.123915 | -0.008639 |
5 rows × 300 columns
The t-SNE algorithm is a very powerful technique that can be used to reduce the dimensionality of data to 2d or 3d to visualise it. We use it to generate the layout of the press releases in the scatter plot.
print mldb.v1.procedures("pr_embed_tsne").put({
"type" : "tsne.train",
"params" : {
"trainingData" : "SELECT * FROM pr_word2vec",
"rowOutputDataset" : "pr_embed_tsne",
"modelFileUrl": "file:///mldb_data/tsne.bin",
"functionName": "tsne_embed",
"perplexity": 5,
"runOnCreation": True
}
})
<Response [201]>
mldb.v1.datasets("pr_embed_tsne_merged").put({
"type" : "merged",
"params" : {
"datasets": [
{"id": "raw"},
{"id": "pr_embed_tsne"}
]
}
})
{
"status": [
{
"rowCount": 1141,
"numLineErrors": 4
},
null
],
"config": {
"params": {
"datasets": [
{
"id": "raw"
},
{
"id": "pr_embed_tsne"
}
]
},
"type": "merged",
"id": "pr_embed_tsne_merged"
},
"state": "ok",
"type": "merged",
"id": "pr_embed_tsne_merged"
}
We now have the pr_embed_tsne_merged
dataset that represents all of our press releases with $(x,y)$ coordinates in the embedding space.
mldb.query("""
SELECT party, title, x, y
FROM pr_embed_tsne_merged
WHERE to_timestamp(date) > to_timestamp('2015-08-01')
LIMIT 5""")
party | title | x | y | |
---|---|---|---|---|
_rowName | ||||
pr189 | green | Greens Challenge other Parties to Run Carbon N... | -14.476276 | 22.831806 |
pr201 | green | Elizabeth May successfully uses social media t... | 21.241373 | 37.598019 |
pr19 | conservative | Harper Announces Support for Religious Freedom | 54.134609 | 13.339911 |
pr803 | liberal | Peggy Nash vs Peggy Nash: | -47.633690 | 15.637717 |
pr915 | liberal | Justin Trudeau celebrates the Mid-Autumn Festival | 34.224525 | 70.258011 |
df = mldb.query("""
SELECT party, title, x, y
FROM pr_embed_tsne_merged
WHERE to_timestamp(date) > to_timestamp('2015-08-01')
""")
import numpy as np
colormap = {
"ndp": "#FF8000",
"liberal": "#DF0101",
"conservative": "#0000FF",
"green": "#01DF01",
"category": "#FE2EC8"
}
import bokeh.plotting as bp
from bokeh.models import HoverTool
press_releases = np.array([str(x.encode('ascii','ignore').split("|")[0]) for x in list(df.title.values)])
x = bp.figure(plot_width=900, plot_height=700, title="Press Releases of Canadian Federal Parties During 2015 Elections",
tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
x_axis_type=None, y_axis_type=None, min_border=1)
x.scatter(
x = df.x.values,
y = df.y.values,
color=[colormap[k] for k in df.party.values],
radius=1,
fill_alpha=0.5,
source=bp.ColumnDataSource({"title": press_releases})
).select(dict(type=HoverTool)).tooltips = {"title":"@title"}
bp.show(x)
This plot is interactive. You can explore it by hovering over the different press releases and the tooltip will show the title of each press release.
Take a look at some patterns and insights we found on the companion blog post.