Use a Wikipedia corpus to extract semantic similarities¶

In this tutorial we will use a Wikipedia corpus to extract semantically similar words for a given word. We will use a technique called "Latent Semantic Analysis" (LSA) that we apply onto a collocation matrix extracted from the Bavarian Wikipedia. I used the ideas from the paper of Widdows and Dorow (2002) "Visualisation Techniques for Analysing Meaning" for this tutorial. You can read more about LSA and related technology in the book Introduction to Information Retrieval. The data was prepared with the help of the Wikipedia Extractor, which is probably the easiest way to extract a pure text corpus from the Wikipedia dumps. The dumps contain all kind of markup used in Wikipedia that we need to remove before we can process the data. This was already done and the data was transformed to LAF/GrAF. We will use a Python GrAF parser to read the data. For most of the calculation we will use numpy, scipy and sparsesvd. Finally, matplotlib will be used to visualize the space of semantic similarities.

Prerequisites¶

You need to install the following Python libraries in order to be able to execute this notebook. The easiest way to install them is via easy_install. If you are on Windows you can download setup packages for numpy and scipy here.

graf-python: http://media.cidles.eu/poio/graf-python/
requests: http://docs.python-requests.org/en/latest/
numpy: http://www.numpy.org/
scipy: http://www.scipy.org/
sparsesvd: https://pypi.python.org/pypi/sparsesvd/

Change directory and imports¶

First we change the directory to somewehere where we can download and extract the data. If you want to download and extract to the current directory you can just skip this step:

In [1]:

%cd "h:\ProjectsWin\git-github\poio-corpus\build\"

h:\ProjectsWin\git-github\poio-corpus\build

Nex, We import all the Python modules that we need:

In [2]:

import re
import io
import math
import codecs
import zipfile

import requests
import numpy as np
import matplotlib.pyplot as plt
import scipy.spatial
import scipy.sparse
import scipy.linalg
from sparsesvd import sparsesvd
import graf

Download stopwords and stop characters¶

A list of Bavarian stopwords was already compiled and is available for download. The next block of code download this list of stopwords and stores it in a variable stopwords. We will also download a list of characters that we want to ignore in Bavarian words and store it in the variable ignorechars:

In [3]:

r = requests.get("https://www.poio.eu/static/stopwords/bar.txt")
stopwords = r.content.decode("utf-8").split()
r = requests.get("https://www.poio.eu/static/ignorechars/bar.txt")
ignorechars = r.content.decode("utf-8")

Download, extract and parse the corpus¶

In the next step we will download and extract the corpus. The corpus is pre-compiled from Wikipedia dumps and was converted to a set of GrAF files. To parse the corpus we will use the library graf-python (see "Prerequisites" above). We will store each document of the Wikipedia as a Unicode string in the list documents:

In [ ]:

r = requests.get("https://www.poio.eu/static/corpus/barwiki-20130813.zip")
with open("barwiki-20130813.zip", "wb") as f:
    f.write(r.content)

In [4]:

z = zipfile.ZipFile("barwiki-20130813.zip")
z.extractall()

In [4]:

gp = graf.GraphParser()
g = gp.parse("barwiki-20130813.hdr")

text = codecs.open("barwiki-20130813.txt", "r", "utf-8")
txt = text.read()
text.close()

documents = list()
for n in g.nodes:
    if n.id.startswith("doc..") and len(n.links) > 0 and len(n.links[0]) > 0:
        doc = txt[n.links[0][0].start:n.links[0][0].end]
        documents.append(doc)

Build a dict for words and doc ids¶

In [5]:

re_ignore_chars = re.compile(u"[{0}]".format(ignorechars))
def _words_for_document(doc):
    words = doc.split()
    words2 = list()

    for w in words:
        w = re_ignore_chars.sub("", w.lower())

        if not w or w in stopwords:
            continue
        
        words2.append(w)

    return words2

In [6]:

wdict = {}
for i, d in enumerate(documents):
    for w in _words_for_document(d):
        if w in wdict:
            wdict[w].append(i)
        else:
            wdict[w] = [i]

Pre-process¶

In [7]:

# Which 1000 words occur most often?
top_words = [k for k in sorted(wdict, key=lambda k: len(wdict[k]), reverse=True)][:1000]
# get all words that appear at least 3 times and sort them
keys = [k for k in wdict.keys() if len(wdict[k]) > 2]
keys.sort()
keys_indices = { w: i for i, w in enumerate(keys) }
# create and empty count matrix
A = np.zeros([len(keys), len(top_words)])

Process¶

In [8]:

for d in documents:
    words = _words_for_document(d)
    len_words = len(words) - 1
    for i, w in enumerate(words):
        if w not in keys_indices:
            continue
        start = i - 15
        if i < 0:
            start = 0
        end = len_words
        if end > i + 15:
            end = i + 15
        for j, t in enumerate(top_words):
            if w == t:
                continue
            if t in words[start:end]:
                A[keys_indices[w],j] += 1

Normalize¶

In [9]:

words_per_top = np.sum(A, axis=0)
tops_per_word = np.sum(np.asarray(A > 0, 'i'), axis=1)
rows, cols = A.shape
for i in range(rows):
    for j in range(cols):
        if words_per_top[j] == 0 or tops_per_word[i] == 0:
            A[i,j] = 0
        else:
            A[i,j] = (A[i,j] / words_per_top[j]) * math.log(float(cols) / tops_per_word[i])

In [10]:

tops_per_word[keys_indices['bia']]

Out[10]:

SVD calculation¶

In [11]:

#U, S, Vt = scipy.sparse.linalg.svds(A, 100)
s_A = scipy.sparse.csc_matrix(A)
ut, s, vt = sparsesvd(s_A, 100)

In [12]:

out = open("bar-ut.bin", "wb")
np.save(out, ut)
out.close()

out = open("bar-s.bin", "wb")
np.save(out, s)
out.close()

out = open("bar-vt.bin", "wb")
np.save(out, vt)
out.close()

In [13]:

import pickle
with open("bar-indices.pickle", "wb") as f:
    pickle.dump(keys_indices, f, 2)

In [14]:

plt.plot(s)
plt.show()

In [15]:

#import scipy.linalg
#reconstructed_matrix = np.dot(np.dot(U, scipy.linalg.diagsvd(S,len(S),len(S))), Vt)
reconstructed_matrix = np.dot(ut.T, np.dot(np.diag(s), vt))

In [16]:

tree = scipy.spatial.cKDTree(reconstructed_matrix)

In [17]:

neighbours = tree.query(reconstructed_matrix[keys_indices[u"bia"]], k=100)

In [18]:

subset = reconstructed_matrix[neighbours[1]]
words = [keys[i] for i in neighbours[1]]

In [20]:

tempU, tempS, tempVt = scipy.linalg.svd(subset)

In [21]:

plt.plot(tempS)
plt.show()

In [22]:

coords = tempU[:,1:3]
plt.figure(1, figsize=(16,12))
plt.plot(tempU[:,1], tempU[:,2], marker="o", linestyle="None")
for label, x, y in zip(words, tempU[:,1], tempU[:,2]):
    plt.annotate(
        label, 
        xy = (x, y), xytext = (-5, 5),
        textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5))
plt.show()

In [ ]: