In this tutorial we will use a Wikipedia corpus to extract semantically similar words for a given word. We will use a technique called "Latent Semantic Analysis" (LSA) that we apply onto a collocation matrix extracted from the Bavarian Wikipedia. I used the ideas from the paper of Widdows and Dorow (2002) "Visualisation Techniques for Analysing Meaning" for this tutorial. You can read more about LSA and related technology in the book Introduction to Information Retrieval. The data was prepared with the help of the Wikipedia Extractor, which is probably the easiest way to extract a pure text corpus from the Wikipedia dumps. The dumps contain all kind of markup used in Wikipedia that we need to remove before we can process the data. This was already done and the data was transformed to LAF/GrAF. We will use a Python GrAF parser to read the data. For most of the calculation we will use numpy, scipy and sparsesvd. Finally, matplotlib will be used to visualize the space of semantic similarities.
You need to install the following Python libraries in order to be able to execute this notebook. The easiest way to install them is via easy_install
. If you are on Windows you can download setup packages for numpy and scipy here.
First we change the directory to somewehere where we can download and extract the data. If you want to download and extract to the current directory you can just skip this step:
%cd "h:\ProjectsWin\git-github\poio-corpus\build\"
h:\ProjectsWin\git-github\poio-corpus\build
Nex, We import all the Python modules that we need:
import re
import io
import math
import codecs
import zipfile
import requests
import numpy as np
import matplotlib.pyplot as plt
import scipy.spatial
import scipy.sparse
import scipy.linalg
from sparsesvd import sparsesvd
import graf
A list of Bavarian stopwords was already compiled and is available for download. The next block of code download this list of stopwords and stores it in a variable stopwords
. We will also download a list of characters that we want to ignore in Bavarian words and store it in the variable ignorechars
:
r = requests.get("https://www.poio.eu/static/stopwords/bar.txt")
stopwords = r.content.decode("utf-8").split()
r = requests.get("https://www.poio.eu/static/ignorechars/bar.txt")
ignorechars = r.content.decode("utf-8")
In the next step we will download and extract the corpus. The corpus is pre-compiled from Wikipedia dumps and was converted to a set of GrAF files. To parse the corpus we will use the library graf-python (see "Prerequisites" above). We will store each document of the Wikipedia as a Unicode string in the list documents
:
r = requests.get("https://www.poio.eu/static/corpus/barwiki-20130813.zip")
with open("barwiki-20130813.zip", "wb") as f:
f.write(r.content)
z = zipfile.ZipFile("barwiki-20130813.zip")
z.extractall()
gp = graf.GraphParser()
g = gp.parse("barwiki-20130813.hdr")
text = codecs.open("barwiki-20130813.txt", "r", "utf-8")
txt = text.read()
text.close()
documents = list()
for n in g.nodes:
if n.id.startswith("doc..") and len(n.links) > 0 and len(n.links[0]) > 0:
doc = txt[n.links[0][0].start:n.links[0][0].end]
documents.append(doc)
re_ignore_chars = re.compile(u"[{0}]".format(ignorechars))
def _words_for_document(doc):
words = doc.split()
words2 = list()
for w in words:
w = re_ignore_chars.sub("", w.lower())
if not w or w in stopwords:
continue
words2.append(w)
return words2
wdict = {}
for i, d in enumerate(documents):
for w in _words_for_document(d):
if w in wdict:
wdict[w].append(i)
else:
wdict[w] = [i]
# Which 1000 words occur most often?
top_words = [k for k in sorted(wdict, key=lambda k: len(wdict[k]), reverse=True)][:1000]
# get all words that appear at least 3 times and sort them
keys = [k for k in wdict.keys() if len(wdict[k]) > 2]
keys.sort()
keys_indices = { w: i for i, w in enumerate(keys) }
# create and empty count matrix
A = np.zeros([len(keys), len(top_words)])
for d in documents:
words = _words_for_document(d)
len_words = len(words) - 1
for i, w in enumerate(words):
if w not in keys_indices:
continue
start = i - 15
if i < 0:
start = 0
end = len_words
if end > i + 15:
end = i + 15
for j, t in enumerate(top_words):
if w == t:
continue
if t in words[start:end]:
A[keys_indices[w],j] += 1
words_per_top = np.sum(A, axis=0)
tops_per_word = np.sum(np.asarray(A > 0, 'i'), axis=1)
rows, cols = A.shape
for i in range(rows):
for j in range(cols):
if words_per_top[j] == 0 or tops_per_word[i] == 0:
A[i,j] = 0
else:
A[i,j] = (A[i,j] / words_per_top[j]) * math.log(float(cols) / tops_per_word[i])
tops_per_word[keys_indices['bia']]
311
#U, S, Vt = scipy.sparse.linalg.svds(A, 100)
s_A = scipy.sparse.csc_matrix(A)
ut, s, vt = sparsesvd(s_A, 100)
out = open("bar-ut.bin", "wb")
np.save(out, ut)
out.close()
out = open("bar-s.bin", "wb")
np.save(out, s)
out.close()
out = open("bar-vt.bin", "wb")
np.save(out, vt)
out.close()
import pickle
with open("bar-indices.pickle", "wb") as f:
pickle.dump(keys_indices, f, 2)
plt.plot(s)
plt.show()
#import scipy.linalg
#reconstructed_matrix = np.dot(np.dot(U, scipy.linalg.diagsvd(S,len(S),len(S))), Vt)
reconstructed_matrix = np.dot(ut.T, np.dot(np.diag(s), vt))
tree = scipy.spatial.cKDTree(reconstructed_matrix)
neighbours = tree.query(reconstructed_matrix[keys_indices[u"bia"]], k=100)
subset = reconstructed_matrix[neighbours[1]]
words = [keys[i] for i in neighbours[1]]
tempU, tempS, tempVt = scipy.linalg.svd(subset)
plt.plot(tempS)
plt.show()
coords = tempU[:,1:3]
plt.figure(1, figsize=(16,12))
plt.plot(tempU[:,1], tempU[:,2], marker="o", linestyle="None")
for label, x, y in zip(words, tempU[:,1], tempU[:,2]):
plt.annotate(
label,
xy = (x, y), xytext = (-5, 5),
textcoords = 'offset points', ha = 'right', va = 'bottom',
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5))
plt.show()