This notebook is a short helper for python beginners and wikipedia API diggers that want to learn how to use the result of queries towarded to diff between pages. We will also cover very basic usage of Natural Language Processing (NLP) using the nltk library.
The code shown below have been produced within the wekeypedia project to produce term-editor and term-page bipartite networks. Equivalent procedures have been implemented within our python library
Most likely you will perform queries on the wikipedia API that will look something like:
{
"format": "json",
"action": "query",
"titles": [page title],
"redirects":"true",
"prop": "info|revisions",
"inprop": "url",
"rvdiffto" : "prev"
}
As this notebook is not about to make queries, we are going to use directly the wrappers that have been package within our wekeypedia python library. There is a bundle object wekeypedia.wikipedia.api
that allows to build queries and get back the json result. You can also use the WikipediaPage.get_diff.
import os, sys, pprint, random
from collections import defaultdict
from bs4 import BeautifulSoup
import nltk
from IPython.display import display, HTML
sys.path.append(os.path.abspath('../../WKP-python-toolkit'))
import wekeypedia
p = wekeypedia.WikipediaPage("Love")
revisions_list = p.get_revisions_list()
# diff = p.get_diff(100000308)
diff = p.get_diff(194033798)
When you ask for a diff between two revisions, the wikipedia API will likely give you back something like that:
<tr>
<td colspan="2" class="diff-lineno">Line 172:</td>
<td colspan="2" class="diff-lineno">Line 172:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td class="diff-context"><div>''[[Adve&#7779;a]]'' and ''[[metta|maitr&#299;]]'' are benevolent love. This love is unconditional and requires considerable self-acceptance. This is quite different from the ordinary love, which is usually about attachment and sex, which rarely occur without self-interest. Instead, in Buddhism it refers to detachment and unselfish interest in others' welfare.</div></td>
<td class="diff-marker"> </td>
<td class="diff-context"><div>''[[Adve&#7779;a]]'' and ''[[metta|maitr&#299;]]'' are benevolent love. This love is unconditional and requires considerable self-acceptance. This is quite different from the ordinary love, which is usually about attachment and sex, which rarely occur without self-interest. Instead, in Buddhism it refers to detachment and unselfish interest in others' welfare.</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td class="diff-context"></td>
<td class="diff-marker"> </td>
<td class="diff-context"></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td class="diff-deletedline"><div>The Bodhisattva ideal in <del class="diffchange diffchange-inline">Tibetan</del> Buddhism involves the complete renunciation of oneself in order to take on the burden of a suffering world. The strongest motivation one has in order to take the path of the Bodhisattva is the idea of salvation within unselfish love for <del class="diffchange diffchange-inline">others</del>.</div></td>
<td class="diff-marker">+</td>
<td class="diff-addedline"><div>The <ins class="diffchange diffchange-inline">[[</ins>Bodhisattva<ins class="diffchange diffchange-inline">]]</ins> ideal in <ins class="diffchange diffchange-inline">Mahayana</ins> Buddhism involves the complete renunciation of oneself in order to take on the burden of a suffering world. The strongest motivation one has in order to take the path of the Bodhisattva is the idea of salvation within unselfish<ins class="diffchange diffchange-inline">, altustic</ins> love for <ins class="diffchange diffchange-inline">all sentient beings</ins>.</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td class="diff-context"></td>
<td class="diff-marker"> </td>
<td class="diff-context"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td class="diff-context"><div>===Hindu===</div></td>
<td class="diff-marker"> </td>
<td class="diff-context"><div>===Hindu===</div></td>
</tr>
display(HTML("<h3>raw html query result</h3>"))
css = """
<style>
.rendered_html tr, .rendered_html td { border: none; border-collapse:collapse; }
.diff-deletedline { background-color : #FFADC6; }
.diff-deletedline del { background-color : #F05151; }
.diff-addedline { background-color : #99FFC3; }
.diff-addedline ins { background-color : #4EF277; }
</style>
"""
display(HTML(css))
display(HTML(diff))
There is several kind of information we can extract.
<ins>
, <del>
, and various combinations of both within <td class="diff-addedline">
and <td class="diff-deletedline">
tags<td class="diff-addedline">
and <td class="diff-deletedline">
tagsThe only tricky thing is to not register false positive because class="diff-addedline"
and class="diff-deletedline"
are also respectively used to show the previous state of an deletion or current state of an addition. That is why the following code target rows (<tr>
) instead of cells. The strategy is to keep only added blocks that are preceded by an empty row (<td class="diff-empty">
) before and deleted blocks that are followed by an empty cell.
def extract(diff_html):
diff = { "added": [],
"deleted" : [] }
d = BeautifulSoup(diff_html, 'html.parser')
tr = d.find_all("tr")
for what in [ ["added", "ins"], ["deleted", "del"] ]:
a = []
# checking block
# we also check this is not only context showing for non-substition edits
a = [ t.find("td", "diff-%sline" % (what[0])) for t in tr if len(t.find_all(what[1])) == 0 and len(t.find_all("td", "diff-empty")) > 0 ]
# checking inline
a.extend(d.find_all(what[1]))
# filtering empty extractions
a = [ x for x in a if x != None ]
# registering
diff[what[0]] = [ tag.get_text() for tag in a ]
return diff
def print_plusminus_overview(diff):
for minus in diff["deleted"]:
print "- %s" % (minus)
for plus in diff["added"]:
print "+ %s" % (plus)
display(HTML("<h3>plus/minus overview</h3>"))
diff = extract(diff)
print_plusminus_overview(diff)
- Tibetan - others + [[ + ]] + Mahayana + , altustic + all sentient beings
We are now going to proceed to a little bit of language processing. NLTK provides very usefull starter tools to manipulate bits of natural language. The core of the workflow is about tokenization and normalization.
The first stem is to be able to count words correctly, it is were normalization intervens:
Right now, we apply lemmatization without the grammatical information. This is just in order to prepare advanced NLP work.
def normalize(word):
lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()
word = word.lower()
word = stemmer.stem_word(word)
word = lemmatizer.lemmatize(word)
return word
The process of counting stems is mainly about mapping the result of the tokenization of plus/minus contents. Dividing sentences into parts and words can be a very tedious work without the right parser or if you are looking for a universal grammar. It is also very related to the language itself. For example parsing english or german is very different. For the moment, we are going to use the Punkt tokenizer because it is now all about english sentences.
Tokenization, stemming and lemmatization are very sensitive points. It is possible to develop more precise strategies depending on what you are looking for. We are going to let it fuzzy to give space to later use and keep a broad mindset about what can be done with diff information. Meanwhile, for counting purpose, the basic implementation of these methods are largely sufficient.
def count_stems(sentences, inflections=None):
stems = defaultdict(int)
ignore_list = "{}()[]<>./,;\"':!?&#=*&%"
if inflections == None:
inflections = defaultdict(dict)
for sentence in sentences:
for word in nltk.word_tokenize(sentence):
old = word
word = normalize(word)
if not(word in ignore_list):
stems[word] += 1
# keeping track of inflection usages
inflections[word].setdefault(old,0)
inflections[word][old] += 1
return stems
def print_plusminus_terms_overview(stems):
print "\n%s|%s\n" % ("+"*len(stems["added"].items()), "-"*len(stems["deleted"].items()))
def print_plusminus_terms(stems):
for k in stems.keys():
display(HTML("<h4>%s:</h4>" % (k)))
for term in stems[k]:
print term
inflections = defaultdict(dict)
display(HTML("<h3>plus/minus ---> terms</h3>"))
stems = {}
stems["added"] = count_stems(diff["added"], inflections)
stems["deleted"] = count_stems(diff["deleted"], inflections)
print_plusminus_terms(stems)
other tibetan
altust mahayana all sentient be
We have also kept trace of inflections. This is not very important over one diff but it is interesting if you have collected inflections over a large set of words. For example, you might want to use the most common inflection instead of the stem form to produce more readable/pretty words cloud.
display(HTML("<h3>inflections</h3>"))
for stem, i in inflections.iteritems():
print "[%s] %s" % (stem, ", ".join(map(lambda x: "%s (%s)" % (x[0], x[1]), i.items())))
[be] beings (1) [mahayana] Mahayana (1) [sentient] sentient (1) [altust] altustic (1) [all] all (1) [other] others (1) [tibetan] Tibetan (1)
This procedure is extensively used in the words of wisdom and love notebook about counting reccuring terms in diff of love, ethics, wisdom and morality pages.