Notebook

Parsing and decoding wikipedia page diffs¶

contact: tamkien@cri-paris.com
github: https://github.com/WeKeyPedia/notebooks/

This notebook is a short helper for python beginners and wikipedia API diggers that want to learn how to use the result of queries towarded to diff between pages. We will also cover very basic usage of Natural Language Processing (NLP) using the nltk library.

The code shown below have been produced within the wekeypedia project to produce term-editor and term-page bipartite networks. Equivalent procedures have been implemented within our python library

Most likely you will perform queries on the wikipedia API that will look something like:

{
  "format": "json",
  "action": "query",
  "titles": [page title],
  "redirects":"true",
  "prop": "info|revisions",
  "inprop": "url",
  "rvdiffto" : "prev"
}

As this notebook is not about to make queries, we are going to use directly the wrappers that have been package within our wekeypedia python library. There is a bundle object wekeypedia.wikipedia.api that allows to build queries and get back the json result. You can also use the WikipediaPage.get_diff.

In [1]:

import os, sys, pprint, random
from collections import defaultdict

from bs4 import BeautifulSoup
import nltk

from IPython.display import display, HTML

sys.path.append(os.path.abspath('../../WKP-python-toolkit'))
import wekeypedia

In [2]:

p = wekeypedia.WikipediaPage("Love")

revisions_list = p.get_revisions_list()
# diff = p.get_diff(100000308)
diff = p.get_diff(194033798) 

Information extraction from the json response¶

When you ask for a diff between two revisions, the wikipedia API will likely give you back something like that:

<tr>
  <td colspan="2" class="diff-lineno">Line 172:</td>
  <td colspan="2" class="diff-lineno">Line 172:</td>
</tr>
<tr>
  <td class="diff-marker">&nbsp;</td>
  <td class="diff-context"><div>''[[Adve&amp;#7779;a]]'' and ''[[metta|maitr&amp;#299;]]'' are benevolent love. This love is unconditional and requires considerable self-acceptance. This is quite different from the ordinary love, which is usually about attachment and sex, which rarely occur without self-interest. Instead, in Buddhism it refers to detachment and unselfish interest in others' welfare.</div></td>
  <td class="diff-marker">&nbsp;</td>
  <td class="diff-context"><div>''[[Adve&amp;#7779;a]]'' and ''[[metta|maitr&amp;#299;]]'' are benevolent love. This love is unconditional and requires considerable self-acceptance. This is quite different from the ordinary love, which is usually about attachment and sex, which rarely occur without self-interest. Instead, in Buddhism it refers to detachment and unselfish interest in others' welfare.</div></td>
</tr>
<tr>
  <td class="diff-marker">&nbsp;</td>
  <td class="diff-context"></td>
  <td class="diff-marker">&nbsp;</td>
  <td class="diff-context"></td>
</tr>
<tr>
  <td class="diff-marker">−</td>
  <td class="diff-deletedline"><div>The Bodhisattva ideal in <del class="diffchange diffchange-inline">Tibetan</del> Buddhism involves the complete renunciation of oneself in order to take on the burden of a suffering world. The strongest motivation one has in order to take the path of the Bodhisattva is the idea of salvation within unselfish love for <del class="diffchange diffchange-inline">others</del>.</div></td>
  <td class="diff-marker">+</td>
  <td class="diff-addedline"><div>The <ins class="diffchange diffchange-inline">[[</ins>Bodhisattva<ins class="diffchange diffchange-inline">]]</ins> ideal in <ins class="diffchange diffchange-inline">Mahayana</ins> Buddhism involves the complete renunciation of oneself in order to take on the burden of a suffering world. The strongest motivation one has in order to take the path of the Bodhisattva is the idea of salvation within unselfish<ins class="diffchange diffchange-inline">, altustic</ins> love for <ins class="diffchange diffchange-inline">all sentient beings</ins>.</div></td>
</tr>
<tr>
  <td class="diff-marker">&nbsp;</td>
  <td class="diff-context"></td>
  <td class="diff-marker">&nbsp;</td>
  <td class="diff-context"></td>
</tr>
<tr>
  <td class="diff-marker">&nbsp;</td>
  <td class="diff-context"><div>===Hindu===</div></td>
  <td class="diff-marker">&nbsp;</td>
  <td class="diff-context"><div>===Hindu===</div></td>
</tr>

In [3]:

display(HTML("<h3>raw html query result</h3>"))

css = """
<style>
.rendered_html tr, .rendered_html td { border: none; border-collapse:collapse; }
.diff-deletedline { background-color : #FFADC6; }
.diff-deletedline del { background-color : #F05151; }
.diff-addedline { background-color : #99FFC3; }
.diff-addedline ins { background-color : #4EF277; }
</style>
"""

display(HTML(css))
display(HTML(diff))

raw html query result

Line 172: Line 172:

''[[Adveṣa]]'' and ''[[metta|maitrī]]'' are benevolent love. This love is unconditional and requires considerable self-acceptance. This is quite different from the ordinary love, which is usually about attachment and sex, which rarely occur without self-interest. Instead, in Buddhism it refers to detachment and unselfish interest in others' welfare.

−

The Bodhisattva ideal in ~~Tibetan~~ Buddhism involves the complete renunciation of oneself in order to take on the burden of a suffering world. The strongest motivation one has in order to take the path of the Bodhisattva is the idea of salvation within unselfish love for ~~others~~.

The [[Bodhisattva]] ideal in Mahayana Buddhism involves the complete renunciation of oneself in order to take on the burden of a suffering world. The strongest motivation one has in order to take the path of the Bodhisattva is the idea of salvation within unselfish, altustic love for all sentient beings.

===Hindu===

There is several kind of information we can extract.

Inline additions/deletions/substitutions to <ins>, <del>, and various combinations of both within <td class="diff-addedline"> and <td class="diff-deletedline"> tags
Full block additionds and deletions enclosed within <td class="diff-addedline"> and <td class="diff-deletedline"> tags

The only tricky thing is to not register false positive because class="diff-addedline" and class="diff-deletedline" are also respectively used to show the previous state of an deletion or current state of an addition. That is why the following code target rows (<tr>) instead of cells. The strategy is to keep only added blocks that are preceded by an empty row (<td class="diff-empty">) before and deleted blocks that are followed by an empty cell.

In [4]:

def extract(diff_html):
  diff = { "added": [],
           "deleted" : [] }

  d = BeautifulSoup(diff_html, 'html.parser')

  tr = d.find_all("tr")

  for what in [ ["added", "ins"], ["deleted", "del"] ]:
    a = []

    # checking block 
    # we also check this is not only context showing for non-substition edits
    a = [ t.find("td", "diff-%sline" % (what[0])) for t in tr if len(t.find_all(what[1])) == 0 and len(t.find_all("td", "diff-empty")) > 0 ]

    # checking inline
    a.extend(d.find_all(what[1]))

    # filtering empty extractions
    a = [ x for x in a if x != None ]

    # registering
    diff[what[0]] = [ tag.get_text() for tag in a ]

  return diff

def print_plusminus_overview(diff):
    for minus in diff["deleted"]:
        print "- %s" % (minus)

    for plus in diff["added"]:
        print "+ %s" % (plus)
        
display(HTML("<h3>plus/minus overview</h3>"))

diff = extract(diff)
print_plusminus_overview(diff)

plus/minus overview

- Tibetan
- others
+ [[
+ ]]
+ Mahayana
+ , altustic
+ all sentient beings

Natural language processing¶

We are now going to proceed to a little bit of language processing. NLTK provides very usefull starter tools to manipulate bits of natural language. The core of the workflow is about tokenization and normalization.

The first stem is to be able to count words correctly, it is were normalization intervens:

stemming is the process of reducing a word to its roots. For example, you may want to transform "gods" to "god", "is" to "be", etc
lemmatization is closely related to stemming. Whereas the first one is a context-free procedure, lemmatization take care of variables related to grammar like the position in the phrase to have a less agressive approach.

Right now, we apply lemmatization without the grammatical information. This is just in order to prepare advanced NLP work.

In [5]:

def normalize(word):
  lemmatizer = nltk.WordNetLemmatizer()
  stemmer = nltk.stem.porter.PorterStemmer()

  word = word.lower()
  word = stemmer.stem_word(word)
  word = lemmatizer.lemmatize(word)

  return word

The process of counting stems is mainly about mapping the result of the tokenization of plus/minus contents. Dividing sentences into parts and words can be a very tedious work without the right parser or if you are looking for a universal grammar. It is also very related to the language itself. For example parsing english or german is very different. For the moment, we are going to use the Punkt tokenizer because it is now all about english sentences.

Tokenization, stemming and lemmatization are very sensitive points. It is possible to develop more precise strategies depending on what you are looking for. We are going to let it fuzzy to give space to later use and keep a broad mindset about what can be done with diff information. Meanwhile, for counting purpose, the basic implementation of these methods are largely sufficient.

In [6]:

def count_stems(sentences, inflections=None):
  stems = defaultdict(int)

  ignore_list = "{}()[]<>./,;\"':!?&#=*&%"
  
  if inflections == None:
    inflections = defaultdict(dict)

  for sentence in sentences:
    for word in nltk.word_tokenize(sentence):
      old = word
      word = normalize(word)
      if not(word in ignore_list):
        stems[word] += 1

        # keeping track of inflection usages
        inflections[word].setdefault(old,0)
        inflections[word][old] += 1

  return stems

def print_plusminus_terms_overview(stems):
    print "\n%s|%s\n" % ("+"*len(stems["added"].items()), "-"*len(stems["deleted"].items()))

def print_plusminus_terms(stems):
  for k in stems.keys():
    display(HTML("<h4>%s:</h4>" % (k)))
    
    for term in stems[k]:
      print term

In [7]:

inflections = defaultdict(dict)

display(HTML("<h3>plus/minus ---> terms</h3>"))

stems = {}
stems["added"] = count_stems(diff["added"], inflections)
stems["deleted"] = count_stems(diff["deleted"], inflections)

print_plusminus_terms(stems)

plus/minus ---> terms

deleted:

other
tibetan

added:

altust
mahayana
all
sentient
be

inflections¶

We have also kept trace of inflections. This is not very important over one diff but it is interesting if you have collected inflections over a large set of words. For example, you might want to use the most common inflection instead of the stem form to produce more readable/pretty words cloud.

In [8]:

display(HTML("<h3>inflections</h3>"))

for stem, i in inflections.iteritems():
    print "[%s] %s" % (stem, ", ".join(map(lambda x: "%s (%s)" % (x[0], x[1]), i.items())))

inflections

[be] beings (1)
[mahayana] Mahayana (1)
[sentient] sentient (1)
[altust] altustic (1)
[all] all (1)
[other] others (1)
[tibetan] Tibetan (1)