Record linkage with Nazca - part 2 - Normalization and blockings

ThisIPython notebook show some features of the Python Nazca library :

  • original notebook : here !
  • date: 2014-07-01
  • author: Vincent Michel ([email protected], [email protected]) @HowIMetYourData

Distances - nazca.utils.normalize

This module provides several normalization and preprocessing functions.

In [1]:
import nazca.utils.normalize as nno

unormalize

Replace diacritical characters with their corresponding ascii characters.

In [2]:
print nno.unormalize(u'toto')
print nno.unormalize(u'tété')
print nno.unormalize(u'tuté')
print nno.unormalize(u'tÉté')
toto
tete
tute
tEte

For non-ascii based characters, an error is raised, but you can give a substitute to avoid it.

In [3]:
print nno.unormalize(u'Βίκτωρ Οὑγκώ', substitute='_')
______ _____

There is also a function lunormalize that also set the sentence to lower case.

In [4]:
print nno.lunormalize(u'tÉté')
tete

simplify

Simply a given sentence by performing the following actions

  1. If remove_stopwords, then remove the stop words (for now only french stopwords exist)
  2. If lemmas are given, the sentence is lemmatized
  3. Set the sentence to lower case
  4. Remove punctuation
In [5]:
print nno.simplify(u"""Écrivain. - Artiste graphiste, auteur de lavis. - 
                       Membre de l'Institut, Académie française (élu en 1841)""")
écrivain  artiste graphiste auteur lavis  
            membre l institut académie française élu 1841

In [6]:
from nazca.data import FRENCH_LEMMAS
print nno.simplify(u""" Victor Hugo occupe une place importante dans l'histoire 
                        des lettres françaises et celle du XIX siècle,
                        dans des genres et des domaines d'une remarquable variété.""",
                   lemmas=FRENCH_LEMMAS)
victor hugo occuper placer important histoire lettre françaises celui xix siècle genre domaine remarquable variété

A lemmatize function is also available.

tokenize

Create a list of tokens from a sentence. A tokenizer object may be given, and should have a tokenize() method returning a list of tokens from a string.

In [7]:
print nno.tokenize(u""" Victor Hugo occupe une place importante dans l'histoire 
                        des lettres françaises et celle du XIX siècle,
                        dans des genres et des domaines d'une remarquable variété.""")
[u'Victor', u'Hugo', u'occupe', u'une', u'place', u'importante', u'dans', u"l'", u'histoire', u'des', u'lettres', u'fran\xe7aises', u'et', u'celle', u'du', u'XIX', u'si\xe8cle,', u'dans', u'des', u'genres', u'et', u'des', u'domaines', u"d'", u'une', u'remarquable', u'vari\xe9t\xe9.']

roundstr

Return an unicode string of a number rounded to a given precision in decimal digits.

In [8]:
print nno.roundstr('3.1415', 2)
3.14

rgxformat

Apply a regular expression to a string and return a formatted string with information extracted by the regular expression.

In [9]:
print nno.rgxformat(u'[Victor Hugo - 26 fev 1802 / 22 mai 1885]',
                    r'\[(?P<firstname>\w+) (?P<lastname>\w+) - (?P<birthdate>.*) \/ (?P<deathdate>.*?)\]',
                    u'%(lastname)s, %(firstname)s (%(birthdate)s - %(deathdate)s)')
Hugo, Victor (26 fev 1802 - 22 mai 1885)

BaseNormalizer - nazca.utils.normalize

The class BaseNormalizer is used to hold all necessary information for a normalization of records. It requires:

  • indice of the attribute to be normalized
  • the normalization function to be used

The prototype of the BaseProcessing class is:

  class BaseNormalizer(object):

     def normalize(self, record):
        """ Normalize a record"""

     def normalize_dataset(self, dataset, inplace=False): 
        """ Normalize a dataset"""

Nazca provides a normalization class for all the normalization previously described.

In [10]:
normalizer = nno.RoundNormalizer(attr_index=2)
print normalizer.normalize(('uri', 'toto', '3.1415'))
['uri', 'toto', '3']

If attr_index is None, the whole record is passed to the callback. You can also give a list of indexes:

In [11]:
normalizer = nno.RoundNormalizer(attr_index=[1, 2])
print normalizer.normalize(('uri', '22.36', '3.1415'))
['uri', '22', '3']

Of course, it is possible to define your own processing, based on a specific normalization:

In [12]:
def iamvictorhugo(value):
    """Return Victor Hugo """
    return u'Victor Hugo'


class VictorHugoNormalizer(nno.BaseNormalizer):

    def __init__(self, attr_index=None):
        super(VictorHugoNormalizer, self).__init__(iamvictorhugo, attr_index)

normalizer = VictorHugoNormalizer(1)
print normalizer.normalize(('http://data.bnf.fr/12249911/michel_houellebecq/', u'Michel Houellebecq'))
['http://data.bnf.fr/12249911/michel_houellebecq/', u'Victor Hugo']

JoinNormalizer

The JoinNormalizer is used to join multiple fields in only one. This new field will be put at the end of the new record.

In [13]:
normalizer = nno.JoinNormalizer((1,2))     
print normalizer.normalize((1, 'ab', 'cd', 'e', 5))
[1, 'e', 5, 'ab, cd']

NormalizerPipeline

Finaly, it is possible to join BaseNormalizer in a pipeline, using the NormalizerPipeline. This pipeline takes a list of BaseNormalizer that will be applied in the given order.

First, we define a RegexpNormalizer:

In [14]:
regexp = r'(?P<id>\d+);{["]?(?P<firstname>.+[^"])["]?};{(?P<surname>.*)};{};{};(?P<date>.*)'
output = u'%(id)s\t%(firstname)s\t%(surname)s\t%(date)s'
n1 = nno.RegexpNormalizer(regexp, u'%(id)s\t%(firstname)s\t%(surname)s\t%(date)s')

Then, we define a specific BaseNormalizer that split data on a tabulation:

In [15]:
n2 = nno.BaseNormalizer(callback= lambda x: x.split('\t'))

The third normalizer is an UnicodeNormalizer:

In [16]:
n3 = nno.UnicodeNormalizer(attr_index=(1, 2, 3))

Thus, we define the pipeline:

In [17]:
pipeline = nno.NormalizerPipeline((n1, n2, n3))
r1 = u'1111;{"Toto tàtà"};{Titi};{};{};'
print pipeline.normalize(r1)
[u'1111', u'toto tata', u'titi', u'']

Blockings - nazca.rl.blocking

Blockings are used to reduce the total number of possible comparisons, by only comparing records that are roughly similar.

In Nazca, the BaseBlocking class is defined as:

 class BaseBlocking(object):

     def __init__(self, ref_attr_index, target_attr_index):
         """ Build the blocking object """

     def _fit(self, refset, targetset): 
         """ Internal fit"""
         raise NotImplementedError      

     def _iter_blocks(self):       
         """ Internal iteration function over blocks"""
         raise NotImplementedError    

     def _cleanup(self):         
         """ Internal cleanup blocking for further use (e.g. in pipeline)"""
         raise NotImplementedError

The BaseBlocking should be build by defining the ref_attr_index and target_attr_index, which are respectively the index of the attribute of the reference set and target set that will be used in the blocking.

The fit() method is called to the blocking technique on the reference and target datasets. Once the fit is done, it is possible to call the iter_blocks() method that will yield two listes (one for the reference set, and one for the targetset) of pair (index, id) of records. It is also possible to iterate on the ids of the records using iter_id_blocks().

Other methods provide different iterative access to the blocks.

The cleanup() method is used to clean up the blocking (e.g. in pipeline) by reseting all the possible internal states.

In [18]:
from nazca.rl import blocking as nrb

KeyBlocking

The KeyBlocking is based on a blocking criteria (or blocking key), that will be used to divide the datasets.

The main idea here is:

  1. to create an index of f(x) for each x in the reference set
  2. to create an index of f(y) for each y in the target set
  3. to iterate on each distinct value of f(x) and to return the identifiers of the records of the both sets for this value.

In this example, we use the KeyBlocking jointly with the soundexcode() to create blocks of records with similar Soundex code.

In [19]:
from functools import partial
from nazca.utils.distances import soundexcode

refset = (('a1', 'smith'), ('a2', 'neighan'), ('a3', 'meier'),  ('a4', 'smithers'),
          ('a5', 'nguyen'), ('a6', 'faulkner'),  ('a7', 'sandy'))
targetset =  (('b1', 'meier'), ('b2', 'meier'),  ('b3', 'smith'),
              ('b4', 'nguyen'), ('b5', 'fawkner'), ('b6', 'santi'), ('b7', 'cain'))

blocking = nrb.KeyBlocking(ref_attr_index=1, target_attr_index=1,  
                           callback=partial(soundexcode, language='english'))  
In [20]:
blocking.fit(refset, targetset)
for refblock, targetblock in blocking.iter_blocks():
    print refblock
    for i, _id in refblock:
        print refset[i]
    print targetblock
    for i, _id in targetblock:
        print targetset[i]
    print 10*'*'
[(0, 'a1'), (6, 'a7')]
('a1', 'smith')
('a7', 'sandy')
[(2, 'b3'), (5, 'b6')]
('b3', 'smith')
('b6', 'santi')
**********
[(1, 'a2'), (4, 'a5')]
('a2', 'neighan')
('a5', 'nguyen')
[(3, 'b4')]
('b4', 'nguyen')
**********
[(2, 'a3')]
('a3', 'meier')
[(0, 'b1'), (1, 'b2')]
('b1', 'meier')
('b2', 'meier')
**********

SoundexBlocking

This is a specific case of KeyBlocking based on Soundex code.

In [21]:
blocking = nrb.SoundexBlocking(ref_attr_index=1, target_attr_index=1)
blocking.fit(refset, targetset)
for refblock, targetblock in blocking.iter_blocks():
    print refblock
    for i, _id in refblock:
        print refset[i]
    print targetblock
    for i, _id in targetblock:
        print targetset[i]
    print 10*'*'
[(0, 'a1'), (6, 'a7')]
('a1', 'smith')
('a7', 'sandy')
[(2, 'b3'), (5, 'b6')]
('b3', 'smith')
('b6', 'santi')
**********
[(1, 'a2'), (4, 'a5')]
('a2', 'neighan')
('a5', 'nguyen')
[(3, 'b4')]
('b4', 'nguyen')
**********
[(2, 'a3')]
('a3', 'meier')
[(0, 'b1'), (1, 'b2')]
('b1', 'meier')
('b2', 'meier')
**********

NGramBlocking

The NGramBlocking is based on the construction of NGrams for a given attribute. It will create a dictionnary with Ngrams as keys, and records as values. A depth can be given, to create sub-dictionnaries.

In [22]:
blocking = nrb.NGramBlocking(ref_attr_index=1, target_attr_index=1, ngram_size=2, depth=1)
blocking.fit(refset, targetset)
print blocking.reference_index
{'me': [(2, 'a3')], 'ne': [(1, 'a2')], 'ng': [(4, 'a5')], 'fa': [(5, 'a6')], 'sm': [(0, 'a1'), (3, 'a4')], 'sa': [(6, 'a7')]}

In [23]:
blocking = nrb.NGramBlocking(ref_attr_index=1, target_attr_index=1, ngram_size=4, depth=1)
blocking.fit(refset, targetset)
print blocking.reference_index
{'meie': [(2, 'a3')], 'nguy': [(4, 'a5')], 'faul': [(5, 'a6')], 'neig': [(1, 'a2')], 'sand': [(6, 'a7')], 'smit': [(0, 'a1'), (3, 'a4')]}

In [24]:
blocking = nrb.NGramBlocking(ref_attr_index=1, target_attr_index=1, ngram_size=2, depth=2)
blocking.fit(refset, targetset)
print blocking.reference_index
{'me': {'ie': [(2, 'a3')]}, 'ne': {'ig': [(1, 'a2')]}, 'ng': {'uy': [(4, 'a5')]}, 'fa': {'ul': [(5, 'a6')]}, 'sm': {'it': [(0, 'a1'), (3, 'a4')]}, 'sa': {'nd': [(6, 'a7')]}}

SortedNeighborhoodBlocking

The SortedNeighborhoodBlocking is based on a sliding window of neighbours, given a function callback (aka key). By default, the key is the identity function.

In [25]:
blocking = nrb.SortedNeighborhoodBlocking(ref_attr_index=1, target_attr_index=1, window_width=2)
blocking.fit(refset, targetset)
print blocking.sorted_dataset
[((6, 'b7'), 'cain', 1), ((5, 'a6'), 'faulkner', 0), ((4, 'b5'), 'fawkner', 1), ((2, 'a3'), 'meier', 0), ((0, 'b1'), 'meier', 1), ((1, 'b2'), 'meier', 1), ((1, 'a2'), 'neighan', 0), ((4, 'a5'), 'nguyen', 0), ((3, 'b4'), 'nguyen', 1), ((6, 'a7'), 'sandy', 0), ((5, 'b6'), 'santi', 1), ((0, 'a1'), 'smith', 0), ((2, 'b3'), 'smith', 1), ((3, 'a4'), 'smithers', 0)]

In [26]:
for refblock, targetblock in blocking.iter_blocks():
    print refblock
    for i, _id in refblock:
        print refset[i]
    print targetblock
    for i, _id in targetblock:
        print targetset[i]
    print 10*'*'
[(5, 'a6')]
('a6', 'faulkner')
[(6, 'b7'), (4, 'b5')]
('b7', 'cain')
('b5', 'fawkner')
**********
[(2, 'a3')]
('a3', 'meier')
[(4, 'b5'), (0, 'b1'), (1, 'b2')]
('b5', 'fawkner')
('b1', 'meier')
('b2', 'meier')
**********
[(1, 'a2')]
('a2', 'neighan')
[(0, 'b1'), (1, 'b2'), (3, 'b4')]
('b1', 'meier')
('b2', 'meier')
('b4', 'nguyen')
**********
[(4, 'a5')]
('a5', 'nguyen')
[(1, 'b2'), (3, 'b4')]
('b2', 'meier')
('b4', 'nguyen')
**********
[(6, 'a7')]
('a7', 'sandy')
[(3, 'b4'), (5, 'b6')]
('b4', 'nguyen')
('b6', 'santi')
**********
[(0, 'a1')]
('a1', 'smith')
[(5, 'b6'), (2, 'b3')]
('b6', 'santi')
('b3', 'smith')
**********
[(3, 'a4')]
('a4', 'smithers')
[(2, 'b3')]
('b3', 'smith')
**********

MergeBlocking

This blocking technique keep only one appearance of one given values, and removes all the other records having this value. The merge is based on a score function, and the record with the higher value is kept.

This is only done on ONE set (the one with a non null attrindex).

In [27]:
refset = [('http://fr.wikipedia.org/wiki/Paris_%28Texas%29', 'Paris', 25898), 
          ('http://fr.wikipedia.org/wiki/Paris', 'Paris', 12223100),        
          ('http://fr.wikipedia.org/wiki/Saint-Malo', 'Saint-Malo', 46342)] 
targetset = [('Paris (Texas)', 25000),
             ('Paris (France)', 12000000)]
blocking = nrb.MergeBlocking(ref_attr_index=1, target_attr_index=None, score_func=lambda x:x[2])
blocking.fit(refset, targetset)
print blocking.merged_dataset
[(1, 'http://fr.wikipedia.org/wiki/Paris'), (2, 'http://fr.wikipedia.org/wiki/Saint-Malo')]

KmeansBlocking

This blocking technique is based on Kmeans clustering. It is based on the Scikit-learn implementation. If the number of clusters is not defined, it will take the size of the reference set divided by 10 or, if the reference set is to small, the size of the reference set divided by 2.

In [28]:
refset = [['R1', 'ref1', (6.14194444444, 48.67)],
          ['R2', 'ref2', (6.2, 49)],
          ['R3', 'ref3', (5.1, 48)],
          ['R4', 'ref4', (5.2, 48.1)]]         
targetset = [['T1', 'target1', (6.2, 48.9)],        
             ['T2', 'target2', (5.3, 48.2)], 
             ['T3', 'target3', (6.25, 48.91)]]
blocking = nrb.KmeansBlocking(ref_attr_index=2, target_attr_index=2, n_clusters=2)
blocking.fit(refset, targetset)
In [29]:
for refblock, targetblock in blocking.iter_blocks():
    print refblock
    for i, _id in refblock:
        print refset[i]
    print targetblock
    for i, _id in targetblock:
        print targetset[i]
    print 10*'*'
[(0, 'R1'), (1, 'R2')]
['R1', 'ref1', (6.14194444444, 48.67)]
['R2', 'ref2', (6.2, 49)]
[(0, 'T1'), (2, 'T3')]
['T1', 'target1', (6.2, 48.9)]
['T3', 'target3', (6.25, 48.91)]
**********
[(2, 'R3'), (3, 'R4')]
['R3', 'ref3', (5.1, 48)]
['R4', 'ref4', (5.2, 48.1)]
[(1, 'T2')]
['T2', 'target2', (5.3, 48.2)]
**********

KdTreeBlocking

This blocking technique is based on KdTree. Kdtrees are used to partition k-dimensional space.

It is efficient as creating clusters of numerical values (with small k).

In [30]:
blocking = nrb.KdTreeBlocking(ref_attr_index=2, target_attr_index=2, threshold=0.3)
blocking.fit(refset, targetset)
In [31]:
for refblock, targetblock in blocking.iter_blocks():
    print refblock
    for i, _id in refblock:
        print refset[i]
    print targetblock
    for i, _id in targetblock:
        print targetset[i]
    print 10*'*'
[(0, 'R1')]
['R1', 'ref1', (6.14194444444, 48.67)]
[(0, 'T1'), (2, 'T3')]
['T1', 'target1', (6.2, 48.9)]
['T3', 'target3', (6.25, 48.91)]
**********
[(1, 'R2')]
['R2', 'ref2', (6.2, 49)]
[(0, 'T1'), (2, 'T3')]
['T1', 'target1', (6.2, 48.9)]
['T3', 'target3', (6.25, 48.91)]
**********
[(2, 'R3')]
['R3', 'ref3', (5.1, 48)]
[(1, 'T2')]
['T2', 'target2', (5.3, 48.2)]
**********
[(3, 'R4')]
['R4', 'ref4', (5.2, 48.1)]
[(1, 'T2')]
['T2', 'target2', (5.3, 48.2)]
**********

MinHashingBlocking

This blocking technique is based on MinHash algorithm. It is very efficient as creating clusters of textual values.

In [32]:
from nazca.data import FRENCH_LEMMAS
from nazca.utils.normalize import SimplifyNormalizer
refset = [['R1', 'ref1', u"Un nuage flotta dans le grand ciel bleu."],  
          ['R2', 'ref2', u"Pour quelle occasion vous êtes-vous apprêtée ?"],  
          ['R3', 'ref3', u"Je les vis ensemble à plusieurs occasions."],      
          ['R4', 'ref4', u"Je n'aime pas ce genre de bandes dessinées tristes."], 
          ['R5', 'ref5', u"Ensemble et à plusieurs occasions, je les vis"]]        
targetset = [['T1', 'target1', u"Des grands nuages noirs flottent dans le ciel."], 
             ['T2', 'target2', u"Je les ai vus ensemble à plusieurs occasions."],  
             ['T3', 'target3', u"J'aime les bandes dessinées de genre comiques."]]
normalizer = SimplifyNormalizer(attr_index=2, lemmas=FRENCH_LEMMAS)
refset = normalizer.normalize_dataset(refset)
targetset = normalizer.normalize_dataset(targetset)
blocking = nrb.MinHashingBlocking(threshold=0.4, ref_attr_index=2, target_attr_index=2) 
blocking.fit(refset, targetset)
In [33]:
for refblock, targetblock in blocking.iter_blocks():
    print refblock
    for i, _id in refblock:
        print refset[i]
    print targetblock
    for i, _id in targetblock:
        print targetset[i]
    print 10*'*'
[(0, 'R1')]
['R1', 'ref1', u'nuage flotter grand ciel bleu']
[(0, 'T1')]
['T1', 'target1', u'grand nuage noir flotter ciel']
**********
[(3, 'R4')]
['R4', 'ref4', u'aimer genre bande dessin\xe9es tristes']
[(2, 'T3')]
['T3', 'target3', u'aimer bande dessin\xe9es genre comiques']
**********
[(2, 'R3'), (4, 'R5')]
['R3', 'ref3', u'vis ensemble \xe0 plusieurs occasions']
['R5', 'ref5', u'ensemble \xe0 plusieurs occasions vis']
[(1, 'T2')]
['T2', 'target2', u'voir ensemble \xe0 plusieurs occasions']
**********

PipelineBlocking - nazca.rl.blocking

Nazca also provides a PipelineBlocking for pipelining multiple blockings within an alignment.

In [34]:
refset = [['1', 'aabb', 'ccdd'],                   
          ['2', 'aabb', 'ddcc'],                
          ['3', 'ccdd', 'aabb'],                
          ['4', 'ccdd', 'bbaa']]         
targetset = [['a', 'aabb', 'ccdd'],  
             ['b', 'aabb', 'ddcc'],
             ['c', 'ccdd', 'aabb'],
             ['d', 'ccdd', 'bbaa']]
blocking_1 = nrb.NGramBlocking(ref_attr_index=1, target_attr_index=1, ngram_size=2, depth=1)  
blocking_2 = nrb.NGramBlocking(ref_attr_index=2, target_attr_index=2, ngram_size=2, depth=1)
blocking = nrb.PipelineBlocking((blocking_1, blocking_2))         
blocking.fit(refset, targetset)
In [35]:
for refblock, targetblock in blocking.iter_blocks():
    print refblock
    for i, _id in refblock:
        print refset[i]
    print targetblock
    for i, _id in targetblock:
        print targetset[i]
    print 10*'*'
[(0, '1')]
['1', 'aabb', 'ccdd']
[(0, 'a')]
['a', 'aabb', 'ccdd']
**********
[(1, '2')]
['2', 'aabb', 'ddcc']
[(1, 'b')]
['b', 'aabb', 'ddcc']
**********
[(2, '3')]
['3', 'ccdd', 'aabb']
[(2, 'c')]
['c', 'ccdd', 'aabb']
**********
[(3, '4')]
['4', 'ccdd', 'bbaa']
[(3, 'd')]
['d', 'ccdd', 'bbaa']
**********

It is also possible to collect stats on the different sub-blockings using the option collect_stats=True, which gives the lengthts of the different blocks across the different sub-blockings.

In [36]:
blocking_1 = nrb.NGramBlocking(ref_attr_index=1, target_attr_index=1, ngram_size=2, depth=1)  
blocking_2 = nrb.NGramBlocking(ref_attr_index=2, target_attr_index=2, ngram_size=2, depth=1)
blocking = nrb.PipelineBlocking((blocking_1, blocking_2), collect_stats=True)         
blocking.fit(refset, targetset)
In [37]:
print blocking.stats
{0: [(2, 2), (2, 2)], 1: [(1, 1), (1, 1), (1, 1), (1, 1)]}