Record linkage with Nazca - Example Dbpedia - INSEE

This IPython notebook show some features of the Python Nazca library :

In [4]:
import nazca.utils.dataio as nio
import nazca.utils.distances as ndi
import nazca.utils.normalize as nun
import nazca.rl.blocking as nbl
import nazca.rl.aligner as nal

1 - Datasets creation

First, we have to create both reference set and target set.

We get all the couples (URI, insee code) from Dbpedia data

In [5]:
refset = nio.sparqlquery('',
                         '''PREFIX dbonto:<>
                            SELECT ?p ?n ?c WHERE {?p a dbonto:PopulatedPlace.
                                                   ?p dbonto:country dbpedia:France.
                                                   ?p foaf:name ?n.
                                                   ?p dbpprop:insee ?c}''',
In [6]:
print len(refset)
print refset[0]
[u'', u'Ajaccio', 2]

We get all the couples (URI, insee code) from INSEE data

In [7]:
targetset = nio.sparqlquery('',
                            '''PREFIX igeo:<>
                               SELECT ?commune ?nom ?code WHERE {?commune igeo:codeCommune ?code.
                                                                 ?commune igeo:nom ?nom}''',
In [8]:
print len(targetset)
print targetset[0]
[u'', u'Mazerolles', 64374]

Definition of the distance functions and the Processing

We use a distance based on difflib, where distance(a, b) == 0 iif a==b

In [9]:
processing = ndi.DifflibProcessing(ref_attr_index=1, target_attr_index=1)
In [10]:
print refset[0], targetset[0]
print processing.distance(refset[0], targetset[0])
[u'', u'Ajaccio', 2] [u'', u'Mazerolles', 64374]

Preprocessings and normalization

We define now a preprocessing step to normalize the string values of each record.

In [11]:
normalizer = nun.SimplifyNormalizer(attr_index=1)

The simplify normalizer is based on the simplify() function that:

  1. Remove the stopwords;
  2. Lemmatized the sentence;
  3. Set the sentence to lower case;
  4. Remove punctuation;
In [12]:
print normalizer.normalize(refset[0])
[u'', u'ajaccio', 2]
In [13]:
print normalizer.normalize(targetset[0])
[u'', u'mazerolles', 64374]


Blockings act similarly to divide-and-conquer approaches. They create subsets of both datasets (blocks) that will be compared, rather than making all the possible comparisons.

We create a simple NGramBlocking that will create subsets of records by looking at the first N characters of each values.In our case, we choose only the first 2 characters, with a depth of one (i.e. we don't do a recursive blocking).

In [14]:
blocking = nbl.NGramBlocking(1, 1, ngram_size=5, depth=1)

The blocking is 'fit' on both refset and targetset and then applied to iterate blocks

In [15]:, targetset)
blocks = list(blocking._iter_blocks())
print blocks[0]
([(1722, u',_Ille-et-Vilaine')], [(17045, u'')])

Cleanup the blocking for now as it may have stored some internal data

In [16]:

Define Aligner

Finaly, we can create the Aligner object that will perform the whole alignment processing. We set the threshold to 0.1 and we only have one processing

In [17]:
aligner = nal.BaseAligner(threshold=0.1, processings=(processing,))

Thus, we register the normalizer and the blocking

In [18]:

The aligner has a get_aligned_pairs() function that will yield the comparisons that are below the threshold:

In [19]:
pairs = list(aligner.get_aligned_pairs(refset[:1000], targetset[:1000]))
In [20]:
print len(pairs)
for p in pairs[:5]:
    print p
((u',_Haute-Corse', 8), (u'', 268), 1e-10)
((u',_Calvados', 910), (u'', 117), 1e-10)
((u'', 272), (u'', 301), 1e-10)
((u',_Ain', 273), (u'', 146), 1e-10)
((u'', 530), (u'', 669), 1e-10)

Each pair has the following structure: ((id in refset, indice in refset), (id in targset, indice in targetset), distance).

Introduction to Nazca - Using pipeline

It could be interesting to pipeline the blockings, i.e. use a raw blocking technique with a good recall (i.e. not too conservative) and that could have a bad precision (i.e. not too precise), but that is fast.

Then, we can use a more time consuming but more precise blocking on each block.

In [21]:
blocking_1 = nbl.NGramBlocking(1, 1, ngram_size=3, depth=1)
blocking_2 = nbl.MinHashingBlocking(1, 1)
blocking = nbl.PipelineBlocking((blocking_1, blocking_2),collect_stats=True)
In [22]:
pairs = list(aligner.get_aligned_pairs(refset, targetset))
In [23]:
print len(pairs)
for p in pairs[:5]:
    print p
((u'', 0), (u'', 18455), 1e-10)
((u'', 2), (u'', 2742), 1e-10)
((u'', 3), (u'', 15569), 1e-10)
((u'', 4), (u'', 2993), 1e-10)
((u',_Corse-du-Sud', 6), (u'', 20886), 1e-10)