Record linkage with Nazca - Example Dbpedia - INSEE

This IPython notebook show some features of the Python Nazca library :

  • original notebook : here !
  • date: 2014-07-01
  • author: Vincent Michel ([email protected], [email protected]) @HowIMetYourData
In [4]:
import nazca.utils.dataio as nio
import nazca.utils.distances as ndi
import nazca.utils.normalize as nun
import nazca.rl.blocking as nbl
import nazca.rl.aligner as nal

1 - Datasets creation

First, we have to create both reference set and target set.

We get all the couples (URI, insee code) from Dbpedia data

In [5]:
refset = nio.sparqlquery('http://demo.cubicweb.org/sparql',
                         '''PREFIX dbonto:<http://dbpedia.org/ontology/>
                            SELECT ?p ?n ?c WHERE {?p a dbonto:PopulatedPlace.
                                                   ?p dbonto:country dbpedia:France.
                                                   ?p foaf:name ?n.
                                                   ?p dbpprop:insee ?c}''',
                         autocast_data=True)
In [6]:
print len(refset)
print refset[0]
3636
[u'http://dbpedia.org/resource/Ajaccio', u'Ajaccio', 2]

We get all the couples (URI, insee code) from INSEE data

In [7]:
targetset = nio.sparqlquery('http://rdf.insee.fr/sparql',
                            '''PREFIX igeo:<http://rdf.insee.fr/def/geo#>
                               SELECT ?commune ?nom ?code WHERE {?commune igeo:codeCommune ?code.
                                                                 ?commune igeo:nom ?nom}''',
                            autocast_data=True)
In [8]:
print len(targetset)
print targetset[0]
36700
[u'http://id.insee.fr/geo/commune/64374', u'Mazerolles', 64374]

Definition of the distance functions and the Processing

We use a distance based on difflib, where distance(a, b) == 0 iif a==b

In [9]:
processing = ndi.DifflibProcessing(ref_attr_index=1, target_attr_index=1)
In [10]:
print refset[0], targetset[0]
print processing.distance(refset[0], targetset[0])
[u'http://dbpedia.org/resource/Ajaccio', u'Ajaccio', 2] [u'http://id.insee.fr/geo/commune/64374', u'Mazerolles', 64374]
0.764705882353

Preprocessings and normalization

We define now a preprocessing step to normalize the string values of each record.

In [11]:
normalizer = nun.SimplifyNormalizer(attr_index=1)

The simplify normalizer is based on the simplify() function that:

  1. Remove the stopwords;
  2. Lemmatized the sentence;
  3. Set the sentence to lower case;
  4. Remove punctuation;
In [12]:
print normalizer.normalize(refset[0])
[u'http://dbpedia.org/resource/Ajaccio', u'ajaccio', 2]

In [13]:
print normalizer.normalize(targetset[0])
[u'http://id.insee.fr/geo/commune/64374', u'mazerolles', 64374]

Blockings

Blockings act similarly to divide-and-conquer approaches. They create subsets of both datasets (blocks) that will be compared, rather than making all the possible comparisons.

We create a simple NGramBlocking that will create subsets of records by looking at the first N characters of each values.In our case, we choose only the first 2 characters, with a depth of one (i.e. we don't do a recursive blocking).

In [14]:
blocking = nbl.NGramBlocking(1, 1, ngram_size=5, depth=1)

The blocking is 'fit' on both refset and targetset and then applied to iterate blocks

In [15]:
blocking.fit(refset, targetset)
blocks = list(blocking._iter_blocks())
print blocks[0]
([(1722, u'http://dbpedia.org/resource/Redon,_Ille-et-Vilaine')], [(17045, u'http://id.insee.fr/geo/commune/35236')])

Cleanup the blocking for now as it may have stored some internal data

In [16]:
blocking.cleanup()

Define Aligner

Finaly, we can create the Aligner object that will perform the whole alignment processing. We set the threshold to 0.1 and we only have one processing

In [17]:
aligner = nal.BaseAligner(threshold=0.1, processings=(processing,))

Thus, we register the normalizer and the blocking

In [18]:
aligner.register_ref_normalizer(normalizer)
aligner.register_target_normalizer(normalizer)
aligner.register_blocking(blocking)

The aligner has a get_aligned_pairs() function that will yield the comparisons that are below the threshold:

In [19]:
pairs = list(aligner.get_aligned_pairs(refset[:1000], targetset[:1000]))
In [20]:
print len(pairs)
for p in pairs[:5]:
    print p
24
((u'http://dbpedia.org/resource/Calvi,_Haute-Corse', 8), (u'http://id.insee.fr/geo/commune/2B050', 268), 1e-10)
((u'http://dbpedia.org/resource/Livry,_Calvados', 910), (u'http://id.insee.fr/geo/commune/14372', 117), 1e-10)
((u'http://dbpedia.org/resource/Saint-Jean-sur-Veyle', 272), (u'http://id.insee.fr/geo/commune/01365', 301), 1e-10)
((u'http://dbpedia.org/resource/Saint-Just,_Ain', 273), (u'http://id.insee.fr/geo/commune/35285', 146), 1e-10)
((u'http://dbpedia.org/resource/Aix-en-Provence', 530), (u'http://id.insee.fr/geo/commune/13001', 669), 1e-10)

Each pair has the following structure: ((id in refset, indice in refset), (id in targset, indice in targetset), distance).

Introduction to Nazca - Using pipeline

It could be interesting to pipeline the blockings, i.e. use a raw blocking technique with a good recall (i.e. not too conservative) and that could have a bad precision (i.e. not too precise), but that is fast.

Then, we can use a more time consuming but more precise blocking on each block.

In [21]:
blocking_1 = nbl.NGramBlocking(1, 1, ngram_size=3, depth=1)
blocking_2 = nbl.MinHashingBlocking(1, 1)
blocking = nbl.PipelineBlocking((blocking_1, blocking_2),collect_stats=True)
aligner.register_blocking(blocking)
In [22]:
pairs = list(aligner.get_aligned_pairs(refset, targetset))
In [23]:
print len(pairs)
for p in pairs[:5]:
    print p
3408
((u'http://dbpedia.org/resource/Ajaccio', 0), (u'http://id.insee.fr/geo/commune/2A004', 18455), 1e-10)
((u'http://dbpedia.org/resource/Bastia', 2), (u'http://id.insee.fr/geo/commune/2B033', 2742), 1e-10)
((u'http://dbpedia.org/resource/Sart%C3%A8ne', 3), (u'http://id.insee.fr/geo/commune/2A272', 15569), 1e-10)
((u'http://dbpedia.org/resource/Corte', 4), (u'http://id.insee.fr/geo/commune/2B096', 2993), 1e-10)
((u'http://dbpedia.org/resource/Bonifacio,_Corse-du-Sud', 6), (u'http://id.insee.fr/geo/commune/2A041', 20886), 1e-10)