Record linkage with Nazca - part 3 - Putting it all together

This IPython notebook show some features of the Python Nazca library :

  • original notebook : here !
  • date: 2014-07-01
  • author: Vincent Michel ([email protected], [email protected]) @HowIMetYourData

Aligner - nazca.rl.aligner

Once you have created your datasets, and define your preprocessings and blockings, you can use the BaseAligner object to perform the alignment.

The BaseAligner is defined as:

class BaseAligner(object):

    def register_ref_normalizer(self, normalizer):
        """ Register normalizers to be applied before alignment """

    def register_target_normalizer(self, normalizer):
        """ Register normalizers to be applied before alignment """

    def register_blocking(self, blocking):
        self.blocking = blocking

    def align(self, refset, targetset, get_matrix=True):
        """ Perform the alignment on the referenceset and the targetset """

    def get_aligned_pairs(self, refset, targetset, unique=True):
        """ Get the pairs of aligned elements """

The align() function return the global distance matrix and the matched elements as a dictionnary, with key the index of reference records, and values the list of aligned target set records.

In [1]:
from nazca.utils.distances import GeographicalProcessing 
from nazca.rl.aligner import BaseAligner

refset = [['R1', 'ref1', (6.14194444444, 48.67)],
          ['R2', 'ref2', (6.2, 49)],
          ['R3', 'ref3', (5.1, 48)],
          ['R4', 'ref4', (5.2, 48.1)]]
targetset = [['T1', 'target1', (6.17, 48.7)],
             ['T2', 'target2', (5.3, 48.2)],
             ['T3', 'target3', (6.25, 48.91)]]
processings = (GeographicalProcessing(2, 2, units='km'),)
aligner = BaseAligner(threshold=30, processings=processings)
mat, matched = aligner.align(refset, targetset)
print mat
print matched
[[   4.55325174  107.09278107   29.12484169]
 [  33.33169937  133.59967041   11.39668941]
 [ 141.97203064   31.38606644  162.75946045]
 [ 126.65346527   15.69240952  147.18429565]]
{0: [(0, 4.5532517), (2, 29.124842)], 1: [(2, 11.396689)], 3: [(1, 15.69241)]}

The get_aligned_pairs() directly yield the found aligned pairs and the distance

In [2]:
aligner = BaseAligner(threshold=30, processings=processings)
for pair in aligner.get_aligned_pairs(refset, targetset):
    print pair
(('R1', 0), ('T1', 0), 4.5532517)
(('R2', 1), ('T3', 2), 11.396689)
(('R4', 3), ('T2', 1), 15.69241)

Plugging preprocessings and blocking

We can plug the preprocessings using register_ref_normalizer() and register_target_normalizer, and the blocking using register_blocking(). Only ONE blocking is allowed, thus you should use PipelineBlocking for multiple blockings.

In [3]:
import nazca.utils.normalize as nno
from nazca.rl import blocking as nrb

normalizer = nno.SimplifyNormalizer(attr_index=1)
blocking = nrb.KdTreeBlocking(ref_attr_index=2, target_attr_index=2, threshold=0.3)
aligner = BaseAligner(threshold=30, processings=processings)
aligner.register_ref_normalizer(normalizer)
aligner.register_target_normalizer(normalizer)
aligner.register_blocking(blocking)
In [4]:
for pair in aligner.get_aligned_pairs(refset, targetset):
    print pair
(('R1', 0), ('T1', 0), 4.5532517433166504)
(('R2', 1), ('T3', 2), 11.396689414978027)
(('R4', 3), ('T2', 1), 15.692409515380859)

An unique boolean could be set to False to get all the alignments and not just the one unique on the target set.

In [5]:
for pair in aligner.get_aligned_pairs(refset, targetset, unique=False):
    print pair
(('R1', 0), ('T3', 2), 29.124841690063477)
(('R1', 0), ('T1', 0), 4.5532517433166504)
(('R2', 1), ('T3', 2), 11.396689414978027)
(('R4', 3), ('T2', 1), 15.692409515380859)

Aligner - nazca.rl.aligner

A pipeline of aligners could be created using PipelineAligner.

In [6]:
from nazca.utils.distances import LevenshteinProcessing, GeographicalProcessing
from nazca.rl.aligner import PipelineAligner

processings = (GeographicalProcessing(2, 2, units='km'),)
aligner_1 = BaseAligner(threshold=30, processings=processings)
processings = (LevenshteinProcessing(1, 1),)
aligner_2 = BaseAligner(threshold=1, processings=processings)

pipeline = PipelineAligner((aligner_1, aligner_2))
In [7]:
for pair in pipeline.get_aligned_pairs(refset, targetset):
    print pair
(('R1', 0), ('T1', 0))
(('R2', 1), ('T3', 2))
(('R4', 3), ('T2', 1))