Record linkage with Nazca - part 1

This IPython notebook show some features of the Python Nazca library :

Introduction on Record Linkage

In [1]:
from IPython.display import HTML
HTML('<iframe src= width=700 height=350></iframe>')

Conventions on datasets


List of attributes of an entry/object. The first attribute is ALWAYS considered as the identifier (e.g. URI) of the record. We note $v$ the number of attributes in a record.

In [2]:
('', 'Victor Hugo', 1802, 1885, u'Besançon (Doubs)', u'Paris')
 'Victor Hugo',
 u'Besan\xe7on (Doubs)',

Reference set / Refset

Set of reference data with size $n \times v$. We note $r_i$ the ith element of the reference set.

Target set / Targetset

Set of target data, with size $m \times v$. We note $t_j$ the jth element of the targset set. Note: If we do not ask the record linkage to be surjective, i.e. one record of the target set is only aligned to one record in the reference set, the notions of reference set and target set are commutable.

Comparison result / Pair

A pair is the result of a comparison between a record of the refset and a record of the targetset. We note $p_{ij}$ the pair resulting from the comparison of $r_i$ and $t_j$.

Standard approach in Nazca

The standard approach in Nazca is the following:

  1. definition of both reference and target sets
  2. definition of some functions to compare different attribute
  3. computation of the global distance by calculating a (possibly averaged) sum of all the distances
  4. keep only the pairs that with a global distance below a given threshold

Problematic and blocking

The major issue in record linkage is the size of the sets to be linked. Lets assume a simple Levenshtein distance :

In [3]:
import timeit
t = timeit.timeit('from nazca.utils.distances import levenshtein\nlevenshtein("abcd", "abde")', number=1000)
print '%s (s)' % t
0.0652930736542 (s)

Lets now assume both sets of size $n=m=10^4$. Thus the total computation time for all the possible comparisons is:

In [4]:
total = t*10000*10000/1000.
print '%s (s) = %s (h) = %s (d)' % (total, total/3600., total/(3600.*24.))
6529.30736542 (s) = 1.81369649039 (h) = 0.0755706870997 (d)

Blocking aims at dividing the global comparisons matrix in small subsets (kind of divide-and-conquer approach). This may be seen as a block diagonalisation of the comparisons matrix. The number of comparisons (may) dramatically decrease !

Structure of the Nazca library

IO - nazca.utils.dataio

This module provides several helpers for input/ouput, and for creating datasets.

In [5]:
import nazca.utils.dataio as nio
In [6]:
refset = nio.sparqlquery('',
                         '''PREFIX dbonto:<>
                            SELECT ?p ?n ?c WHERE {?p a dbonto:PopulatedPlace.
                                                   ?p dbonto:country dbpedia:France.
                                                   ?p foaf:name ?n.
                                                   ?p dbpprop:insee ?c}''',
print refset[0]
[u'', u'Ajaccio', u'2']

The autocast_data option may be use to automatically try to cast attributes to some specific types

In [7]:
refset = nio.sparqlquery('',
                         '''PREFIX dbonto:<>
                            SELECT ?p ?n ?c WHERE {?p a dbonto:PopulatedPlace.
                                                   ?p dbonto:country dbpedia:France.
                                                   ?p foaf:name ?n.
                                                   ?p dbpprop:insee ?c}''')
print refset[0]
[u'', u'Ajaccio', 2]
In [8]:
refset = nio.rqlquery('', 'Any U, N WHERE X is Project, X name N, X cwuri U')
print refset[0]
[u'', u'cubicweb-mobile']

Distances - nazca.utils.distances

This module provides several distance functions.

In [9]:
import nazca.utils.distances as ndi

Exact match distance

The simplest distance, defined as 0 if both values are equal, 1 elsewise.

In [10]:
for sa, sb in (('abcd', 'abcd'), ('abcd', 'abce'), ((1, 2), (1, 2)), ((1, 2, 'abd'), (2, 1, 'abd'))):
    print sa, sb, ndi.exact_match(sa, sb)
abcd abcd 0
abcd abce 1
(1, 2) (1, 2) 0
(1, 2, 'abd') (2, 1, 'abd') 1

Levenshtein distance

The Levenshtein distance is defined as the minimal cost to transform string a into string b, where 3 operators are allowed:

  • Replace one character of string a into a character of string b
  • Add one character of string b into string a
  • Remove one character of string b
In [11]:
for sa, sb in (('abcd', 'abcd'), ('abcd', 'abce'), ('abcd', 'abc'), ('abc', 'abcd'), ('abcd', 'efgh')):
    print sa, sb, ndi.levenshtein(sa, sb)
abcd abcd 0
abcd abce 1
abcd abc 1
abc abcd 1
abcd efgh 4

Soundex distance

The Soundex distance is return 1 if the soundex codes of the two strings are different, and 0 otherwise:

In [12]:
for sa, sb in (('victor', 'victor'), ('victor', 'viktor'), ('victor', 'victo'), ('victor', 'viktaur')):
    print sa, sb, ndi.soundex(sa, sb)
victor victor 0
victor viktor 0
victor victo 1
victor viktaur 0

Jaccard distance

The Jaccard distance between two strings is defined as one minus Jaccard distance $J$ between the two sets of tokens built from the strings, with: $$ J(A, B) = (A \cap B)/(A \cup B) $$ Thus: $$ d(string A, string B) = 1 - J(tokens A, tokens B) $$

In [13]:
for sa, sb in (('victor hugo', 'victor hugo'), ('victor hugo', 'victor'),
               ('victor hugo', 'victor hugo Besancon 1802 Paris 1885')):
    print sa, sb, ndi.jaccard(sa, sb)
victor hugo victor hugo 0.0
victor hugo victor 0.5
victor hugo victor hugo Besancon 1802 Paris 1885 0.666666666667

Difflib distance

Distance based on the SequenceMatcher class of the difflib modulen, which is based on the gestalt pattern matching algorithm. The Nazca distance return 1 - difflib.SequenceMatcher(None, stra, strb).ratio().

In [14]:
for sa, sb in (('abcd', 'abcd'), ('abcd', 'abce'), ('abcd', 'abc'), ('abc', 'abcd'), ('abcd', 'efgh'),,
               ('victor', 'victor'), ('victor', 'viktor'), ('victor', 'victo'), ('victor', 'viktaur')):
    print sa, sb, ndi.difflib_match(sa, sb)
  File "<ipython-input-14-f01d54be2f60>", line 1
    for sa, sb in (('abcd', 'abcd'), ('abcd', 'abce'), ('abcd', 'abc'), ('abc', 'abcd'), ('abcd', 'efgh'),,
SyntaxError: invalid syntax

Temporal distance

The Temporal distance is based on the dateutil Python module and is used to compute the distance between two strings representing dates.

In [15]:
for d1, d2 in (('14 aout 1991', '14/08/1991'), ('14 aout 1991', '08/14/1991'), ('14 aout 1991', '08/15/1992'),
               ('1er mai 2012', '01/05/2012'), ('Jean est né le 1er octobre 1958', 'Le 01-10-1958, Jean est né')):
    print d1, d2, ndi.temporal(d1, d2)
14 aout 1991 14/08/1991 0
14 aout 1991 08/14/1991 0
14 aout 1991 08/15/1992 367
1er mai 2012 01/05/2012 0
Jean est né le 1er octobre 1958 Le 01-10-1958, Jean est né 0

Different granularies may be used:

In [16]:
print ndi.temporal('13 mars', '13 mai', 'months')
In [17]:
print ndi.temporal('14 aout 1991', '08/15/1992', 'years')

Dates are extracted from the text using french informations by default, but an english parser exists, and additional parsers may be built:

In [18]:
from dateutil import parser as dateparser
print ndi.temporal('13 march', '13 may', 'months', parserinfo=dateparser.parserinfo)

Geographical distance

The geographical distance between two points on Earth. Results are in meters, but the units can also be in 'km'. The planet radius can also be changed (if it may be useful...).

In [19]:
paris = (48.856578, 2.351828)     
london = (51.504872, -0.07857)      
print ndi.geographical(paris, london, in_radians=False)
In [20]:
print ndi.geographical(paris, london, in_radians=False, units='km')

BaseProcessing - nazca.utils.distances

The class BaseProcessing is used to hold all necessary information for a distance between records

  • indice of the attribute to be used in the reference set records
  • indice of the attribute to be used in the target set records
  • the distance function to be used
  • the weight of the distance in the global distance matrix
  • the normalization of the matrix. If True, the distance between two points is arbitrary set to [0, 1], doing: $$d = 1 - 1/(1 + d(x, y))$$

The prototype of the BaseProcessing class is:

 class BaseProcessing(object):

    def distance(self, reference_record, target_record):         
        """ Compute the distance between two records"""

    def cdist(self, refset, targetset, ref_indexes=None, target_indexes=None):        
        """ Compute the metric matrix, given two datasets and a metric"""

    def pdist(self, dataset):         
        """ Compute the upper triangular matrix in a way similar to scipy.spatial.metrical"""

Nazca provides a processing class for all the distances previously described.

Of course, it is possible to define your own processing, based on a specific distance:

In [21]:
def iamvictorhugo(stra, strb):
    """Return 0 if both strings include 'victor hugo', 0 otherwise """
    return 0 if ('victor hugo' in stra.lower() and 'victor hugo' in strb.lower()) else 1

class VictorHugoProcessing(ndi.BaseProcessing):
          def __init__(self, ref_attr_index=None, target_attr_index=None,     
                       weight=1, matrix_normalized=False):         
            super(VictorHugoProcessing, self).__init__(ref_attr_index, target_attr_index,  
                                                       iamvictorhugo, weight, matrix_normalized)

processing = VictorHugoProcessing(1, 1)
print processing.distance(('', 'Victor Hugo'),
                          ('', 'Victor Hugo'))
print processing.distance(('', 'Victor Hugo'),
                          ('', 'Yu guo'))