Notebook

JellyFish¶

Author: Rory Creedon

Date: 28 November 2013

Purpose: To examine the possibility of using the python JellyFish library for complex "fuzzy" string matching.

CONTENTS:

Introduction - Outlining the Problem
JellyFish and It's Algorithms - Brief introduction to JellyFish
Levenshtein Distance - Description, Calculation, and Examples
Damerau Levenshtein Distance - Description, Calculation, and Examples
Jaro Distance - Description, Calculation, and Examples
Jaro-Winkler Distance - Description, Calculation, and Examples
Match Rating Approach Comparison - Brief Description
Hamming Distance - Brief Description
Experimenting with the Measures
Conclusion

In [1]:

import pandas as pd
import numpy as np
import datetime
import os
from pandas import DataFrame
from numpy import nan as NA
from IPython.core.display import HTML
from IPython.core.display import Image
from IPython.display import Math
from IPython.display import Latex
import collections
import jellyfish as jf
import re

Introduction¶

With regard to style names in production data we face two distinct problems:

Style names not being clean within factory.

The fact of the matter is that often the string values of the style names are not clean meaning that styles that in reality are the same are not recognised as such. This is clearly problematic as we want to be able to ;identify those styles that are produced on multiple lines in order to sensibly create the running days values, to do across line analysis of the same styles etc. etc.

Matching styles across factory.

Eventually we will want to be able to match styles across factories in order to identify those styles that are produced in multiple factories. The reasons for this are pretty obvious.

At this point, the notebook will mostly relate to the former problem, as it is most salient at this stage of the data cleaning.

DISCLAIMER I am not an expert in this type of computer science. Much of this notebook is based upon Wikipedia entries. I suggest you do your own research, and use this notebook as a jumping off point.

JellyFish and Its Algorithms¶

Jellyfish is a python library for doing approximate and phonetic matching of strings. It allows for string comparison using the following algorithms:

Levenshtein Distance
Damerau-Levenshtein Distance
Jaro Distance
Jaro-Winkler Distance
Match Rating Approach Comparison
Hamming Distance

All of the above are metrics for measuring the difference between two sequences. Each of these will not be briefly introduced and an example given:

Levenshtein Distance
¶

The Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to transform one string into another. The lower the score, the lower the number of edits. If the score is 0, then no edits are needed so the strings are exactly the same.

For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

Edit 1: kitten → sitten (substitution of "s" for "k")
Edit 2: sitten → sittin (substitution of "i" for "e")
Edit 3: sittin → sitting (insertion of "g" at the end)

The measure is most useful when looking to match short strings with longer strings. It takes a lot of computational power to do this with long strings and may not therefore be totally appropriate.

For more information about the measure see

http://en.wikipedia.org/wiki/Levenshtein_distance

Now some examples follow:

In [2]:

A = 'Rory'
B = 'Ronnie'

jf.levenshtein_distance(A, B)

Out[2]:

As the following demonstrates, the algorithm also considers case, which is a strong argument for converting all strings to the same case. This is a pretty standard way of working with strings so will come as no surprise.

In [3]:

A = 'Rory'
B = 'rory'
jf.levenshtein_distance(A, B)

Out[3]:

In [4]:

# Now the levenshtein score of 0 means the strings are an exact match. 
jf.levenshtein_distance(A.lower(), B.lower())

Out[4]:

Damerau Levenshtein Distance
¶

This measure is very similar to the Levenshtein distance, in that it is a measure of the minimum number of edits needed to transform one string into another. The permissible 'edits' in the Levenshtein distance are insertions, deletion and substitution whereas in the Damerau Levenshtein distance the transposition of two adjacent characters is also allowed. Damerau claimed that these four edits correspond to 80% of human spelling errors.

As with the Levenshtein distance a score of zero indicates and exact match etc.

This measure may be an improvement on the Levenshtein distance as using the Damerau Levenshtein Distance strings where two letters are simply the wrong way around will have a lower score (indicating a better match) than they would under the Levenshtein measure.

A simple example will suffice:

In [5]:

jf.levenshtein_distance('jellyfihs', 'jellyfish')

Out[5]:

In [6]:

jf.damerau_levenshtein_distance('jellyfihs', 'jellyfish')

Out[6]:

In this example, the Levenshtein distance works like this:

Edit 1: jellyfihs → jellyfiss (substitution of "s" for "h")
Edit 2: jellyfiss → jellyfish (substitution of "h" for "s")

This is only one way it could happen. For instance 'h' could have been deleted, and then inserted (and so on), but the minimum number of edits is always 2

The Damerau Levenshtein distance works like this:

Edit 1: jellyfihs → jellyfish (transposition of 's' and 'h' adjacent characters)

The measure may therefore be a better one in that it recognises strings as being closer than the Levenshtein measure in cases where there has been a simple mix up of characters.

NB I have observed some odd behaviour of the jf.damerau_levenshtein_distance algorithm. The odd behaviour may be related to me misunderstanding the nature of the measure, or it may be a problem with coding of the library. I have written to the developer for clarification. If you are interested, then see the following, if not then skip to the next measure below.

In [7]:

jf.damerau_levenshtein_distance('jellyifhs', 'jellyfish')

Out[7]:

In [8]:

jf.damerau_levenshtein_distance('ifhs', 'fish')

Out[8]:

I find the above output very odd for the reason that the scores should be the same as the edits required in both instances are exactly the same:

In the first example:

Edit 1: jellyifhs → jellyfihs (transpose adjacent characters 'i' and 'f')
Edit 2: jellyfihs → jellyfish (transpose adjacent characters 'h' and 's')

In the second example:

Edit 1: ifhs → fihs (transpose adjacent characters 'i' and 'f')
Edit 2: fihs → fish (transpose adjacent characters 'h' and 's')

Why the outputs are different remains to be determined.

Update It appears from looking at the source code that in some cases the function is returning the OSA measure not the Damerau Levenshtein measure. The developer is now aware, and has been contacted. use this measure with caution.

More information on the measure can be found at http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

Jaro Distance
¶

The Jaro distance is a measure that considers the number of matching characters in both strings being compared, and also the number of transpositions which is defined as the number of matching characters (in a different sequence order) divided by two. The measure returns a score between 0 and 1, 0 being no match whatsoever (as defined in the calculation) and 1 being a perfect match.

Beware that this measure will ignore matching characters that are more than a certain distance from each other. This could either be a good thing (to ignore spurious matches) or a bad thing (ignoring correct matches). In any event, it is important to be aware of this, so it is explained in detail below.

It is calculated as follows:

In [9]:

Latex(r"""\begin{eqnarray}
d_j = \left\{
    \begin{array}{1 1}
        0 & \quad \text{if $m$ = 0 <}\\
        \\
        \frac{1}{3} \bigg(\frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m - t}{m} & \quad \text{otherwise}
    \end{array} \right.
\end{eqnarray}""")

Out[9]:

$\begin{eqnarray} d_j = \left\{ \begin{array}{1 1} 0 & \quad \text{if $m$ = 0 <}\\ \\ \frac{1}{3} \bigg(\frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m - t}{m} & \quad \text{otherwise} \end{array} \right. \end{eqnarray}$

$m$ is the number of matching characters
$t$ is half the number of transpositions.
$|s_1|$ is the length of the first string to be matched
$|s_2|$ is the length of the second string to be matched

If the number of matching characters is 0, then the measure equals 0

A character in each string is only considered to be matching if it is the same and obeys the following rule as to distance:

In [10]:

Latex(r"""\begin{eqnarray}
\frac{max(|s_1|,|s_2|)}{2} -1
\end{eqnarray}""")

Out[10]:

$\begin{eqnarray} \frac{max(|s_1|,|s_2|)}{2} -1 \end{eqnarray}$

Observe the following:

In [11]:

S1 = 'AzzzzB'
S2 = 'BxxA'
jf.jaro_winkler(S1, S2)

Out[11]:

0.0

Although the characters A and B appear in both strings, m = 0 because they are farther in distance from each other than:

In [12]:

Latex(r"""\begin{eqnarray}
\frac{max(|6|,|4|)}{2} -1 = 2
\end{eqnarray}""")

Out[12]:

$\begin{eqnarray} \frac{max(|6|,|4|)}{2} -1 = 2 \end{eqnarray}$

Transpositions are also important as already mentioned. The number of transpositions is calculated as "The number of matching (but different sequence order) characters divided by 2". Observe the following:

In [13]:

S3 = 'Poverty'
S4 = 'Poervty'
jf.jaro_distance(S3, S4)

Out[13]:

0.9523809523809524

Sadly it looks like this is also being calculated incorrectly. To my mind the calculation should be as follows:

In [14]:

Latex(r"""\begin{eqnarray}
\Bigg(\frac{7}{7} + \frac{7}{7} + \frac{7- \frac{3}{2}}{7}\Bigg) = 0.9285714
\end{eqnarray}""")

Out[14]:

$\begin{eqnarray} \Bigg(\frac{7}{7} + \frac{7}{7} + \frac{7- \frac{3}{2}}{7}\Bigg) = 0.9285714 \end{eqnarray}$

Whereas is appears that it is being calcualted as follows:

In [15]:

Latex(r"""\begin{eqnarray}
\Bigg(\frac{7}{7} + \frac{7}{7} + \frac{7- \frac{2}{2}}{7}\Bigg) = 0.9523809
\end{eqnarray}""")

Out[15]:

$\begin{eqnarray} \Bigg(\frac{7}{7} + \frac{7}{7} + \frac{7- \frac{2}{2}}{7}\Bigg) = 0.9523809 \end{eqnarray}$

The critical difference is that I calculate the number of matching (but different sequence order) characters as 3, those being:

Whereas it appears that JellyFish thinks there are only two.

Again, I have raised this issue with the developer. It may resolved in future, but for now use this measure with caution.

Jaro-Winkler Distance¶

The Jaro-Winkler Distance measure builds upon the Jaro measure, but uses a prefix scale which gives more favorable ratings to strings that match from the beginning for a set prefix length. This will become clear when the calculation is viewed:

In [16]:

Latex(r"""\begin{eqnarray}
d_w = d_j + (\ell p(1 - d_j))
\end{eqnarray}""")

Out[16]:

$\begin{eqnarray} d_w = d_j + (\ell p(1 - d_j)) \end{eqnarray}$

The expression is made up of the following:

$d_w$ is the Jarrow-Winkler distance
$d_j$ is the Jaro distance for strings s1 and s2
$\ell$ is the length of the common prefix up to a maximum of 4 characters
$p$ is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. It should not exceed 0.25, otherwise the distance can become larger than 1. The standard value for this constant in Winkler's work is 0.1

The key takeaway is that strings that begin with the same prefixes score more highly...

Observe the following:

In [17]:

S5 = 'Innovaiosn'
S6 = 'Innovations'
jf.jaro_distance(S5, S6)

Out[17]:

0.9363636363636364

In [18]:

jf.jaro_winkler(S5, S6)

Out[18]:

0.9618181818181818

Although it is not stated anywhere in the jellyfish documentation, it is clear that the value of $p$ is 0.1, this is because rearranging the following to solve for $\ell$ :

In [19]:

Latex(r"""\begin{eqnarray}
d_w = d_j + (\ell p(1 - d_j))\\
\\
\\
\ell = \frac{d_w - d_j}{1 - d_j} * \frac{1}{4}\\
\\
\\
0.1 = \frac{0.96182-0.936364}{1-0.936364} * \frac{1}{4}

\end{eqnarray}""")

Out[19]:

$\begin{eqnarray} d_w = d_j + (\ell p(1 - d_j))\\ \\ \\ \ell = \frac{d_w - d_j}{1 - d_j} * \frac{1}{4}\\ \\ \\ 0.1 = \frac{0.96182-0.936364}{1-0.936364} * \frac{1}{4} \end{eqnarray}$

In some implementations of Jaro-Winkler, the prefix bonus is only added when the compared strings have a Jaro distance above a set "boost threshold". In the work of the developer himself, this threshold was set at 0.7. In other words, if the Jaro measure is less than 0.7, then even if the prefixes of the string are the same up to four characters, the prefix bonus will not be applied. Then the calculation looks like this:

In [20]:

Latex(r"""\begin{eqnarray}
d_w = \left\{
    \begin{array}{1 1}
        d_j & \quad \text{if $d_j$ < $b_t$}\\
        d_j + (\ell p(1 - d_j)) & \quad \text{otherwise}
    \end{array} \right.
\end{eqnarray}""")

Out[20]:

$\begin{eqnarray} d_w = \left\{ \begin{array}{1 1} d_j & \quad \text{if $d_j$ < $b_t$}\\ d_j + (\ell p(1 - d_j)) & \quad \text{otherwise} \end{array} \right. \end{eqnarray}$

Again, as there is no documentation to JellyFish it is hard to know if this is being applied, and if so, at what level the $b_t$ value is set to trigger the prefix bonus. However, a bit of easy experimentation can demonstrate, firstly I compare strings that have a Jaro score just below 0.7, and then strings that have a score just above 0.7

In [21]:

S7 = 'ABCDqwerty'
S8 = 'ABCDpoiuyt'
jf.jaro_distance(S7, S8)

Out[21]:

0.6777777777777777

In [22]:

jf.jaro_winkler(S7, S8)

Out[22]:

0.6777777777777777

In [23]:

S9 = 'ABCDqwerty'
S10 = 'ABCDpoiuty'
jf.jaro_distance(S9, S10)

Out[23]:

0.7333333333333334

In [24]:

jf.jaro_winkler(S9, S10)

Out[24]:

0.8400000000000001

The above output indicates that indeed, there implementation of the Jaro-Winkler measure in JellyFish uses a threshold of 0.7 for $b_t$

Match Rating Approach Comparison¶

This technique is based upon a phonetic algorithm. As such I don’t see much use for this function in the work I am doing, and will therefore neglect to explore it further. I may return to this at a later date. For information about the measure see:

http://en.wikipedia.org/wiki/Match_rating_approach

Hamming Distance¶

"In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In another way, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other."

In actual fact this measure has the glory of being the only one of these that is technically a 'metric' as unlike the others presented here it satisfies the rules of triangle inequality.

However, it is also the most useless for our purpose as it can only operate on strings of the same length, and additionally it effectively only permits edits that are substitutions. Therefore the Levenshtein distance is already a much subtler measure, and the remaining measures looked at even more so.

Therefore, this measure will not be explored.

Experimenting with the Algorithms¶

In order to experiment with the algorithms I am going to read in some style data from a factory and play with it, to see how/if they might be useful to us.

In [25]:

# This reads the data from a CSV file and stores it in a DataFrame object which has been named "DF"
Styles = pd.Series(pd.read_csv(r'C:\Users\rcreedon\Dropbox\GIZSupervisor\DATA\Production_Data\STP_Data\Data_Sets\Wave1\1005\1005_all_merged.csv')['style'].unique())

In [26]:

Styles

Out[26]:

0                  Denim Jacket
1                    Denim Vest
2                    Jonas Long
3           2618 (Longer Denim)
4                     Lilen Trs
5                          Ride
6                  Tye Dye Jean
7                   Jonas Short
8                 Jacket (2002)
9                   Dobby Short
10                 Parker Chino
11                   550 (Pant)
12                   Ride (510)
13                Chambrey(117)
14               Space Treggins
15                 Jhonny Boxer
16                   New skinny
17                        Clint
18                     Todd Jkt
19                Jacket (2001)
20               Back to School
21                 Reidun Parka
22                     Mixmatch
23         Reidun Parka(Lining)
24                  Parko Chino
25                   Camden Jkt
26                          NaN
27                   CB32405510
28                   Bomber Jkt
29                   5 Pkt Cord
30                   Shairt-615
31                   Shairt-537
32                   Shairt-564
33               Love Trs.(340)
34               1ZB-153T(Pant)
35                1ZB-153T(JKT)
36               Jacket(S.wear)
37     Jacket(Bk to School)Boys
38    Jacket(Bk to School)Girls
39                 Jacket-253KE
40                     14011876
dtype: object

In [27]:

Styles = Styles[pd.notnull(Styles)]

One way to work with this data could be to create a dictionary with an entry for each string, and then iterate through the remaining strings, and enter those in a list item if they meet some threshold of each returned distance score.

The first thing that should always be done is to strip away all white spaces and non-alphanumeric characters and convert the case to lower (or upper). This can be achieved in the following manner:

In [28]:

delchars = ''.join(c for c in map(chr, range(256)) if not c.isalnum())
for row in Styles.index:
    Styles[row] =  Styles[row].translate(None, delchars).lower()
Styles

Out[28]:

0               denimjacket
1                 denimvest
2                 jonaslong
3           2618longerdenim
4                  lilentrs
5                      ride
6                tyedyejean
7                jonasshort
8                jacket2002
9                dobbyshort
10              parkerchino
11                  550pant
12                  ride510
13              chambrey117
14            spacetreggins
15              jhonnyboxer
16                newskinny
17                    clint
18                  toddjkt
19               jacket2001
20             backtoschool
21              reidunparka
22                 mixmatch
23        reidunparkalining
24               parkochino
25                camdenjkt
27               cb32405510
28                bomberjkt
29                 5pktcord
30                shairt615
31                shairt537
32                shairt564
33               lovetrs340
34              1zb153tpant
35               1zb153tjkt
36              jacketswear
37     jacketbktoschoolboys
38    jacketbktoschoolgirls
39              jacket253ke
40                 14011876
dtype: object

Levenshtein Distance¶

Firstly I turn to the Levenshtein distance

In [29]:

LMatches1 = {_style:[] for _style in Styles}
for _style in LMatches1.keys():
    for row in Styles.index:
        if _style != Styles[row] and jf.levenshtein_distance(_style, Styles[row]) < 5:
            LMatches1[_style].append(Styles[row])
            

In [30]:

LMatches1

Out[30]:

{'14011876': [],
 '1zb153tjkt': ['1zb153tpant'],
 '1zb153tpant': ['1zb153tjkt'],
 '2618longerdenim': [],
 '550pant': [],
 '5pktcord': [],
 'backtoschool': [],
 'bomberjkt': ['camdenjkt'],
 'camdenjkt': ['bomberjkt'],
 'cb32405510': [],
 'chambrey117': [],
 'clint': ['ride'],
 'denimjacket': [],
 'denimvest': [],
 'dobbyshort': ['jonasshort'],
 'jacket2001': ['jacket2002', 'jacket253ke'],
 'jacket2002': ['jacket2001', 'jacket253ke'],
 'jacket253ke': ['jacket2002', 'jacket2001'],
 'jacketbktoschoolboys': ['jacketbktoschoolgirls'],
 'jacketbktoschoolgirls': ['jacketbktoschoolboys'],
 'jacketswear': [],
 'jhonnyboxer': [],
 'jonaslong': ['jonasshort'],
 'jonasshort': ['jonaslong', 'dobbyshort'],
 'lilentrs': [],
 'lovetrs340': [],
 'mixmatch': [],
 'newskinny': [],
 'parkerchino': ['parkochino'],
 'parkochino': ['parkerchino'],
 'reidunparka': [],
 'reidunparkalining': [],
 'ride': ['ride510', 'clint'],
 'ride510': ['ride'],
 'shairt537': ['shairt615', 'shairt564'],
 'shairt564': ['shairt615', 'shairt537'],
 'shairt615': ['shairt537', 'shairt564'],
 'spacetreggins': [],
 'toddjkt': [],
 'tyedyejean': []}

The above has matched styles based upon whether the levenshtein score is less than 5. What does this tell me?

Well firstly, it seems like to make the score useful, it should be normalized by some measure of the length of the string. For example, it seems preety clear that for a style with a name like 'ride', it will always be possible to match the string to any other string of the same length.

This is why it is stated in the Wikipedia entry that it is best for comparing strings of similar length. One strategy could be to match those strings that have fewer edits than some proportion of the average length of the two strings being matched.

For example if the condition was that only for scores of less than 0.35 times the average length of the strings would be considered matched, then this would mean that there would only be a match if fewer than 3.5 edits were needed on two strings of average length 10.

In [31]:

LMatches2 = {_style:[] for _style in Styles}
for _style in LMatches2.keys():
    for row in Styles.index:
        if _style != Styles[row] and jf.levenshtein_distance(_style, Styles[row]) < \
        ((len(_style) + len(Styles[row])/2)*0.35):
            LMatches2[_style].append(Styles[row])

In [32]:

LMatches2

Out[32]:

{'14011876': [],
 '1zb153tjkt': ['1zb153tpant'],
 '1zb153tpant': ['1zb153tjkt'],
 '2618longerdenim': [],
 '550pant': [],
 '5pktcord': [],
 'backtoschool': [],
 'bomberjkt': ['camdenjkt'],
 'camdenjkt': ['bomberjkt'],
 'cb32405510': [],
 'chambrey117': [],
 'clint': [],
 'denimjacket': ['denimvest'],
 'denimvest': [],
 'dobbyshort': ['jonasshort'],
 'jacket2001': ['jacket2002', 'jacketswear', 'jacket253ke'],
 'jacket2002': ['jacket2001', 'jacketswear', 'jacket253ke'],
 'jacket253ke': ['jacket2002', 'jacket2001', 'jacketswear'],
 'jacketbktoschoolboys': ['backtoschool', 'jacketbktoschoolgirls'],
 'jacketbktoschoolgirls': ['jacketbktoschoolboys'],
 'jacketswear': ['jacket2002', 'jacket2001', 'jacket253ke'],
 'jhonnyboxer': [],
 'jonaslong': ['jonasshort'],
 'jonasshort': ['jonaslong', 'dobbyshort'],
 'lilentrs': [],
 'lovetrs340': [],
 'mixmatch': [],
 'newskinny': [],
 'parkerchino': ['parkochino'],
 'parkochino': ['parkerchino'],
 'reidunparka': ['reidunparkalining'],
 'reidunparkalining': ['reidunparka'],
 'ride': [],
 'ride510': ['ride'],
 'shairt537': ['shairt615', 'shairt564'],
 'shairt564': ['shairt615', 'shairt537'],
 'shairt615': ['shairt537', 'shairt564'],
 'spacetreggins': [],
 'toddjkt': [],
 'tyedyejean': []}

I'm not sure I am approaching this in the most scientific way, I am simply playing with the numbers. If the multiplier is too large, then too many matches are made, too small and some are missed. For example, I would hope that 'Ride' and 'Ride510' would be matched, but in order to get there a multiplier of 0.5 is needed, and that creates too many matches.

I will return here later. Let's now look to see how the second measure holds up

Damerau Levenshtein Distance Distance¶

Now I turn to the Damerau Levenshtein distance

In [33]:

DLMatches1 = {_style:[] for _style in Styles}
for _style in DLMatches1.keys():
    for row in Styles.index:
        if _style != Styles[row] and jf.damerau_levenshtein_distance(_style, Styles[row]) < 5:
            DLMatches1[_style].append(Styles[row])

In [34]:

DLMatches1

Out[34]:

{'14011876': [],
 '1zb153tjkt': ['1zb153tpant'],
 '1zb153tpant': ['1zb153tjkt'],
 '2618longerdenim': [],
 '550pant': [],
 '5pktcord': [],
 'backtoschool': [],
 'bomberjkt': ['camdenjkt'],
 'camdenjkt': ['bomberjkt'],
 'cb32405510': [],
 'chambrey117': [],
 'clint': ['ride'],
 'denimjacket': [],
 'denimvest': [],
 'dobbyshort': ['jonasshort'],
 'jacket2001': ['jacket2002', 'jacket253ke'],
 'jacket2002': ['jacket2001', 'jacket253ke'],
 'jacket253ke': ['jacket2002', 'jacket2001'],
 'jacketbktoschoolboys': ['jacketbktoschoolgirls'],
 'jacketbktoschoolgirls': ['jacketbktoschoolboys'],
 'jacketswear': [],
 'jhonnyboxer': [],
 'jonaslong': ['jonasshort'],
 'jonasshort': ['jonaslong', 'dobbyshort'],
 'lilentrs': [],
 'lovetrs340': [],
 'mixmatch': [],
 'newskinny': [],
 'parkerchino': ['parkochino'],
 'parkochino': ['parkerchino'],
 'reidunparka': [],
 'reidunparkalining': [],
 'ride': ['ride510', 'clint'],
 'ride510': ['ride'],
 'shairt537': ['shairt615', 'shairt564'],
 'shairt564': ['shairt615', 'shairt537'],
 'shairt615': ['shairt537', 'shairt564'],
 'spacetreggins': [],
 'toddjkt': [],
 'tyedyejean': []}

Again, the same problems with the short strings seems to have happened. But what, if anything is the difference between the output when using the two measures?

In [35]:

for key in DLMatches1:
    if LMatches1[key] != DLMatches1[key]:
        print 'This was the Leven result ' + key + " " + str(LMatches1[key])
        print 'This was the D - Leven result ' + key + " " + str(DLMatches1[key])
   

In fact they were identical.

Now looking at the more nuanced measure matching based upon a proportion of the average string length:

In [36]:

DLMatches2 = {_style:[] for _style in Styles}
for _style in DLMatches2.keys():
    for row in Styles.index:
        if _style != Styles[row] and jf.damerau_levenshtein_distance(_style, Styles[row]) < \
        ((len(_style) + len(Styles[row])/2)*0.35):
            DLMatches2[_style].append(Styles[row])

In [37]:

DLMatches2

Out[37]:

{'14011876': [],
 '1zb153tjkt': ['1zb153tpant'],
 '1zb153tpant': ['1zb153tjkt'],
 '2618longerdenim': [],
 '550pant': [],
 '5pktcord': [],
 'backtoschool': [],
 'bomberjkt': ['camdenjkt'],
 'camdenjkt': ['bomberjkt'],
 'cb32405510': [],
 'chambrey117': [],
 'clint': [],
 'denimjacket': ['denimvest'],
 'denimvest': [],
 'dobbyshort': ['jonasshort'],
 'jacket2001': ['jacket2002', 'jacketswear', 'jacket253ke'],
 'jacket2002': ['jacket2001', 'jacketswear', 'jacket253ke'],
 'jacket253ke': ['jacket2002', 'jacket2001', 'jacketswear'],
 'jacketbktoschoolboys': ['backtoschool', 'jacketbktoschoolgirls'],
 'jacketbktoschoolgirls': ['jacketbktoschoolboys'],
 'jacketswear': ['jacket2002', 'jacket2001', 'jacket253ke'],
 'jhonnyboxer': [],
 'jonaslong': ['jonasshort'],
 'jonasshort': ['jonaslong', 'dobbyshort'],
 'lilentrs': [],
 'lovetrs340': [],
 'mixmatch': [],
 'newskinny': [],
 'parkerchino': ['parkochino'],
 'parkochino': ['parkerchino'],
 'reidunparka': ['reidunparkalining'],
 'reidunparkalining': ['reidunparka'],
 'ride': [],
 'ride510': ['ride'],
 'shairt537': ['shairt615', 'shairt564'],
 'shairt564': ['shairt615', 'shairt537'],
 'shairt615': ['shairt537', 'shairt564'],
 'spacetreggins': [],
 'toddjkt': [],
 'tyedyejean': []}

Again, this is just trial and error, but let's see if the matches are different when the same multiplier is used (0.35):

In [38]:

for key in DLMatches2:
    if LMatches2[key] != DLMatches2[key]:
        print 'This was the Leven result ' + key + " " + str(LMatches2[key])
        print 'This was the D - Leven result ' + key + " " + str(DLMatches2[key])
        print

Again, no difference.

Jaro Distance¶

Now I turn to the Jaro distance.

For the Jaro distance again some arbitrary score has be chosen as the cutoff

In [39]:

JMatches1 = {_style:[] for _style in Styles}
for _style in JMatches1.keys():
    for row in Styles.index:
        if _style != Styles[row] and jf.jaro_distance(_style, Styles[row]) > 0.75:
            JMatches1[_style].append(Styles[row])

In [40]:

JMatches1

Out[40]:

{'14011876': [],
 '1zb153tjkt': ['1zb153tpant'],
 '1zb153tpant': ['1zb153tjkt'],
 '2618longerdenim': [],
 '550pant': [],
 '5pktcord': [],
 'backtoschool': ['jacketbktoschoolboys', 'jacketbktoschoolgirls'],
 'bomberjkt': [],
 'camdenjkt': [],
 'cb32405510': [],
 'chambrey117': [],
 'clint': ['lilentrs'],
 'denimjacket': ['denimvest'],
 'denimvest': ['denimjacket'],
 'dobbyshort': [],
 'jacket2001': ['jacket2002', 'jacket253ke'],
 'jacket2002': ['jacket2001', 'jacket253ke'],
 'jacket253ke': ['jacket2002', 'jacket2001', 'jacketswear'],
 'jacketbktoschoolboys': ['backtoschool', 'jacketbktoschoolgirls'],
 'jacketbktoschoolgirls': ['backtoschool', 'jacketbktoschoolboys'],
 'jacketswear': ['jacket253ke'],
 'jhonnyboxer': [],
 'jonaslong': ['jonasshort'],
 'jonasshort': ['jonaslong'],
 'lilentrs': ['clint'],
 'lovetrs340': [],
 'mixmatch': [],
 'newskinny': [],
 'parkerchino': ['parkochino'],
 'parkochino': ['parkerchino'],
 'reidunparka': ['reidunparkalining'],
 'reidunparkalining': ['reidunparka'],
 'ride': ['ride510'],
 'ride510': ['ride'],
 'shairt537': ['shairt615', 'shairt564'],
 'shairt564': ['shairt615', 'shairt537'],
 'shairt615': ['shairt537', 'shairt564'],
 'spacetreggins': [],
 'toddjkt': [],
 'tyedyejean': []}

Ok, this is where my non-scientific approach breaks down. I have arbitrarily chosen a figure of greater than 0.75. However, it seems to me like the results are a lot closer to what I would hope. In fact they match very closely the questions I actually sent to this factory about their style names, that I did simply by eyeballing the data.

let's compare:

In [41]:

for key in JMatches1:
    if DLMatches2[key] != JMatches1[key]:
        print 'This was the D- Leven result ' + key + " " + str(DLMatches2[key])
        print 'This was the Jaro result ' + key + " " + str(JMatches1[key])
        print

This was the D- Leven result lilentrs []
This was the Jaro result lilentrs ['clint']

This was the D- Leven result clint []
This was the Jaro result clint ['lilentrs']

This was the D- Leven result denimvest []
This was the Jaro result denimvest ['denimjacket']

This was the D- Leven result jacketswear ['jacket2002', 'jacket2001', 'jacket253ke']
This was the Jaro result jacketswear ['jacket253ke']

This was the D- Leven result camdenjkt ['bomberjkt']
This was the Jaro result camdenjkt []

This was the D- Leven result backtoschool []
This was the Jaro result backtoschool ['jacketbktoschoolboys', 'jacketbktoschoolgirls']

This was the D- Leven result jonasshort ['jonaslong', 'dobbyshort']
This was the Jaro result jonasshort ['jonaslong']

This was the D- Leven result jacketbktoschoolgirls ['jacketbktoschoolboys']
This was the Jaro result jacketbktoschoolgirls ['backtoschool', 'jacketbktoschoolboys']

This was the D- Leven result jacket2002 ['jacket2001', 'jacketswear', 'jacket253ke']
This was the Jaro result jacket2002 ['jacket2001', 'jacket253ke']

This was the D- Leven result bomberjkt ['camdenjkt']
This was the Jaro result bomberjkt []

This was the D- Leven result ride []
This was the Jaro result ride ['ride510']

This was the D- Leven result jacket2001 ['jacket2002', 'jacketswear', 'jacket253ke']
This was the Jaro result jacket2001 ['jacket2002', 'jacket253ke']

This was the D- Leven result dobbyshort ['jonasshort']
This was the Jaro result dobbyshort []

It seems to me as thought the Jaro result is actually much more in line with what I would consider to be actual matches.

Jaro -Winkler Distance¶

Now I turn to the Jaro Winkler distance.

In [42]:

JWMatches1 = {_style:[] for _style in Styles}
for _style in JWMatches1.keys():
    for row in Styles.index:
        if _style != Styles[row] and jf.jaro_winkler(_style, Styles[row]) > 0.75:
            JWMatches1[_style].append(Styles[row])

In [43]:

JWMatches1

Out[43]:

{'14011876': [],
 '1zb153tjkt': ['1zb153tpant'],
 '1zb153tpant': ['1zb153tjkt'],
 '2618longerdenim': [],
 '550pant': [],
 '5pktcord': [],
 'backtoschool': ['jacketbktoschoolboys', 'jacketbktoschoolgirls'],
 'bomberjkt': [],
 'camdenjkt': [],
 'cb32405510': [],
 'chambrey117': [],
 'clint': ['lilentrs'],
 'denimjacket': ['denimvest'],
 'denimvest': ['denimjacket'],
 'dobbyshort': [],
 'jacket2001': ['jacket2002', 'jacketswear', 'jacket253ke'],
 'jacket2002': ['jacket2001', 'jacketswear', 'jacket253ke'],
 'jacket253ke': ['jacket2002', 'jacket2001', 'jacketswear'],
 'jacketbktoschoolboys': ['backtoschool', 'jacketbktoschoolgirls'],
 'jacketbktoschoolgirls': ['backtoschool',
  'jacketswear',
  'jacketbktoschoolboys'],
 'jacketswear': ['jacket2002',
  'jacket2001',
  'jacketbktoschoolgirls',
  'jacket253ke'],
 'jhonnyboxer': [],
 'jonaslong': ['jonasshort'],
 'jonasshort': ['jonaslong'],
 'lilentrs': ['clint'],
 'lovetrs340': [],
 'mixmatch': [],
 'newskinny': [],
 'parkerchino': ['parkochino'],
 'parkochino': ['parkerchino'],
 'reidunparka': ['reidunparkalining'],
 'reidunparkalining': ['reidunparka'],
 'ride': ['ride510'],
 'ride510': ['ride'],
 'shairt537': ['shairt615', 'shairt564'],
 'shairt564': ['shairt615', 'shairt537'],
 'shairt615': ['shairt537', 'shairt564'],
 'spacetreggins': [],
 'toddjkt': [],
 'tyedyejean': []}

It seems to me that as the J-W score gives a boost for those matches with a similar prefix, that, the threshold score for matching could be pushed a little higher, to eliminate more spurious matches. But first let's see how it compares with the Jaro measure:

In [44]:

for key in JWMatches1:
    if JMatches1[key] != JWMatches1[key]:
        print 'This was the Jaro result ' + key + " " + str(JMatches1[key])
        print 'This was the Jaro-Winkler result ' + key + " " + str(JWMatches1[key])
        print

This was the Jaro result jacketswear ['jacket253ke']
This was the Jaro-Winkler result jacketswear ['jacket2002', 'jacket2001', 'jacketbktoschoolgirls', 'jacket253ke']

This was the Jaro result jacketbktoschoolgirls ['backtoschool', 'jacketbktoschoolboys']
This was the Jaro-Winkler result jacketbktoschoolgirls ['backtoschool', 'jacketswear', 'jacketbktoschoolboys']

This was the Jaro result jacket2002 ['jacket2001', 'jacket253ke']
This was the Jaro-Winkler result jacket2002 ['jacket2001', 'jacketswear', 'jacket253ke']

This was the Jaro result jacket2001 ['jacket2002', 'jacket253ke']
This was the Jaro-Winkler result jacket2001 ['jacket2002', 'jacketswear', 'jacket253ke']

The prefix boost is associated with J-W measure is pushing styles with the same prefix higher. As the prefixes are likely to be common when working with styles, perhaps this is a disadvantage of that method.

One fix, might be to push the threshold higher.

In [45]:

JWMatches2 = {_style:[] for _style in Styles}
for _style in JWMatches2.keys():
    for row in Styles.index:
        if _style != Styles[row] and jf.jaro_winkler(_style, Styles[row]) > 0.8:
            JWMatches2[_style].append(Styles[row])

In [46]:

JWMatches2

Out[46]:

{'14011876': [],
 '1zb153tjkt': ['1zb153tpant'],
 '1zb153tpant': ['1zb153tjkt'],
 '2618longerdenim': [],
 '550pant': [],
 '5pktcord': [],
 'backtoschool': ['jacketbktoschoolboys', 'jacketbktoschoolgirls'],
 'bomberjkt': [],
 'camdenjkt': [],
 'cb32405510': [],
 'chambrey117': [],
 'clint': [],
 'denimjacket': ['denimvest'],
 'denimvest': ['denimjacket'],
 'dobbyshort': [],
 'jacket2001': ['jacket2002', 'jacketswear', 'jacket253ke'],
 'jacket2002': ['jacket2001', 'jacketswear', 'jacket253ke'],
 'jacket253ke': ['jacket2002', 'jacket2001', 'jacketswear'],
 'jacketbktoschoolboys': ['backtoschool', 'jacketbktoschoolgirls'],
 'jacketbktoschoolgirls': ['backtoschool',
  'jacketswear',
  'jacketbktoschoolboys'],
 'jacketswear': ['jacket2002',
  'jacket2001',
  'jacketbktoschoolgirls',
  'jacket253ke'],
 'jhonnyboxer': [],
 'jonaslong': ['jonasshort'],
 'jonasshort': ['jonaslong'],
 'lilentrs': [],
 'lovetrs340': [],
 'mixmatch': [],
 'newskinny': [],
 'parkerchino': ['parkochino'],
 'parkochino': ['parkerchino'],
 'reidunparka': ['reidunparkalining'],
 'reidunparkalining': ['reidunparka'],
 'ride': ['ride510'],
 'ride510': ['ride'],
 'shairt537': ['shairt615', 'shairt564'],
 'shairt564': ['shairt615', 'shairt537'],
 'shairt615': ['shairt537', 'shairt564'],
 'spacetreggins': [],
 'toddjkt': [],
 'tyedyejean': []}

In [47]:

for key in JWMatches2:
    if JMatches1[key] != JWMatches2[key]:
        print 'This was the Jaro result ' + key + " " + str(JMatches1[key])
        print 'This was the Jaro-Winkler (0.8) result ' + key + " " + str(JWMatches2[key])
        print

This was the Jaro result lilentrs ['clint']
This was the Jaro-Winkler (0.8) result lilentrs []

This was the Jaro result clint ['lilentrs']
This was the Jaro-Winkler (0.8) result clint []

This was the Jaro result jacketswear ['jacket253ke']
This was the Jaro-Winkler (0.8) result jacketswear ['jacket2002', 'jacket2001', 'jacketbktoschoolgirls', 'jacket253ke']

This was the Jaro result jacketbktoschoolgirls ['backtoschool', 'jacketbktoschoolboys']
This was the Jaro-Winkler (0.8) result jacketbktoschoolgirls ['backtoschool', 'jacketswear', 'jacketbktoschoolboys']

This was the Jaro result jacket2002 ['jacket2001', 'jacket253ke']
This was the Jaro-Winkler (0.8) result jacket2002 ['jacket2001', 'jacketswear', 'jacket253ke']

This was the Jaro result jacket2001 ['jacket2002', 'jacket253ke']
This was the Jaro-Winkler (0.8) result jacket2001 ['jacket2002', 'jacketswear', 'jacket253ke']

In fact no, increasing even 0.05 on the J-W cutoff is not enough to get rid of the effect of the prefix, and it has the deleterious effect of removing matches for 1ZB153TJKT and 1ZB153TPant.

Conclusions¶

It is too soon to say exactly which of the measures will be most useful, and to what extent they can help. If you are reading this notebook as a non-colleague, then the usefulness of the measures will depend upon what type of strings you are analyzing, and I daresay that the measures I did not consider could be useful.

For my purposes it seems like the Jaro measure will be the most promising, although of course, in reality I will use all four measures and compare their output to get a sense of where my strings might be matching.

It is possible that I may return to this notebook and add further thoughts on experimentation at a later date.

One thing that these algorithms may not help with is where style names are incorporated into much larger strings. To identify those as matches some sort of algorithm based upon the matching of sequences may be needed. More in this to follow. Possibly I will write my own.

JellyFish¶

Introduction¶

JellyFish and Its Algorithms¶

Levenshtein Distance¶

Damerau Levenshtein Distance¶

Jaro Distance¶

Jaro-Winkler Distance¶

Match Rating Approach Comparison¶

Hamming Distance¶

Experimenting with the Algorithms¶

Levenshtein Distance¶

Damerau Levenshtein Distance Distance¶

Jaro Distance¶

Jaro -Winkler Distance¶

Conclusions¶

Levenshtein Distance
¶

Damerau Levenshtein Distance
¶

Jaro Distance
¶