Author: Rory Creedon
Date: 28 November 2013
Purpose: To examine the possibility of using the python JellyFish library for complex "fuzzy" string matching.
CONTENTS:
import pandas as pd
import numpy as np
import datetime
import os
from pandas import DataFrame
from numpy import nan as NA
from IPython.core.display import HTML
from IPython.core.display import Image
from IPython.display import Math
from IPython.display import Latex
import collections
import jellyfish as jf
import re
With regard to style names in production data we face two distinct problems:
At this point, the notebook will mostly relate to the former problem, as it is most salient at this stage of the data cleaning.
DISCLAIMER I am not an expert in this type of computer science. Much of this notebook is based upon Wikipedia entries. I suggest you do your own research, and use this notebook as a jumping off point.
Jellyfish is a python library for doing approximate and phonetic matching of strings. It allows for string comparison using the following algorithms:
All of the above are metrics for measuring the difference between two sequences. Each of these will not be briefly introduced and an example given:
The Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to transform one string into another. The lower the score, the lower the number of edits. If the score is 0, then no edits are needed so the strings are exactly the same.
For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:
The measure is most useful when looking to match short strings with longer strings. It takes a lot of computational power to do this with long strings and may not therefore be totally appropriate.
For more information about the measure see
http://en.wikipedia.org/wiki/Levenshtein_distance
Now some examples follow:
A = 'Rory'
B = 'Ronnie'
jf.levenshtein_distance(A, B)
4
As the following demonstrates, the algorithm also considers case, which is a strong argument for converting all strings to the same case. This is a pretty standard way of working with strings so will come as no surprise.
A = 'Rory'
B = 'rory'
jf.levenshtein_distance(A, B)
1
# Now the levenshtein score of 0 means the strings are an exact match.
jf.levenshtein_distance(A.lower(), B.lower())
0
This measure is very similar to the Levenshtein distance, in that it is a measure of the minimum number of edits needed to transform one string into another. The permissible 'edits' in the Levenshtein distance are insertions, deletion and substitution whereas in the Damerau Levenshtein distance the transposition of two adjacent characters is also allowed. Damerau claimed that these four edits correspond to 80% of human spelling errors.
As with the Levenshtein distance a score of zero indicates and exact match etc.
This measure may be an improvement on the Levenshtein distance as using the Damerau Levenshtein Distance strings where two letters are simply the wrong way around will have a lower score (indicating a better match) than they would under the Levenshtein measure.
A simple example will suffice:
jf.levenshtein_distance('jellyfihs', 'jellyfish')
2
jf.damerau_levenshtein_distance('jellyfihs', 'jellyfish')
1
In this example, the Levenshtein distance works like this:
This is only one way it could happen. For instance 'h' could have been deleted, and then inserted (and so on), but the minimum number of edits is always 2
The Damerau Levenshtein distance works like this:
The measure may therefore be a better one in that it recognises strings as being closer than the Levenshtein measure in cases where there has been a simple mix up of characters.
NB I have observed some odd behaviour of the jf.damerau_levenshtein_distance algorithm. The odd behaviour may be related to me misunderstanding the nature of the measure, or it may be a problem with coding of the library. I have written to the developer for clarification. If you are interested, then see the following, if not then skip to the next measure below.
jf.damerau_levenshtein_distance('jellyifhs', 'jellyfish')
2
jf.damerau_levenshtein_distance('ifhs', 'fish')
3
I find the above output very odd for the reason that the scores should be the same as the edits required in both instances are exactly the same:
In the first example:
In the second example:
Why the outputs are different remains to be determined.
Update It appears from looking at the source code that in some cases the function is returning the OSA measure not the Damerau Levenshtein measure. The developer is now aware, and has been contacted. use this measure with caution.
More information on the measure can be found at http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
The Jaro distance is a measure that considers the number of matching characters in both strings being compared, and also the number of transpositions which is defined as the number of matching characters (in a different sequence order) divided by two. The measure returns a score between 0 and 1, 0 being no match whatsoever (as defined in the calculation) and 1 being a perfect match.
Beware that this measure will ignore matching characters that are more than a certain distance from each other. This could either be a good thing (to ignore spurious matches) or a bad thing (ignoring correct matches). In any event, it is important to be aware of this, so it is explained in detail below.
It is calculated as follows:
Latex(r"""\begin{eqnarray}
d_j = \left\{
\begin{array}{1 1}
0 & \quad \text{if $m$ = 0 <}\\
\\
\frac{1}{3} \bigg(\frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m - t}{m} & \quad \text{otherwise}
\end{array} \right.
\end{eqnarray}""")
If the number of matching characters is 0, then the measure equals 0
A character in each string is only considered to be matching if it is the same and obeys the following rule as to distance:
Latex(r"""\begin{eqnarray}
\frac{max(|s_1|,|s_2|)}{2} -1
\end{eqnarray}""")
Observe the following:
S1 = 'AzzzzB'
S2 = 'BxxA'
jf.jaro_winkler(S1, S2)
0.0
Although the characters A and B appear in both strings, m = 0 because they are farther in distance from each other than:
Latex(r"""\begin{eqnarray}
\frac{max(|6|,|4|)}{2} -1 = 2
\end{eqnarray}""")
Transpositions are also important as already mentioned. The number of transpositions is calculated as "The number of matching (but different sequence order) characters divided by 2". Observe the following:
S3 = 'Poverty'
S4 = 'Poervty'
jf.jaro_distance(S3, S4)
0.9523809523809524
Sadly it looks like this is also being calculated incorrectly. To my mind the calculation should be as follows:
Latex(r"""\begin{eqnarray}
\Bigg(\frac{7}{7} + \frac{7}{7} + \frac{7- \frac{3}{2}}{7}\Bigg) = 0.9285714
\end{eqnarray}""")
Whereas is appears that it is being calcualted as follows:
Latex(r"""\begin{eqnarray}
\Bigg(\frac{7}{7} + \frac{7}{7} + \frac{7- \frac{2}{2}}{7}\Bigg) = 0.9523809
\end{eqnarray}""")
The critical difference is that I calculate the number of matching (but different sequence order) characters as 3, those being:
Whereas it appears that JellyFish thinks there are only two.
Again, I have raised this issue with the developer. It may resolved in future, but for now use this measure with caution.
The Jaro-Winkler Distance measure builds upon the Jaro measure, but uses a prefix scale which gives more favorable ratings to strings that match from the beginning for a set prefix length. This will become clear when the calculation is viewed:
Latex(r"""\begin{eqnarray}
d_w = d_j + (\ell p(1 - d_j))
\end{eqnarray}""")
The expression is made up of the following:
The key takeaway is that strings that begin with the same prefixes score more highly...
Observe the following:
S5 = 'Innovaiosn'
S6 = 'Innovations'
jf.jaro_distance(S5, S6)
0.9363636363636364
jf.jaro_winkler(S5, S6)
0.9618181818181818
Although it is not stated anywhere in the jellyfish documentation, it is clear that the value of p is 0.1, this is because rearranging the following to solve for ℓ:
Latex(r"""\begin{eqnarray}
d_w = d_j + (\ell p(1 - d_j))\\
\\
\\
\ell = \frac{d_w - d_j}{1 - d_j} * \frac{1}{4}\\
\\
\\
0.1 = \frac{0.96182-0.936364}{1-0.936364} * \frac{1}{4}
\end{eqnarray}""")
In some implementations of Jaro-Winkler, the prefix bonus is only added when the compared strings have a Jaro distance above a set "boost threshold". In the work of the developer himself, this threshold was set at 0.7. In other words, if the Jaro measure is less than 0.7, then even if the prefixes of the string are the same up to four characters, the prefix bonus will not be applied. Then the calculation looks like this:
Latex(r"""\begin{eqnarray}
d_w = \left\{
\begin{array}{1 1}
d_j & \quad \text{if $d_j$ < $b_t$}\\
d_j + (\ell p(1 - d_j)) & \quad \text{otherwise}
\end{array} \right.
\end{eqnarray}""")
Again, as there is no documentation to JellyFish it is hard to know if this is being applied, and if so, at what level the bt value is set to trigger the prefix bonus. However, a bit of easy experimentation can demonstrate, firstly I compare strings that have a Jaro score just below 0.7, and then strings that have a score just above 0.7
S7 = 'ABCDqwerty'
S8 = 'ABCDpoiuyt'
jf.jaro_distance(S7, S8)
0.6777777777777777
jf.jaro_winkler(S7, S8)
0.6777777777777777
S9 = 'ABCDqwerty'
S10 = 'ABCDpoiuty'
jf.jaro_distance(S9, S10)
0.7333333333333334
jf.jaro_winkler(S9, S10)
0.8400000000000001
The above output indicates that indeed, there implementation of the Jaro-Winkler measure in JellyFish uses a threshold of 0.7 for bt
This technique is based upon a phonetic algorithm. As such I don’t see much use for this function in the work I am doing, and will therefore neglect to explore it further. I may return to this at a later date. For information about the measure see:
"In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In another way, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other."
In actual fact this measure has the glory of being the only one of these that is technically a 'metric' as unlike the others presented here it satisfies the rules of triangle inequality.
However, it is also the most useless for our purpose as it can only operate on strings of the same length, and additionally it effectively only permits edits that are substitutions. Therefore the Levenshtein distance is already a much subtler measure, and the remaining measures looked at even more so.
Therefore, this measure will not be explored.
In order to experiment with the algorithms I am going to read in some style data from a factory and play with it, to see how/if they might be useful to us.
# This reads the data from a CSV file and stores it in a DataFrame object which has been named "DF"
Styles = pd.Series(pd.read_csv(r'C:\Users\rcreedon\Dropbox\GIZSupervisor\DATA\Production_Data\STP_Data\Data_Sets\Wave1\1005\1005_all_merged.csv')['style'].unique())
Styles
0 Denim Jacket 1 Denim Vest 2 Jonas Long 3 2618 (Longer Denim) 4 Lilen Trs 5 Ride 6 Tye Dye Jean 7 Jonas Short 8 Jacket (2002) 9 Dobby Short 10 Parker Chino 11 550 (Pant) 12 Ride (510) 13 Chambrey(117) 14 Space Treggins 15 Jhonny Boxer 16 New skinny 17 Clint 18 Todd Jkt 19 Jacket (2001) 20 Back to School 21 Reidun Parka 22 Mixmatch 23 Reidun Parka(Lining) 24 Parko Chino 25 Camden Jkt 26 NaN 27 CB32405510 28 Bomber Jkt 29 5 Pkt Cord 30 Shairt-615 31 Shairt-537 32 Shairt-564 33 Love Trs.(340) 34 1ZB-153T(Pant) 35 1ZB-153T(JKT) 36 Jacket(S.wear) 37 Jacket(Bk to School)Boys 38 Jacket(Bk to School)Girls 39 Jacket-253KE 40 14011876 dtype: object
Styles = Styles[pd.notnull(Styles)]
One way to work with this data could be to create a dictionary with an entry for each string, and then iterate through the remaining strings, and enter those in a list item if they meet some threshold of each returned distance score.
The first thing that should always be done is to strip away all white spaces and non-alphanumeric characters and convert the case to lower (or upper). This can be achieved in the following manner:
delchars = ''.join(c for c in map(chr, range(256)) if not c.isalnum())
for row in Styles.index:
Styles[row] = Styles[row].translate(None, delchars).lower()
Styles
0 denimjacket 1 denimvest 2 jonaslong 3 2618longerdenim 4 lilentrs 5 ride 6 tyedyejean 7 jonasshort 8 jacket2002 9 dobbyshort 10 parkerchino 11 550pant 12 ride510 13 chambrey117 14 spacetreggins 15 jhonnyboxer 16 newskinny 17 clint 18 toddjkt 19 jacket2001 20 backtoschool 21 reidunparka 22 mixmatch 23 reidunparkalining 24 parkochino 25 camdenjkt 27 cb32405510 28 bomberjkt 29 5pktcord 30 shairt615 31 shairt537 32 shairt564 33 lovetrs340 34 1zb153tpant 35 1zb153tjkt 36 jacketswear 37 jacketbktoschoolboys 38 jacketbktoschoolgirls 39 jacket253ke 40 14011876 dtype: object
Firstly I turn to the Levenshtein distance
LMatches1 = {_style:[] for _style in Styles}
for _style in LMatches1.keys():
for row in Styles.index:
if _style != Styles[row] and jf.levenshtein_distance(_style, Styles[row]) < 5:
LMatches1[_style].append(Styles[row])
LMatches1
{'14011876': [], '1zb153tjkt': ['1zb153tpant'], '1zb153tpant': ['1zb153tjkt'], '2618longerdenim': [], '550pant': [], '5pktcord': [], 'backtoschool': [], 'bomberjkt': ['camdenjkt'], 'camdenjkt': ['bomberjkt'], 'cb32405510': [], 'chambrey117': [], 'clint': ['ride'], 'denimjacket': [], 'denimvest': [], 'dobbyshort': ['jonasshort'], 'jacket2001': ['jacket2002', 'jacket253ke'], 'jacket2002': ['jacket2001', 'jacket253ke'], 'jacket253ke': ['jacket2002', 'jacket2001'], 'jacketbktoschoolboys': ['jacketbktoschoolgirls'], 'jacketbktoschoolgirls': ['jacketbktoschoolboys'], 'jacketswear': [], 'jhonnyboxer': [], 'jonaslong': ['jonasshort'], 'jonasshort': ['jonaslong', 'dobbyshort'], 'lilentrs': [], 'lovetrs340': [], 'mixmatch': [], 'newskinny': [], 'parkerchino': ['parkochino'], 'parkochino': ['parkerchino'], 'reidunparka': [], 'reidunparkalining': [], 'ride': ['ride510', 'clint'], 'ride510': ['ride'], 'shairt537': ['shairt615', 'shairt564'], 'shairt564': ['shairt615', 'shairt537'], 'shairt615': ['shairt537', 'shairt564'], 'spacetreggins': [], 'toddjkt': [], 'tyedyejean': []}
The above has matched styles based upon whether the levenshtein score is less than 5. What does this tell me?
Well firstly, it seems like to make the score useful, it should be normalized by some measure of the length of the string. For example, it seems preety clear that for a style with a name like 'ride', it will always be possible to match the string to any other string of the same length.
This is why it is stated in the Wikipedia entry that it is best for comparing strings of similar length. One strategy could be to match those strings that have fewer edits than some proportion of the average length of the two strings being matched.
For example if the condition was that only for scores of less than 0.35 times the average length of the strings would be considered matched, then this would mean that there would only be a match if fewer than 3.5 edits were needed on two strings of average length 10.
LMatches2 = {_style:[] for _style in Styles}
for _style in LMatches2.keys():
for row in Styles.index:
if _style != Styles[row] and jf.levenshtein_distance(_style, Styles[row]) < \
((len(_style) + len(Styles[row])/2)*0.35):
LMatches2[_style].append(Styles[row])
LMatches2
{'14011876': [], '1zb153tjkt': ['1zb153tpant'], '1zb153tpant': ['1zb153tjkt'], '2618longerdenim': [], '550pant': [], '5pktcord': [], 'backtoschool': [], 'bomberjkt': ['camdenjkt'], 'camdenjkt': ['bomberjkt'], 'cb32405510': [], 'chambrey117': [], 'clint': [], 'denimjacket': ['denimvest'], 'denimvest': [], 'dobbyshort': ['jonasshort'], 'jacket2001': ['jacket2002', 'jacketswear', 'jacket253ke'], 'jacket2002': ['jacket2001', 'jacketswear', 'jacket253ke'], 'jacket253ke': ['jacket2002', 'jacket2001', 'jacketswear'], 'jacketbktoschoolboys': ['backtoschool', 'jacketbktoschoolgirls'], 'jacketbktoschoolgirls': ['jacketbktoschoolboys'], 'jacketswear': ['jacket2002', 'jacket2001', 'jacket253ke'], 'jhonnyboxer': [], 'jonaslong': ['jonasshort'], 'jonasshort': ['jonaslong', 'dobbyshort'], 'lilentrs': [], 'lovetrs340': [], 'mixmatch': [], 'newskinny': [], 'parkerchino': ['parkochino'], 'parkochino': ['parkerchino'], 'reidunparka': ['reidunparkalining'], 'reidunparkalining': ['reidunparka'], 'ride': [], 'ride510': ['ride'], 'shairt537': ['shairt615', 'shairt564'], 'shairt564': ['shairt615', 'shairt537'], 'shairt615': ['shairt537', 'shairt564'], 'spacetreggins': [], 'toddjkt': [], 'tyedyejean': []}
I'm not sure I am approaching this in the most scientific way, I am simply playing with the numbers. If the multiplier is too large, then too many matches are made, too small and some are missed. For example, I would hope that 'Ride' and 'Ride510' would be matched, but in order to get there a multiplier of 0.5 is needed, and that creates too many matches.
I will return here later. Let's now look to see how the second measure holds up
Now I turn to the Damerau Levenshtein distance
DLMatches1 = {_style:[] for _style in Styles}
for _style in DLMatches1.keys():
for row in Styles.index:
if _style != Styles[row] and jf.damerau_levenshtein_distance(_style, Styles[row]) < 5:
DLMatches1[_style].append(Styles[row])
DLMatches1
{'14011876': [], '1zb153tjkt': ['1zb153tpant'], '1zb153tpant': ['1zb153tjkt'], '2618longerdenim': [], '550pant': [], '5pktcord': [], 'backtoschool': [], 'bomberjkt': ['camdenjkt'], 'camdenjkt': ['bomberjkt'], 'cb32405510': [], 'chambrey117': [], 'clint': ['ride'], 'denimjacket': [], 'denimvest': [], 'dobbyshort': ['jonasshort'], 'jacket2001': ['jacket2002', 'jacket253ke'], 'jacket2002': ['jacket2001', 'jacket253ke'], 'jacket253ke': ['jacket2002', 'jacket2001'], 'jacketbktoschoolboys': ['jacketbktoschoolgirls'], 'jacketbktoschoolgirls': ['jacketbktoschoolboys'], 'jacketswear': [], 'jhonnyboxer': [], 'jonaslong': ['jonasshort'], 'jonasshort': ['jonaslong', 'dobbyshort'], 'lilentrs': [], 'lovetrs340': [], 'mixmatch': [], 'newskinny': [], 'parkerchino': ['parkochino'], 'parkochino': ['parkerchino'], 'reidunparka': [], 'reidunparkalining': [], 'ride': ['ride510', 'clint'], 'ride510': ['ride'], 'shairt537': ['shairt615', 'shairt564'], 'shairt564': ['shairt615', 'shairt537'], 'shairt615': ['shairt537', 'shairt564'], 'spacetreggins': [], 'toddjkt': [], 'tyedyejean': []}
Again, the same problems with the short strings seems to have happened. But what, if anything is the difference between the output when using the two measures?
for key in DLMatches1:
if LMatches1[key] != DLMatches1[key]:
print 'This was the Leven result ' + key + " " + str(LMatches1[key])
print 'This was the D - Leven result ' + key + " " + str(DLMatches1[key])
In fact they were identical.
Now looking at the more nuanced measure matching based upon a proportion of the average string length:
DLMatches2 = {_style:[] for _style in Styles}
for _style in DLMatches2.keys():
for row in Styles.index:
if _style != Styles[row] and jf.damerau_levenshtein_distance(_style, Styles[row]) < \
((len(_style) + len(Styles[row])/2)*0.35):
DLMatches2[_style].append(Styles[row])
DLMatches2
{'14011876': [], '1zb153tjkt': ['1zb153tpant'], '1zb153tpant': ['1zb153tjkt'], '2618longerdenim': [], '550pant': [], '5pktcord': [], 'backtoschool': [], 'bomberjkt': ['camdenjkt'], 'camdenjkt': ['bomberjkt'], 'cb32405510': [], 'chambrey117': [], 'clint': [], 'denimjacket': ['denimvest'], 'denimvest': [], 'dobbyshort': ['jonasshort'], 'jacket2001': ['jacket2002', 'jacketswear', 'jacket253ke'], 'jacket2002': ['jacket2001', 'jacketswear', 'jacket253ke'], 'jacket253ke': ['jacket2002', 'jacket2001', 'jacketswear'], 'jacketbktoschoolboys': ['backtoschool', 'jacketbktoschoolgirls'], 'jacketbktoschoolgirls': ['jacketbktoschoolboys'], 'jacketswear': ['jacket2002', 'jacket2001', 'jacket253ke'], 'jhonnyboxer': [], 'jonaslong': ['jonasshort'], 'jonasshort': ['jonaslong', 'dobbyshort'], 'lilentrs': [], 'lovetrs340': [], 'mixmatch': [], 'newskinny': [], 'parkerchino': ['parkochino'], 'parkochino': ['parkerchino'], 'reidunparka': ['reidunparkalining'], 'reidunparkalining': ['reidunparka'], 'ride': [], 'ride510': ['ride'], 'shairt537': ['shairt615', 'shairt564'], 'shairt564': ['shairt615', 'shairt537'], 'shairt615': ['shairt537', 'shairt564'], 'spacetreggins': [], 'toddjkt': [], 'tyedyejean': []}
Again, this is just trial and error, but let's see if the matches are different when the same multiplier is used (0.35):
for key in DLMatches2:
if LMatches2[key] != DLMatches2[key]:
print 'This was the Leven result ' + key + " " + str(LMatches2[key])
print 'This was the D - Leven result ' + key + " " + str(DLMatches2[key])
print
Again, no difference.
Now I turn to the Jaro distance.
For the Jaro distance again some arbitrary score has be chosen as the cutoff
JMatches1 = {_style:[] for _style in Styles}
for _style in JMatches1.keys():
for row in Styles.index:
if _style != Styles[row] and jf.jaro_distance(_style, Styles[row]) > 0.75:
JMatches1[_style].append(Styles[row])
JMatches1
{'14011876': [], '1zb153tjkt': ['1zb153tpant'], '1zb153tpant': ['1zb153tjkt'], '2618longerdenim': [], '550pant': [], '5pktcord': [], 'backtoschool': ['jacketbktoschoolboys', 'jacketbktoschoolgirls'], 'bomberjkt': [], 'camdenjkt': [], 'cb32405510': [], 'chambrey117': [], 'clint': ['lilentrs'], 'denimjacket': ['denimvest'], 'denimvest': ['denimjacket'], 'dobbyshort': [], 'jacket2001': ['jacket2002', 'jacket253ke'], 'jacket2002': ['jacket2001', 'jacket253ke'], 'jacket253ke': ['jacket2002', 'jacket2001', 'jacketswear'], 'jacketbktoschoolboys': ['backtoschool', 'jacketbktoschoolgirls'], 'jacketbktoschoolgirls': ['backtoschool', 'jacketbktoschoolboys'], 'jacketswear': ['jacket253ke'], 'jhonnyboxer': [], 'jonaslong': ['jonasshort'], 'jonasshort': ['jonaslong'], 'lilentrs': ['clint'], 'lovetrs340': [], 'mixmatch': [], 'newskinny': [], 'parkerchino': ['parkochino'], 'parkochino': ['parkerchino'], 'reidunparka': ['reidunparkalining'], 'reidunparkalining': ['reidunparka'], 'ride': ['ride510'], 'ride510': ['ride'], 'shairt537': ['shairt615', 'shairt564'], 'shairt564': ['shairt615', 'shairt537'], 'shairt615': ['shairt537', 'shairt564'], 'spacetreggins': [], 'toddjkt': [], 'tyedyejean': []}
Ok, this is where my non-scientific approach breaks down. I have arbitrarily chosen a figure of greater than 0.75. However, it seems to me like the results are a lot closer to what I would hope. In fact they match very closely the questions I actually sent to this factory about their style names, that I did simply by eyeballing the data.
let's compare:
for key in JMatches1:
if DLMatches2[key] != JMatches1[key]:
print 'This was the D- Leven result ' + key + " " + str(DLMatches2[key])
print 'This was the Jaro result ' + key + " " + str(JMatches1[key])
print
This was the D- Leven result lilentrs [] This was the Jaro result lilentrs ['clint'] This was the D- Leven result clint [] This was the Jaro result clint ['lilentrs'] This was the D- Leven result denimvest [] This was the Jaro result denimvest ['denimjacket'] This was the D- Leven result jacketswear ['jacket2002', 'jacket2001', 'jacket253ke'] This was the Jaro result jacketswear ['jacket253ke'] This was the D- Leven result camdenjkt ['bomberjkt'] This was the Jaro result camdenjkt [] This was the D- Leven result backtoschool [] This was the Jaro result backtoschool ['jacketbktoschoolboys', 'jacketbktoschoolgirls'] This was the D- Leven result jonasshort ['jonaslong', 'dobbyshort'] This was the Jaro result jonasshort ['jonaslong'] This was the D- Leven result jacketbktoschoolgirls ['jacketbktoschoolboys'] This was the Jaro result jacketbktoschoolgirls ['backtoschool', 'jacketbktoschoolboys'] This was the D- Leven result jacket2002 ['jacket2001', 'jacketswear', 'jacket253ke'] This was the Jaro result jacket2002 ['jacket2001', 'jacket253ke'] This was the D- Leven result bomberjkt ['camdenjkt'] This was the Jaro result bomberjkt [] This was the D- Leven result ride [] This was the Jaro result ride ['ride510'] This was the D- Leven result jacket2001 ['jacket2002', 'jacketswear', 'jacket253ke'] This was the Jaro result jacket2001 ['jacket2002', 'jacket253ke'] This was the D- Leven result dobbyshort ['jonasshort'] This was the Jaro result dobbyshort []
It seems to me as thought the Jaro result is actually much more in line with what I would consider to be actual matches.
Now I turn to the Jaro Winkler distance.
JWMatches1 = {_style:[] for _style in Styles}
for _style in JWMatches1.keys():
for row in Styles.index:
if _style != Styles[row] and jf.jaro_winkler(_style, Styles[row]) > 0.75:
JWMatches1[_style].append(Styles[row])
JWMatches1
{'14011876': [], '1zb153tjkt': ['1zb153tpant'], '1zb153tpant': ['1zb153tjkt'], '2618longerdenim': [], '550pant': [], '5pktcord': [], 'backtoschool': ['jacketbktoschoolboys', 'jacketbktoschoolgirls'], 'bomberjkt': [], 'camdenjkt': [], 'cb32405510': [], 'chambrey117': [], 'clint': ['lilentrs'], 'denimjacket': ['denimvest'], 'denimvest': ['denimjacket'], 'dobbyshort': [], 'jacket2001': ['jacket2002', 'jacketswear', 'jacket253ke'], 'jacket2002': ['jacket2001', 'jacketswear', 'jacket253ke'], 'jacket253ke': ['jacket2002', 'jacket2001', 'jacketswear'], 'jacketbktoschoolboys': ['backtoschool', 'jacketbktoschoolgirls'], 'jacketbktoschoolgirls': ['backtoschool', 'jacketswear', 'jacketbktoschoolboys'], 'jacketswear': ['jacket2002', 'jacket2001', 'jacketbktoschoolgirls', 'jacket253ke'], 'jhonnyboxer': [], 'jonaslong': ['jonasshort'], 'jonasshort': ['jonaslong'], 'lilentrs': ['clint'], 'lovetrs340': [], 'mixmatch': [], 'newskinny': [], 'parkerchino': ['parkochino'], 'parkochino': ['parkerchino'], 'reidunparka': ['reidunparkalining'], 'reidunparkalining': ['reidunparka'], 'ride': ['ride510'], 'ride510': ['ride'], 'shairt537': ['shairt615', 'shairt564'], 'shairt564': ['shairt615', 'shairt537'], 'shairt615': ['shairt537', 'shairt564'], 'spacetreggins': [], 'toddjkt': [], 'tyedyejean': []}
It seems to me that as the J-W score gives a boost for those matches with a similar prefix, that, the threshold score for matching could be pushed a little higher, to eliminate more spurious matches. But first let's see how it compares with the Jaro measure:
for key in JWMatches1:
if JMatches1[key] != JWMatches1[key]:
print 'This was the Jaro result ' + key + " " + str(JMatches1[key])
print 'This was the Jaro-Winkler result ' + key + " " + str(JWMatches1[key])
print
This was the Jaro result jacketswear ['jacket253ke'] This was the Jaro-Winkler result jacketswear ['jacket2002', 'jacket2001', 'jacketbktoschoolgirls', 'jacket253ke'] This was the Jaro result jacketbktoschoolgirls ['backtoschool', 'jacketbktoschoolboys'] This was the Jaro-Winkler result jacketbktoschoolgirls ['backtoschool', 'jacketswear', 'jacketbktoschoolboys'] This was the Jaro result jacket2002 ['jacket2001', 'jacket253ke'] This was the Jaro-Winkler result jacket2002 ['jacket2001', 'jacketswear', 'jacket253ke'] This was the Jaro result jacket2001 ['jacket2002', 'jacket253ke'] This was the Jaro-Winkler result jacket2001 ['jacket2002', 'jacketswear', 'jacket253ke']
The prefix boost is associated with J-W measure is pushing styles with the same prefix higher. As the prefixes are likely to be common when working with styles, perhaps this is a disadvantage of that method.
One fix, might be to push the threshold higher.
JWMatches2 = {_style:[] for _style in Styles}
for _style in JWMatches2.keys():
for row in Styles.index:
if _style != Styles[row] and jf.jaro_winkler(_style, Styles[row]) > 0.8:
JWMatches2[_style].append(Styles[row])
JWMatches2
{'14011876': [], '1zb153tjkt': ['1zb153tpant'], '1zb153tpant': ['1zb153tjkt'], '2618longerdenim': [], '550pant': [], '5pktcord': [], 'backtoschool': ['jacketbktoschoolboys', 'jacketbktoschoolgirls'], 'bomberjkt': [], 'camdenjkt': [], 'cb32405510': [], 'chambrey117': [], 'clint': [], 'denimjacket': ['denimvest'], 'denimvest': ['denimjacket'], 'dobbyshort': [], 'jacket2001': ['jacket2002', 'jacketswear', 'jacket253ke'], 'jacket2002': ['jacket2001', 'jacketswear', 'jacket253ke'], 'jacket253ke': ['jacket2002', 'jacket2001', 'jacketswear'], 'jacketbktoschoolboys': ['backtoschool', 'jacketbktoschoolgirls'], 'jacketbktoschoolgirls': ['backtoschool', 'jacketswear', 'jacketbktoschoolboys'], 'jacketswear': ['jacket2002', 'jacket2001', 'jacketbktoschoolgirls', 'jacket253ke'], 'jhonnyboxer': [], 'jonaslong': ['jonasshort'], 'jonasshort': ['jonaslong'], 'lilentrs': [], 'lovetrs340': [], 'mixmatch': [], 'newskinny': [], 'parkerchino': ['parkochino'], 'parkochino': ['parkerchino'], 'reidunparka': ['reidunparkalining'], 'reidunparkalining': ['reidunparka'], 'ride': ['ride510'], 'ride510': ['ride'], 'shairt537': ['shairt615', 'shairt564'], 'shairt564': ['shairt615', 'shairt537'], 'shairt615': ['shairt537', 'shairt564'], 'spacetreggins': [], 'toddjkt': [], 'tyedyejean': []}
for key in JWMatches2:
if JMatches1[key] != JWMatches2[key]:
print 'This was the Jaro result ' + key + " " + str(JMatches1[key])
print 'This was the Jaro-Winkler (0.8) result ' + key + " " + str(JWMatches2[key])
print
This was the Jaro result lilentrs ['clint'] This was the Jaro-Winkler (0.8) result lilentrs [] This was the Jaro result clint ['lilentrs'] This was the Jaro-Winkler (0.8) result clint [] This was the Jaro result jacketswear ['jacket253ke'] This was the Jaro-Winkler (0.8) result jacketswear ['jacket2002', 'jacket2001', 'jacketbktoschoolgirls', 'jacket253ke'] This was the Jaro result jacketbktoschoolgirls ['backtoschool', 'jacketbktoschoolboys'] This was the Jaro-Winkler (0.8) result jacketbktoschoolgirls ['backtoschool', 'jacketswear', 'jacketbktoschoolboys'] This was the Jaro result jacket2002 ['jacket2001', 'jacket253ke'] This was the Jaro-Winkler (0.8) result jacket2002 ['jacket2001', 'jacketswear', 'jacket253ke'] This was the Jaro result jacket2001 ['jacket2002', 'jacket253ke'] This was the Jaro-Winkler (0.8) result jacket2001 ['jacket2002', 'jacketswear', 'jacket253ke']
In fact no, increasing even 0.05 on the J-W cutoff is not enough to get rid of the effect of the prefix, and it has the deleterious effect of removing matches for 1ZB153TJKT and 1ZB153TPant.
It is too soon to say exactly which of the measures will be most useful, and to what extent they can help. If you are reading this notebook as a non-colleague, then the usefulness of the measures will depend upon what type of strings you are analyzing, and I daresay that the measures I did not consider could be useful.
For my purposes it seems like the Jaro measure will be the most promising, although of course, in reality I will use all four measures and compare their output to get a sense of where my strings might be matching.
It is possible that I may return to this notebook and add further thoughts on experimentation at a later date.
One thing that these algorithms may not help with is where style names are incorporated into much larger strings. To identify those as matches some sort of algorithm based upon the matching of sequences may be needed. More in this to follow. Possibly I will write my own.