In previous labs, we've explored the power tables as a data management abstraction, in particular with the Pandas DataFrame object. Tables let us select rows and columns of interest, group data, and measure aggregates.
But what happens when we have more than one table? Traditional relational databases usually contain many tables. Moreover, when integrating multiple data sets, we necessarily need tools to combine them.
In this lab, we will use Panda's take on the database join operation to see how tables can be linked together. Specifically, we're going to perform a "fuzzy join" based on string edit-distance as another approach to finding duplicate records.
Today we'll be using a small data set of restaurants. Download the data from here. Put the data file, "restaurants.csv", in the same directory as this notebook.
We're going to be using a string-similarity python library to compute "edit distance". Install it on your VM by running the following:
sudo apt-get install python-levenshtein
NOTE: You may also need to run sudo apt-get update
.
To test that it works, the following should run OK:
import Levenshtein as L
A join is a way to connect rows in two different data tables based on some criteria. Suppose the university has a database for student records with two tables in it: Students and Grades.
import pandas as pd
Students = pd.DataFrame({'student_id': [1, 2], 'name': ['Alice', 'Bob']})
Students
name | student_id | |
---|---|---|
0 | Alice | 1 |
1 | Bob | 2 |
2 rows × 2 columns
Grades = pd.DataFrame({'student_id': [1, 1, 2, 2], 'class_id': [1, 2, 1, 3], 'grade': ['A', 'C', 'B', 'B']})
Grades
class_id | grade | student_id | |
---|---|---|---|
0 | 1 | A | 1 |
1 | 2 | C | 1 |
2 | 1 | B | 2 |
3 | 3 | B | 2 |
4 rows × 3 columns
Let's say we want to know all of Bob's grades. Then, we can look up Bob's student ID in the Students table, and with the ID, look up his grades in the Grades table. Joins naturally express this process: when two tables share a common type of column (student ID in this case), we can join the tables together to get a complete view.
In Pandas, we can use the merge method to perform a join. Pass the two tables to join as the first arguments, then the "on" parameter is set to the join column name.
pd.merge(Students, Grades, on='student_id')
name | student_id | class_id | grade | |
---|---|---|---|---|
0 | Alice | 1 | 1 | A |
1 | Alice | 1 | 2 | C |
2 | Bob | 2 | 1 | B |
3 | Bob | 2 | 3 | B |
4 rows × 4 columns
Classes = pd.DataFrame({'class_id': [1, 2, 3], 'title': ['Math', 'English', 'Spanish']})
pd.merge(pd.merge(Students, Grades, on='student_id'), Classes, on='class_id')
name | student_id | class_id | grade | title | |
---|---|---|---|---|---|
0 | Alice | 1 | 1 | A | Math |
1 | Bob | 2 | 1 | B | Math |
2 | Alice | 1 | 2 | C | English |
3 | Bob | 2 | 3 | B | Spanish |
4 rows × 5 columns
Now let's load the restaurant data that we will be analyzing:
resto = pd.read_csv('restaurants.csv')
resto.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 858 entries, 0 to 857 Data columns (total 4 columns): id 858 non-null int64 cluster 858 non-null int64 name 858 non-null object city 858 non-null object dtypes: int64(2), object(2)
resto[:10]
id | cluster | name | city | |
---|---|---|---|---|
0 | 560 | 453 | 2223 | san francisco |
1 | 781 | 675 | 103 west | atlanta |
2 | 279 | 172 | 20 mott | new york |
3 | 43 | 23 | 21 club | new york |
4 | 44 | 23 | 21 club | new york city |
5 | 280 | 173 | 9 jones street | new york |
6 | 486 | 379 | abbey | atlanta |
7 | 145 | 74 | abruzzi | atlanta |
8 | 146 | 74 | abruzzi | atlanta |
9 | 561 | 454 | acquarello | san francisco |
10 rows × 4 columns
The restaurant data has four columns. id is a unique ID field (unique for each row), name is the name of the restaurant, and city is where it is located. The fourth column, cluster, is a "gold standard" column. If two records have the same cluster, that means they are both about the same restaurant.
The type of join we made above between Students and Grades, where we link records with equal values in a common column, is called an equijoin. Equijoins may join on more than one column, too (both value have to match).
Let's use an equijoin to find pairs of duplicate restaurant records. We join the data to itself, on the cluster column.
Note: a join between a table and itself is called a self-join.
The result ("clusters" below) has a lot of extra records in it. For example, since we're joining a table to itself, every record matches itself. We can filter on IDs to get rid of these extra join results. Note that when Pandas joins two tables that have columns with the same name, it appends "_x" and "_y" to the names to distinguish them.
clusters = pd.merge(resto, resto, on='cluster')
clusters = clusters[clusters.id_x != clusters.id_y]
clusters[:10]
id_x | cluster | name_x | city_x | id_y | name_y | city_y | |
---|---|---|---|---|---|---|---|
4 | 43 | 23 | 21 club | new york | 44 | 21 club | new york city |
5 | 44 | 23 | 21 club | new york city | 43 | 21 club | new york |
10 | 145 | 74 | abruzzi | atlanta | 146 | abruzzi | atlanta |
11 | 146 | 74 | abruzzi | atlanta | 145 | abruzzi | atlanta |
20 | 184 | 94 | alain rondelli | san francisco | 185 | alain rondelli | san francisco |
21 | 185 | 94 | alain rondelli | san francisco | 184 | alain rondelli | san francisco |
36 | 186 | 95 | aqua | san francisco | 187 | aqua | san francisco |
37 | 187 | 95 | aqua | san francisco | 186 | aqua | san francisco |
40 | 45 | 24 | aquavit | new york | 46 | aquavit | new york city |
41 | 46 | 24 | aquavit | new york city | 45 | aquavit | new york |
10 rows × 7 columns
Filter clusters so that we only keep one instance of each matching pair (HINT: use the IDs again).
clusters = clusters[clusters.id_x < clusters.id_y]
clusters[:10]
id_x | cluster | name_x | city_x | id_y | name_y | city_y | |
---|---|---|---|---|---|---|---|
4 | 43 | 23 | 21 club | new york | 44 | 21 club | new york city |
10 | 145 | 74 | abruzzi | atlanta | 146 | abruzzi | atlanta |
20 | 184 | 94 | alain rondelli | san francisco | 185 | alain rondelli | san francisco |
36 | 186 | 95 | aqua | san francisco | 187 | aqua | san francisco |
40 | 45 | 24 | aquavit | new york | 46 | aquavit | new york city |
46 | 1 | 0 | arnie morton's of chicago | los angeles | 2 | arnie morton's of chicago | los angeles |
51 | 3 | 1 | art's delicatessen | studio city | 4 | art's deli | studio city |
58 | 47 | 25 | aureole | new york | 48 | aureole | new york city |
62 | 147 | 75 | bacchanalia | atlanta | 148 | bacchanalia | atlanta |
79 | 5 | 2 | hotel bel-air | bel air | 6 | bel-air hotel | bel air |
10 rows × 7 columns
Sometimes an equijoin isn't good enough.
Say you want to match up records that are almost equal in a column. Or where a function of a columns is equal. Or maybe you don't care about equality: maybe "less than" or "greater than or equal to" is what you want. These cases call for a more general join than equijoin.
We are going to make one of these joins between the restaurants data and itself. Specifically, we want to match up pairs of records whose restaurant names are almost the same. We call this a fuzzy join.
To do a fuzzy join in Pandas we need to go about it in a few steps:
SQL Aside: In SQL, all of joins are supported in about the same way as equijoins are. Essentially, you write a boolean expression on columns from the join-tables, and whenever that expression is true, you join the records together. This is very similar to writing an if statement in python or Java.
Let's do an example to get the hang of it.
We use a "dummy" column to compute the Cartesian product of the data with itself. dummy takes the same value for every record, so we can do an equijoin and get back all pairs.
resto['dummy'] = 0
prod = pd.merge(resto, resto, on='dummy')
# Clean up
del prod['dummy']
del resto['dummy']
# Show that prod is the size of "resto" squared:
print len(prod), len(resto)**2
736164 736164
prod[:10]
id_x | cluster_x | name_x | city_x | id_y | cluster_y | name_y | city_y | |
---|---|---|---|---|---|---|---|---|
0 | 560 | 453 | 2223 | san francisco | 560 | 453 | 2223 | san francisco |
1 | 560 | 453 | 2223 | san francisco | 781 | 675 | 103 west | atlanta |
2 | 560 | 453 | 2223 | san francisco | 279 | 172 | 20 mott | new york |
3 | 560 | 453 | 2223 | san francisco | 43 | 23 | 21 club | new york |
4 | 560 | 453 | 2223 | san francisco | 44 | 23 | 21 club | new york city |
5 | 560 | 453 | 2223 | san francisco | 280 | 173 | 9 jones street | new york |
6 | 560 | 453 | 2223 | san francisco | 486 | 379 | abbey | atlanta |
7 | 560 | 453 | 2223 | san francisco | 145 | 74 | abruzzi | atlanta |
8 | 560 | 453 | 2223 | san francisco | 146 | 74 | abruzzi | atlanta |
9 | 560 | 453 | 2223 | san francisco | 561 | 454 | acquarello | san francisco |
10 rows × 8 columns
prod = prod[prod.id_x < prod.id_y]
prod[:10]
id_x | cluster_x | name_x | city_x | id_y | cluster_y | name_y | city_y | |
---|---|---|---|---|---|---|---|---|
1 | 560 | 453 | 2223 | san francisco | 781 | 675 | 103 west | atlanta |
9 | 560 | 453 | 2223 | san francisco | 561 | 454 | acquarello | san francisco |
12 | 560 | 453 | 2223 | san francisco | 708 | 602 | afghan kebab house | new york city |
20 | 560 | 453 | 2223 | san francisco | 782 | 676 | alon's at the terrace | atlanta |
24 | 560 | 453 | 2223 | san francisco | 762 | 656 | andre's french restaurant | las vegas |
28 | 560 | 453 | 2223 | san francisco | 640 | 534 | apple pan the | west la |
33 | 560 | 453 | 2223 | san francisco | 709 | 603 | arcadia | new york city |
40 | 560 | 453 | 2223 | san francisco | 641 | 535 | asahi ramen | west la |
47 | 560 | 453 | 2223 | san francisco | 642 | 536 | baja fresh | westlake village |
48 | 560 | 453 | 2223 | san francisco | 783 | 677 | baker's cajun cafe | atlanta |
10 rows × 8 columns
In the homework assignment, we used a string similarity metric called cosine similarity which measured how many "tokens" two strings shared in common. Now, we're going to use an alternative measure of string similarity called edit-distance. Edit-distance counts the number of simple changes you have to make to a string to turn it into another string.
Import the edit distance library:
import Levenshtein as L
L.distance('Hello, World!', 'Hallo, World!')
1
Next, we add a computed column, named distance, that measures the edit distance between the names of two restaurants:
# This takes a minute or two to run
prod['distance'] = prod.apply(lambda r: L.distance(r['name_x'], r['name_y']), axis=1)
Now we complete the join by filtering out pairs or records that aren't equal enough for our liking. As in the first homework assignment, we can only figure out how similar is "similar enough" by trying out some different options. Let's try maximum edit-distance from 0 to 10 and compute precision and recall.
%matplotlib inline
import pylab
def accuracy(max_distance):
similar = prod[prod.distance < max_distance]
correct = float(sum(similar.cluster_x == similar.cluster_y))
precision = correct / len(similar)
recall = correct / len(clusters)
return (precision, recall)
thresholds = range(1, 11)
p = []
r = []
for t in thresholds:
acc = accuracy(t)
p.append(acc[0])
r.append(acc[1])
pylab.plot(thresholds, p)
pylab.plot(thresholds, r)
pylab.legend(['precision', 'recall'], loc=2)
<matplotlib.legend.Legend at 0x11da6f950>
Create a scatterplot with precision on one axis and recall on the other. Where are "good" points on the plot, and where are "bad" ones.
pylab.scatter(p, r)
# Top-right of chart is "good" because precision and recall are both high (close to 1)
<matplotlib.collections.PathCollection at 0x11d9969d0>
L.ratio(s1, s1)
).ratio
gives a similarity score between 0 and 1, with higher meaning more similar.
Add a column to "prod" with the ratio
similarities of the name columns, and redo the precision/recall tradeoff analysis with the new metric.
(Note: you will have to alter the accuracy
method and the threshold range.)
On this data, does Levenshtein.ratio
do better than Levenshtein.distance
?
prod['ratio'] = prod.apply(lambda r: L.ratio(r['name_x'], r['name_y']), axis=1)
prod[:10]
id_x | cluster_x | name_x | city_x | id_y | cluster_y | name_y | city_y | distance | ratio | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 560 | 453 | 2223 | san francisco | 781 | 675 | 103 west | atlanta | 8 | 0.166667 |
9 | 560 | 453 | 2223 | san francisco | 561 | 454 | acquarello | san francisco | 10 | 0.000000 |
12 | 560 | 453 | 2223 | san francisco | 708 | 602 | afghan kebab house | new york city | 18 | 0.000000 |
20 | 560 | 453 | 2223 | san francisco | 782 | 676 | alon's at the terrace | atlanta | 21 | 0.000000 |
24 | 560 | 453 | 2223 | san francisco | 762 | 656 | andre's french restaurant | las vegas | 25 | 0.000000 |
28 | 560 | 453 | 2223 | san francisco | 640 | 534 | apple pan the | west la | 14 | 0.000000 |
33 | 560 | 453 | 2223 | san francisco | 709 | 603 | arcadia | new york city | 7 | 0.000000 |
40 | 560 | 453 | 2223 | san francisco | 641 | 535 | asahi ramen | west la | 11 | 0.000000 |
47 | 560 | 453 | 2223 | san francisco | 642 | 536 | baja fresh | westlake village | 10 | 0.000000 |
48 | 560 | 453 | 2223 | san francisco | 783 | 677 | baker's cajun cafe | atlanta | 18 | 0.000000 |
10 rows × 10 columns
def accuracy_ratio(min_ratio):
similar = prod[prod.ratio > min_ratio]
correct = float(sum(similar.cluster_x == similar.cluster_y))
precision = correct / len(similar)
recall = correct / len(clusters)
return (precision, recall)
thresholds = range(1, 10)
p = []
r = []
for t in thresholds:
acc = accuracy_ratio(float(t)/10)
p.append(acc[0])
r.append(acc[1])
pylab.plot(thresholds, p)
pylab.plot(thresholds, r)
pylab.legend(['precision', 'recall'], loc=2)
<matplotlib.legend.Legend at 0x11a465b90>
pylab.scatter(p, r)
# Ratio appears to do a little better than distance---it achieves higher precision at reasonable levels of recall (say >0.8)
<matplotlib.collections.PathCollection at 0x11a497810>