CS194-16 Introduction to Data Science
NOTE click near here to select this cell, esc-Enter will get you into cell edit mode, shift-Enter gets you back
Name: Please put your name
Student ID: Please put your student ID
Entity resolution is a common, yet difficult problem in data cleaning and integration. In this assignment, we will use powerful and scalable text analysis techniques to perform entity resolution across two data sets of commercial products.
Entity resolution, also known as record deduplication, is the process of identifying rows in one or more data sets that refer to the same real world entity. Take an example. You're on ebay looking for a hip data science accessory, but you're on a budget, so you decide to scrape the ebay listings for a few days to get a feel for the market. Unfortunately, the listings are confusing and you don't know how to aggregate them. Entity resolution to the rescue! You find an authoritative database and map all the ebay listings to it. Now you can comparison shop, get that sweet Keuffel and Esser for way cheap, and impress all the data hipsters.
But finding matching records is a hard problem in general. A major reason is that the criteria for identifying duplicates are often vague and impossible to encode in rules. In addition, from a purely computational perspective, the problem is quadratic in the size of its inputs: naively, all pairs of records need to be compared to find all the duplicates. In this assignment, we will begin to address both these challenges.
Your assignment is to perform entity resolution over two web-scraped data sets of commercial product listings, one from Amazon, and one from Google. The goal is to build a unified database of all products listed on the Internet: a one-stop-shop for all your shopping needs. (Elevator pitch: it's like Kayak.com* for e-commerce!*)
The web has made unprecedented amounts of data available publicly, but scraped data frequently needs to be de-duplicated. These data sets are typical examples of what you can collect with some simple scripting. The data is not especially large (just a few thousand records), but even so, you will find that entity resolution is a major challenge (top results with this data are ~50% success rate). Don't get discouraged; the goal is to get acquainted with techniques to tackle the problem, and apply them to a representative example.
Data files for this assignment can be found at:
https://github.com/biddata/datascience/raw/master/F15/hw1/hw1data.tar.gz
The zip file includes the following files:
Besides the complete data files, there are "sample" data files for each data set. Use these for Part 1. In addition, there is a "gold standard" file that contains all of the true mappings between duplicate entities in the two data sets. Every row in the gold standard file has a pair of record IDs (one Google, one Amazon) that belong to two record that describe the same thing in the real world. We will use the gold standard to evaluate our algorithms.
Complete the all the exercises below and turn in a write up in the form of an IPython notebook, that is, an .ipynb file. The write up should include your code, answers to exercise questions, and plots of results. Use the submission process described in the bcourses page for this HW.
You can use this notebook and fill in answers inline, or if you prefer, do your write up in a separate notebook. In this notebook, we provide code templates for many of the exercises. They are intended to help with code re-use, since the exercises build on each other, and are highly recommended. Don't forget to include answers to questions that ask for natural language responses, i.e., in English, not code!
We would prefer to test some of your code automatically, so please try to submit a notebook that uses the function names requested by the questions and that can be executed with "Cell > Run all".
This assignment can be done with basic python and matplotlib. Feel free to use PANDAs, too, which you may find well suited to several exercises. As for other libraries, please check with course staff whether they're allowed. In general, we want you to use whatever is comfortable, except for libraries (e.g., NLTK) that include functionality covered in the assignment.
You're not required to do your coding in IPython, so feel free to use your favorite editor or IDE. But when you're done, remember to put your code into a notebook for your write up.
This assignment is to be done individually. Everyone should be getting a hands on experience in this course. You are free to discuss course material with fellow students, and we encourage you to use Internet resources to aid your understanding, but the work you turn in, including all code and answers, must be your own work.
Download the data and uncompress/unarchive it (tar -xzf will do it). Read each file in from the file system, and store them as lists of lines.
For each of the data files ("Google.csv", "Amazon.csv", and the samples), we want to parse the IDs out of each record. The IDs are the first column of the file (they are URLs for Google, and alphanumeric strings for Amazon). Omitting the headers, load these data files into *dictionaries mapping ID to a single string containing the rest of the record except for the price field. Insert spaces between the other fields to make this string.
# Please preserve the format of this line so we can use it for automated testing.
DATA_PATH = "" # Make this the /path/to/the/data`
import csv
# TODO Load data files here...
def loadprodfile(fname):
pass
# amazon_small = loadprodfile(DATA_PATH + "Amazon_small.csv")
# google_small = loadprodfile(DATA_PATH + "Google_small.csv")
A simple approach to entity resolution is to treat all records as strings and compute their similarity with a string distance function. In this section, we will build some components for bag-of-words text-analysis, and use them to compute record similarity.
Bag-of-words is a conceptually simple yet powerful approach to text analysis. The idea is to treat strings, a.k.a. documents, as unordered collections of words, or tokens, i.e., as bags of words.
Note on terminology: "token" is more general than what we ordinarily mean by "word" and includes things like numbers, acronyms, and other exotica like word-roots and fixed-length character strings. Bag of words techniques all apply to any sort of token, so when we say "bag-of-words" we really mean "bag-of-tokens," strictly speaking.
Tokens become the atomic unit of text comparison. If we want to compare two documents, we count how many tokens they share in common. If we want to search for documents with keyword queries (this is what Google does), then we turn the keywords into tokens and find documents that contain them.
The power of this approach is that it makes string comparisons insensitive to small differences that probably do not affect meaning much, for example, punctuation and word order.
a. Implement the function simple_tokenize(string)
that takes a string and returns a list of tokens in the string.
simple_tokenize
should split strings using the provided regular expression (and its OK to use the re split function).
Since we want to make token-matching case insensitive, make sure all tokens are lower-case.
Give an interpretation, in natural language, of what the regular expression, split_regex
, matches.
import re
quickbrownfox = "A quick brown fox jumps over the lazy dog."
split_regex = r'\W+'
# TODO Implement this
def simple_tokenize(string):
pass
print simple_tokenize(quickbrownfox) # Should give ['a', 'quick', 'brown', ... ]
# Simple testcases
assert(simple_tokenize(" ") == [])
assert(simple_tokenize("!!!!123A/456_B/789C.123A") == ["123a","456_b","789c","123a"])
assert(simple_tokenize(quickbrownfox) ==
['a','quick','brown','fox','jumps','over','the','lazy','dog'])
TODO Answer question a above here (click here, hit esc-Enter to edit, enter your answer, then shift-Enter to exit)
b. Stopwords are common words that do not contribute much to the content or meaning of a document (e.g., "the", "a", "is", "to", etc.).
Stopwords add noise to bag-of-words comparisons, so the are usually excluded.
Using the included file "stopwords.txt", implement tokenize
, an improved tokenizer that does not emit stopwords.
# TODO Implement this
def tokenize(string):
pass
print tokenize(quickbrownfox) # Should give ['quick', 'brown', ... ]
assert(tokenize("Why a the?") == [])
assert(tokenize(quickbrownfox) == ['quick','brown','fox','jumps','lazy','dog'])
c. Now let's tokenize the two small data sets.
For each one build a dictionary of tokens, i.e., a dictionary where the record IDs are the keys, and the output of tokenize
is the values. Include tokens for the name, description, and manufacturer fields, but not the price field.
How many tokens, total, are there in the two data sets?
Which Amazon record has the biggest number of tokens?
def makedict(items):
pass
# TODO Compute these (dict() or DataFrame OK)
amazon_rec2tok = makedict(amazon_small)
google_rec2tok = makedict(google_small)
total_tokens = 0
print 'There are %s tokens in the combined data sets' % total_tokens
biggest_record = ""
biggest_len = 0
print 'The Amazon record with ID "%s" has the most tokens "%d"' % (biggest_record, biggest_len)
Bag-of-words comparisons are not very good when all tokens are treated the same: some tokens are more important than others. Weights give us a way to specify which tokens to favor. With weights, when we compare documents, instead of counting common tokens, we sum up the weights of common tokens.
A good heuristic for assigning weights is called "Term-Frequency/Inverse-Document-Frequency," or TF-IDF for short.
TF rewards tokens that appear more in the same document.
It is computed as the frequency of a token in a document, that is, if token t
appears 5 times in document d
then the TF of t
in d
is 5
. The intuition for TF is that if a word occurs often in a document, then it is more important to the meaning of the document.
IDF rewards tokens that are rare across the documents in a data set. The intuition is that it is more significant if two documents share a rare word than a common one. IDF weight for a token, t, in a set of documents, U, is computed as follows:
Note we are actually using the log of the inverse document frequency, which typically works better than the raw inverse frequency. But the terminology "IDF" still applies.
Finally, to bring it all together, the total TF-IDF weight for a token in a document is the product of its TF and IDF weights.
a. Implement tf(tokens)
that takes a list of tokens belonging to a single document and returns a dictionary mapping tokens to TF weights.
# TODO Implement this
def tf(tokens):
pass
print tf(tokenize(quickbrownfox + quickbrownfox)) # Should give {'brown': 2, 'lazy': 2, 'jumps': 2...}
b. Implement idfs
that assigns an IDF weight to every unique token in a collection of data called corpus
. You may structure corpus
however you want, but idfs
should return a dictionary mapping tokens to weights. Use idfs
to compute IDF weights for all tokens in the combined small data sets. How many unique tokens are there?
import math
# TODO Implement this
def idfs(corpus):
pass
idfs_small = # Use find_idfs here
unique_tokens = len(idfs_small)
print "There are %s unique tokens in the small data sets." % unique_tokens
idfs_small['product'] # should be 2.436116509460426
c. What are the 10 tokens with the smallest IDF in the combined small data set? Do you think they are useful for search? Why or why not?
smallest_idf_tokens = # TODO Compute me
print smallest_idf_tokens
TODO Answer question c here (click, esc-Enter, edit, shift-Enter)
d. Plot a histogram of IDF values. Use 20 buckets for the histogram. What conclusions can you draw from the distribution of weights?
import pylab
%matplotlib inline
# TODO Make a plot. HINT: You can use pylab.hist
TODO Answer question d
e. Use tf
to implement tfidf(tokens, idfs)
that takes a list of tokens from a document and a dictionary of idf weights and returns a dictionary mapping tokens to total TF-IDF weight. Use tfidf
to compute the weights of Amazon product record 'b000hkgj8k'.
# TODO Implement this
def tfidf(tokens, idfs):
pass
rec_b000hkgj8k_weights = # Fix me
print "Amazon record 'b000hkgj8k' has tokens and weights:\n%s" % rec_b000hkgj8k_weights
Now we are ready to do text comparisons in a formal way. The metric of string distance we will use is called cosine similarity. We will treat each document as a vector in some high dimensional space. Then, to compare two documents we compute the cosine of the angle between their two document vectors. This is easier than it sounds.
The first question to answer is how do we represent documents as vectors? The answer is familiar: bag-of-words! We treat each unique token as a dimension, and treat token weights as magnitudes in their respective token dimensions. For example, suppose we use simple counts as weights, and we want to interpret the string "Hello, world! Goodbye, world!" as a vector. Then in the "hello" and "goodbye" dimensions the vector has value 1, in the "world" dimension it has value 2, and it is zero in all other dimensions.
Next question is: given two vectors how do we find the cosine of the angle between them? Recall the formula for the dot product of two vectors:
a⋅b=‖a‖‖b‖cosθHere a⋅b=∑ni=1aibi is the ordinary dot product of two vectors, and ‖a‖=√∑ni=1a2i is the norm of a.
We can rearrange terms and solve for the cosine to find it is simply the normalized dot product of the vectors. With our vector model, the dot product and norm computations are simple functions of the bag-of-words document representations, so we now have a formal way to compute similarity:
similarity=cosθ=a⋅b‖a‖‖b‖=∑ni=1aibi√∑ni=1a2i√∑ni=1b2iSetting aside the algebra, the geometric interpretation is more intuitive. The angle between two document vectors is small if they share many tokens in common, because they are pointing in roughly the same direction. Then, the cosine of the angle will be large. Otherwise, if the angle is large (and they have few words in common), the cosine is small. So the cosine scales proportionally with our intuitive sense of similarity.
a. Implement cosine_similarity(tokens1, tokens2, idfs)
that takes two lists of tokens and computes their cosine similarity in the context of some global IDF weights.
Use tfidf
, and the IDF weights from exercise 2b.
import math
# Optional utility
def dotprod(a, b):
# Optional utility
def norm(a):
# Optional freebie
def cossim(a, b):
return dotprod(a, b) / norm(a) / norm(b)
test_vec1 = {'foo': 2, 'bar': 3, 'baz': 5 }
test_vec2 = {'foo': 1, 'bax': 5, 'baz': 4 }
print dotprod(test_vec1, test_vec2), norm(test_vec1) # Should be 22 6.16441400297
# TODO Implement this
def cosine_similarity(tokens1, tokens2, idfs):
print cosine_similarity(["adobe", "photoshop"],
["adobe", "illustrator"],
idfs_small) # Should be 0.321248173318
b. Now we can finally do some entity resolution!
For every product record in the small Google data set, use cosine_similarity
to compute its similarity to every record in the small Amazon set. Build a dictionary mapping (Amazon Id, Google Id)
tuples to similarity scores between 0 and 1.
What is the similarity between Amazon record 'b000o24l3q' and Google record http://www.google.com/base/feeds/snippets/17242822440574356561
.
# TODO Compute similarities
similarities = {}
print 'Requested similarity is %s.' % similarities[('b000o24l3q',
'http://www.google.com/base/feeds/snippets/17242822440574356561')]
c. Use the "gold standard" data (loaded from the supplied file) to answer the following questions. How many true duplicate pairs are there in the small data sets? What is the average similarity score for true duplicates? What about for non-duplicates? Based on this, is cosine similarity doing a good job, qualitatively speaking, of identifying duplicates? Why or why not?
gold_standard = [] # Load this if not already loaded
true_dups = 0 # Fix me
avg_sim_dups = 0.0 # Fix me
avg_sim_non = 0.0 # Fix me
print "There are %s true duplicates." % true_dups
print "The average similarity of true duplicates is %s." % avg_sim_dups
print "And for non duplicates, it is %s." % avg_sim_non
TODO Answer question c here