Plagiarism detection is concerned with automatically identifying text that has been copied from one document to another. Plagiarism detection is generally concerned with finding text that was copied without permission, so it involves both methods for detecting copied text and matching the text to citations. The methods for identifying copied text have broader applicability in areas such as sociology. For example, sociologists maybe be concerned with studying communication patterns or the spread of information in social media.
As an applied field, plagiarism detection methods borrow heavily from other field such as natural language processing and bioinformatics. For example, Wikipedia cites string matching and alignment techniques from bioinformatics as one of the popular methods.
In this blog post, I'm going to explore a basic method for identifying copied text using the bag of words (unordered set of words) model and Naive Bayes classifiers. I'll explain the Naive Bayes classifier and test the method using Wikipedia articles.
For my approach, I actually want to match each word in a sample document to a corpus of reference documents. To do so, I create a bag of words for each word and its neighbors in the samples and reference documents. I like the bag of words approach since it is robust to perturbations such as:
To extract the bags of words, I use a sliding window approach. For example, consider the sentence
The quick brown fox jumped over the lazy dog.
We get the following bags of words:
A = {the quick brown fox jumped}
B = {quick brown fox jumped over}
C = {brown fox jumped over the}
D = {fox jumped over the lazy}
E = {jumped over the lazy dog}
Now that we have a representation for the words, we can study Naive Bayes classifiers.
Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes' theorem with the "naive" assumption of independence between every pair of features. Given a class variable $Y$ and a dependent feature vector $x_1$ through $x_n$, Bayes' theorem states the following relationship:
$$ P(Y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots x_n \mid Y)} {P(x_1, \dots, x_n)} $$Using the naive independence assumption that
$$ P(x_w | Y, x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n) = P(x_w | Y) $$for all $i$, this relationship is simplified to
$$ P(Y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{w} P(x_w \mid Y)} {P(x_1, \dots, x_n)} $$Since $P(x_1, \dots, x_n)$ is constant given the input, we can use the following classification rule:
$$ P(Y \mid x_1, \dots, x_n) \propto P(Y) \prod_{w} P(x_w \mid Y) \\ \Downarrow \\ \hat{Y} = \arg\max_Y P(Y) \prod_{w} P(x_w \mid Y), $$The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of $P(x_i \mid Y)$.
When creating a Naive Bayes classifier, we need to make three decisions:
We will look at three variants of Naive Bayes.
The multinomial estimator is given by:
$$ \hat{\theta}(w \mid Y) = \frac{y_w + \alpha}{N_y + \alpha n} $$where $y_w$ is a binary decision variable giving whether the word $w$ is in class $Y$, $N_y$ is the number of words in class $Y$, $n$ is the total number of words in the training set, and $\alpha$ is a smoothing variable, often set to 1.
The probability rule is given by:
$$ P(x_w | Y ) \propto \hat{\theta}(w \mid Y)^{x_w} $$Note that since $x_w = 0$ and $ P(x_w | Y ) = 1 $ for words not in class $X$, we can simplify the probabilitie rules so that we only need to consider words in $X$ instead of all of the words in the training set instances:
$$ P(x_w | Y ) \propto \hat{\theta}(w \mid Y)\\ \\ P(Y \mid x_1, \dots, x_n) \propto P(Y) \prod_{w \in X} P(x_w \mid Y) $$This observation can reduce the computational time by quite a bit since we can exclude training classes that do not have any overlapping features and words in the training classes that are not in the validation classes.
Let's run through a calculation using the example from the introduction.
We'll use the following bags of words for our validation set:
A = {the quick brown fox jumped}
F = {alpha beta delta gamma kappa}
Let's use bags of words B, C, and D for our training set:
B = {quick brown fox jumped over}
C = {brown fox jumped over the}
D = {fox jumped over the lazy}
The total set of training words is:
{quick brown fox jumped over the lazy}
Since each training class has 5 words and there are a total of 7 words, we can find the feature estimator as:
$$ \hat{\theta}(w | Y) = \frac{y_w + \alpha}{N_y + \alpha n} = \frac{y_w + 1}{5 + 7} = \frac{y_w + 1}{12} $$As an example, let's find the likelihood of observing "the" and "quick" in class B.
$$ \hat{\theta}(\text{the} \mid B) = \frac{B_{\text{the}} + 1}{12} = \frac{0 + 1}{12} = \frac{1}{12} $$Since the word "the" is not in class B, the occurence value is 0, resulting in a likelihood of 1/12.
$$ \hat{\theta}(\text{quick} \mid B) = \frac{B_{\text{quick}} + 1}{12} = \frac{1 + 1}{12} = \frac{1}{6} $$For the word "quick", which is in class B, we find a likelihood of 1/6.
We can compute the probabilites for our validation instances as follows:
$$ P(A \mid B) \propto \frac{1}{3} \Big(\frac{1}{12}\Big) \Big(\frac{2}{12}\Big)^4 = \frac{16}{746496} $$$$ P(A \mid C) \propto \frac{1}{3} \Big(\frac{2}{12}\Big) \Big(\frac{1}{12}\Big) \Big(\frac{2}{12}\Big)^3 = \frac{16}{746496} $$$$ P(A \mid D) \propto \frac{1}{3} \Big(\frac{2}{12}\Big) \Big(\frac{1}{12}\Big)^2 \Big(\frac{2}{12}\Big)^2 = \frac{8}{746496} $$Validation instance $A$ has an equal probability of belonging to classes $B$ and $C$ and a lower probability of belonging to class $D$. Thus, $A$ would be predicted to belong to either class $B$ or $C$.
Since instance $F$ doesn't have any words in common with any of the training instances, $F$ is computed as having an equal probability of belonging to all three classes:
$$ P(F | B) \propto \frac{1}{3} \Big(\frac{1}{12}\Big)^5 = \frac{1}{746496} $$$$ P(F | C) \propto \frac{1}{3} \Big(\frac{1}{12}\Big)^5 = \frac{1}{746496} $$$$ P(F | D) \propto \frac{1}{3} \Big(\frac{1}{12}\Big)^5 = \frac{1}{746496} $$Since the multinomial estimator does not weight the effect of each occurence or non-occurence by the word itself, the multinomial estimator has difficulty distinguishing between cases which have the same number of missing words, as we saw. Instead, we can weight the words by their "inverse document frequency," or basically, give higher weights to words that appear in fewer documents. The idea behind IDF is that words that appear in fewer documents are stronger indicators that those that appear is many documents.
The IDF estimator is given by:
$$ \hat{\theta}(w \mid Y) = \frac{N_c}{N_w}^{y_w} $$where $N_c$ is the total number of classes, $y_w$ is a binary decision variable giving whether the word $w$ is in class $Y$, and $N_w$ is the number of classes containing the word $w$.
We can use the same probability rules as above.
To review, the total set of words (features) is:
{quick brown fox jumped over the lazy}
The class counts for each word are as follows:
quick = 1, brown = 2, fox = 3, jumped = 3,
over = 3, the = 2, lazy = 1
Since there are 3 training classes, we can find the feature estimator as:
$$ P(x \mid Y) \propto \Big(\frac{N_c}{N_w}\Big)^{y_w} = \Big(\frac{3}{N_w}\Big)^{y_w} $$As an example, let's find the proportional likelihood of observing "the" and "quick" in class B.
$$ P(\text{the} \mid B) \propto \Big(\frac{3}{N_\text{the}}\Big)^{y_\text{the}} = \Big(\frac{3}{2}\Big)^0 = 1 $$Since the word "the" is not in class B, the occurence value is 0, resulting in a proportional likelihood of 1.
$$ P(\text{quick} \mid B) \propto \Big(\frac{3}{N_\text{quick}}\Big)^{y_\text{quick}} = \Big(\frac{3}{1}\Big)^1 = 3 $$For the word "quick", which is in class B, we find a proportional likelihood of 3.
The resulting proportional probabilities for our validation set instances under the IDF model are as follows:
$$ P(A \mid B) \propto \frac{1}{3} \Big(\frac{3}{2}\Big)^0 \Big(\frac{3}{1}\Big)^1 \Big(\frac{3}{2}\Big)^1 \Big(\frac{3}{3}\Big)^1 \Big(\frac{3}{3}\Big)^1 = \frac{81}{54} = 1.5 $$$$ P(A \mid C) \propto \frac{1}{3} \Big(\frac{3}{2}\Big)^1 \Big(\frac{3}{1}\Big)^0 \Big(\frac{3}{2}\Big)^1 \Big(\frac{3}{3}\Big)^1 \Big(\frac{3}{3}\Big)^1 = \frac{81}{108} = 0.75 $$$$ P(A \mid D) \propto \frac{1}{3} \Big(\frac{3}{2}\Big)^1 \Big(\frac{3}{1}\Big)^0 \Big(\frac{3}{2}\Big)^0 \Big(\frac{3}{3}\Big)^1 \Big(\frac{3}{3}\Big)^1 = \frac{27}{54} = 0.5 $$Validation instance $A$ has the highest probability of belonging to class $B$.
Since $F$ doesn't have any words in common with any of the training instances, $F$ is computed as having equal probabilities of belonging to all three classes:
$$ P(F \mid B) \propto \frac{1}{3} \Big(1\Big)^5 = \frac{1}{3} $$$$ P(F \mid C) \propto \frac{1}{3} \Big(1\Big)^5 = \frac{1}{3} $$$$ P(F \mid D) \propto \frac{1}{3} \Big(1\Big)^5 = \frac{1}{3} $$Unlike the multinomial Naive Bayes, the Bernoulli Naive Bayes penalizes for words that are missing. We can use the same estimator as the multinomial Naive Bayes, but the Bernoulli probability rule is given by:
$$ P(x_w \mid Y) \propto P(w \mid Y)^{x_w} (1 - P(w \mid Y))^{(1 - {x_w})} $$As a reminder, the total set of words (features) is:
{the quick brown fox jumped over lazy}
Now, on to the examples. We'll use the multinomial feature estimator from above. Since we're using the binomial combination rule, however, we need to consider every word in the vocabulary, unlike the multinomial combination rule. This means we have four cases for every word:
As a result, we find the following proportional probabilities:
$$ P(A | B) \propto \frac{1}{3} \Big(\frac{1}{12}\Big) \Big(\frac{2}{12}\Big)^4 \Big(\frac{10}{12}\Big) \Big(\frac{11}{12}\Big) = \frac{1760}{107495424} $$$$ P(A | C) \propto \frac{1}{3} \Big(\frac{2}{12}\Big) \Big(\frac{1}{12}\Big) \Big(\frac{2}{12}\Big)^3 \Big(\frac{10}{12}\Big) \Big(\frac{11}{12}\Big) = \frac{1760}{107495424} $$$$ P(A | D) \propto \frac{1}{3} \Big(\frac{2}{12}\Big) \Big(\frac{1}{12}\Big)^2 \Big(\frac{2}{12}\Big)^2 \Big(\frac{10}{12}\Big) \Big(\frac{10}{12}\Big) = \frac{800}{107495424} $$Validation instance $A$ has equal probabilities for belonging in classes $B$ and $C$ and a lower probability of belonging to class $D$. Thus, $A$ would be predicted to belong to either class $B$ or $C$.
Since $F$ doesn't have any words in common with any of the training instances, $F$ is computed as having equal probabilities of belonging to all three classes:
$$ P(F | B) \propto \frac{1}{3} \Big(\frac{15}{17}\Big)^{7} \Big(\frac{1}{17}\Big)^{5} = \frac{170859375}{6047981701347} $$$$ P(F | C) \propto \frac{1}{3} \Big(\frac{15}{17}\Big)^{7} \Big(\frac{1}{17}\Big)^{5} = \frac{170859375}{6047981701347} $$$$ P(F | D) \propto \frac{1}{3} \Big(\frac{15}{17}\Big)^{7} \Big(\frac{1}{17}\Big)^{5} = \frac{170859375}{6047981701347} $$Now that we understand how the three methods work, we can evaluate their real-world performance on Wikipedia articles. The English version of the Wikipedia articles can be downloaded here. I used the smallest file of the complete pages, current versions only, in XML format, bzipped.
The first step is to extract the articles from the XML dump and remove non-content (e.g, Talk pages) and short articles.
%load_ext autoreload
%autoreload 2
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import random
import xml.etree.ElementTree as ET
def read_articles(flname):
tree = ET.parse(flname)
root = tree.getroot()
formatted_articles = []
for page in root.iterfind("{http://www.mediawiki.org/xml/export-0.8/}page"):
article_text = None
title = page.find("{http://www.mediawiki.org/xml/export-0.8/}title").text
revision = page.find("{http://www.mediawiki.org/xml/export-0.8/}revision")
text = revision.find("{http://www.mediawiki.org/xml/export-0.8/}text").text
formatted_articles.append((title, text))
return formatted_articles
articles = read_articles("data/enwiki-20140707-pages-meta-current1.xml-p000000010p000010000")
articles = filter(lambda t: not t[0].startswith("Talk:"), articles)
articles = filter(lambda t: len(t[1]) > 5000, articles)
sampled_articles = list(random.sample(articles, 30))
training_articles = sampled_articles[:10]
validation_articles = sampled_articles[10:20]
near_match_articles = sampled_articles[20:]
The second step is to clean the data by removing the mediawiki formatting. This will make remove hyperlinks and other syntax that could confuse the classification process.
import re
def remove_block_regex(text, start_delim, end_delim):
block_regex = re.compile(start_delim + r".+?" + end_delim, flags=re.DOTALL)
new_string = ""
last_end = 0
for match in block_regex.finditer(text):
new_string += text[last_end:match.start(0)] + " "
last_end = match.end(0)
new_string += text[last_end:]
return new_string
def replace_link_block(block):
if block.startswith("[[image:") or \
block.startswith("[[file:"):
return " "
elif block[1] != "[":
start = block.find(" ")
if start == -1:
return " "
else:
return block[start+1:-1]
elif "|" in block:
start = block.find("|")
return block[start+1:-2]
return block[2:-2]
def substitute_links(text):
pos = 0
start_pos = 0
while pos < len(text):
if text[pos:pos+2] == "[[":
start_pos = pos
pos += 2
elif text[pos] == "[":
start_pos = pos
pos += 1
elif text[pos:pos+2] == "]]":
block = text[start_pos:pos+2]
block = replace_link_block(block)
text = text[:start_pos] + block + text[pos+2:]
pos = 0
elif text[pos] == "]":
block = text[start_pos:pos+1]
block = replace_link_block(block)
text = text[:start_pos] + block + text[pos+1:]
pos = 0
else:
pos += 1
return text
def remove_tail(text):
headers = ["==See Also==", "== See Also ==", "==References==", "== References ==",
"==External Links==", "== External Links ==", "==Further Reading==",
"== Further Reading =="]
positions = [text.find(s.lower()) for s in headers]
positions = filter(lambda p: p != -1, positions)
if len(positions) == 0:
return text
pos = min(positions)
return text[:pos]
TAG_REGEX = re.compile(r"<\s*(\w+).+?/\1\s*>", flags=re.DOTALL)
def remove_html_tags(text):
pos = 0
start_pos = 0
last_block_end = 0
new_string = ""
while pos < len(text):
if text[pos] == "<":
start_pos = pos
elif text[pos:pos+2] == "/>":
new_string += text[last_block_end:start_pos] + " "
pos += 2
last_block_end = pos
pos += 1
new_string += text[last_block_end:]
text = new_string
new_string = ""
last_end = 0
for match in TAG_REGEX.finditer(text):
new_string += text[last_end:match.start(0)] + " "
last_end = match.end(0)
new_string += text[last_end:]
return new_string
def remove_wiki_formatting(text):
text = text.lower()
text = remove_tail(text)
text = remove_block_regex(text, r"\{\{", r"\}\}")
text = substitute_links(text)
text = remove_block_regex(text, r"<!--", r"-->")
text = remove_html_tags(text)
text = text.replace(">", " ")
text = text.replace("<", " ")
text = text.replace("-", " ")
text = text.replace(" ", " ")
text = text.replace(">", ">")
text = text.replace("<", "<")
text = text.replace("&", " ")
text = text.replace("|", " ")
text = text.replace("/", " ")
text = text.replace("}}", " ")
text = text.replace("{{", " ")
text = text.replace("===", "")
text = text.replace("==", "")
text = text.replace("=", "")
text = text.replace("'''", "")
text = text.replace("''", "")
text = text.replace(".", "")
text = text.replace(",", "")
text = text.replace("\"", "")
text = text.replace("(", "")
text = text.replace(")", "")
text = text.replace("'", "")
text = text.replace("?", "")
text = text.replace(";", "")
text = text.replace(":", "")
text = text.replace("!", "")
text = text.replace("*", "")
return " ".join(text.split())
stripped_training_articles = map(lambda t: (t[0], remove_wiki_formatting(t[1])), training_articles)
stripped_validation_articles = map(lambda t: (t[0], remove_wiki_formatting(t[1])), validation_articles)
stripped_near_match_articles = map(lambda t: (t[0], remove_wiki_formatting(t[1])), near_match_articles)
Now that we have each article in unformatted, plain text, we can extract the bags and prepare our test sets. The bags are found by scanning over the articles' text with a sliding window. For each word, we extract the word, its $n$ neighbors to the left, and its $n$ neighbors to the right. We represent each word set as a tuple containing an unordered set of the extracted words, the index of the center word, and the title of the article. The center word index and title of the article are used to identify the sources of the bags and verify the correctness of the classification.
Training and validation sets for four data sets:
In each data set, we include not only the instances mimicking our test case but also randomly-drawn bags to mimic the normal "background noise" we would find.
def extract_word_sets(article, window_length):
title, text = article
words = text.split()
word_sets = list()
for i in xrange(len(words) - window_length):
word_set = ((title, i + ((window_length - 1) / 2)), frozenset(words[i:i+window_length]))
word_sets.append(word_set)
return word_sets
WINDOW_LENGTH = 7
training_article_word_sets = map(lambda article: extract_word_sets(article, WINDOW_LENGTH), stripped_training_articles)
validation_article_word_sets = map(lambda article: extract_word_sets(article, WINDOW_LENGTH), stripped_validation_articles)
near_match_article_word_sets = map(lambda article: extract_word_sets(article, WINDOW_LENGTH), stripped_near_match_articles)
training_word_sets = random.sample(reduce(lambda x, y: x + y, training_article_word_sets), 3000)
validation_word_sets = random.sample(reduce(lambda x, y: x + y, validation_article_word_sets), 3000)
word_set_pairs_0 = []
for article_word_sets in near_match_article_word_sets:
for i in xrange(len(article_word_sets)):
word_set_pairs_0.append((article_word_sets[i], article_word_sets[i]))
word_set_pairs_1 = []
for article_word_sets in near_match_article_word_sets:
for i in xrange(len(article_word_sets) - 1):
word_set_pairs_1.append(article_word_sets[i:i+2])
word_set_pairs_2 = []
for article_word_sets in near_match_article_word_sets:
for i in xrange(len(article_word_sets) - 2):
word_set_pairs_2.append((article_word_sets[i], article_word_sets[i+2]))
training_word_set_0 = training_word_sets[:1500]
validation_word_set_0 = validation_word_sets[:1500]
validation_set_labels_0 = dict()
for ws1, ws2 in random.sample(word_set_pairs_0, 1500):
training_word_set_0.append(ws1)
validation_word_set_0.append(ws2)
validation_set_labels_0[ws2[0]] = ws1[0]
training_word_set_1 = training_word_sets[:1500]
validation_word_set_1 = validation_word_sets[:1500]
validation_set_labels_1 = dict()
for ws1, ws2 in random.sample(word_set_pairs_1, 1500):
training_word_set_1.append(ws1)
validation_word_set_1.append(ws2)
validation_set_labels_1[ws2[0]] = ws1[0]
training_word_set_2 = training_word_sets[:1500]
validation_word_set_2 = validation_word_sets[:1500]
validation_set_labels_2 = dict()
for ws1, ws2 in random.sample(word_set_pairs_2, 1500):
training_word_set_2.append(ws1)
validation_word_set_2.append(ws2)
validation_set_labels_2[ws2[0]] = ws1[0]
print len(training_word_sets), len(validation_word_sets)
print training_word_sets[0]
3000 3000 (('David Beckham', 9313), frozenset([u'england', u'beckham', u'lead', u'into', u'1\u20130', u'put', u'the']))
To determine the viability of using a Naive Bayes approach, we want to examine the distribution of overlaps between the word sets. The distance function is the number of words in $s_1$ that are not in $s_2$ or the difference of $s_1$ and $s_2$ given by:
$$ d_a(s_1, s_2) = | s_1 - s_2 | $$from collections import defaultdict
def compute_asymm_dist_distr(word_set1, word_set2):
min_dist_distribution = defaultdict(lambda: 0)
for key, s1 in word_set1:
min_dist = len(s1)
for key2, s2 in word_set2:
dist = 0
for w in s1:
if w not in s2:
dist += 1
min_dist = min(min_dist, dist)
if min_dist == 0:
break
min_dist_distribution[min_dist] += 1
return min_dist_distribution
dist_random_distr = compute_asymm_dist_distr(validation_word_sets, training_word_sets)
dist_offset0_distr = compute_asymm_dist_distr(validation_word_set_0, training_word_set_0)
dist_offset1_distr = compute_asymm_dist_distr(validation_word_set_1, training_word_set_1)
dist_offset2_distr = compute_asymm_dist_distr(validation_word_set_2, training_word_set_2)
distances_random = list(sorted(dist_random_distr.keys()))
distances_offset0 = list(sorted(dist_offset0_distr.keys()))
distances_offset1 = list(sorted(dist_offset1_distr.keys()))
distances_offset2 = list(sorted(dist_offset2_distr.keys()))
densities_random = [dist_random_distr[d] / float(sum(dist_random_distr.values())) for d in distances_random]
densities_offset0 = [dist_offset0_distr[d] / float(sum(dist_offset0_distr.values())) for d in distances_offset0]
densities_offset1 = [dist_offset1_distr[d] / float(sum(dist_offset1_distr.values())) for d in distances_offset1]
densities_offset2 = [dist_offset2_distr[d] / float(sum(dist_offset2_distr.values())) for d in distances_offset2]
print densities_random
plt.plot(distances_random, densities_random, color="c", label="Random")
plt.hold(True)
plt.plot(distances_offset0, densities_offset0, color="r", label="Offset 0")
plt.plot(distances_offset1, densities_offset1, color="g", label="Offset 1")
plt.plot(distances_offset2, densities_offset2, color="b", label="Offset 2")
plt.xlabel("Distance (missing words)", fontsize=16)
plt.ylabel("Density", fontsize=16)
plt.legend(loc="upper right")
[0.0006666666666666666, 0.01, 0.099, 0.4073333333333333, 0.3863333333333333, 0.08833333333333333, 0.008333333333333333]
<matplotlib.legend.Legend at 0x5697c90>
The background noise, or probability of finding another bag of words with a minimum distance of $d$, will have a large impact on our ability to distinguish between true and false matches. With a window length of 7 words, we find that that the probability (cyan) of finding another bag of words with the same words, 1 word different, and 2 words different in a randomly-chosen set are 0.07%, 1.0%, and 9.9%, respectively. A majority of the bags have a minimal distance beween 3 and 6.
As true positives with differnces of 0, 1, and 2 words are added into the test sets, we see that the distribution becomes a superposition of the original random distribution and the new distributions. This change suggests that the presence of true matches should be readily observable.
Below, we've implemented the multinomial, IDF, and Bernoulli variants disussed above.
from collections import defaultdict
class BaseNaiveBayes(object):
def __init__(self):
self.word_classes = defaultdict(list)
self.total_classes = 0.0
def train(self, labelled_classes):
self.total_classes = float(len(labelled_classes))
for label, word_set in labelled_classes:
for w in word_set:
self.word_classes[w].append((label, word_set))
# Gotcha! defaultdict adds default instances
# when you check for existence of a non-existent
# key so save size here
self.n_total_features = len(self.word_classes)
def _log_likelihood(self, class_word_set, word_present_class):
raise NotImplemented
def classify(self, instance):
instance_label, instance_word_set = instance
# base class -- no match
best_match = (0.0, None)
possible_classes = []
for word in instance_word_set:
possible_classes.extend(self.word_classes[word])
for label, class_word_set in possible_classes:
log_sum_likelihood = -np.log(self.total_classes)
for word in instance_word_set:
log_sum_likelihood += self._log_likelihood(class_word_set, instance_word_set, word)
likelihood = np.exp(log_sum_likelihood)
best_match = max((likelihood, label), best_match)
return instance_label, best_match[1], best_match[0]
class MultinomialNaiveBayes(BaseNaiveBayes):
def __init__(self, alpha=1.0):
super(MultinomialNaiveBayes, self).__init__()
self.alpha = alpha
def _log_likelihood(self, class_word_set, instance_word_set, word):
if word in class_word_set:
occurence = 1.0
else:
occurence = 0.0
n_class_features = float(len(class_word_set))
log_likelihood = np.log(occurence + self.alpha) - np.log(n_class_features + self.alpha * self.n_total_features)
return log_likelihood
class IDFNaiveBayes(BaseNaiveBayes):
def _log_likelihood(self, class_word_set, instance_word_set, word):
if word not in class_word_set:
return np.log(1.0)
n_word_classes = float(len(self.word_classes[word]))
log_likelihood = np.log(self.total_classes) - np.log(n_word_classes)
return log_likelihood
class BernoulliNaiveBayes(BaseNaiveBayes):
def __init__(self, alpha=1.0):
super(MultinomialNaiveBayes, self).__init__()
self.alpha = alpha
def _log_likelihood(self, class_word_set, instance_word_set, word):
if word in class_word_set:
occurence = 1.0
else:
occurence = 0.0
n_class_features = float(len(class_word_set))
log_likelihood = np.log(occurence + self.alpha) - np.log(n_class_features + self.alpha * self.n_total_features)
if word in instance_word_set:
return log_likelihood
else:
return np.log(1.0 - np.exp(log_likelihood))
def classify(self, instance):
instance_label, instance_word_set = instance
# base class -- no match
best_match = (0.0, None)
possible_classes = []
for word in instance_word_set:
possible_classes.extend(self.word_classes[word])
for label, class_word_set in possible_classes:
log_sum_likelihood = -np.log(self.total_classes)
for word in self.word_classes.iterkeys():
log_sum_likelihood += self._log_likelihood(class_word_set,
instance_word_set,
word)
likelihood = np.exp(log_sum_likelihood)
best_match = max((likelihood, label), best_match)
return instance_label, best_match[1], best_match[0]
We'll evaluate the multinomial and IDF Naive Bayes classifiers using the test sets with true matches with distances of 0, 1, and 2 words.
mnb = MultinomialNaiveBayes(alpha=1.0)
mnb.train(training_word_set_0)
offset0_word_set_classifications = map(lambda instance: mnb.classify(instance), validation_word_set_0)
mnb = MultinomialNaiveBayes(alpha=1.0)
mnb.train(training_word_set_1)
offset1_word_set_classifications = map(lambda instance: mnb.classify(instance), validation_word_set_1)
mnb = MultinomialNaiveBayes(alpha=1.0)
mnb.train(training_word_set_2)
offset2_word_set_classifications = map(lambda instance: mnb.classify(instance), validation_word_set_2)
def plot_roc_curve(*triplets):
colors = ["r", "g", "b",]
c = 0
for label, word_classifications, mappings in triplets:
sorted_classif = sorted([(prob, orig, classif) for orig, classif, prob in word_classifications])
sorted_classif.reverse()
tp = 0.0
fp = 0.0
xs = []
ys = []
for _, orig, classif in sorted_classif:
if orig not in mappings:
continue
if mappings[orig] == classif:
tp += 1.0
else:
fp += 1.0
xs.append(fp / 1500.0)
ys.append(tp / 1500.0)
xs.append(1.0)
ys.append(1.0)
print label, tp / 1500.0, fp / 1500.0
plt.plot(xs, ys, color=colors[c], label=label)
c += 1
plt.hold(True)
plt.xlabel("Incorrect Class", fontsize=16)
plt.ylabel("Correct Class", fontsize=16)
plt.legend(loc="lower right")
triplets = [("Offset 0", offset0_word_set_classifications, validation_set_labels_0),
("Offset 1", offset1_word_set_classifications, validation_set_labels_1),
("Offset 2", offset2_word_set_classifications, validation_set_labels_2)]
plot_roc_curve(*triplets)
Offset 0 0.998666666667 0.00133333333333 Offset 1 0.933333333333 0.0666666666667 Offset 2 0.858 0.142
With the multinomial NB model, we get true positive rates of 99.9%, 93.3%, and 85.8% for the test sets with 0, 1, and 2 word differences, respectively.
idfnb = IDFNaiveBayes()
idfnb.train(training_word_set_0)
offset0_word_set_classifications = map(lambda instance: idfnb.classify(instance), validation_word_set_0)
idfnb = IDFNaiveBayes()
idfnb.train(training_word_set_1)
offset1_word_set_classifications = map(lambda instance: idfnb.classify(instance), validation_word_set_1)
idfnb = IDFNaiveBayes()
idfnb.train(training_word_set_2)
offset2_word_set_classifications = map(lambda instance: idfnb.classify(instance), validation_word_set_2)
triplets = [("Offset 0", offset0_word_set_classifications, validation_set_labels_0),
("Offset 1", offset1_word_set_classifications, validation_set_labels_1),
("Offset 2", offset2_word_set_classifications, validation_set_labels_2)]
plot_roc_curve(*triplets)
Offset 0 0.995333333333 0.00466666666667 Offset 1 0.93 0.07 Offset 2 0.860666666667 0.139333333333
With the IDF NB model, we get true positive rates of 99.5%, 93.0%, and 86.1% for the test sets with 0, 1, and 2 word differences, respectively. The IDF and multinomial models perfom similarly, suggesting that either would be a good choice.
By this point, we've discussed quite a bit. We reviewed plagiarism detection and the problem of identifying copied text. We looked at the bag of words model for feature extraction and showed how it works with a simple example. We looked at Naive Bayes classifiers and three different variations, the multinomial, IDF, and Bernoulli, and demonstrated how the calculations are performed by hand. We then implemented all three classifiers and evaluated the multinomial and IDF variants on wikipedia articles. We found that both the IDF and multinomial models performed well, correctly classifying about 86.1% - 99.5% of matching bags, depending on the number of words that had been changed.
As a preliminary example, there is still a lot of work to do. We didn't consider how to handle misspellings. One approach would be to cluster the word so that variants such as "definitely" and "defenitely" are grouped together. Then, all of the variants could be replaced by a single version.
We could also improve our validation. From a statistical standpoint, we should perform multiple samples for each data set and compute the variation in the distributions and error rates.
We also didn't test on real data -- we built synthetic data sets from the Wikipedia articles. Since the bags were sampled from a small set of articles, it could be possible that words have a higher repetition rate than if we sampled from all of Wikipedia. For example, an article on psychology is likely to use the word psychology multiple times. Our results could be better in the more general case.