We need to start thinking about how to translate collections of texts into quantifiable phenomena. The easiest way to start is to think about word frequencies.
I'm going to try and stay away from NLTK and Scikits-Learn for as much as I can. Let's nail down some basic concepts in Python first.
First, let's review how to get a count of terms per document: a term frequency vector.
#examples taken from here: http://stackoverflow.com/a/1750187
mydoclist = ['Julie loves me more than Linda loves me',
'Jane likes me more than Julie loves me',
'He likes basketball more than baseball']
#mydoclist = ['sun sky bright', 'sun sun bright']
from collections import Counter
for doc in mydoclist:
tf = Counter()
for word in doc.split():
tf[word] +=1
print tf.items()
[('me', 2), ('Julie', 1), ('loves', 2), ('Linda', 1), ('than', 1), ('more', 1)] [('me', 2), ('Julie', 1), ('likes', 1), ('loves', 1), ('Jane', 1), ('than', 1), ('more', 1)] [('basketball', 1), ('baseball', 1), ('likes', 1), ('He', 1), ('than', 1), ('more', 1)]
Here, we've introduced a new Python object called a Counter. Counters are only in Python 2.7 and higher. They're neat because they allow you to perform this exact kind of function; counting things in a loop.
Let's call this a first stab at representing documents quantitatively, just by their word counts. But for those of you who are already tipped off by the "vector" in the vector space model, these aren't really comparable. This is because they're not in the same vocabulary space.
What we really want is for every document to be the same length, where length is determined by the total vocabulary of our corpus.
import string #allows for format()
def build_lexicon(corpus):
lexicon = set()
for doc in corpus:
lexicon.update([word for word in doc.split()])
return lexicon
def tf(term, document):
return freq(term, document)
def freq(term, document):
return document.split().count(term)
vocabulary = build_lexicon(mydoclist)
doc_term_matrix = []
print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'
for doc in mydoclist:
print 'The doc is "' + doc + '"'
tf_vector = [tf(word, doc) for word in vocabulary]
tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector)
print 'The tf vector for Document %d is [%s]' % ((mydoclist.index(doc)+1), tf_vector_string)
doc_term_matrix.append(tf_vector)
# here's a test: why did I wrap mydoclist.index(doc)+1 in parens? it returns an int...
# try it! type(mydoclist.index(doc) + 1)
print 'All combined, here is our master document term matrix: '
print doc_term_matrix
Our vocabulary vector is [me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more] The doc is "Julie loves me more than Linda loves me" The tf vector for Document 1 is [2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1] The doc is "Jane likes me more than Julie loves me" The tf vector for Document 2 is [2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1] The doc is "He likes basketball more than baseball" The tf vector for Document 3 is [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1] All combined, here is our master document term matrix: [[2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1], [2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]]
Okay, that seems reasonable enough. If any of you have any experience with machine learning, what you've just seen is the creation of a feature space. Now every document is in the same feature space, meaning that we can represent the entire corpus in the same dimensional space without having lost too much information.
Once you've got your data in the same feature space, you can start applying some machine learning methods; classifying, clustering, and so on. But actually, we've got a few problems. Words aren't all equally informative.
If words appear too frequently in a single document, they're going to muck up our analysis. We want to perform some scaling of each of these term frequency vectors into something a bit more representative. In other words, we need to do some vector normalizing.
We don't really have the time to go into the intense math of this. Just accept for now that we need to ensure that the L2 norm of each vector is equal to 1. Here's some code that shows how this is done.
import math
def l2_normalizer(vec):
denom = np.sum([el**2 for el in vec])
return [(el / math.sqrt(denom)) for el in vec]
doc_term_matrix_l2 = []
for vec in doc_term_matrix:
doc_term_matrix_l2.append(l2_normalizer(vec))
print 'A regular old document term matrix: '
print np.matrix(doc_term_matrix)
print '\nA document term matrix with row-wise L2 norms of 1:'
print np.matrix(doc_term_matrix_l2)
# if you want to check this math, perform the following:
# from numpy import linalg as la
# la.norm(doc_term_matrix[0])
# la.norm(doc_term_matrix_l2[0])
A regular old document term matrix: [[2 0 1 0 0 2 0 1 0 1 1] [2 0 1 0 1 1 1 0 0 1 1] [0 1 0 1 1 0 0 0 1 1 1]] A document term matrix with row-wise L2 norms of 1: [[ 0.57735027 0. 0.28867513 0. 0. 0.57735027 0. 0.28867513 0. 0.28867513 0.28867513] [ 0.63245553 0. 0.31622777 0. 0.31622777 0.31622777 0.31622777 0. 0. 0.31622777 0.31622777] [ 0. 0.40824829 0. 0.40824829 0.40824829 0. 0. 0. 0.40824829 0.40824829 0.40824829]]
Not bad. Without getting too deeply mired into the linear algebra, you can see immediately that we've scaled down vectors such that each element is between [0, 1], without losing too much valuable information. You see how it's no longer the case that a term count of 1 is the same value in one vector as another?
Why would we care about this kind of normalizing? Think about it this way; if you wanted to make a document seem more related to a particular topic than it actually was, you might try boosting the likelihood of its inclusion into a topic by repeating the same word over and over and over again. Frankly, at a certain point, we're getting a diminishing return on the informative value of the word. So we need to scale down words that appear too frequently in a document.
But we're still not there yet. Just as all words aren't equally valuable within a document, not all words are valuable across all documents. We can try reweighting every word by its inverse document frequency. Let's see what's involved in that.
def numDocsContaining(word, doclist):
doccount = 0
for doc in doclist:
if freq(word, doc) > 0:
doccount +=1
return doccount
def idf(word, doclist):
n_samples = len(doclist)
df = numDocsContaining(word, doclist)
return np.log(n_samples / 1+df)
my_idf_vector = [idf(word, mydoclist) for word in vocabulary]
print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'
print 'The inverse document frequency vector is [' + ', '.join(format(freq, 'f') for freq in my_idf_vector) + ']'
Our vocabulary vector is [me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more] The inverse document frequency vector is [1.609438, 1.386294, 1.609438, 1.386294, 1.609438, 1.609438, 1.386294, 1.386294, 1.386294, 1.791759, 1.791759]
Now we have a general sense of information values per term in our vocabulary, accounting for their relative frequency across the entire corpus. Recall that this is an inverse! The more negative a term, the more frequent it is.
We're almost there. To get TF-IDF weighted word vectors, you have to perform the simple calculation of tf * idf.
Time to take a step back. Recall from linear algebra that if you multiply a vector of A x B by a vector of A x B, you're going to get a vector of size A x A, or a scalar. This won't do, since what we want is a term vector of the same dimensions (1 x number of terms), where each element has been scaled by this idf weighting. How do we do that in Python?
We could write the whole function out here, but instead we're going to show a brief introduction into numpy
.
import numpy as np
def build_idf_matrix(idf_vector):
idf_mat = np.zeros((len(idf_vector), len(idf_vector)))
np.fill_diagonal(idf_mat, idf_vector)
return idf_mat
my_idf_matrix = build_idf_matrix(my_idf_vector)
#print my_idf_matrix
Awesome! Now we have converted our IDF vector into a matrix of size BxB, where the diagonal is the IDF vector. That means we can perform now multiply every term frequency vector by the inverse document frequency matrix. Then to make sure we are also accounting for words that appear too frequently within documents, we'll normalize each document such that the L2 norm = 1.
doc_term_matrix_tfidf = []
#performing tf-idf matrix multiplication
for tf_vector in doc_term_matrix:
doc_term_matrix_tfidf.append(np.dot(tf_vector, my_idf_matrix))
#normalizing
doc_term_matrix_tfidf_l2 = []
for tf_vector in doc_term_matrix_tfidf:
doc_term_matrix_tfidf_l2.append(l2_normalizer(tf_vector))
print vocabulary
print np.matrix(doc_term_matrix_tfidf_l2) # np.matrix() just to make it easier to look at
set(['me', 'basketball', 'Julie', 'baseball', 'likes', 'loves', 'Jane', 'Linda', 'He', 'than', 'more']) [[ 0.57211257 0. 0.28605628 0. 0. 0.57211257 0. 0.24639547 0. 0.31846153 0.31846153] [ 0.62558902 0. 0.31279451 0. 0.31279451 0.31279451 0.26942653 0. 0. 0.34822873 0.34822873] [ 0. 0.36063612 0. 0.36063612 0.41868557 0. 0. 0. 0.36063612 0.46611542 0.46611542]]
Awesome! You've just seen an example of how to tediously construct a TF-IDF weighted document term matrix.
Here comes the best part: you don't even have to do this by hand.
Remember that everything in Python is an object, objects take up memory (and performing actions take up time). Using scikits-learn ensures that you don't have to worry about the efficiency of all the previous steps.
NOTE: The values you get from the TfidfVectorizer/TfidfTransformer
will be different than what we have computed by hand. This is because scikits-learn uses an adapted version of Tfidf to deal with divide-by-zero errors. There is a more in-depth discussion here.
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(min_df=1)
term_freq_matrix = count_vectorizer.fit_transform(mydoclist)
print "Vocabulary:", count_vectorizer.vocabulary_
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(term_freq_matrix)
tf_idf_matrix = tfidf.transform(term_freq_matrix)
print tf_idf_matrix.todense()
Vocabulary: {u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2} [[ 0. 0. 0. 0. 0.28945906 0. 0.38060387 0.57891811 0.57891811 0.22479078 0.22479078] [ 0. 0. 0. 0.41715759 0.3172591 0.3172591 0. 0.3172591 0.6345182 0.24637999 0.24637999] [ 0.48359121 0.48359121 0.48359121 0. 0. 0.36778358 0. 0. 0. 0.28561676 0.28561676]]
In fact, you can do this just by combining the steps into one: the TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df = 1)
tfidf_matrix = tfidf_vectorizer.fit_transform(mydoclist)
print tfidf_matrix.todense()
[[ 0. 0. 0. 0. 0.28945906 0. 0.38060387 0.57891811 0.57891811 0.22479078 0.22479078] [ 0. 0. 0. 0.41715759 0.3172591 0.3172591 0. 0.3172591 0.6345182 0.24637999 0.24637999] [ 0.48359121 0.48359121 0.48359121 0. 0. 0.36778358 0. 0. 0. 0.28561676 0.28561676]]
And we can fit new observations into this vocabulary space like so:
new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play baseball']
new_term_freq_matrix = tfidf_vectorizer.transform(new_docs)
print tfidf_vectorizer.vocabulary_
print new_term_freq_matrix.todense()
{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2} [[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ] [ 0.62276601 0. 0. 0.62276601 0. 0. 0. 0.4736296 0. 0. 0. ]]
Note that we didn't get words like 'watches' in the new_term_freq_matrix
. That's because we trained the object on the documents in mydoclist,
and that word doesn't appear in the vocabulary from that corpus. In other words, it's out of the lexicon.
import os
import csv
#os.chdir('/Users/rweiss/Dropbox/presentations/IRiSS2013/text1/fileformats/')
with open('amazon/sociology_2010.csv', 'rb') as csvfile:
amazon_reader = csv.DictReader(csvfile, delimiter=',')
amazon_reviews = [row['review_text'] for row in amazon_reader]
#your code here!!!