Notebook

Basic Text Analysis¶

We're going to cover two areas of text analysis: computing similarity between documents and training a very simple text classification.

Similarity metrics¶

An early exploratory level analysis that you might want to perform is similarity. How similar are your documents to each other? Once you are operating under the vector space model, you can compute different types of distances between word vectors; Euclidean, Hamming, Jaccard, cosine, and so forth.

It's currently quite popular to use cosine distances, so let's review what a cosine distance is and how to compute it.

Cosine distance¶

Think back to your days in grade school geometry. Cosine is a measure of an angle between two vectors. If the vectors are heading towards points very distant from the other, the angle between the two vectors will be very large. If they are heading towards a similar point in space, the angle will be very small.

Since we have represented documents as word vectors, we can find the cosine distance between documents and treat the resulting angle as an estimate of similarity based on word occurrences and frequencies. Similarity tends to be recalculated as 1 - distance so a cosine score of 1.0 means exactly identical.

Here are some excellent, more in-depth summaries of the math behind cosine distances as they relate to text in vector space:

We could go into more detail about how to compute cosine distance, but luckily scikits-learn has made this easy for us. We can multiply the tfidf sparse matrix object by its transpose and get cosine distances.

In [1]:

#according to one of the guys who helped write the tfidf implementation in scikits-learn: http://stackoverflow.com/a/8897648
from sklearn.feature_extraction.text import TfidfVectorizer

mydoclist = ['Julie loves me more than Linda loves me',
'Jane likes me more than Julie loves me',
'He likes basketball more than baseball']

tfidf_vectorizer = TfidfVectorizer(min_df = 1)
tfidf_matrix = tfidf_vectorizer.fit_transform(mydoclist)

document_distances = (tfidf_matrix * tfidf_matrix.T)
print 'Created a ' + str(document_distances.get_shape()[0]) + ' by ' + str(document_distances.get_shape()[1]) + ' document-document cosine distance matrix.'
print document_distances.toarray()

Created a 3 by 3 document-document cosine distance matrix.
[[ 1.          0.75360253  0.12840803]
 [ 0.75360253  1.          0.2574232 ]
 [ 0.12840803  0.2574232   1.        ]]

Finding similar documents¶

So now you have documents as word vectors. You want to find the ones that are really similar to each other.

Here's a tip: the cosine distance between two vectors is a reasonably easy way to perform a first test of document similarity.

Think back to your days in grade school geometry. Cosine is a measure of an angle between two vectors of some length. If the vectors are heading towards points very distant from the other, the angle between the two vectors will be very large. If they are heading towards a similar point in space, the angle will be very small.

Since we have represented documents as word vectors, we can find the cosine distance between documents and treat the resulting angle as an estimate of similarity based on tf-idf weighted word vectors.

This is how cheating detection works, in a nutshell.

In [2]:

from sklearn.metrics.pairwise import linear_kernel

#code taken from here: http://stackoverflow.com/a/12128777
from sklearn.metrics.pairwise import linear_kernel 

#linear kernel is the same as cosine distance when using tfidf + euclidean normalized vectors (L2 Norm=1))
#this is the benefit of sticking to scikits-learn from beginning to end of an analysis

cosine_similarities = linear_kernel(tfidf_matrix[0:1], tfidf_matrix).flatten() # let's look at similarity to the very first document
related_docs_indices = cosine_similarities.argsort()[:-len(mydoclist)-1:-1] #what is the order of most to least similar?

print mydoclist
print cosine_similarities[related_docs_indices] # what are the cosine distances?

['Julie loves me more than Linda loves me', 'Jane likes me more than Julie loves me', 'He likes basketball more than baseball']
[ 1.          0.75360253  0.12840803]

Classifying documents¶

First, let's create a massive list of dictionaries, where each dictionary is {field_source: review_text}.

In [3]:

import os
import csv
#os.chdir('/path/to/wherever/you/downloaded/data/from/textcleaning')
os.chdir('/Users/rweiss/Dropbox/presentations/IRiSS2013/text2/extra/amazon')

amazon_reviews = []
target_labels = []

for infile in os.listdir(os.path.join(os.getcwd())):
    if infile.endswith('csv'):
        label = infile.split('.')[0]
        target_labels.append(label)
        
        with open(infile, 'rb') as csvfile:
            amazon_reader = csv.DictReader(csvfile, delimiter=',')
            infile_rows = [{ label: row['review_text'] } for row in amazon_reader]
            
        for doc in infile_rows:
            amazon_reviews.append(doc)
        
print 'There are ' + str(len(amazon_reviews)) + ' total reviews.'

print 'The labels are '+ ', '.join(target_labels) + '.'

There are 5522 total reviews.
The labels are biologicalsciences_2010, literature_2010, sociology_2010.

So now we have a key-value mapping of label-body text for all the Amazon reviews. Let's look into building a classifier based on this text.

In [4]:

#first, we need to shuffle the docs into random order
#this is to make it easier for me to make train and test sets

from random import shuffle
x = [amazon_reviews[i] for i in range(len(amazon_reviews))]
shuffle(x)

In [5]:

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from operator import itemgetter

trainset_size = int(round(len(amazon_reviews)*0.75)) # i chose this threshold arbitrarily...
print 'The training set size for this classifier is ' + str(trainset_size) + '\n'

X_train = np.array([''.join(el.values()) for el in x[0:trainset_size]])
y_train = np.array([''.join(el.keys()) for el in x[0:trainset_size]])

X_test = np.array([''.join(el.values()) for el in x[trainset_size+1:len(amazon_reviews)]])   
y_test = np.array([''.join(el.keys()) for el in x[trainset_size+1:len(amazon_reviews)]])  

vectorizer = TfidfVectorizer(min_df=1, ngram_range=(1, 1), stop_words='english', strip_accents='unicode', norm='l2')
                             
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

classifier = MultinomialNB().fit(X_train, y_train)
y_predicted = classifier.predict(X_test)

print 'The precision for this classifier is ' + str(metrics.precision_score(y_test, y_predicted))
print 'The recall for this classifier is ' + str(metrics.recall_score(y_test, y_predicted))
print 'The f1 for this classifier is ' + str(metrics.f1_score(y_test, y_predicted))

#hey, not bad!  shouldn't be surprising; there's a lot of data
#simple thing to do would be to up the n-grams to bigrams; try varying ngram_range from (1, 1) to (1, 2)
#we could also modify the vectorizer to stem or lemmatize
print '\nHere is the confusion matrix:'
print metrics.confusion_matrix(y_test, y_predicted)

#What are the top N most predictive features per class?
N = 10
vocabulary = np.array([t for t, i in sorted(vectorizer.vocabulary_.iteritems(), key=itemgetter(1))])

for i, label in enumerate(target_labels):
    topN = np.argsort(classifier.coef_[i])[-N:]
    print "\nThe top %d most informative features for %s: \n%s" % (N, label, " ".join(vocabulary[topN]))

The training set size for this classifier is 4142

The precision for this classifier is 0.764007077933
The recall for this classifier is 0.800580130529
The f1 for this classifier is 0.722836562295

Here is the confusion matrix:
[[  18  169    0]
 [   0 1086    0]
 [   0  106    0]]

The top 10 most informative features for biologicalsciences_2010: 
information gavin women intuition violence gift becker read fear book

The top 10 most informative features for literature_2010: 
like reading time novel quot story great gatsby read book

The top 10 most informative features for sociology_2010: 
really turow stickers lerner relationships dance quot read book anger

etcML demonstration!¶

First, let's reformat the Amazon data into one of the two structures that etcML expects

In [6]:

import os
import zipfile

for review in amazon_reviews:
    label = ''.join(review.keys())
    text = ''.join(review.values())
    
    etcMLdir = os.path.join(os.getcwd() + '/etcML/' + label)
    
    if not os.path.exists(etcMLdir):
        try:
            os.makedirs(etcMLdir)
        except OSError:
            print "Skipping creation of %s because it exists already." % etcMLdir
    
    #would probably be better to create a dictionary that stores the DOI and then names the file the DOI rather than the index number
    with open(os.path.join(etcMLdir + os.sep + 'review_' + str(amazon_reviews.index(review)) + '.txt'), 'wb' ) as outfile:
        outfile.write(text)

#note that it wasn't really necessary to write these files out to a directory first...
#we could have written a function that added to a zipfile dynamically
        
def zipdir(path, zip):
    for root, dirs, files in os.walk(path):
        for file in files:
            zip.write(os.path.join(root, file))

zip = zipfile.ZipFile('amazon.zip', 'w')
zipdir('etcML/', zip)
zip.close()