Notebook

Document-level text analysis¶

Document-level analysis is when you are interested in the whole text article, not tokens (sentences or words). The most basic example is labeling documents against some classification scheme, hence text classification. When you don't know your scheme ahead of time or you're interested in exploring a large set of data, you can try topic modeling.

We're going to go over a couple of examples of document-level text analysis using some very most common classifiers models. We're going to go over the code to train your own model and discuss the results we see.

Supervised learning: Text classification in Python¶

We're going to go over examples of how to use the excellent Scikits-Learn library to train some text classifiers.

The dataset used are the titles and topic codes from the NYTimes dataset that comes with the RTextTools library in R. It consists of titles from NYTimes front page news and associated codes according to Amber Boydstun's classification scheme.

In [1]:

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from operator import itemgetter
from sklearn.metrics import classification_report
import csv
import os

os.chdir('/Users/rweiss/Dropbox/presentations/MozFest2013/data/')

#note that if you generated this from R, you will need to delete the row
#"NYT_sample.Topic.Code","NYT_sample.Title"
#from the top of the file.
nyt = open('../data/nyt_title_data.csv') # check the structure of this file!
nyt_data = []
nyt_labels = []
csv_reader = csv.reader(nyt)

for line in csv_reader:
 nyt_labels.append(int(line[0]))
 nyt_data.append(line[1])

nyt.close()

trainset_size = int(round(len(nyt_data)*0.75)) # i chose this threshold arbitrarily...to discuss
print 'The training set size for this classifier is ' + str(trainset_size) + '\n'

X_train = np.array([''.join(el) for el in nyt_data[0:trainset_size]])
y_train = np.array([el for el in nyt_labels[0:trainset_size]])

X_test = np.array([''.join(el) for el in nyt_data[trainset_size+1:len(nyt_data)]]) 
y_test = np.array([el for el in nyt_labels[trainset_size+1:len(nyt_labels)]]) 

#print(X_train)

vectorizer = TfidfVectorizer(min_df=2, 
 ngram_range=(1, 2), 
 stop_words='english', 
 strip_accents='unicode', 
 norm='l2')
 
test_string = unicode(nyt_data[0])

print "Example string: " + test_string
print "Preprocessed string: " + vectorizer.build_preprocessor()(test_string)
print "Tokenized string:" + str(vectorizer.build_tokenizer()(test_string))
print "N-gram data string:" + str(vectorizer.build_analyzer()(test_string))
print "\n"
 

The training set size for this classifier is 1621

Example string: Dole Courts Democrats
Preprocessed string: dole courts democrats
Tokenized string:[u'Dole', u'Courts', u'Democrats']
N-gram data string:[u'dole', u'courts', u'democrats', u'dole courts', u'courts democrats']

In [2]:

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

nb_classifier = MultinomialNB().fit(X_train, y_train)

y_nb_predicted = nb_classifier.predict(X_test)

print "MODEL: Multinomial Naive Bayes\n"

print 'The precision for this classifier is ' + str(metrics.precision_score(y_test, y_nb_predicted))
print 'The recall for this classifier is ' + str(metrics.recall_score(y_test, y_nb_predicted))
print 'The f1 for this classifier is ' + str(metrics.f1_score(y_test, y_nb_predicted))
print 'The accuracy for this classifier is ' + str(metrics.accuracy_score(y_test, y_nb_predicted))

print '\nHere is the classification report:'
print classification_report(y_test, y_nb_predicted)

#simple thing to do would be to up the n-grams to bigrams; try varying ngram_range from (1, 1) to (1, 2)
#we could also modify the vectorizer to stem or lemmatize
print '\nHere is the confusion matrix:'
print metrics.confusion_matrix(y_test, y_nb_predicted, labels=unique(nyt_labels))

MODEL: Multinomial Naive Bayes

The precision for this classifier is 0.678610380886
The recall for this classifier is 0.549165120594
The f1 for this classifier is 0.506785046956
The accuracy for this classifier is 0.549165120594

Here is the classification report:
 precision recall f1-score support

 3 1.00 0.23 0.38 47
 12 0.75 0.08 0.15 37
 15 1.00 0.10 0.19 39
 16 0.59 0.56 0.58 112
 19 0.46 0.88 0.60 162
 20 0.70 0.64 0.67 99
 29 1.00 0.21 0.35 43

avg / total 0.68 0.55 0.51 539


Here is the confusion matrix:
[[ 11 0 0 5 24 7 0]
 [ 0 3 0 5 24 5 0]
 [ 0 0 4 4 26 5 0]
 [ 0 1 0 63 44 4 0]
 [ 0 0 0 18 143 1 0]
 [ 0 0 0 8 28 63 0]
 [ 0 0 0 4 25 5 9]]

In [3]:

#What are the top N most predictive features per class?
N = 10
vocabulary = np.array([t for t, i in sorted(vectorizer.vocabulary_.iteritems(), key=itemgetter(1))])

for i, label in enumerate(nyt_labels):
 if i == 7: # hack...
 break
 topN = np.argsort(nb_classifier.coef_[i])[-N:]
 print "\nThe top %d most informative features for topic code %s: \n%s" % (N, label, " ".join(vocabulary[topN]))
 #print topN

The top 10 most informative features for topic code 20: 
study hospitals aids medicare cancer health tobacco care drug new

The top 10 most informative features for topic code 29: 
special special report new drug sniper suspect report crime police case

The top 10 most informative features for topic code 3: 
market chief wall new billion big stocks enron deal microsoft

The top 10 most informative features for topic code 16: 
iraqi military baghdad 11 bush challenged nation challenged nation war iraq

The top 10 most informative features for topic code 19: 
russia japan war mideast russian india leader new israel china

The top 10 most informative features for topic code 19: 
2000 campaign 2000 clinton testing testing president politics bush democrats campaign president

The top 10 most informative features for topic code 20: 
victory bowl knicks win world yankees playoffs series game baseball

In [4]:

from sklearn.svm import LinearSVC

svm_classifier = LinearSVC().fit(X_train, y_train)

y_svm_predicted = svm_classifier.predict(X_test)
print "MODEL: Linear SVC\n"

print 'The precision for this classifier is ' + str(metrics.precision_score(y_test, y_svm_predicted))
print 'The recall for this classifier is ' + str(metrics.recall_score(y_test, y_svm_predicted))
print 'The f1 for this classifier is ' + str(metrics.f1_score(y_test, y_svm_predicted))
print 'The accuracy for this classifier is ' + str(metrics.accuracy_score(y_test, y_svm_predicted))

print '\nHere is the classification report:'
print classification_report(y_test, y_svm_predicted)

#simple thing to do would be to up the n-grams to bigrams; try varying ngram_range from (1, 1) to (1, 2)
#we could also modify the vectorizer to stem or lemmatize
print '\nHere is the confusion matrix:'
print metrics.confusion_matrix(y_test, y_svm_predicted, labels=unique(nyt_labels))

MODEL: Linear SVC

The precision for this classifier is 0.63396261297
The recall for this classifier is 0.623376623377
The f1 for this classifier is 0.620561226142
The accuracy for this classifier is 0.623376623377

Here is the classification report:
 precision recall f1-score support

 3 0.69 0.47 0.56 47
 12 0.43 0.49 0.46 37
 15 0.68 0.44 0.53 39
 16 0.60 0.59 0.59 112
 19 0.60 0.77 0.67 162
 20 0.72 0.66 0.69 99
 29 0.73 0.56 0.63 43

avg / total 0.63 0.62 0.62 539


Here is the confusion matrix:
[[ 22 5 1 6 7 5 1]
 [ 1 18 1 2 10 2 3]
 [ 1 3 17 1 10 6 1]
 [ 1 7 3 66 28 6 1]
 [ 4 5 2 23 124 4 0]
 [ 3 2 1 9 16 65 3]
 [ 0 2 0 3 12 2 24]]

In [5]:

#What are the top N most predictive features per class?
N = 10
vocabulary = np.array([t for t, i in sorted(vectorizer.vocabulary_.iteritems(), key=itemgetter(1))])

for i, label in enumerate(nyt_labels):
 if i == 7: # hack...
 break
 topN = np.argsort(svm_classifier.coef_[i])[-N:]
 print "\nThe top %d most informative features for topic code %s: \n%s" % (N, label, " ".join(vocabulary[topN]))
 #print topN

The top 10 most informative features for topic code 20: 
schiavo tissue baby scientists fat cancer gene medicare hospitals tobacco

The top 10 most informative features for topic code 29: 
fallen limiting charged police rampage suspect murder crime sniper gun

The top 10 most informative features for topic code 3: 
profit workers merger response storm pricing deal stocks enron microsoft

The top 10 most informative features for topic code 16: 
base generals afghanistan navy force hussein nation nato 11 iraq

The top 10 most informative features for topic code 19: 
pakistan india russian japan europe china africa mideast israel chinese

The top 10 most informative features for topic code 19: 
impeachment whitewater race gingrich lewinsky senate president politics campaign democrats

The top 10 most informative features for topic code 20: 
bowl armstrong match knicks playoffs play yankees series baseball game

In [6]:

from sklearn.linear_model import LogisticRegression

maxent_classifier = LogisticRegression().fit(X_train, y_train)

y_maxent_predicted = maxent_classifier.predict(X_test)
print "MODEL: Maximum Entropy\n"

print 'The precision for this classifier is ' + str(metrics.precision_score(y_test, y_maxent_predicted))
print 'The recall for this classifier is ' + str(metrics.recall_score(y_test, y_maxent_predicted))
print 'The f1 for this classifier is ' + str(metrics.f1_score(y_test, y_maxent_predicted))
print 'The accuracy for this classifier is ' + str(metrics.accuracy_score(y_test, y_maxent_predicted))

print '\nHere is the classification report:'
print classification_report(y_test, y_maxent_predicted)

#simple thing to do would be to up the n-grams to bigrams; try varying ngram_range from (1, 1) to (1, 2)
#we could also modify the vectorizer to stem or lemmatize
print '\nHere is the confusion matrix:'
print metrics.confusion_matrix(y_test, y_maxent_predicted, labels=unique(nyt_labels))

MODEL: Maximum Entropy

The precision for this classifier is 0.654004346593
The recall for this classifier is 0.549165120594
The f1 for this classifier is 0.524774091511
The accuracy for this classifier is 0.549165120594

Here is the classification report:
 precision recall f1-score support

 3 0.94 0.32 0.48 47
 12 0.67 0.16 0.26 37
 15 0.71 0.13 0.22 39
 16 0.64 0.54 0.59 112
 19 0.44 0.85 0.58 162
 20 0.72 0.60 0.65 99
 29 1.00 0.28 0.44 43

avg / total 0.65 0.55 0.52 539


Here is the confusion matrix:
[[ 15 0 0 4 25 3 0]
 [ 1 6 0 3 24 3 0]
 [ 0 1 5 1 30 2 0]
 [ 0 2 0 61 43 6 0]
 [ 0 0 1 19 138 4 0]
 [ 0 0 1 7 32 59 0]
 [ 0 0 0 1 25 5 12]]

In [7]:

#What are the top N most predictive features per class?
N = 10
vocabulary = np.array([t for t, i in sorted(vectorizer.vocabulary_.iteritems(), key=itemgetter(1))])

for i, label in enumerate(nyt_labels):
 if i == 7: # hack...
 break
 topN = np.argsort(maxent_classifier.coef_[i])[-N:]
 print "\nThe top %d most informative features for topic code %s: \n%s" % (N, label, " ".join(vocabulary[topN]))
 #print topN

The top 10 most informative features for topic code 20: 
study scientists aids health hospitals medicare cancer care drug tobacco

The top 10 most informative features for topic code 29: 
murder mexico officer drug gun suspect sniper police case crime

The top 10 most informative features for topic code 3: 
big chief wall pay market billion stocks deal enron microsoft

The top 10 most informative features for topic code 16: 
challenged iraqis nato force arms nation challenged war nation 11 iraq

The top 10 most informative features for topic code 19: 
europe leader russia chinese japan russian mideast india israel china

The top 10 most informative features for topic code 19: 
2000 dole clinton bush race senate politics president democrats campaign

The top 10 most informative features for topic code 20: 
team play armstrong bowl knicks yankees playoffs series game baseball

Unsupervised learning: Topic modeling in Python¶

Now we're going to go over some typical topic modeling by using the popular Gensim library.

The nice thing about Gensim is that it's ready to be applied to large datasets as it incorporates both the online version of LDA and distributed computing capability.

We won't go over those features in this tutorial, since that would take hours to show a single example, and the NYTimes dataset is really quite small and can be run on a single machine.

In [8]:

from gensim import corpora, models, similarities
from itertools import chain
import nltk
from nltk.corpus import stopwords
from operator import itemgetter
import re

url_pattern = r'https?:\/\/(.*[\r\n]*)+'

documents = [nltk.clean_html(document) for document in nyt_data]
stoplist = stopwords.words('english')
texts = [[word for word in document.lower().split() if word not in stoplist]
 for document in documents]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

tfidf = models.TfidfModel(corpus) 
corpus_tfidf = tfidf[corpus]

#lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=100)
#lsi.print_topics(20)

n_topics = 60
lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=n_topics)

for i in range(0, n_topics):
 temp = lda.show_topic(i, 10)
 terms = []
 for term in temp:
 terms.append(term[1])
 print "Top 10 terms for topic #" + str(i) + ": "+ ", ".join(terms)
 
print 
print 'Which LDA topic maximally describes a document?\n'
print 'Original document: ' + documents[1]
print 'Preprocessed document: ' + str(texts[1])
print 'Matrix Market format: ' + str(corpus[1])
print 'Topic probability mixture: ' + str(lda[corpus[1]])
print 'Maximally probable topic: topic #' + str(max(lda[corpus[1]],key=itemgetter(1))[0])

Top 10 terms for topic #0: company, delays, reforms, money;, 6, whose, rights, giving, nazis, running
Top 10 terms for topic #1: garden, he's, lavish, networks,, trade-off, scripts, head, plan, pentagon, hit
Top 10 terms for topic #2: revival, now,, last,, guns,, killing,, chamber, tries, split, friends;, russians;
Top 10 terms for topic #3: legal, beckons, best, resistance, won't, privileged, mexico's, presidency,, tilts, production
Top 10 terms for topic #4: europe, r, entrepreneurs,, ambitious, moscow, tough, home,, said, call, one
Top 10 terms for topic #5: pakistan, gathering, claimed, signs, raise, taliban;, supplying, hesitantly, ended, reviewing
Top 10 terms for topic #6: gaining,, losing,, torricelli, take, super, beijing, bowl, two, find, bush's
Top 10 terms for topic #7: terror, turmoil, must, shift, suspects, level, 2, vast, left, anguish
Top 10 terms for topic #8: colombia, swing, justices, effort, holding, suspect, pledge, talks, peres, military
Top 10 terms for topic #9: california;, title, balance, takes, governor, move, u.n., tough, hill's, seat-squirmer
Top 10 terms for topic #10: syria, baseball;, afghanistan,, pushes, swiftly, sharon, report, halt, israel, ace
Top 10 terms for topic #11: adoptions, rewarding, pause, recovery,, saudis, hundreds, diplomacy;, betrayed, laotians, mideast
Top 10 terms for topic #12: sars, timetable, clash, hotel, aide, urges, close, iraq, bush, toronto
Top 10 terms for topic #13: inside, care, business, planes, take, rebels, ill, mexico, town, war:
Top 10 terms for topic #14: insiders, kerry, suspected, woman, bid, giuliani, describes, saudi, foster, court
Top 10 terms for topic #15: aid, briefly, airstrike, zarqawi, survived, balks, numbers, bush, get, game
Top 10 terms for topic #16: cardinals, choosy, loner,, series, world, region, short, blasts, control, feared,
Top 10 terms for topic #17: fronts, disease, project, stretching,, sadly, lessons, excruciating, air, came, trial
Top 10 terms for topic #18: alzheimer's, drugs, cover, treatments, emergency, program, streets, doctor's, many,, room,
Top 10 terms for topic #19: weigh, documents, incentives, dissuade, gifts, bush, east, stay, leaves, trial
Top 10 terms for topic #20: veto, resolution, pursue, islamic, shiite, leadership;, iraq,, northeast, back, g.o.p.
Top 10 terms for topic #21: game, fear, c.i.a., sees, intelligence;, hindered, hindsight,, terror, tide, all-around
Top 10 terms for topic #22: public, conservatives, protest, terrorism, amid, starts, tobacco, impasse, white, rules
Top 10 terms for topic #23: arafat, fled, detainee, homes, mentally, heart, chechen, afghanistan, approves, money
Top 10 terms for topic #24: iran, rivera, send, confession, star, deaths, spy, yankees, croatian's, scale
Top 10 terms for topic #25: talks, defines, palestinians, german, vote, final, taliban, plan, war:, nation
Top 10 terms for topic #26: international, business;, passes, opportunity, vows, expand, seek, sell, crime, asks
Top 10 terms for topic #27: all,, treatment, crash, economic, executive, medicare, covering, goes, '96;, plea
Top 10 terms for topic #28: fall, coach, analysis;, news, charges, caught, (again), costly, near, feet
Top 10 terms for topic #29: missile, chief, taiwan, forced, mccain, test, dole's, canceled, vote, unable
Top 10 terms for topic #30: states, murder, shift,, 3, locked, sets, it's, backing, koreans, sites
Top 10 terms for topic #31: middle, executives, twice, widowed, india,, scorned, sent, inquiry, (again), hand,
Top 10 terms for topic #32: challenged:, nation, time,, insurance, efforts, sept., 11, keep, peace, panel
Top 10 terms for topic #33: president:, testing, peace, overview;, israel, rehiring, ex-president, morgan, rises, hope
Top 10 terms for topic #34: saudi, blast, high, 13, insurance, pact, government, costs, day, lowest
Top 10 terms for topic #35: proposes, medicare, found, troops;, guilty, tighten, remains, set, crisis, case
Top 10 terms for topic #36: search, baseball, 5, upsets, move, one, playoffs, bush, may, swiftly
Top 10 terms for topic #37: responses:, threats, strategy, disease;, budget, rich, security;, presence, damage, toss
Top 10 terms for topic #38: al, qaeda, witness, personal, let, $1, memo;, russia:, tour, bold
Top 10 terms for topic #39: failure, benefit, cutback, cheney's, companies, drug, bush, slave, adopt, fund
Top 10 terms for topic #40: military, losses, target, bars, cloning, ban, california's, clear, book,, novel
Top 10 terms for topic #41: smoke, front;, role, battle:, vast, spin, fierce, contest, vote:, clear
Top 10 terms for topic #42: life, he'll, court, dean's, reshape, faulted, contest, choice, falling, steps
Top 10 terms for topic #43: toward, children, light, hurrying, warmth, aiding, russians, try, presents, missiles
Top 10 terms for topic #44: surges, mode,, rightist, bolster, pick, york, attack, dies, new, cries
Top 10 terms for topic #45: confinement, guantanamo,, interrogation, rebuilding, corporations, goldman, hurting, deal;, appears, iowa
Top 10 terms for topic #46: tells, record, vow, israel, yankees, temper, leaders, runner, mile, working
Top 10 terms for topic #47: points, arms, 10%, arbitration, approval, three, picked, woo, populist, f.d.a.
Top 10 terms for topic #48: uneasiness, tigers, iraq, inflamed:, reconstruction;, repayment, fee, sees, worst, region
Top 10 terms for topic #49: sosa, mcgwire, expert's, point, plan, blair, gains, caribbean, coverage, jet
Top 10 terms for topic #50: 1998, resolve, base, campaign:, aids, family, foley, cleric, republicans, elections:
Top 10 terms for topic #51: peace,, politics, crashes, hollywood, enclaves, cockfights, flourishing, war, die, u.s.,
Top 10 terms for topic #52: green, cloning, bus, flaws, hubris, eccentric's, frenzy, defense, avoid, pro
Top 10 terms for topic #53: giants, declares, fence, victory, paler, rosy, sometimes, investors,, forecasts, results
Top 10 terms for topic #54: bridgeport, dispute, health, ballot, insurer, options, away, mccain,, crack, uninsured
Top 10 terms for topic #55: time, milosevic, travel, savor, stalls, tuesdays:, clues, among, grants, africans
Top 10 terms for topic #56: victims, texaco, democracy, batter, jews, using, back, nigerians, lurching, nafta
Top 10 terms for topic #57: city, school, shaken, weapons, pills., sunscreen., spray., checklist, bug, camp:
Top 10 terms for topic #58: g.i.'s, counting, command, ways, east, marines, charter, john, changes, assailing
Top 10 terms for topic #59: jordan, egypt, rape, brings, asian, rise, young, kabul, first, leader

Which LDA topic maximally describes a document?

Original document: Yanks End Drought; Mets Fall in Opener
Preprocessed document: ['yanks', 'end', 'drought;', 'mets', 'fall', 'opener']
Matrix Market format: [(3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)]
Topic probability mixture: [(19, 0.43120206027031704), (27, 0.14505309958538579), (28, 0.14521071520874509), (38, 0.14520079160221899)]
Maximally probable topic: topic #19

/Users/rweiss/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/gensim-0.8.7-py2.7.egg/gensim/__init__.py:12: UserWarning: Module IPython was already imported from /Applications/Canopy.app/appdata/canopy-1.0.3.1262.macosx-x86_64/Canopy.app/Contents/lib/python2.7/site-packages/IPython/__init__.pyc, but /Users/rweiss/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/ipython-1.0.0-py2.7.egg is being added to sys.path
 __version__ = __import__('pkg_resources').get_distribution('gensim').version

Big picture questions:¶

How do the different supervised models compare against each other?
What's the tradeoffs between the metrics per model?
What about per class? Are some models better than others are certain classes?
What if we had had more data? Would some models get better than others?
What if our observations had more data? Instead of titles, we used lead paragraphs or even the full document?
What if our feature space was different? Instead of unigrams or bigrams, we used trigrams? Parts-of-speech?
Is there something about the underlying language structure that leads certain models to being better than others?
How do the supervised models compare against the unsupervised model?
Are they "better?" If so, how?
What did we need to train a supervised model? What did we need to train an unsupervised model?
On that note, when is it more appropriate to use an unsupervised model over a supervised model?
How do you choose k number of topics for an unsupervised model?
What happens if you run the unsupervised model again? What about the supervised model?