Document-level analysis is when you are interested in the whole text article, not tokens (sentences or words). The most basic example is labeling documents against some classification scheme, hence text classification. When you don't know your scheme ahead of time or you're interested in exploring a large set of data, you can try topic modeling.
We're going to go over a couple of examples of document-level text analysis using some very most common classifiers models. We're going to go over the code to train your own model and discuss the results we see.
We're going to go over examples of how to use the excellent Scikits-Learn library to train some text classifiers.
The dataset used are the titles and topic codes from the NYTimes
dataset that comes with the RTextTools library in R
. It consists of titles from NYTimes front page news and associated codes according to Amber Boydstun's classification scheme.
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from operator import itemgetter
from sklearn.metrics import classification_report
import csv
import os
os.chdir('/Users/rweiss/Dropbox/presentations/MozFest2013/data/')
#note that if you generated this from R, you will need to delete the row
#"NYT_sample.Topic.Code","NYT_sample.Title"
#from the top of the file.
nyt = open('../data/nyt_title_data.csv') # check the structure of this file!
nyt_data = []
nyt_labels = []
csv_reader = csv.reader(nyt)
for line in csv_reader:
nyt_labels.append(int(line[0]))
nyt_data.append(line[1])
nyt.close()
trainset_size = int(round(len(nyt_data)*0.75)) # i chose this threshold arbitrarily...to discuss
print 'The training set size for this classifier is ' + str(trainset_size) + '\n'
X_train = np.array([''.join(el) for el in nyt_data[0:trainset_size]])
y_train = np.array([el for el in nyt_labels[0:trainset_size]])
X_test = np.array([''.join(el) for el in nyt_data[trainset_size+1:len(nyt_data)]])
y_test = np.array([el for el in nyt_labels[trainset_size+1:len(nyt_labels)]])
#print(X_train)
vectorizer = TfidfVectorizer(min_df=2,
ngram_range=(1, 2),
stop_words='english',
strip_accents='unicode',
norm='l2')
test_string = unicode(nyt_data[0])
print "Example string: " + test_string
print "Preprocessed string: " + vectorizer.build_preprocessor()(test_string)
print "Tokenized string:" + str(vectorizer.build_tokenizer()(test_string))
print "N-gram data string:" + str(vectorizer.build_analyzer()(test_string))
print "\n"
The training set size for this classifier is 1621 Example string: Dole Courts Democrats Preprocessed string: dole courts democrats Tokenized string:[u'Dole', u'Courts', u'Democrats'] N-gram data string:[u'dole', u'courts', u'democrats', u'dole courts', u'courts democrats']
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
nb_classifier = MultinomialNB().fit(X_train, y_train)
y_nb_predicted = nb_classifier.predict(X_test)
print "MODEL: Multinomial Naive Bayes\n"
print 'The precision for this classifier is ' + str(metrics.precision_score(y_test, y_nb_predicted))
print 'The recall for this classifier is ' + str(metrics.recall_score(y_test, y_nb_predicted))
print 'The f1 for this classifier is ' + str(metrics.f1_score(y_test, y_nb_predicted))
print 'The accuracy for this classifier is ' + str(metrics.accuracy_score(y_test, y_nb_predicted))
print '\nHere is the classification report:'
print classification_report(y_test, y_nb_predicted)
#simple thing to do would be to up the n-grams to bigrams; try varying ngram_range from (1, 1) to (1, 2)
#we could also modify the vectorizer to stem or lemmatize
print '\nHere is the confusion matrix:'
print metrics.confusion_matrix(y_test, y_nb_predicted, labels=unique(nyt_labels))
MODEL: Multinomial Naive Bayes The precision for this classifier is 0.678610380886 The recall for this classifier is 0.549165120594 The f1 for this classifier is 0.506785046956 The accuracy for this classifier is 0.549165120594 Here is the classification report: precision recall f1-score support 3 1.00 0.23 0.38 47 12 0.75 0.08 0.15 37 15 1.00 0.10 0.19 39 16 0.59 0.56 0.58 112 19 0.46 0.88 0.60 162 20 0.70 0.64 0.67 99 29 1.00 0.21 0.35 43 avg / total 0.68 0.55 0.51 539 Here is the confusion matrix: [[ 11 0 0 5 24 7 0] [ 0 3 0 5 24 5 0] [ 0 0 4 4 26 5 0] [ 0 1 0 63 44 4 0] [ 0 0 0 18 143 1 0] [ 0 0 0 8 28 63 0] [ 0 0 0 4 25 5 9]]
#What are the top N most predictive features per class?
N = 10
vocabulary = np.array([t for t, i in sorted(vectorizer.vocabulary_.iteritems(), key=itemgetter(1))])
for i, label in enumerate(nyt_labels):
if i == 7: # hack...
break
topN = np.argsort(nb_classifier.coef_[i])[-N:]
print "\nThe top %d most informative features for topic code %s: \n%s" % (N, label, " ".join(vocabulary[topN]))
#print topN
The top 10 most informative features for topic code 20: study hospitals aids medicare cancer health tobacco care drug new The top 10 most informative features for topic code 29: special special report new drug sniper suspect report crime police case The top 10 most informative features for topic code 3: market chief wall new billion big stocks enron deal microsoft The top 10 most informative features for topic code 16: iraqi military baghdad 11 bush challenged nation challenged nation war iraq The top 10 most informative features for topic code 19: russia japan war mideast russian india leader new israel china The top 10 most informative features for topic code 19: 2000 campaign 2000 clinton testing testing president politics bush democrats campaign president The top 10 most informative features for topic code 20: victory bowl knicks win world yankees playoffs series game baseball
from sklearn.svm import LinearSVC
svm_classifier = LinearSVC().fit(X_train, y_train)
y_svm_predicted = svm_classifier.predict(X_test)
print "MODEL: Linear SVC\n"
print 'The precision for this classifier is ' + str(metrics.precision_score(y_test, y_svm_predicted))
print 'The recall for this classifier is ' + str(metrics.recall_score(y_test, y_svm_predicted))
print 'The f1 for this classifier is ' + str(metrics.f1_score(y_test, y_svm_predicted))
print 'The accuracy for this classifier is ' + str(metrics.accuracy_score(y_test, y_svm_predicted))
print '\nHere is the classification report:'
print classification_report(y_test, y_svm_predicted)
#simple thing to do would be to up the n-grams to bigrams; try varying ngram_range from (1, 1) to (1, 2)
#we could also modify the vectorizer to stem or lemmatize
print '\nHere is the confusion matrix:'
print metrics.confusion_matrix(y_test, y_svm_predicted, labels=unique(nyt_labels))
MODEL: Linear SVC The precision for this classifier is 0.63396261297 The recall for this classifier is 0.623376623377 The f1 for this classifier is 0.620561226142 The accuracy for this classifier is 0.623376623377 Here is the classification report: precision recall f1-score support 3 0.69 0.47 0.56 47 12 0.43 0.49 0.46 37 15 0.68 0.44 0.53 39 16 0.60 0.59 0.59 112 19 0.60 0.77 0.67 162 20 0.72 0.66 0.69 99 29 0.73 0.56 0.63 43 avg / total 0.63 0.62 0.62 539 Here is the confusion matrix: [[ 22 5 1 6 7 5 1] [ 1 18 1 2 10 2 3] [ 1 3 17 1 10 6 1] [ 1 7 3 66 28 6 1] [ 4 5 2 23 124 4 0] [ 3 2 1 9 16 65 3] [ 0 2 0 3 12 2 24]]
#What are the top N most predictive features per class?
N = 10
vocabulary = np.array([t for t, i in sorted(vectorizer.vocabulary_.iteritems(), key=itemgetter(1))])
for i, label in enumerate(nyt_labels):
if i == 7: # hack...
break
topN = np.argsort(svm_classifier.coef_[i])[-N:]
print "\nThe top %d most informative features for topic code %s: \n%s" % (N, label, " ".join(vocabulary[topN]))
#print topN
The top 10 most informative features for topic code 20: schiavo tissue baby scientists fat cancer gene medicare hospitals tobacco The top 10 most informative features for topic code 29: fallen limiting charged police rampage suspect murder crime sniper gun The top 10 most informative features for topic code 3: profit workers merger response storm pricing deal stocks enron microsoft The top 10 most informative features for topic code 16: base generals afghanistan navy force hussein nation nato 11 iraq The top 10 most informative features for topic code 19: pakistan india russian japan europe china africa mideast israel chinese The top 10 most informative features for topic code 19: impeachment whitewater race gingrich lewinsky senate president politics campaign democrats The top 10 most informative features for topic code 20: bowl armstrong match knicks playoffs play yankees series baseball game
from sklearn.linear_model import LogisticRegression
maxent_classifier = LogisticRegression().fit(X_train, y_train)
y_maxent_predicted = maxent_classifier.predict(X_test)
print "MODEL: Maximum Entropy\n"
print 'The precision for this classifier is ' + str(metrics.precision_score(y_test, y_maxent_predicted))
print 'The recall for this classifier is ' + str(metrics.recall_score(y_test, y_maxent_predicted))
print 'The f1 for this classifier is ' + str(metrics.f1_score(y_test, y_maxent_predicted))
print 'The accuracy for this classifier is ' + str(metrics.accuracy_score(y_test, y_maxent_predicted))
print '\nHere is the classification report:'
print classification_report(y_test, y_maxent_predicted)
#simple thing to do would be to up the n-grams to bigrams; try varying ngram_range from (1, 1) to (1, 2)
#we could also modify the vectorizer to stem or lemmatize
print '\nHere is the confusion matrix:'
print metrics.confusion_matrix(y_test, y_maxent_predicted, labels=unique(nyt_labels))
MODEL: Maximum Entropy The precision for this classifier is 0.654004346593 The recall for this classifier is 0.549165120594 The f1 for this classifier is 0.524774091511 The accuracy for this classifier is 0.549165120594 Here is the classification report: precision recall f1-score support 3 0.94 0.32 0.48 47 12 0.67 0.16 0.26 37 15 0.71 0.13 0.22 39 16 0.64 0.54 0.59 112 19 0.44 0.85 0.58 162 20 0.72 0.60 0.65 99 29 1.00 0.28 0.44 43 avg / total 0.65 0.55 0.52 539 Here is the confusion matrix: [[ 15 0 0 4 25 3 0] [ 1 6 0 3 24 3 0] [ 0 1 5 1 30 2 0] [ 0 2 0 61 43 6 0] [ 0 0 1 19 138 4 0] [ 0 0 1 7 32 59 0] [ 0 0 0 1 25 5 12]]
#What are the top N most predictive features per class?
N = 10
vocabulary = np.array([t for t, i in sorted(vectorizer.vocabulary_.iteritems(), key=itemgetter(1))])
for i, label in enumerate(nyt_labels):
if i == 7: # hack...
break
topN = np.argsort(maxent_classifier.coef_[i])[-N:]
print "\nThe top %d most informative features for topic code %s: \n%s" % (N, label, " ".join(vocabulary[topN]))
#print topN
The top 10 most informative features for topic code 20: study scientists aids health hospitals medicare cancer care drug tobacco The top 10 most informative features for topic code 29: murder mexico officer drug gun suspect sniper police case crime The top 10 most informative features for topic code 3: big chief wall pay market billion stocks deal enron microsoft The top 10 most informative features for topic code 16: challenged iraqis nato force arms nation challenged war nation 11 iraq The top 10 most informative features for topic code 19: europe leader russia chinese japan russian mideast india israel china The top 10 most informative features for topic code 19: 2000 dole clinton bush race senate politics president democrats campaign The top 10 most informative features for topic code 20: team play armstrong bowl knicks yankees playoffs series game baseball
Now we're going to go over some typical topic modeling by using the popular Gensim library.
The nice thing about Gensim is that it's ready to be applied to large datasets as it incorporates both the online version of LDA and distributed computing capability.
We won't go over those features in this tutorial, since that would take hours to show a single example, and the NYTimes dataset is really quite small and can be run on a single machine.
from gensim import corpora, models, similarities
from itertools import chain
import nltk
from nltk.corpus import stopwords
from operator import itemgetter
import re
url_pattern = r'https?:\/\/(.*[\r\n]*)+'
documents = [nltk.clean_html(document) for document in nyt_data]
stoplist = stopwords.words('english')
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
#lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=100)
#lsi.print_topics(20)
n_topics = 60
lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=n_topics)
for i in range(0, n_topics):
temp = lda.show_topic(i, 10)
terms = []
for term in temp:
terms.append(term[1])
print "Top 10 terms for topic #" + str(i) + ": "+ ", ".join(terms)
print
print 'Which LDA topic maximally describes a document?\n'
print 'Original document: ' + documents[1]
print 'Preprocessed document: ' + str(texts[1])
print 'Matrix Market format: ' + str(corpus[1])
print 'Topic probability mixture: ' + str(lda[corpus[1]])
print 'Maximally probable topic: topic #' + str(max(lda[corpus[1]],key=itemgetter(1))[0])
Top 10 terms for topic #0: company, delays, reforms, money;, 6, whose, rights, giving, nazis, running Top 10 terms for topic #1: garden, he's, lavish, networks,, trade-off, scripts, head, plan, pentagon, hit Top 10 terms for topic #2: revival, now,, last,, guns,, killing,, chamber, tries, split, friends;, russians; Top 10 terms for topic #3: legal, beckons, best, resistance, won't, privileged, mexico's, presidency,, tilts, production Top 10 terms for topic #4: europe, r, entrepreneurs,, ambitious, moscow, tough, home,, said, call, one Top 10 terms for topic #5: pakistan, gathering, claimed, signs, raise, taliban;, supplying, hesitantly, ended, reviewing Top 10 terms for topic #6: gaining,, losing,, torricelli, take, super, beijing, bowl, two, find, bush's Top 10 terms for topic #7: terror, turmoil, must, shift, suspects, level, 2, vast, left, anguish Top 10 terms for topic #8: colombia, swing, justices, effort, holding, suspect, pledge, talks, peres, military Top 10 terms for topic #9: california;, title, balance, takes, governor, move, u.n., tough, hill's, seat-squirmer Top 10 terms for topic #10: syria, baseball;, afghanistan,, pushes, swiftly, sharon, report, halt, israel, ace Top 10 terms for topic #11: adoptions, rewarding, pause, recovery,, saudis, hundreds, diplomacy;, betrayed, laotians, mideast Top 10 terms for topic #12: sars, timetable, clash, hotel, aide, urges, close, iraq, bush, toronto Top 10 terms for topic #13: inside, care, business, planes, take, rebels, ill, mexico, town, war: Top 10 terms for topic #14: insiders, kerry, suspected, woman, bid, giuliani, describes, saudi, foster, court Top 10 terms for topic #15: aid, briefly, airstrike, zarqawi, survived, balks, numbers, bush, get, game Top 10 terms for topic #16: cardinals, choosy, loner,, series, world, region, short, blasts, control, feared, Top 10 terms for topic #17: fronts, disease, project, stretching,, sadly, lessons, excruciating, air, came, trial Top 10 terms for topic #18: alzheimer's, drugs, cover, treatments, emergency, program, streets, doctor's, many,, room, Top 10 terms for topic #19: weigh, documents, incentives, dissuade, gifts, bush, east, stay, leaves, trial Top 10 terms for topic #20: veto, resolution, pursue, islamic, shiite, leadership;, iraq,, northeast, back, g.o.p. Top 10 terms for topic #21: game, fear, c.i.a., sees, intelligence;, hindered, hindsight,, terror, tide, all-around Top 10 terms for topic #22: public, conservatives, protest, terrorism, amid, starts, tobacco, impasse, white, rules Top 10 terms for topic #23: arafat, fled, detainee, homes, mentally, heart, chechen, afghanistan, approves, money Top 10 terms for topic #24: iran, rivera, send, confession, star, deaths, spy, yankees, croatian's, scale Top 10 terms for topic #25: talks, defines, palestinians, german, vote, final, taliban, plan, war:, nation Top 10 terms for topic #26: international, business;, passes, opportunity, vows, expand, seek, sell, crime, asks Top 10 terms for topic #27: all,, treatment, crash, economic, executive, medicare, covering, goes, '96;, plea Top 10 terms for topic #28: fall, coach, analysis;, news, charges, caught, (again), costly, near, feet Top 10 terms for topic #29: missile, chief, taiwan, forced, mccain, test, dole's, canceled, vote, unable Top 10 terms for topic #30: states, murder, shift,, 3, locked, sets, it's, backing, koreans, sites Top 10 terms for topic #31: middle, executives, twice, widowed, india,, scorned, sent, inquiry, (again), hand, Top 10 terms for topic #32: challenged:, nation, time,, insurance, efforts, sept., 11, keep, peace, panel Top 10 terms for topic #33: president:, testing, peace, overview;, israel, rehiring, ex-president, morgan, rises, hope Top 10 terms for topic #34: saudi, blast, high, 13, insurance, pact, government, costs, day, lowest Top 10 terms for topic #35: proposes, medicare, found, troops;, guilty, tighten, remains, set, crisis, case Top 10 terms for topic #36: search, baseball, 5, upsets, move, one, playoffs, bush, may, swiftly Top 10 terms for topic #37: responses:, threats, strategy, disease;, budget, rich, security;, presence, damage, toss Top 10 terms for topic #38: al, qaeda, witness, personal, let, $1, memo;, russia:, tour, bold Top 10 terms for topic #39: failure, benefit, cutback, cheney's, companies, drug, bush, slave, adopt, fund Top 10 terms for topic #40: military, losses, target, bars, cloning, ban, california's, clear, book,, novel Top 10 terms for topic #41: smoke, front;, role, battle:, vast, spin, fierce, contest, vote:, clear Top 10 terms for topic #42: life, he'll, court, dean's, reshape, faulted, contest, choice, falling, steps Top 10 terms for topic #43: toward, children, light, hurrying, warmth, aiding, russians, try, presents, missiles Top 10 terms for topic #44: surges, mode,, rightist, bolster, pick, york, attack, dies, new, cries Top 10 terms for topic #45: confinement, guantanamo,, interrogation, rebuilding, corporations, goldman, hurting, deal;, appears, iowa Top 10 terms for topic #46: tells, record, vow, israel, yankees, temper, leaders, runner, mile, working Top 10 terms for topic #47: points, arms, 10%, arbitration, approval, three, picked, woo, populist, f.d.a. Top 10 terms for topic #48: uneasiness, tigers, iraq, inflamed:, reconstruction;, repayment, fee, sees, worst, region Top 10 terms for topic #49: sosa, mcgwire, expert's, point, plan, blair, gains, caribbean, coverage, jet Top 10 terms for topic #50: 1998, resolve, base, campaign:, aids, family, foley, cleric, republicans, elections: Top 10 terms for topic #51: peace,, politics, crashes, hollywood, enclaves, cockfights, flourishing, war, die, u.s., Top 10 terms for topic #52: green, cloning, bus, flaws, hubris, eccentric's, frenzy, defense, avoid, pro Top 10 terms for topic #53: giants, declares, fence, victory, paler, rosy, sometimes, investors,, forecasts, results Top 10 terms for topic #54: bridgeport, dispute, health, ballot, insurer, options, away, mccain,, crack, uninsured Top 10 terms for topic #55: time, milosevic, travel, savor, stalls, tuesdays:, clues, among, grants, africans Top 10 terms for topic #56: victims, texaco, democracy, batter, jews, using, back, nigerians, lurching, nafta Top 10 terms for topic #57: city, school, shaken, weapons, pills., sunscreen., spray., checklist, bug, camp: Top 10 terms for topic #58: g.i.'s, counting, command, ways, east, marines, charter, john, changes, assailing Top 10 terms for topic #59: jordan, egypt, rape, brings, asian, rise, young, kabul, first, leader Which LDA topic maximally describes a document? Original document: Yanks End Drought; Mets Fall in Opener Preprocessed document: ['yanks', 'end', 'drought;', 'mets', 'fall', 'opener'] Matrix Market format: [(3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)] Topic probability mixture: [(19, 0.43120206027031704), (27, 0.14505309958538579), (28, 0.14521071520874509), (38, 0.14520079160221899)] Maximally probable topic: topic #19
/Users/rweiss/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/gensim-0.8.7-py2.7.egg/gensim/__init__.py:12: UserWarning: Module IPython was already imported from /Applications/Canopy.app/appdata/canopy-1.0.3.1262.macosx-x86_64/Canopy.app/Contents/lib/python2.7/site-packages/IPython/__init__.pyc, but /Users/rweiss/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/ipython-1.0.0-py2.7.egg is being added to sys.path __version__ = __import__('pkg_resources').get_distribution('gensim').version