The Naive Bayes classifier is a statsitcal programing tool with diverse uses in text analysis. It is famously used for determining if an email is "spam" or "ham"(the opposite of spam). It also has uses in sentiment analysis, for example, determining whether a movie review was good or bad, determining the authorship of an article, or categorizing documents into topics. Today we will be using a naive bayes classifier however, not to classify, but to analyze competing sides of political debates on the floor of the US Congress.
%pylab inline
from __future__ import division
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split
from speech import Speech
phrase = "abortion"
num_speeches = Speech.get(0, 0, phrase=phrase, congress="", start_date="1995-05-04", speaker_party="*")['count']
print "Downloading %i speeches" % num_speeches
speeches = Speech.get(start=0, rows=num_speeches, phrase=phrase, speaker_party="*")['speeches']
print len(speeches), "speeches downloaded"
naive_bayes = MultinomialNB(alpha=1.0,fit_prior=True)
vectorizer = TfidfVectorizer(min_df=.1, max_df=.6, stop_words='english' )
data = [" ".join(speech['speaking']) for speech in speeches]
data = vectorizer.fit_transform(data)
target = [speech['speaker_party'] for speech in speeches]
target = [ 0 if x == "D" else 1 for x in target ]
data.shape
X_train, X_test , Y_train, Y_test = train_test_split(data, target, test_size=0.2)
print X_train.shape, X_test.shape, len(Y_train), len(Y_test)
naive_bayes.fit(X_train, Y_train)
naive_bayes.score(X_test, Y_test)
cross_val_score(naive_bayes, data, target, scoring='accuracy', verbose=1, cv=5)
terms = vectorizer.get_feature_names()
t1 = [(naive_bayes.feature_log_prob_[0][i] * (naive_bayes.class_count_[0] / naive_bayes.class_count_.sum())) for i in range(len(terms))]
t2 = [(naive_bayes.feature_log_prob_[1][i] * (naive_bayes.class_count_[1] / naive_bayes.class_count_.sum())) for i in range(len(terms))]
[(terms[i],t1[i]) for i in np.array(t1).argsort()] # Top Terms for Republicans
[(terms[i],t2[i]) for i in np.array(t2).argsort()] # Top Terms for Democrats
What other contexts might you be able to apply this methodology to? How can a classifier be useful in your work? How about the particular way in which we were able to find the words most indicative of a certain class?