%matplotlib inline
import json
import requests
import sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 30)
critics = pd.read_csv('rt_critics.csv')
critics.describe()
imdb | rtid | |
---|---|---|
count | 14072.000000 | 1.407200e+04 |
mean | 155048.688104 | 5.594059e+07 |
std | 157531.635841 | 1.805150e+08 |
min | 13442.000000 | 1.100000e+01 |
25% | 97240.000000 | 1.129200e+04 |
50% | 115798.000000 | 1.337500e+04 |
75% | 134119.000000 | 1.645000e+04 |
max | 1190539.000000 | 7.710318e+08 |
8 rows × 2 columns
critics.head()
critic | fresh | imdb | publication | quote | review_date | rtid | title | |
---|---|---|---|---|---|---|---|---|
0 | Derek Adams | fresh | 114709 | Time Out | So ingenious in concept, design and execution ... | 2009-10-04 | 9559 | Toy story |
1 | Richard Corliss | fresh | 114709 | TIME Magazine | The year's most inventive comedy. | 2008-08-31 | 9559 | Toy story |
2 | David Ansen | fresh | 114709 | Newsweek | A winning animated feature that has something ... | 2008-08-18 | 9559 | Toy story |
3 | Leonard Klady | fresh | 114709 | Variety | The film sports a provocative and appealing st... | 2008-06-09 | 9559 | Toy story |
4 | Jonathan Rosenbaum | fresh | 114709 | Chicago Reader | An entertaining computer-generated, hyperreali... | 2008-03-10 | 9559 | Toy story |
5 rows × 8 columns
critic | fresh | imdb | publication | quote | review_date | rtid | title | |
---|---|---|---|---|---|---|---|---|
5090 | Chris Nashawaty | fresh | 73486 | Entertainment Weekly | There's a lot here. But with a classic like Cu... | 2010-09-09 | 12965 | One Flew Over the Cuckoo's Nest |
5091 | Richard Schickel | rotten | 73486 | TIME Magazine | One Flew over the Cuckoo 's Nest is an earnest... | 2009-02-20 | 12965 | One Flew Over the Cuckoo's Nest |
5092 | James Berardinelli | fresh | 73486 | ReelViews | Viewed 30 years after its release, One Flew Ov... | 2008-11-04 | 12965 | One Flew Over the Cuckoo's Nest |
5097 | Roger Ebert | fresh | 73486 | Chicago Sun-Times | Is One Flew Over the Cuckoo's Nest not a great... | 2003-03-25 | 12965 | One Flew Over the Cuckoo's Nest |
4 rows × 8 columns
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
Multinomial vs. Bernoulli:
# Multinomial Model actually counts occurences out of all possible occurences for probability - better for greater features
# Bernoulli model counts only all documents with presence of the word - better for fewer features
#
### How the Count Vectorizer Works
#
from sklearn.feature_extraction.text import CountVectorizer
text = ['Math is great', 'Math is really great', 'Exciting exciting Math']
print "Original text is\n", '\n'.join(text)
vectorizer = CountVectorizer(ngram_range=(1,2))
vectorizer2 = CountVectorizer()
# call `fit` to build the vocabulary
vectorizer.fit(text)
Original text is Math is great Math is really great Exciting exciting Math
CountVectorizer(analyzer=u'word', binary=False, charset=None, charset_error=None, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 2), preprocessor=None, stop_words=None, strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
vectorizer?
# call `transform` to convert text to a bag of words
x = vectorizer.transform(text)
print(x) # A compressed version
(0, 1) 1 (0, 2) 1 (0, 3) 1 (1, 1) 1 (1, 2) 1 (1, 3) 1 (1, 4) 1 (2, 0) 2 (2, 3) 1
# CountVectorizer uses a sparse array to save memory, but it's easier in this assignment to
# convert back to a "normal" numpy array
#x = x.toarray()
#x
array([[0, 1, 1, 1, 0], [0, 1, 1, 1, 1], [2, 0, 0, 1, 0]])
print
print "Transformed text vector is \n", x
# `get_feature_names` tracks which word is associated with each column of the transformed x
print
print "Words for each feature:"
print vectorizer.get_feature_names()
# Notice that the bag of words treatment doesn't preserve information about the *order* of words,
# just their frequency
Transformed text vector is (0, 3) 1 (0, 4) 1 (0, 5) 1 (0, 7) 1 (0, 8) 1 (1, 3) 1 (1, 4) 1 (1, 6) 1 (1, 7) 1 (1, 8) 1 (1, 9) 1 (1, 10) 1 (2, 0) 2 (2, 1) 1 (2, 2) 1 (2, 7) 1 Words for each feature: [u'exciting', u'exciting exciting', u'exciting math', u'great', u'is', u'is great', u'is really', u'math', u'math is', u'really', u'really great']
(nreview, nwords)
array. Each row corresponds to a bag-of-words representation for a single review. This will be the input to the model.nreview
-element 1/0 array, encoding whether a review is Fresh (1) or Rotten (0). This is the desired outputcritics.quote[2]
'A winning animated feature that has something for everyone on the age spectrum.'
# Create a vector where each row is bag-of-words for a single quote
X = vectorizer.fit_transform(critics.quote)
# We can see the bag-of-words representation
#ViewX = X.toarray()
#ViewX[30]
array([0, 0, 0, ..., 0, 0, 0])
# Create an array where each element encodes whether the array is Fresh or Rotten
Y = (critics.fresh == 'fresh').values.astype(np.int)
critics.quote[1]
"The year's most inventive comedy."
train_test_split?
# Use SKLearn's train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(X, Y)
# Create our classifier
clf = MultinomialNB().fit(xtrain, ytrain)
print "Accuracy: %0.2f%%" % (100 * clf.score(xtest, ytest))
Accuracy: 77.54%
MultinomialNB().fit
<bound method MultinomialNB.fit of MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)>
#Print the accuracy on the test and training dataset
training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)
print "Accuracy on training data: %0.2f" % (training_accuracy)
print "Accuracy on test data: %0.2f" % (test_accuracy)
Accuracy on training data: 0.99 Accuracy on test data: 0.78
# Save prediction and probability
prob = clf.predict_proba(X)[:, 0]
predict = clf.predict(X)
# Review errors
bad_rotten = np.argsort(prob[Y == 0])[:5]
bad_fresh = np.argsort(prob[Y == 1])[-5:]
print "Mis-predicted Rotten quotes"
print '---------------------------'
for row in bad_rotten:
print critics[Y == 0].quote.irow(row)
print
print "Mis-predicted Fresh quotes"
print '--------------------------'
for row in bad_fresh:
print critics[Y == 1].quote.irow(row)
print
Mis-predicted Rotten quotes --------------------------- Peter's Friends won't win over anyone looking for depth - as drama, it's pure popcorn - but the vignettes are swept along by Branagh's richly theatrical temperament and by the exuberant wit of the cast. Where the Wild Things Are is audacious in its refusal to be reassuring, which makes it hard to love, but also hard to dismiss. Despite its arresting visual style, its wave after wave of creative and hypnotic images, "The Pillow Book," as its name hints, slowly but inexorably leads to sleep. The thought that he may yet return for further adventures with his body and Lugosi's sconce fills us with mortal terror. That is the most fearful prospect which the picture manages to convey. One Flew over the Cuckoo 's Nest is an earnest attempt to make a serious film. But in the end the movie backs away from both the human reality and the cloudy but potent symbolism that Ken Kesey found in the asylum. Mis-predicted Fresh quotes -------------------------- It's a one-joke movie, a funhouse ride, the cinematic equivalent of having a rubber spider thrown in your lap. But it doesn't matter if you reject the wispy script or the plot, which has as much substance as a spider's web; you'll jump every time. It isn't without some zip, though you have to wonder why the producers bothered when the censors demanded that the dancers be shown only from the neck up. A gooey, swooning swatch of romantic hyperventilation, its queasy charms. And let it be said that surrendering to those charms could be as guilt-inducing as polishing off a pint of Haagen-Dazs chocolate ice cream before lunch. This tough-to-peg whodunit keeps you going for two hours, despite a few James Bond-ish (or Jane Bond-ish) turns that play less preposterously than you might assume were they to be divulged. The movie's own payoff is compelling enough, but the project has a weightless feel that limits involvement. Better you give it an hour-and-a-half on video someday, surrounded by wine and snacks.
critics[critics['quote'].str.contains("audacious in its")]
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-1-dd56d2d6384d> in <module>() ----> 1 critics[critics['quote'].str.contains("audacious in its")] NameError: name 'critics' is not defined
critics[critics['quote'].str.contains("own payoff is")]
critic | fresh | imdb | publication | quote | review_date | rtid | title | |
---|---|---|---|---|---|---|---|---|
5965 | Mike Clark | fresh | 115710 | USA Today | The movie's own payoff is compelling enough, b... | 2000-01-01 | 364525542 | Blood and Wine |
1 rows × 8 columns