Natural Language Processing (NLP)¶

Part 1: Introduction¶

Adapted from NLP Crash Course by Charlie Greenbacker and Introduction to NLP by Dan Jurafsky

What is NLP?¶

Using computers to process (analyze, understand, generate) natural human languages

Why is NLP useful?¶

Most knowledge created by humans is unstructured text
Need some way to make sense of it
Enables quantitative analysis of text data

What are some of the higher level task areas?¶

Speech recognition and generation: Apple Siri
- Speech to text
- Text to speech
Question answering: IBM Watson
- Match query with knowledge base
- Reasoning about intent of question
Machine translation: Google Translate
- One language to another to another
Information retrieval: Google
- Finding relevant results
- Finding similar results
Information extraction: Gmail
- Structured information from unstructured documents
Assistive technologies: Google autocompletion
- Predictive text input
- Text simplification
Natural Language Generation: computer-generated articles
- Generating text from data
Automatic summarization: Google News
- Extractive summarization
- Abstractive summarization
Sentiment analysis: Twitter analysis
- Attitude of speaker

What are some of the lower level components?¶

Tokenization: breaking text into tokens (words, sentences, n-grams)
Stopword removal: a/an/the
Stemming and lemmatization: root word
TF-IDF: word importance
Part-of-speech tagging: noun/verb/adjective
Named entity recognition: person/organization/location
Spelling correction: "New Yrok City"
Word sense disambiguation: "buy a mouse"
Segmentation: "New York City subway"
Language detection: "translate this page"
Machine learning

Why is NLP hard?¶

Ambiguity:
- Teacher Strikes Idle Kids
- Red Tape Holds Up New Bridges
- Hospitals are Sued by 7 Foot Doctors
- Juvenile Court to Try Shooting Defendant
- Local High School Dropouts Cut in Half
Non-standard English: tweets/text messages
Idioms: "throw in the towel"
Newly coined words: "retweet"
Tricky entity names: "Where is A Bug's Life playing?"
World knowledge: "Mary and Sue are sisters", "Mary and Sue are mothers"

How does NLP work?¶

Build probabilistic model using data about a language
Requires an understanding of the language
Requires an understanding of the world (or a particular domain)

Part 2: Reading in the Yelp Reviews¶

"corpus" = collection of documents
"corpora" = plural form of corpus

In [1]:

import pandas as pd
import numpy as np
import scipy as sp
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline

In [2]:

# read yelp.csv into a DataFrame
url = 'https://raw.githubusercontent.com/justmarkham/DAT7/master/data/yelp.csv'
yelp = pd.read_csv(url)

# create a new DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# split the new DataFrame into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(yelp_best_worst.text, yelp_best_worst.stars, random_state=1)

Part 3: Tokenization¶

What: Separate text into units such as sentences or words
Why: Gives structure to previously unstructured text
Notes: Relatively easy with English language text, not easy with some languages

In [3]:

# use CountVectorizer to create document-term matrices from X_train and X_test
vect = CountVectorizer()
train_dtm = vect.fit_transform(X_train)
test_dtm = vect.transform(X_test)

In [4]:

# rows are documents, columns are terms (aka "tokens" or "features")
train_dtm.shape

Out[4]:

(3064, 16825)

In [5]:

# last 50 features
print vect.get_feature_names()[-50:]

[u'yyyyy', u'z11', u'za', u'zabba', u'zach', u'zam', u'zanella', u'zankou', u'zappos', u'zatsiki', u'zen', u'zero', u'zest', u'zexperience', u'zha', u'zhou', u'zia', u'zihuatenejo', u'zilch', u'zin', u'zinburger', u'zinburgergeist', u'zinc', u'zinfandel', u'zing', u'zip', u'zipcar', u'zipper', u'zippers', u'zipps', u'ziti', u'zoe', u'zombi', u'zombies', u'zone', u'zones', u'zoning', u'zoo', u'zoyo', u'zucca', u'zucchini', u'zuchinni', u'zumba', u'zupa', u'zuzu', u'zwiebel', u'zzed', u'\xe9clairs', u'\xe9cole', u'\xe9m']

In [6]:

# show vectorizer options
vect

Out[6]:

CountVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

CountVectorizer documentation

lowercase: boolean, True by default
Convert all characters to lowercase before tokenizing.

In [7]:

# don't convert to lowercase
vect = CountVectorizer(lowercase=False)
train_dtm = vect.fit_transform(X_train)
train_dtm.shape

Out[7]:

(3064, 20838)

token_pattern: string
Regular expression denoting what constitutes a "token". The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

In [8]:

# allow tokens of one character
vect = CountVectorizer(token_pattern=r'(?u)\b\w+\b')
train_dtm = vect.fit_transform(X_train)
train_dtm.shape

Out[8]:

(3064, 16861)

ngram_range: tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [9]:

# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
train_dtm = vect.fit_transform(X_train)
train_dtm.shape

Out[9]:

(3064, 169847)

In [10]:

# last 50 features
print vect.get_feature_names()[-50:]

[u'zone out', u'zone when', u'zones', u'zones dolls', u'zoning', u'zoning issues', u'zoo', u'zoo and', u'zoo is', u'zoo not', u'zoo the', u'zoo ve', u'zoyo', u'zoyo for', u'zucca', u'zucca appetizer', u'zucchini', u'zucchini and', u'zucchini bread', u'zucchini broccoli', u'zucchini carrots', u'zucchini fries', u'zucchini pieces', u'zucchini strips', u'zucchini veal', u'zucchini very', u'zucchini with', u'zuchinni', u'zuchinni again', u'zuchinni the', u'zumba', u'zumba class', u'zumba or', u'zumba yogalates', u'zupa', u'zupa flavors', u'zuzu', u'zuzu in', u'zuzu is', u'zuzu the', u'zwiebel', u'zwiebel kr\xe4uter', u'zzed', u'zzed in', u'\xe9clairs', u'\xe9clairs napoleons', u'\xe9cole', u'\xe9cole len\xf4tre', u'\xe9m', u'\xe9m all']

Predicting the star rating:

In [11]:

# use default options for CountVectorizer
vect = CountVectorizer()

# create document-term matrices
train_dtm = vect.fit_transform(X_train)
test_dtm = vect.transform(X_test)

# use Naive Bayes to predict the star rating
nb = MultinomialNB()
nb.fit(train_dtm, y_train)
y_pred_class = nb.predict(test_dtm)

# calculate accuracy
print metrics.accuracy_score(y_test, y_pred_class)

0.918786692759

In [12]:

# calculate null accuracy
y_test_binary = np.where(y_test==5, 1, 0)
y_test_binary.mean()

Out[12]:

0.81996086105675148

In [13]:

# define a function that accepts a vectorizer and returns the accuracy
def tokenize_test(vect):
    train_dtm = vect.fit_transform(X_train)
    print 'Features: ', train_dtm.shape[1]
    test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(train_dtm, y_train)
    y_pred_class = nb.predict(test_dtm)
    print 'Accuracy: ', metrics.accuracy_score(y_test, y_pred_class)

In [14]:

# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)

Features:  169847
Accuracy:  0.854207436399

Part 4: Stopword Removal¶

What: Remove common words that will likely appear in any text
Why: They don't tell you much about your text

In [15]:

# show vectorizer options
vect

Out[15]:

CountVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

stop_words: string {'english'}, list, or None (default)
If 'english', a built-in stop word list for English is used.
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

In [16]:

# remove English stop words
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)

Features:  16528
Accuracy:  0.915851272016

In [17]:

# set of stop words
print vect.get_stop_words()

frozenset(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'go', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'your', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once'])

Part 5: Other CountVectorizer Options¶

max_features: int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [18]:

# remove English stop words and only keep 100 features
vect = CountVectorizer(stop_words='english', max_features=100)
tokenize_test(vect)

Features:  100
Accuracy:  0.869863013699

In [19]:

# all 100 features
print vect.get_feature_names()

[u'amazing', u'area', u'atmosphere', u'awesome', u'bad', u'bar', u'best', u'better', u'big', u'came', u'cheese', u'chicken', u'clean', u'coffee', u'come', u'day', u'definitely', u'delicious', u'did', u'didn', u'dinner', u'don', u'eat', u'excellent', u'experience', u'favorite', u'feel', u'food', u'free', u'fresh', u'friendly', u'friends', u'going', u'good', u'got', u'great', u'happy', u'home', u'hot', u'hour', u'just', u'know', u'like', u'little', u'll', u'location', u'long', u'looking', u'lot', u'love', u'lunch', u'make', u'meal', u'menu', u'minutes', u'need', u'new', u'nice', u'night', u'order', u'ordered', u'people', u'perfect', u'phoenix', u'pizza', u'place', u'pretty', u'prices', u'really', u'recommend', u'restaurant', u'right', u'said', u'salad', u'sandwich', u'sauce', u'say', u'service', u'staff', u'store', u'sure', u'table', u'thing', u'things', u'think', u'time', u'times', u'took', u'town', u'tried', u'try', u've', u'wait', u'want', u'way', u'went', u'wine', u'work', u'worth', u'years']

In [20]:

# include 1-grams and 2-grams, and limit the number of features
vect = CountVectorizer(ngram_range=(1, 2), max_features=100000)
tokenize_test(vect)

Features:  100000
Accuracy:  0.885518590998

min_df: float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [21]:

# include 1-grams and 2-grams, and only include terms that appear at least 2 times
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
tokenize_test(vect)

Features:  43957
Accuracy:  0.932485322896

Part 6: Introduction to TextBlob¶

TextBlob: "Simplified Text Processing"

In [22]:

# print the first review
print yelp_best_worst.text[0]

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!

In [23]:

# save it as a TextBlob object
review = TextBlob(yelp_best_worst.text[0])

In [24]:

# list the words
review.words

Out[24]:

WordList(['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excellent', 'The', 'weather', 'was', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'grounds', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'was', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', 'fills', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'was', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'was', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'looks', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', 'eggs', 'vegetable', 'skillet', 'and', 'it', 'was', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', 'pieces', 'of', 'their', 'griddled', 'bread', 'with', 'was', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', 'was', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'had', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back'])

In [25]:

# list the sentences
review.sentences

Out[25]:

[Sentence("My wife took me here on my birthday for breakfast and it was excellent."),
 Sentence("The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure."),
 Sentence("Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning."),
 Sentence("It looked like the place fills up pretty quickly so the earlier you get here the better."),
 Sentence("Do yourself a favor and get their Bloody Mary."),
 Sentence("It was phenomenal and simply the best I've ever had."),
 Sentence("I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it."),
 Sentence("It was amazing."),
 Sentence("While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious."),
 Sentence("It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete."),
 Sentence("It was the best "toast" I've ever had."),
 Sentence("Anyway, I can't wait to go back!")]

In [26]:

# some string methods are available
review.lower()

Out[26]:

TextBlob("my wife took me here on my birthday for breakfast and it was excellent.  the weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  our waitress was excellent and our food arrived quickly on the semi-busy saturday morning.  it looked like the place fills up pretty quickly so the earlier you get here the better.

do yourself a favor and get their bloody mary.  it was phenomenal and simply the best i've ever had.  i'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  it was amazing.

while everything on the menu looks excellent, i had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  it came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  it was the best "toast" i've ever had.

anyway, i can't wait to go back!")

Part 7: Stemming and Lemmatization¶

Stemming:

What: Reduce a word to its base/stem/root form
Why: Often makes sense to treat related words the same way
Notes:
- Uses a "simple" and fast rule-based approach
- Stemmed words are usually not shown to users (used for analysis/indexing)
- Some search engines treat words with the same stem as synonyms

In [27]:

# initialize stemmer
stemmer = SnowballStemmer('english')

# stem each word
print [stemmer.stem(word) for word in review.words]

[u'my', u'wife', u'took', u'me', u'here', u'on', u'my', u'birthday', u'for', u'breakfast', u'and', u'it', u'was', u'excel', u'the', u'weather', u'was', u'perfect', u'which', u'made', u'sit', u'outsid', u'overlook', u'their', u'ground', u'an', u'absolut', u'pleasur', u'our', u'waitress', u'was', u'excel', u'and', u'our', u'food', u'arriv', u'quick', u'on', u'the', u'semi-busi', u'saturday', u'morn', u'it', u'look', u'like', u'the', u'place', u'fill', u'up', u'pretti', u'quick', u'so', u'the', u'earlier', u'you', u'get', u'here', u'the', u'better', u'do', u'yourself', u'a', u'favor', u'and', u'get', u'their', u'bloodi', u'mari', u'it', u'was', u'phenomen', u'and', u'simpli', u'the', u'best', u'i', u've', u'ever', u'had', u'i', u"'m", u'pretti', u'sure', u'they', u'onli', u'use', u'ingredi', u'from', u'their', u'garden', u'and', u'blend', u'them', u'fresh', u'when', u'you', u'order', u'it', u'it', u'was', u'amaz', u'while', u'everyth', u'on', u'the', u'menu', u'look', u'excel', u'i', u'had', u'the', u'white', u'truffl', u'scrambl', u'egg', u'veget', u'skillet', u'and', u'it', u'was', u'tasti', u'and', u'delici', u'it', u'came', u'with', u'2', u'piec', u'of', u'their', u'griddl', u'bread', u'with', u'was', u'amaz', u'and', u'it', u'absolut', u'made', u'the', u'meal', u'complet', u'it', u'was', u'the', u'best', u'toast', u'i', u've', u'ever', u'had', u'anyway', u'i', u'ca', u"n't", u'wait', u'to', u'go', u'back']

Lemmatization

What: Derive the canonical form ('lemma') of a word
Why: Can be better than stemming
Notes: Uses a dictionary-based approach (slower than stemming)

In [28]:

# assume every word is a noun
print [word.lemmatize() for word in review.words]

['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', u'wa', 'excellent', 'The', 'weather', u'wa', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', u'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', u'wa', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', u'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', u'wa', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', u'ingredient', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', u'wa', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', u'look', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', u'egg', 'vegetable', 'skillet', 'and', 'it', u'wa', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', u'piece', 'of', 'their', 'griddled', 'bread', 'with', u'wa', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', u'wa', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'had', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back']

In [29]:

# assume every word is a verb
print [word.lemmatize(pos='v') for word in review.words]

['My', 'wife', u'take', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', u'be', 'excellent', 'The', 'weather', u'be', 'perfect', 'which', u'make', u'sit', 'outside', u'overlook', 'their', u'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', u'be', 'excellent', 'and', 'our', 'food', u'arrive', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', u'look', 'like', 'the', 'place', u'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', u'be', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', u'have', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', u'be', u'amaze', 'While', 'EVERYTHING', 'on', 'the', 'menu', u'look', 'excellent', 'I', u'have', 'the', 'white', 'truffle', u'scramble', u'egg', 'vegetable', 'skillet', 'and', 'it', u'be', 'tasty', 'and', 'delicious', 'It', u'come', 'with', '2', u'piece', 'of', 'their', u'griddle', 'bread', 'with', u'be', u'amaze', 'and', 'it', 'absolutely', u'make', 'the', 'meal', 'complete', 'It', u'be', 'the', 'best', 'toast', 'I', "'ve", 'ever', u'have', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back']

In [30]:

# define a function that accepts text and returns a list of lemmas
def split_into_lemmas(text):
    text = unicode(text, 'utf-8').lower()
    words = TextBlob(text).words
    return [word.lemmatize() for word in words]

In [31]:

# use split_into_lemmas as the feature extraction function
vect = CountVectorizer(analyzer=split_into_lemmas)
tokenize_test(vect)

Features:  16452
Accuracy:  0.920743639922

In [32]:

# last 50 features
print vect.get_feature_names()[-50:]

[u'yuyuyummy', u'yuzu', u'z', u'z-grill', u'z11', u'zach', u'zam', u'zanella', u'zankou', u'zappos', u'zatsiki', u'zen', u'zen-like', u'zero', u'zero-star', u'zest', u'zexperience', u'zha', u'zhou', u'zia', u'zilch', u'zin', u'zinburger', u'zinburgergeist', u'zinc', u'zinfandel', u'zing', u'zip', u'zipcar', u'zipper', u'zipps', u'ziti', u'zoe', u'zombi', u'zombie', u'zone', u'zoning', u'zoo', u'zoyo', u'zucca', u'zucchini', u'zuchinni', u'zumba', u'zupa', u'zuzu', u'zwiebel-kr\xe4uter', u'zzed', u'\xe9clairs', u'\xe9cole', u'\xe9m']

Part 8: Term Frequency - Inverse Document Frequency (TF-IDF)¶

What: Computes "relative frequency" that a word appears in a document compared to its frequency across all documents
Why: More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents)
Notes: Used for search engine scoring, text summarization, document clustering

In [33]:

# example documents
train_simple = ['call you tonight',
                'Call me a cab',
                'please call me... PLEASE!']

In [34]:

# CountVectorizer
vect = CountVectorizer()
pd.DataFrame(vect.fit_transform(train_simple).toarray(), columns=vect.get_feature_names())

Out[34]:

	cab	call	me	please	tonight	you
0	0	1	0	0	1	1
1	1	1	1	0	0	0
2	0	1	1	2	0	0

In [35]:

# TfidfVectorizer
vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(train_simple).toarray(), columns=vect.get_feature_names())

Out[35]:

	cab	call	me	please	tonight	you
0	0.000000	0.385372	0.000000	0.000000	0.652491	0.652491
1	0.720333	0.425441	0.547832	0.000000	0.000000	0.000000
2	0.000000	0.266075	0.342620	0.901008	0.000000	0.000000

Part 9: Using TF-IDF to Summarize a Yelp Review¶

In [36]:

# create a document-term matrix using TF-IDF
vect = TfidfVectorizer(stop_words='english')
dtm = vect.fit_transform(yelp.text)
features = vect.get_feature_names()
dtm.shape

Out[36]:

(10000, 28881)

In [37]:

def summarize():
    
    # choose a random review that is at least 300 characters
    review_length = 0
    while review_length < 300:
        review_id = np.random.randint(0, len(yelp))
        review_text = unicode(yelp.text[review_id], 'utf-8')
        review_length = len(review_text)
    
    # create a dictionary of words and their TF-IDF scores
    word_scores = {}
    for word in TextBlob(review_text).words:
        word = word.lower()
        if word in features:
            word_scores[word] = dtm[review_id, features.index(word)]
    
    # print words with the top 5 TF-IDF scores
    print 'TOP SCORING WORDS:'
    top_scores = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:5]
    for word, score in top_scores:
        print word
    
    # print 5 random words
    print '\n' + 'RANDOM WORDS:'
    random_words = np.random.choice(word_scores.keys(), size=5, replace=False)
    for word in random_words:
        print word
    
    # print the review
    print '\n' + review_text

In [38]:

summarize()

TOP SCORING WORDS:
insurance
surgery
dr
estimate
fish

RANDOM WORDS:
check
credits
small
called
allowed

Dr Fish is an ok oral surgeon.  However, his financial practices are shady.  He gave us a high estimate for surgery, 6k; we had to pay 4k before the surgery.  He then lowered the estimate after the surgery but refused to credit our card; we had insurance that paid 2k.  Two months later, we received a check for a small amount of the overage.  Their office person told me they dont do credit card credits.  I then called our insurance company and learned Dr. Fish had received $1800 more from us than the insurance co allowed in Dr Fish's agreement with the co.  Now we have to go fight him for the refund. His office person said, oh, insurance co's dont like to pay.  Hmmm--why contract with them then?  I would say, look elsewhere!

Part 10: Sentiment Analysis¶

In [39]:

print review

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!

In [40]:

# polarity ranges from -1 (most negative) to 1 (most positive)
review.sentiment.polarity

Out[40]:

0.40246913580246907

In [41]:

# understanding the apply method
yelp['length'] = yelp.text.apply(len)

In [42]:

# define a function that accepts text and returns the polarity
def detect_sentiment(text):
    return TextBlob(text.decode('utf-8')).sentiment.polarity

In [43]:

# create a new DataFrame column for sentiment
yelp['sentiment'] = yelp.text.apply(detect_sentiment)

In [44]:

# boxplot of sentiment grouped by stars
yelp.boxplot(column='sentiment', by='stars')

Out[44]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a88c390>

In [45]:

# reviews with most positive sentiment
yelp[yelp.sentiment == 1].text.head()

Out[45]:

254    Our server Gary was awesome. Food was amazing....
347    3 syllables for this place. \r\nA-MAZ-ING!\r\n...
420                                    LOVE the food!!!!
459    Love it!!! Wish we still lived in Arizona as C...
679                                     Excellent burger
Name: text, dtype: object

In [46]:

# reviews with most negative sentiment
yelp[yelp.sentiment == -1].text.head()

Out[46]:

773     This was absolutely horrible. I got the suprem...
1517                  Nasty workers and over priced trash
3266    Absolutely awful... these guys have NO idea wh...
4766                                       Very bad food!
5812        I wouldn't send my worst enemy to this place.
Name: text, dtype: object

In [47]:

# widen the column display
pd.set_option('max_colwidth', 500)

In [48]:

# negative sentiment in a 5-star review
yelp[(yelp.stars == 5) & (yelp.sentiment < -0.3)].head()

Out[48]:

	business_id	date	review_id	stars	text	type	user_id	cool	useful	funny	length	sentiment
390	106JT5p8e8Chtd0CZpcARw	2009-08-06	KowGVoP_gygzdSu6Mt3zKQ	5	RIP AZ Coffee Connection. :( I stopped by two days ago unaware that they had closed. I am severely bummed. This place is irreplaceable! Damn you, Starbucks and McDonalds!	review	jKeaOrPyJ-dI9SNeVqrbww	1	0	0	175	-0.302083
1287	57-dgZzOnLox6eudArRKgw	2008-08-28	sksXE8krD3WvqSOhtlSUyQ	5	Obsessed. Like, I've-got-the-Twangy-Tart-withdrawal-shakes level of addiction to this place. Please make one in Arcadia! Pleeeaaassse.	review	gEnU4BqTK-4abqYl_Ljjfg	3	3	5	134	-0.625000
3075	PwtYeGu-19v9bU4nbP9UbA	2011-12-05	8yfOlQGxQlCgQL9TnnzQkw	5	Unfortunately Out of Business.	review	0fOPM1H03gF5EJooYvkL1Q	0	2	0	30	-0.500000
3516	Bc4DoKgrKCtCuN-0O5He3A	2009-12-19	-qqrl4101KbQKIdar1lMRw	5	Cashew brittle, almond brittle, bacon brittle! Go now, before it's too late!	review	wHg1YkCzdZq9WBJOTRgxHQ	9	8	6	77	-0.375000
6726	FURgKkRFtMK5yKbjYZVVwA	2012-08-13	8xx8i94sKvBhWZv8ZVyfBA	5	Brown bag chicken sammich, mac n cheese, fried okra, and the bourbon drink. Nuff said.	review	hFP7Si9jvdOUmmMesg4ghw	0	0	0	87	-0.600000

In [49]:

# positive sentiment in a 1-star review
yelp[(yelp.stars == 1) & (yelp.sentiment > 0.5)].head()

Out[49]:

	business_id	date	review_id	stars	text	type	user_id	cool	useful	funny	length	sentiment
1781	53YGfwmbW73JhFiemNeyzQ	2012-06-22	Gi-4O3EhE175vujbFGDIew	1	If you like the stuck up Scottsdale vibe this is a good place for you. The food isn't impressive. Nice outdoor seating.	review	Hqgx3IdJAAaoQjvrUnbNvw	0	1	2	119	0.766667
2353	3Srfy_VeCgwDbo4iyUFOtw	2006-08-23	K8tXedC2NMBEZ8p77zg23Q	1	My co-workers and I refer to this place as "Pizza n' Ants". The staff will be happy to serve you with bare hands, right after using the till. Also, as the nickname suggests, there has been a noticable insect problem. \r\n\r\n\r\n\r\nAs if that could all be overlooked, the pizza isn't even good. If you are in this part of town, go to Z Pizza or Slices for great pizza instead!	review	rPGZttaVjRoVi3GYbs62cg	0	1	0	372	0.567143
5257	cXx-fHY11Se8rFHkkUeaUg	2009-10-27	2yHyr0N_XNZggmIfZ7JaHw	1	Remember how I said that the Trivia was the best thing about this place? Well, they got rid of long time Triva host, Dave (who had been featured in the College Times and was the best thing about the trivia). Without Dave's personality, this place just doesn't cut it. Will never go here again. Bummer.	review	nx2PS25Qe3MCEFUdO_XOtw	2	4	0	304	0.650000
6222	fDZzCjlxaA4OOmnFO-i0vw	2012-07-09	F5aRE4oqmHthiHudmnShLQ	1	My mother always told me, if I didn't have anything nice to say, say nothing!	review	J92bzxYVmyoLHULzh9xNCA	1	2	1	77	0.750000
6702	77oW-QeIXbUoTbUbrdD2aA	2012-01-05	oVYk9Gxa3TY63FAeoeCEzg	1	Most livable city my eye!\r\nPlastic yuppies around every corner looking for a reason to belong. I can't wait for the homosexuals to take control of this dog park and give it some class.\r\n\r\nAvoid at all cost.	review	ek4GWXatDshMorJwGC2JAw	1	2	4	207	0.625000

In [50]:

# reset the column display width
pd.reset_option('max_colwidth')

Part 11: Adding Features to a Document-Term Matrix¶

In [51]:

# create a new DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# split the new DataFrame into training and testing sets
feature_cols = ['text', 'sentiment', 'cool', 'useful', 'funny']
X = yelp_best_worst[feature_cols]
y = yelp_best_worst.stars
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [52]:

# use CountVectorizer with text column only
vect = CountVectorizer()
train_dtm = vect.fit_transform(X_train[:, 0])
test_dtm = vect.transform(X_test[:, 0])
print train_dtm.shape
print test_dtm.shape

(3064, 16825)
(1022, 16825)

In [53]:

# shape of other four feature columns
X_train[:, 1:].shape

Out[53]:

(3064L, 4L)

In [54]:

# cast other feature columns to float and convert to a sparse matrix
extra = sp.sparse.csr_matrix(X_train[:, 1:].astype(float))
extra.shape

Out[54]:

(3064, 4)

In [55]:

# combine sparse matrices
train_dtm_extra = sp.sparse.hstack((train_dtm, extra))
train_dtm_extra.shape

Out[55]:

(3064, 16829)

In [56]:

# repeat for testing set
extra = sp.sparse.csr_matrix(X_test[:, 1:].astype(float))
test_dtm_extra = sp.sparse.hstack((test_dtm, extra))
test_dtm_extra.shape

Out[56]:

(1022, 16829)

In [57]:

# use logistic regression with text column only
logreg = LogisticRegression(C=1e9)
logreg.fit(train_dtm, y_train)
y_pred_class = logreg.predict(test_dtm)
print metrics.accuracy_score(y_test, y_pred_class)

0.917808219178

In [58]:

# use logistic regression with all features
logreg = LogisticRegression(C=1e9)
logreg.fit(train_dtm_extra, y_train)
y_pred_class = logreg.predict(test_dtm_extra)
print metrics.accuracy_score(y_test, y_pred_class)

0.922700587084

Part 12: Fun TextBlob Features¶

In [59]:

# spelling correction
TextBlob('15 minuets late').correct()

Out[59]:

TextBlob("15 minutes late")

In [60]:

# spellcheck
Word('parot').spellcheck()

Out[60]:

[('part', 0.9929478138222849), (u'parrot', 0.007052186177715092)]

In [61]:

# definitions
Word('bank').define('v')

Out[61]:

[u'tip laterally',
 u'enclose with a bank',
 u'do business with a bank or keep an account at a bank',
 u'act as the banker in a game or in gambling',
 u'be in the banking business',
 u'put into a bank account',
 u'cover with ashes so to control the rate of burning',
 u'have confidence or faith in']

In [62]:

# language identification
TextBlob('Hola amigos').detect_language()

Out[62]:

u'es'