Good morning!¶

PRELIMINARIES¶

I have (I think/hope) about 90 minutes of material. This session is supposed to go for about 110 minutes. I thought we could break it up as follows:

30 minutes content
5 minutes
30 minutes content
5 minutes
30 minutes content
10 minutes questions/discussion

About me¶

Douglas Starnes
Pythonic polyglot 'ninja'
Memphis, TN
Programming languages, mobile, 'extreme web', data
Speaker
Co-director of Memphis Python User Group (MEMpy)
- @MemphisPython
- MEMPy
PyTennessee staff
- @pytennessee
- PyTennessee

Contact¶

douglas@douglasstarnes.com (douglas.a.starnes@gmail.com)
http://douglasstarnes.com
@poweredbyaltnet

This talk is about introductory natural langauge techniques using Python. Most of the talk will focus on NLTK but towards the end we'll look at a few other frameworks such as scikit-learn, TextBlob and BeautifulSoup.

NLTK is the Natural Language ToolKit. It is a set of software, data and documentation to perform Natural Language Processing (NLP) in Python.

Natural language is spoken and/or written language used by humans for communication. We will be focusing on English (because today's host only speaks English) but NLTK has some facilities for other languages as well.

Natural language processing is the automated (usually by a computer) analysis of natural language.

Uses of natural language processing:¶

essay evaluation
information retrieval
question answering
sentiment analysis
spam detection
optical character recognition
speech recognition

Getting NLTK¶

You must have Python version 2.6+ or 3.2+ (for NLTK 3) After that installation is simple:

pip install nltk

NLTK has soft dependencies on numpy and matplotlib.

You'll want to have some data, or corpora to experiment with:

nltk.download_gui()

CORPORA¶

NLTK comes with over over 40 corpora or bodies of text that we can experiment with. The corpora have a variety of features. Here are a few of the more popular ones:

Gutenberg
Brown
Stopwords
Penn TreeBank
WordNet

Corpora are imported as modules:

Let's take a look at a few of them in code.

Gutenberg¶

In [ ]:

import nltk
from nltk.corpus import gutenberg
gutenberg

The Gutenberg corpus is a PlaintextCorpusReader and contains a collection of text documents with no metadata.

In [ ]:

for fileid in gutenberg.fileids():
    print fileid

In [ ]:

fileid = 'shakespeare-macbeth.txt'
words = gutenberg.words(fileid)
words[:20]

In [ ]:

raw = gutenberg.raw(fileid)
raw[:1000]

In [ ]:

len(words)

In [ ]:

sents = gutenberg.sents(fileid)
sents

In [ ]:

len(sents)

In [ ]:

len(raw)

Stopwords¶

A 'stopword' is a word that usually is filtered out. This includes words such as 'the', 'and', 'or', etc. Note that there is no definitive list but NLTK has a fairly complete one.

In [ ]:

from nltk.corpus import stopwords
words = stopwords.words('english')
for word in words[:25]: print word,

NLTK also has lists of stopwords in many languages.

In [ ]:

languages = stopwords.fileids()
for lang in languages:
    words = stopwords.words(lang)
    print lang
    for word in words[:25]: print word,
    print '\n'

Penn Treebank¶

The Penn Treebank organizes sentences into trees. Example

Another valuable feature of the Penn Treebank is extensive part of speech tagging. List

In [ ]:

from nltk.corpus import treebank
words = treebank.tagged_words()[:25]
print ', '.join(["('%s', '%s')" % (word[0], word[1]) for word in words])

In [ ]:

sent = treebank.tagged_sents()[0]
sent

We'll use the Penn Treebank again to tag our own text.

Wordnet¶

Wordnet is a different animal. This corpus contains what are called 'synsets' or synonym rings. To get the synset for a particular word:

In [ ]:

from nltk.corpus import wordnet
good_synsets = wordnet.synsets('good')
for synset in good_synsets[:20]:
    print synset.name
    print synset.pos
    print synset.definition
    print '-' * 40

We can see that each synset has a definition and a part of speech or POS tag.

Wordnet Part Of Speech (POS) Tags:¶

n - noun
v - verb
a - adjective
s - adjective satellite
r - adverb

Suppose we are only interested in nouns.

In [ ]:

good_nouns = wordnet.synsets('good', pos='n')
for noun in good_nouns:
    print noun.name
    print noun.pos
    print noun.definition
    print '-' * 40

Each synset also (potenially) has examples of usage.

In [ ]:

for noun in good_nouns:
    print noun.definition
    if len(noun.examples) == 0:
        print '**NO EXAMPLES'
    else:
        for example in noun.examples:
            print example
    print '-' * 40

Each synset can also have a set of lemmas which each can have a set of antonyms. An antonym is a lemma which has a structure like that of synsets.

In [ ]:

noun = good_nouns[1]

for lemma in good_nouns[1].lemmas:
    print lemma.name
    antonyms = lemma.antonyms()
    for antonym in antonyms: print '-', antonym.name

Since you'll probably have your own domain specific documents that you want to work with, NLTK lets you create custom corpora. We'll see how to do that later on.

STRING PROCESSING¶

Tokenizing¶

Tokenization is the process of breaking a body of text into meaningful pieces. This is similar to parsing a math equation into tokens such as operators and operands. With natural language we might want sentences or words. NLTK provides many ways to tokenize text. Gutenberg

In [ ]:

url = 'http://www.gutenberg.org/cache/epub/1661/pg1661.txt' # The Adventures of Sherlock Holmes
import urllib
book = urllib.urlopen(url).read()

Obviously, we could try a naive method.

In [ ]:

tokens = book.split(' ')
print tokens[:100]

This has some drawbacks. The NLTK methods elimate a lot of the noise for us.

We can tokenize by sentences.

In [ ]:

from nltk import sent_tokenize
sents = sent_tokenize(book)
print 'There are', len(sents), 'sentences'
print sents[1002:1004]

And words.

In [ ]:

from nltk import word_tokenize
words = word_tokenize(book)
print 'There are', len(words), 'words'
print word_tokenize(''.join(sents[1002:1004]))

The word_tokenize function uses a TreeBankTokenizer to generate tokens. As you can see it separates contractions. To avoid this we could use a WhitespaceTokenizer.

In [ ]:

from nltk.tokenize import WhitespaceTokenizer
tokenizer = WhitespaceTokenizer()
print tokenizer.tokenize(''.join(sents[1002:1004]))

The WhitespaceTokenizer only looks at whitespace characters and does not tokenize punctuation. The WordPunctTokenizer will create a token.

In [ ]:

from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
print tokenizer.tokenize(''.join(sents[1002:1004]))

Stemming¶

Stemming is the process of reducing inflected words to their root or stem. Inflection is the modification of a word to express tense, gender, number and others. For example, the stem of 'running' is 'run'. The following is a naive approach.

In [ ]:

def stem(word):
    for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
    return word

print 'runs:', stem('runs')
print 'played:', stem('played')
print 'playing:', stem('playing')
print 'running:', stem('running')

This does not work for all situations, especially those where the stem is modified.

NLTK gives us several stemmer classes to accomplish this. The most common is the PorterStemmer. Python

In [ ]:

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print 'talking:', stemmer.stem('talking')
print 'talks:', stemmer.stem('talks')
print 'talked:', stemmer.stem('talked')
print 'running:', stemmer.stem('running')

NLTK also provides a RegexpStemmer which is more of a brute force method.

In [ ]:

from nltk.stem import RegexpStemmer
stemmer = RegexpStemmer('ing')
print 'talking:', stemmer.stem('talking')
print 'talks:', stemmer.stem('talks')

This has the limitation of working only with the specified suffix.

In [ ]:

stemmer = RegexpStemmer('ed')
print 'talked:', stemmer.stem('talked')

And in instances where the stem itself is modified before adding a suffix:

In [ ]:

stemmer = RegexpStemmer('ing')
print 'running:', stemmer.stem('running')

The SnowballStemmer is another place that NLTK supports multiple languages. Conjugation

In [ ]:

from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('spanish')
print 'hablar:', stemmer.stem('hablar') # to speak
print 'hablo:', stemmer.stem('hablo') # I speak

COLLOCATION¶

Words that appear in proximity are collocated. A collocation of two words is a bigram. NLTK makes this easy.

In [ ]:

from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
finder = BigramCollocationFinder.from_words(gutenberg.words('shakespeare-macbeth.txt'))
bigram = BigramAssocMeasures()
finder.nbest(bigram.pmi, 25) # pmi => pointwise mutual information, scoring method

In [ ]:

from nltk.collocations import TrigramAssocMeasures, TrigramCollocationFinder
tri_finder = TrigramCollocationFinder.from_words(gutenberg.words('shakespeare-macbeth.txt'))
trigram = TrigramAssocMeasures()
tri_finder.nbest(trigram.pmi, 25)

In [ ]:

finder.apply_freq_filter(5)
finder.nbest(bigram.pmi, 25)

In [ ]:

def remove_my(word):
    return word == 'my'

finder.apply_word_filter(remove_my)
finder.nbest(bigram.pmi, 25)

POS TAGGING¶

The easy way:¶

In [ ]:

nltk.pos_tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))

If you want more control:¶

In [ ]:

from nltk.tag import DefaultTagger
default_tagger = DefaultTagger('NN')
sent = default_tagger.tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))
sent

As you can see, the DefaultTagger simple tags each token with the given tag. This is not very useful by itself (but we'll come back to it later). Let's try training a tagger based on the Penn treebank.

In [ ]:

from nltk.tag import UnigramTagger
tagger = UnigramTagger(treebank.tagged_sents())
sent = tagger.tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))
sent

The Penn treebank has about 3900 sentences so it's not going to have everything. That it didn't tag 'PyOhio' is not really a surprise but it also missed 'awesome'.

Let's try a more well known set of words:

In [ ]:

sent = tagger.tag(nltk.word_tokenize('The quick brown fox jumped over the lazy dog.'))
sent

And it didn't get all of these either.

The Brown corpus is the first public corpus with over 1 million tagged words. Let's see if it can do any better. With 57K sentences, it will take a little longer to train.

In [ ]:

from nltk.corpus import brown
tagger = UnigramTagger(brown.tagged_sents(brown.fileids()))

In [ ]:

sent = tagger.tag(nltk.word_tokenize('The quick brown fox jumped over the lazy dog.'))
sent

That's more like it! Let's try our other sentence:

In [ ]:

sent = tagger.tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))
sent

Not any better.

The taggers we've seen are called backoff taggers. One benefit of these is they can be chained together and words not tagged in one tagger will be handled by the next tagger in the chain. Let's put our default tagger at the end of the chain and then any words missed will eventually be tagged 'NN'.

In [ ]:

tagger._taggers = [tagger, default_tagger]
sent = tagger.tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))
sent

Taggers can also be pickled.

In [ ]:

import pickle
f = open('/Users/douglasstarnes/nltk_data/taggers/dumbtagger.pickle', 'w')
pickle.dump(tagger, f)
f.close()

The reason I stored the pickle in the nltk_data directory is so I could load it using the nltk.tag.load method:

In [ ]:

dumbtagger = nltk.tag.load('taggers/dumbtagger.pickle')
sent = dumbtagger.tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))
sent

You can also remove tags with the untag function:

In [ ]:

nltk.tag.untag(sent)

Fortunately, we don't have to do that. The community (via Github) has provided us with scripts to automate the training process. Github

python train_tagger.py treebank

loading treebank

3914 tagged sents, training on 3914

training AffixTagger with affix -3 and backoff <DefaultTagger: tag=-None->

training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=2536>

training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=4935>

training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=2324>

evaluating TrigramTagger

accuracy: 0.992362

Training the Brown corpus this way takes forever so I'm going to use the Penn treebank. It will save to the nltk_data so we can use the nltk library to load the pickle.

In [ ]:

import nltk.data
tagger = nltk.data.load('taggers/treebank_aubt.pickle')
sent = tagger.tag(treebank.sents()[0])
sent

In [ ]:

words = nltk.word_tokenize('PyOhio is an awesome software development conference.')
sent = tagger.tag(words)
sent

In [ ]:

words = nltk.word_tokenize('The quick brown fox jumped over the lazy dog.')
sent = tagger.tag(words)
sent

CLASSIFICATION¶

NLTK has several different classifiers. The most common one is the NaiveBayesClassifier.

In [ ]:

from nltk.corpus import movie_reviews
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier

p_f = [] # positive features
n_f = [] # negative features

We need to extract the features using the 'bag of words' method which just indicates word presence. Since the classifier requires a dict, we'll just make the value of each dict True.

In [ ]:

for fileid in movie_reviews.fileids('pos'):
    words = movie_reviews.words(fileid)
    words = dict([(word, True) for word in words])
    p_f.append((words, 'pos'))
    
for fileid in movie_reviews.fileids('neg'):
    words = movie_reviews.words(fileid)
    words = dict([(word, True) for word in words])
    n_f.append((words, 'neg'))

We'll use 90% to train the classifier and then 10% to test it.

In [ ]:

training_set = p_f[:900] + n_f[:900]
test_set = p_f[900:] + n_f[900:]

Then we'll train the set and test the accuracy.

In [ ]:

classifier = NaiveBayesClassifier.train(training_set)
nltk.classify.util.accuracy(classifier, test_set)

DOCUMENT SIMILARITY¶

For this we can use a formula known as TFIDF or 'term frequency inverse document frequency'. This tells us the importance of a word in a document in a corpus.

The term frequency is the number of times a word appears in a document and thus how important that word is to all others.

The inverse document frequency is how important a word is in the corpus.

The product of the two is TFIDF.

In [ ]:

import math
from nltk.tokenize import WhitespaceTokenizer

documents = [
    'I like to play golf and tennis', 
    'The local court is a place I like', 
    'I do not like to play tennis but I like to play golf', 
    'My neighbor went bowling yesterday'
]

tokenizer = WhitespaceTokenizer()

def tf(term, doc):
    words = tokenizer.tokenize(doc)
    terms = sum([1 for t in words if t == term])
    return float(terms) / float(len(words))

def idf(term):
    docs = sum([1 for doc in documents if term in tokenizer.tokenize(doc)])
    return math.log(float(len(documents)) / float(1 + docs))

def tfidf(term, doc):
    return tf(term, doc) * idf(term)

for doc in documents: 
    print tfidf('bowling', doc) # golf, play, tennis, bowling

In a production situation we would have a much larger corpus, filter out stopwords, normalize the text, etc. But this example fits on a screen.

To compute document similarity, scikit-learn provides a nice implementation in TfidfVectorizer.

In [ ]:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=1) # consider terms that occur one or more times
doc_matrix = vectorizer.fit_transform(['happy dog', 'sad dog', 'happy cat', 'sad cat'])
(doc_matrix * doc_matrix.T).A

In [ ]:

vectorizer = TfidfVectorizer(min_df=1)
doc_matrix = vectorizer.fit_transform(documents)
(doc_matrix * doc_matrix.T).A

CLEANING UP YOUR CORPUS¶

In [ ]:

url_osu_home = 'http://www.osu.edu/'
html = urllib.urlopen(url_osu_home).read()
print html[:1000]

In [ ]:

html = nltk.util.clean_html(html)
nltk.word_tokenize(html)

In [ ]:

from bs4 import BeautifulSoup
html = urllib.urlopen(url_osu_home).read()
osu_home = BeautifulSoup(html)
osu_home.title

In [ ]:

metas = osu_home.findAll('meta')
for tag in metas:
    try:
        print tag.attrs['content']
    except KeyError:
        pass

TEXTBLOB¶

TextBlob is an NLP library built on top of NLTK and Pattern (see below). It has a simplified API.

In [ ]:

from textblob import TextBlob
text = 'the quick brown fox jumped over the lazy dog'
blob = TextBlob(text)

POS tagging is a breeze

In [ ]:

blob.tags

Translating between languages is a snap!

In [ ]:

fr = TextBlob(text)
fr = fr.translate(to="fr")
print fr.string

And it works with non-Western/European languages:

In [ ]:

cn = TextBlob(text)
cn = cn.translate(to="zh-CN")
print cn.string

And right to left languages:

In [ ]:

iw = TextBlob(text)
iw = iw.translate(to="IW")
print iw.string

Misspelled words? No problem!

In [ ]:

text = 'I spell like an amatur.  I beleive it is not a serious issue.  But ignorence is bliss.'
blob = TextBlob(text)
print text
print blob.correct().string

And remember all the trouble we went through with sentiment analysis before?

In [ ]:

blobs = [
    'Today is a beautiful day.',
    'Today is a terrible day.',
    'Today is a rainy day.',
    'I love rainy days.',
    'I think rainy days are beautiful.',
    'Today is a sunny day.',
    'I think sunny days are awful.'
]
for blob in blobs:
    print '****',blob
    print TextBlob(blob).sentiment

I WANT MORE!!!!¶

NLTK Site
NLTK book -- FREE
NLTK book 1st ed. -- also free
Chinese POS tagger
scikit-learn (TfidfVectorizer)
TextBlob
Pattern
Memory-Based Shallow Parsing Demo
Google and StackOverflow are your friends

THANK YOU!¶

douglas.a.starnes@gmail.com
http://douglasstarnes.com
@poweredbyaltnet

In [ ]: