I have (I think/hope) about 90 minutes of material. This session is supposed to go for about 110 minutes. I thought we could break it up as follows:
This talk is about introductory natural langauge techniques using Python. Most of the talk will focus on NLTK but towards the end we'll look at a few other frameworks such as scikit-learn, TextBlob and BeautifulSoup.
NLTK is the Natural Language ToolKit. It is a set of software, data and documentation to perform Natural Language Processing (NLP) in Python.
Natural language is spoken and/or written language used by humans for communication. We will be focusing on English (because today's host only speaks English) but NLTK has some facilities for other languages as well.
Natural language processing is the automated (usually by a computer) analysis of natural language.
You must have Python version 2.6+ or 3.2+ (for NLTK 3) After that installation is simple:
pip install nltk
NLTK has soft dependencies on numpy and matplotlib.
You'll want to have some data, or corpora to experiment with:
nltk.download_gui()
NLTK comes with over over 40 corpora or bodies of text that we can experiment with. The corpora have a variety of features. Here are a few of the more popular ones:
Corpora are imported as modules:
Let's take a look at a few of them in code.
import nltk
from nltk.corpus import gutenberg
gutenberg
The Gutenberg corpus is a PlaintextCorpusReader
and contains a collection of text documents with no metadata.
for fileid in gutenberg.fileids():
print fileid
fileid = 'shakespeare-macbeth.txt'
words = gutenberg.words(fileid)
words[:20]
raw = gutenberg.raw(fileid)
raw[:1000]
len(words)
sents = gutenberg.sents(fileid)
sents
len(sents)
len(raw)
A 'stopword' is a word that usually is filtered out. This includes words such as 'the', 'and', 'or', etc. Note that there is no definitive list but NLTK has a fairly complete one.
from nltk.corpus import stopwords
words = stopwords.words('english')
for word in words[:25]: print word,
NLTK also has lists of stopwords in many languages.
languages = stopwords.fileids()
for lang in languages:
words = stopwords.words(lang)
print lang
for word in words[:25]: print word,
print '\n'
from nltk.corpus import treebank
words = treebank.tagged_words()[:25]
print ', '.join(["('%s', '%s')" % (word[0], word[1]) for word in words])
sent = treebank.tagged_sents()[0]
sent
We'll use the Penn Treebank again to tag our own text.
Wordnet is a different animal. This corpus contains what are called 'synsets' or synonym rings. To get the synset for a particular word:
from nltk.corpus import wordnet
good_synsets = wordnet.synsets('good')
for synset in good_synsets[:20]:
print synset.name
print synset.pos
print synset.definition
print '-' * 40
We can see that each synset has a definition and a part of speech or POS tag.
Suppose we are only interested in nouns.
good_nouns = wordnet.synsets('good', pos='n')
for noun in good_nouns:
print noun.name
print noun.pos
print noun.definition
print '-' * 40
Each synset also (potenially) has examples of usage.
for noun in good_nouns:
print noun.definition
if len(noun.examples) == 0:
print '**NO EXAMPLES'
else:
for example in noun.examples:
print example
print '-' * 40
Each synset can also have a set of lemmas which each can have a set of antonyms. An antonym is a lemma which has a structure like that of synsets.
noun = good_nouns[1]
for lemma in good_nouns[1].lemmas:
print lemma.name
antonyms = lemma.antonyms()
for antonym in antonyms: print '-', antonym.name
Since you'll probably have your own domain specific documents that you want to work with, NLTK lets you create custom corpora. We'll see how to do that later on.
Tokenization is the process of breaking a body of text into meaningful pieces. This is similar to parsing a math equation into tokens such as operators and operands. With natural language we might want sentences or words. NLTK provides many ways to tokenize text. Gutenberg
url = 'http://www.gutenberg.org/cache/epub/1661/pg1661.txt' # The Adventures of Sherlock Holmes
import urllib
book = urllib.urlopen(url).read()
Obviously, we could try a naive method.
tokens = book.split(' ')
print tokens[:100]
This has some drawbacks. The NLTK methods elimate a lot of the noise for us.
We can tokenize by sentences.
from nltk import sent_tokenize
sents = sent_tokenize(book)
print 'There are', len(sents), 'sentences'
print sents[1002:1004]
And words.
from nltk import word_tokenize
words = word_tokenize(book)
print 'There are', len(words), 'words'
print word_tokenize(''.join(sents[1002:1004]))
The word_tokenize
function uses a TreeBankTokenizer
to generate tokens. As you can see it separates contractions. To avoid this we could use a WhitespaceTokenizer
.
from nltk.tokenize import WhitespaceTokenizer
tokenizer = WhitespaceTokenizer()
print tokenizer.tokenize(''.join(sents[1002:1004]))
The WhitespaceTokenizer
only looks at whitespace characters and does not tokenize punctuation. The WordPunctTokenizer
will create a token.
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
print tokenizer.tokenize(''.join(sents[1002:1004]))
Stemming is the process of reducing inflected words to their root or stem. Inflection is the modification of a word to express tense, gender, number and others. For example, the stem of 'running' is 'run'. The following is a naive approach.
def stem(word):
for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
if word.endswith(suffix):
return word[:-len(suffix)]
return word
print 'runs:', stem('runs')
print 'played:', stem('played')
print 'playing:', stem('playing')
print 'running:', stem('running')
This does not work for all situations, especially those where the stem is modified.
NLTK gives us several stemmer classes to accomplish this. The most common is the PorterStemmer
. Python
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print 'talking:', stemmer.stem('talking')
print 'talks:', stemmer.stem('talks')
print 'talked:', stemmer.stem('talked')
print 'running:', stemmer.stem('running')
NLTK also provides a RegexpStemmer
which is more of a brute force method.
from nltk.stem import RegexpStemmer
stemmer = RegexpStemmer('ing')
print 'talking:', stemmer.stem('talking')
print 'talks:', stemmer.stem('talks')
This has the limitation of working only with the specified suffix.
stemmer = RegexpStemmer('ed')
print 'talked:', stemmer.stem('talked')
And in instances where the stem itself is modified before adding a suffix:
stemmer = RegexpStemmer('ing')
print 'running:', stemmer.stem('running')
The SnowballStemmer
is another place that NLTK supports multiple languages. Conjugation
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('spanish')
print 'hablar:', stemmer.stem('hablar') # to speak
print 'hablo:', stemmer.stem('hablo') # I speak
Words that appear in proximity are collocated. A collocation of two words is a bigram. NLTK makes this easy.
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
finder = BigramCollocationFinder.from_words(gutenberg.words('shakespeare-macbeth.txt'))
bigram = BigramAssocMeasures()
finder.nbest(bigram.pmi, 25) # pmi => pointwise mutual information, scoring method
from nltk.collocations import TrigramAssocMeasures, TrigramCollocationFinder
tri_finder = TrigramCollocationFinder.from_words(gutenberg.words('shakespeare-macbeth.txt'))
trigram = TrigramAssocMeasures()
tri_finder.nbest(trigram.pmi, 25)
finder.apply_freq_filter(5)
finder.nbest(bigram.pmi, 25)
def remove_my(word):
return word == 'my'
finder.apply_word_filter(remove_my)
finder.nbest(bigram.pmi, 25)
nltk.pos_tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))
from nltk.tag import DefaultTagger
default_tagger = DefaultTagger('NN')
sent = default_tagger.tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))
sent
As you can see, the DefaultTagger simple tags each token with the given tag. This is not very useful by itself (but we'll come back to it later). Let's try training a tagger based on the Penn treebank.
from nltk.tag import UnigramTagger
tagger = UnigramTagger(treebank.tagged_sents())
sent = tagger.tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))
sent
The Penn treebank has about 3900 sentences so it's not going to have everything. That it didn't tag 'PyOhio' is not really a surprise but it also missed 'awesome'.
Let's try a more well known set of words:
sent = tagger.tag(nltk.word_tokenize('The quick brown fox jumped over the lazy dog.'))
sent
And it didn't get all of these either.
The Brown corpus is the first public corpus with over 1 million tagged words. Let's see if it can do any better. With 57K sentences, it will take a little longer to train.
from nltk.corpus import brown
tagger = UnigramTagger(brown.tagged_sents(brown.fileids()))
sent = tagger.tag(nltk.word_tokenize('The quick brown fox jumped over the lazy dog.'))
sent
That's more like it! Let's try our other sentence:
sent = tagger.tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))
sent
Not any better.
The taggers we've seen are called backoff taggers. One benefit of these is they can be chained together and words not tagged in one tagger will be handled by the next tagger in the chain. Let's put our default tagger at the end of the chain and then any words missed will eventually be tagged 'NN'.
tagger._taggers = [tagger, default_tagger]
sent = tagger.tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))
sent
Taggers can also be pickled.
import pickle
f = open('/Users/douglasstarnes/nltk_data/taggers/dumbtagger.pickle', 'w')
pickle.dump(tagger, f)
f.close()
The reason I stored the pickle in the nltk_data directory is so I could load it using the nltk.tag.load
method:
dumbtagger = nltk.tag.load('taggers/dumbtagger.pickle')
sent = dumbtagger.tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))
sent
You can also remove tags with the untag
function:
nltk.tag.untag(sent)
Fortunately, we don't have to do that. The community (via Github) has provided us with scripts to automate the training process. Github
python train_tagger.py treebank
loading treebank
3914 tagged sents, training on 3914
training AffixTagger with affix -3 and backoff <DefaultTagger: tag=-None->
training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=2536>
training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=4935>
training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=2324>
evaluating TrigramTagger
accuracy: 0.992362
Training the Brown corpus this way takes forever so I'm going to use the Penn treebank. It will save to the nltk_data so we can use the nltk library to load the pickle.
import nltk.data
tagger = nltk.data.load('taggers/treebank_aubt.pickle')
sent = tagger.tag(treebank.sents()[0])
sent
words = nltk.word_tokenize('PyOhio is an awesome software development conference.')
sent = tagger.tag(words)
sent
words = nltk.word_tokenize('The quick brown fox jumped over the lazy dog.')
sent = tagger.tag(words)
sent
NLTK has several different classifiers. The most common one is the NaiveBayesClassifier
.
from nltk.corpus import movie_reviews
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
p_f = [] # positive features
n_f = [] # negative features
We need to extract the features using the 'bag of words' method which just indicates word presence. Since the classifier requires a dict, we'll just make the value of each dict True.
for fileid in movie_reviews.fileids('pos'):
words = movie_reviews.words(fileid)
words = dict([(word, True) for word in words])
p_f.append((words, 'pos'))
for fileid in movie_reviews.fileids('neg'):
words = movie_reviews.words(fileid)
words = dict([(word, True) for word in words])
n_f.append((words, 'neg'))
We'll use 90% to train the classifier and then 10% to test it.
training_set = p_f[:900] + n_f[:900]
test_set = p_f[900:] + n_f[900:]
Then we'll train the set and test the accuracy.
classifier = NaiveBayesClassifier.train(training_set)
nltk.classify.util.accuracy(classifier, test_set)
For this we can use a formula known as TFIDF or 'term frequency inverse document frequency'. This tells us the importance of a word in a document in a corpus.
The term frequency is the number of times a word appears in a document and thus how important that word is to all others.
The inverse document frequency is how important a word is in the corpus.
The product of the two is TFIDF.
import math
from nltk.tokenize import WhitespaceTokenizer
documents = [
'I like to play golf and tennis',
'The local court is a place I like',
'I do not like to play tennis but I like to play golf',
'My neighbor went bowling yesterday'
]
tokenizer = WhitespaceTokenizer()
def tf(term, doc):
words = tokenizer.tokenize(doc)
terms = sum([1 for t in words if t == term])
return float(terms) / float(len(words))
def idf(term):
docs = sum([1 for doc in documents if term in tokenizer.tokenize(doc)])
return math.log(float(len(documents)) / float(1 + docs))
def tfidf(term, doc):
return tf(term, doc) * idf(term)
for doc in documents:
print tfidf('bowling', doc) # golf, play, tennis, bowling
In a production situation we would have a much larger corpus, filter out stopwords, normalize the text, etc. But this example fits on a screen.
To compute document similarity, scikit-learn provides a nice implementation in TfidfVectorizer
.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1) # consider terms that occur one or more times
doc_matrix = vectorizer.fit_transform(['happy dog', 'sad dog', 'happy cat', 'sad cat'])
(doc_matrix * doc_matrix.T).A
vectorizer = TfidfVectorizer(min_df=1)
doc_matrix = vectorizer.fit_transform(documents)
(doc_matrix * doc_matrix.T).A
url_osu_home = 'http://www.osu.edu/'
html = urllib.urlopen(url_osu_home).read()
print html[:1000]
html = nltk.util.clean_html(html)
nltk.word_tokenize(html)
from bs4 import BeautifulSoup
html = urllib.urlopen(url_osu_home).read()
osu_home = BeautifulSoup(html)
osu_home.title
metas = osu_home.findAll('meta')
for tag in metas:
try:
print tag.attrs['content']
except KeyError:
pass
TextBlob is an NLP library built on top of NLTK and Pattern (see below). It has a simplified API.
from textblob import TextBlob
text = 'the quick brown fox jumped over the lazy dog'
blob = TextBlob(text)
POS tagging is a breeze
blob.tags
Translating between languages is a snap!
fr = TextBlob(text)
fr = fr.translate(to="fr")
print fr.string
And it works with non-Western/European languages:
cn = TextBlob(text)
cn = cn.translate(to="zh-CN")
print cn.string
And right to left languages:
iw = TextBlob(text)
iw = iw.translate(to="IW")
print iw.string
Misspelled words? No problem!
text = 'I spell like an amatur. I beleive it is not a serious issue. But ignorence is bliss.'
blob = TextBlob(text)
print text
print blob.correct().string
And remember all the trouble we went through with sentiment analysis before?
blobs = [
'Today is a beautiful day.',
'Today is a terrible day.',
'Today is a rainy day.',
'I love rainy days.',
'I think rainy days are beautiful.',
'Today is a sunny day.',
'I think sunny days are awful.'
]
for blob in blobs:
print '****',blob
print TextBlob(blob).sentiment