NLTK¶

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and an active discussion forum.

Library documentation: http://www.nltk.org/

In [1]:

# needed to display the graphs
%matplotlib inline

In [2]:

# import the library and download sample texts
import nltk
nltk.download()

showing info http://nltk.github.com/nltk_data/

Out[2]:

True

In [3]:

from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

In [4]:

# examine concordances (word + context)
text1.concordance("monstrous")

Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

In [5]:

text1.similar("monstrous")

imperial subtly impalpable pitiable curious abundant perilous
trustworthy untoward singular lamentable few determined maddens
horrible tyrannical lazy mystifying christian exasperate

In [6]:

text2.common_contexts(["monstrous", "very"])

a_pretty is_pretty a_lucky am_glad be_glad

In [7]:

# see where in a text certain words are found to occur
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

In [8]:

# count of all tokens (including punctuation)
len(text3)

Out[8]:

In [9]:

# number of distinct tokens
len(set(text3))

Out[9]:

In [10]:

# the texts are just lists of strings
text2[141525:]

Out[10]:

[u'among',
 u'the',
 u'merits',
 u'and',
 u'the',
 u'happiness',
 u'of',
 u'Elinor',
 u'and',
 u'Marianne',
 u',',
 u'let',
 u'it',
 u'not',
 u'be',
 u'ranked',
 u'as',
 u'the',
 u'least',
 u'considerable',
 u',',
 u'that',
 u'though',
 u'sisters',
 u',',
 u'and',
 u'living',
 u'almost',
 u'within',
 u'sight',
 u'of',
 u'each',
 u'other',
 u',',
 u'they',
 u'could',
 u'live',
 u'without',
 u'disagreement',
 u'between',
 u'themselves',
 u',',
 u'or',
 u'producing',
 u'coolness',
 u'between',
 u'their',
 u'husbands',
 u'.',
 u'THE',
 u'END']

In [11]:

# build a frequency distribution
fdist1 = FreqDist(text1) 
fdist1

Out[11]:

FreqDist({u',': 18713, u'the': 13721, u'.': 6862, u'of': 6536, u'and': 6024, u'a': 4569, u'to': 4542, u';': 4072, u'in': 3916, u'that': 2982, ...})

In [12]:

fdist1.most_common(20)

Out[12]:

[(u',', 18713),
 (u'the', 13721),
 (u'.', 6862),
 (u'of', 6536),
 (u'and', 6024),
 (u'a', 4569),
 (u'to', 4542),
 (u';', 4072),
 (u'in', 3916),
 (u'that', 2982),
 (u"'", 2684),
 (u'-', 2552),
 (u'his', 2459),
 (u'it', 2209),
 (u'I', 2124),
 (u's', 1739),
 (u'is', 1695),
 (u'he', 1661),
 (u'with', 1659),
 (u'was', 1632)]

In [13]:

fdist1['whale']

Out[13]:

In [14]:

fdist1.plot(20, cumulative=True)

In [15]:

# apply a list comprehension to get words over 15 characters
V = set(text1)
long_words = [w for w in V if len(w) > 15]
sorted(long_words)

Out[15]:

[u'CIRCUMNAVIGATION',
 u'Physiognomically',
 u'apprehensiveness',
 u'cannibalistically',
 u'characteristically',
 u'circumnavigating',
 u'circumnavigation',
 u'circumnavigations',
 u'comprehensiveness',
 u'hermaphroditical',
 u'indiscriminately',
 u'indispensableness',
 u'irresistibleness',
 u'physiognomically',
 u'preternaturalness',
 u'responsibilities',
 u'simultaneousness',
 u'subterraneousness',
 u'supernaturalness',
 u'superstitiousness',
 u'uncomfortableness',
 u'uncompromisedness',
 u'undiscriminating',
 u'uninterpenetratingly']

In [16]:

fdist2 = FreqDist(text5)
sorted(w for w in set(text5) if len(w) > 7 and fdist2[w] > 7)

Out[16]:

[u'#14-19teens',
 u'#talkcity_adults',
 u'((((((((((',
 u'........',
 u'Question',
 u'actually',
 u'anything',
 u'computer',
 u'cute.-ass',
 u'everyone',
 u'football',
 u'innocent',
 u'listening',
 u'remember',
 u'seriously',
 u'something',
 u'together',
 u'tomorrow',
 u'watching']

In [17]:

# word sequences that appear together unusually often
text4.collocations()

United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties

Raw Text Processing¶

In [18]:

# download raw text from an online repository
import urllib2
url = "http://www.gutenberg.org/files/2554/2554.txt"
response = urllib2.urlopen(url)
raw = response.read().decode('utf8')
len(raw)

Out[18]:

In [19]:

raw[:75]

Out[19]:

u'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

In [20]:

# tokenize the raw text
from nltk import word_tokenize
tokens = word_tokenize(raw)
len(tokens)

Out[20]:

In [21]:

tokens[:10]

Out[21]:

[u'The',
 u'Project',
 u'Gutenberg',
 u'EBook',
 u'of',
 u'Crime',
 u'and',
 u'Punishment',
 u',',
 u'by']

In [22]:

text = nltk.Text(tokens)
text[1024:1062]

Out[22]:

[u'CHAPTER',
 u'I',
 u'On',
 u'an',
 u'exceptionally',
 u'hot',
 u'evening',
 u'early',
 u'in',
 u'July',
 u'a',
 u'young',
 u'man',
 u'came',
 u'out',
 u'of',
 u'the',
 u'garret',
 u'in',
 u'which',
 u'he',
 u'lodged',
 u'in',
 u'S.',
 u'Place',
 u'and',
 u'walked',
 u'slowly',
 u',',
 u'as',
 u'though',
 u'in',
 u'hesitation',
 u',',
 u'towards',
 u'K.',
 u'bridge',
 u'.']

In [23]:

text.collocations()

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; Nikodim Fomitch; young man; Ilya Petrovitch; n't know;
Project Gutenberg; Dmitri Prokofitch; Andrey Semyonovitch; Hay Market

In [24]:

raw.find("PART I")

Out[24]:

In [25]:

# HTML parsing using the Beautiful Soup library
from bs4 import BeautifulSoup
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib2.urlopen(url).read().decode('utf8')
raw = BeautifulSoup(html).get_text()
tokens = word_tokenize(raw)
tokens[0:10]

Out[25]:

[u'BBC',
 u'NEWS',
 u'|',
 u'Health',
 u'|',
 u'Blondes',
 u"'to",
 u'die',
 u'out',
 u'in']

In [26]:

# isolate just the article text
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('gene')

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin

Regular Expressions¶

In [27]:

# regular expression library
import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

In [28]:

# match the end of a word
[w for w in wordlist if re.search('ed$', w)][0:10]

Out[28]:

[u'abaissed',
 u'abandoned',
 u'abased',
 u'abashed',
 u'abatised',
 u'abed',
 u'aborted',
 u'abridged',
 u'abscessed',
 u'absconded']

In [29]:

# wildcard matches any single character
[w for w in wordlist if re.search('^..j..t..$', w)][0:10]

Out[29]:

[u'abjectly',
 u'adjuster',
 u'dejected',
 u'dejectly',
 u'injector',
 u'majestic',
 u'objectee',
 u'objector',
 u'rejecter',
 u'rejector']

In [30]:

# combination of caret (start of word) and sets
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]

Out[30]:

[u'gold', u'golf', u'hold', u'hole']

In [31]:

chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))

# plus symbol matches any number of times repeating
[w for w in chat_words if re.search('^m+i+n+e+$', w)]

Out[31]:

[u'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 u'miiiiiinnnnnnnnnneeeeeeee',
 u'mine',
 u'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']

In [32]:

wsj = sorted(set(nltk.corpus.treebank.words()))

# more advanced regex example
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)][0:10]

Out[32]:

[u'0.0085',
 u'0.05',
 u'0.1',
 u'0.16',
 u'0.2',
 u'0.25',
 u'0.28',
 u'0.3',
 u'0.4',
 u'0.5']

In [33]:

[w for w in wsj if re.search('^[A-Z]+\$$', w)]

Out[33]:

[u'C$', u'US$']

In [34]:

[w for w in wsj if re.search('^[0-9]{4}$', w)][0:10]

Out[34]:

[u'1614',
 u'1637',
 u'1787',
 u'1901',
 u'1903',
 u'1917',
 u'1925',
 u'1929',
 u'1933',
 u'1934']

In [35]:

[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)][0:10]

Out[35]:

[u'10-day',
 u'10-lap',
 u'10-year',
 u'100-share',
 u'12-point',
 u'12-year',
 u'14-hour',
 u'15-day',
 u'150-point',
 u'190-point']

In [36]:

[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)][0:10]

Out[36]:

[u'black-and-white',
 u'bread-and-butter',
 u'father-in-law',
 u'machine-gun-toting',
 u'savings-and-loan']

In [37]:

[w for w in wsj if re.search('(ed|ing)$', w)][0:10]

Out[37]:

[u'62%-owned',
 u'Absorbed',
 u'According',
 u'Adopting',
 u'Advanced',
 u'Advancing',
 u'Alfred',
 u'Allied',
 u'Annualized',
 u'Anything']

In [38]:

# using "findall" to extract partial matches from words
fd = nltk.FreqDist(vs for word in wsj 
                      for vs in re.findall(r'[aeiou]{2,}', word))
fd.most_common(12)

Out[38]:

[(u'io', 549),
 (u'ea', 476),
 (u'ie', 331),
 (u'ou', 329),
 (u'ai', 261),
 (u'ia', 253),
 (u'ee', 217),
 (u'oo', 174),
 (u'ua', 109),
 (u'au', 106),
 (u'ue', 105),
 (u'ui', 95)]

Normalizing Text¶

In [39]:

# NLTK has several word stemmers built in
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()

In [40]:

[porter.stem(t) for t in tokens][0:10]

Out[40]:

[u'UK',
 u'Blond',
 u"'to",
 u'die',
 u'out',
 u'in',
 u'200',
 u"years'",
 u'Scientist',
 u'believ']

In [41]:

[lancaster.stem(t) for t in tokens][0:10]

Out[41]:

[u'uk',
 u'blond',
 u"'to",
 u'die',
 u'out',
 u'in',
 u'200',
 u"years'",
 u'sci',
 u'believ']

In [42]:

wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens][0:10]

Out[42]:

[u'UK',
 u'Blondes',
 u"'to",
 u'die',
 u'out',
 u'in',
 u'200',
 u"years'",
 u'Scientists',
 u'believe']

In [43]:

# also has a tokenizer that takes a regular expression as a parameter
text = 'That U.S.A. poster-print costs $12.40...'
pattern = r'''(?x)    # set flag to allow verbose regexps
     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
   | \w+(-\w+)*        # words with optional internal hyphens
   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
   | \.\.\.            # ellipsis
   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
'''
nltk.regexp_tokenize(text, pattern)

Out[43]:

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

Tagging¶

In [44]:

# Use a built-in tokenizer and tagger
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)

Out[44]:

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

In [45]:

# Word similarity using a pre-tagged text
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')

man time day year car moment world family house country child boy
state job way war girl place word work

In [46]:

# Tagged words are saved as tuples
nltk.corpus.brown.tagged_words()[0:10]

Out[46]:

[(u'The', u'AT'),
 (u'Fulton', u'NP-TL'),
 (u'County', u'NN-TL'),
 (u'Grand', u'JJ-TL'),
 (u'Jury', u'NN-TL'),
 (u'said', u'VBD'),
 (u'Friday', u'NR'),
 (u'an', u'AT'),
 (u'investigation', u'NN'),
 (u'of', u'IN')]

In [47]:

nltk.corpus.brown.tagged_words(tagset='universal')[0:10]

Out[47]:

[(u'The', u'DET'),
 (u'Fulton', u'NOUN'),
 (u'County', u'NOUN'),
 (u'Grand', u'ADJ'),
 (u'Jury', u'NOUN'),
 (u'said', u'VERB'),
 (u'Friday', u'NOUN'),
 (u'an', u'DET'),
 (u'investigation', u'NOUN'),
 (u'of', u'ADP')]

In [48]:

from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()

Out[48]:

[(u'NOUN', 30640),
 (u'VERB', 14399),
 (u'ADP', 12355),
 (u'.', 11928),
 (u'DET', 11389),
 (u'ADJ', 6706),
 (u'ADV', 3349),
 (u'CONJ', 2717),
 (u'PRON', 2535),
 (u'PRT', 2264),
 (u'NUM', 2166),
 (u'X', 106)]

In [49]:

# Part of speech tag count for words following "often" in a text
brown_lrnd_tagged = brown.tagged_words(categories='learned', tagset='universal')
tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'often']
fd = nltk.FreqDist(tags)
fd.tabulate()

VERB  ADV  ADP  ADJ    .  PRT 
  37    8    7    6    4    2

In [50]:

# Load some raw sentences to tag
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')

In [51]:

# Default tagger (assigns same tag to each token)
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
nltk.FreqDist(tags).max()
raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
tokens = word_tokenize(raw)
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(tokens)

Out[51]:

[('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('green', 'NN'),
 ('eggs', 'NN'),
 ('and', 'NN'),
 ('ham', 'NN'),
 (',', 'NN'),
 ('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('them', 'NN'),
 ('Sam', 'NN'),
 ('I', 'NN'),
 ('am', 'NN'),
 ('!', 'NN')]

In [52]:

# Evaluate the performance against a tagged corpus
default_tagger.evaluate(brown_tagged_sents)

Out[52]:

0.13089484257215028

In [53]:

# Training a unigram tagger
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.tag(brown_sents[2007])

Out[53]:

[(u'Various', u'JJ'),
 (u'of', u'IN'),
 (u'the', u'AT'),
 (u'apartments', u'NNS'),
 (u'are', u'BER'),
 (u'of', u'IN'),
 (u'the', u'AT'),
 (u'terrace', u'NN'),
 (u'type', u'NN'),
 (u',', u','),
 (u'being', u'BEG'),
 (u'on', u'IN'),
 (u'the', u'AT'),
 (u'ground', u'NN'),
 (u'floor', u'NN'),
 (u'so', u'QL'),
 (u'that', u'CS'),
 (u'entrance', u'NN'),
 (u'is', u'BEZ'),
 (u'direct', u'JJ'),
 (u'.', u'.')]

In [54]:

# Now evalute it
unigram_tagger.evaluate(brown_tagged_sents)

Out[54]:

0.9349006503968017

In [55]:

# Combining taggers
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(brown_tagged_sents, backoff=t0)
t2 = nltk.BigramTagger(brown_tagged_sents, backoff=t1)
t2.evaluate(brown_tagged_sents)

Out[55]:

0.9730592517453309

Classifying Text¶

In [56]:

# Define a feature extractor
def gender_features(word):
        return {'last_letter': word[-1]}
gender_features('Shrek')

Out[56]:

{'last_letter': 'k'}

In [57]:

# Prepare a list of examples
from nltk.corpus import names
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
    [(name, 'female') for name in names.words('female.txt')])
import random
random.shuffle(labeled_names)

In [58]:

# Process the names data
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [59]:

classifier.classify(gender_features('Neo'))

Out[59]:

'male'

In [60]:

classifier.classify(gender_features('Trinity'))

Out[60]:

'female'

In [61]:

print(nltk.classify.accuracy(classifier, test_set))

0.752

In [62]:

classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = u'a'           female : male   =     35.4 : 1.0
             last_letter = u'k'             male : female =     31.9 : 1.0
             last_letter = u'f'             male : female =     17.4 : 1.0
             last_letter = u'p'             male : female =     11.3 : 1.0
             last_letter = u'm'             male : female =     10.2 : 1.0

In [63]:

# Document classification
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

In [64]:

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

In [65]:

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [66]:

print(nltk.classify.accuracy(classifier, test_set))

0.64

In [67]:

classifier.show_most_informative_features(5)

Most Informative Features
          contains(sans) = True              neg : pos    =      8.4 : 1.0
     contains(uplifting) = True              pos : neg    =      8.2 : 1.0
    contains(mediocrity) = True              neg : pos    =      7.7 : 1.0
     contains(dismissed) = True              pos : neg    =      7.0 : 1.0
   contains(overwhelmed) = True              pos : neg    =      6.3 : 1.0