Tokenization

Open In Colab

Tokenization or Text segmentation is the problem of dividing a string of written language into its component words.

The most simple way to divide a text into a list of its words is to split over the whitespaces.

In [53]:
text = "Let's eat, grandpa"
print(text.split())
["Let's", 'eat,', 'grandpa']

The problem with that approach is that contractions (Let's -> Let + s) are not handled and punctuations signs stay attached to the nearest word (eat, -> eat + ,).

The right way to tokenize is to use a tokenizer. Most NLP libraries offer their own tokenizers. Here we will use tokenizers from the NLTK library.

The NLTK library offers many tokenizers. We'll work with the WordPunctTokenizer.

But first let's install NLTK and download the necessary resources.

In [65]:
!pip install nltk 
Requirement already satisfied: nltk in /Users/alexis/anaconda3/envs/amcp/lib/python3.8/site-packages (3.5)
Requirement already satisfied: click in /Users/alexis/anaconda3/envs/amcp/lib/python3.8/site-packages (from nltk) (7.1.2)
Requirement already satisfied: tqdm in /Users/alexis/anaconda3/envs/amcp/lib/python3.8/site-packages (from nltk) (4.49.0)
Requirement already satisfied: joblib in /Users/alexis/anaconda3/envs/amcp/lib/python3.8/site-packages (from nltk) (0.16.0)
Requirement already satisfied: regex in /Users/alexis/anaconda3/envs/amcp/lib/python3.8/site-packages (from nltk) (2020.7.14)
In [66]:
import nltk
nltk.download('popular')
[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package shakespeare is already up-to-date!
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package stopwords is already up-to-date!
[nltk_data]    | Downloading package treebank to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package omw to /Users/alexis/nltk_data...
[nltk_data]    |   Package omw is already up-to-date!
[nltk_data]    | Downloading package wordnet to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package wordnet is already up-to-date!
[nltk_data]    | Downloading package wordnet_ic to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package wordnet_ic is already up-to-date!
[nltk_data]    | Downloading package words to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package words is already up-to-date!
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package maxent_ne_chunker is already up-to-date!
[nltk_data]    | Downloading package punkt to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package punkt is already up-to-date!
[nltk_data]    | Downloading package snowball_data to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package snowball_data is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /Users/alexis/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | 
[nltk_data]  Done downloading collection popular
Out[66]:
True

Apply the WordPunctTokenizer

We get a different results. The punctuations are now handled as tokens.

In [55]:
from nltk.tokenize import WordPunctTokenizer
tokens = WordPunctTokenizer().tokenize("Let's eat your soup, Grandpa.")
print(tokens)
['Let', "'", 's', 'eat', 'your', 'soup', ',', 'Grandpa', '.']

Let's tokenize the text from the Wikipedia Earth page and look at the frequency of the most common words.

In [56]:
from nltk.tokenize import WordPunctTokenizer
from collections import Counter
import requests

def wikipedia_page(title):
    '''
    This function returns the raw text of a wikipedia page 
    given a wikipedia page title
    '''
    params = { 
        'action': 'query', 
        'format': 'json', # request json formatted content
        'titles': title, # title of the wikipedia page
        'prop': 'extracts', 
        'explaintext': True
    }
    # send a request to the wikipedia api 
    response = requests.get(
         'https://en.wikipedia.org/w/api.php',
         params= params
     ).json()

    # Parse the result
    page = next(iter(response['query']['pages'].values()))
    # return the page content 
    if 'extract' in page.keys():
        return page['extract']
    else:
        return "Page not found"
In [57]:
text = wikipedia_page('Earth').lower()
tokens = WordPunctTokenizer().tokenize(text)
print(Counter(tokens).most_common(20))
[('the', 672), (',', 525), ('.', 455), ('of', 318), ('and', 239), ('earth', 206), ('is', 169), ('to', 154), ('in', 147), ('a', 126), ('s', 120), ("'", 119), ('(', 99), ('-', 76), ('by', 72), ('from', 69), ('that', 68), ('as', 67), ('at', 57), ('with', 52)]

We now see that earth and earth's for instance are no longer separate tokens and that the punctuation signs are stand alone tokens. This will come in handy if we want to remove them.

Tokenization on characters

We can also tokenize on characters instead of words.

In [58]:
# example of character tokenization
char_tokens = [ c for c in text ]
print("Most common characters in the text")
print(Counter(char_tokens).most_common(20))
print()
print(f"All characters in the text: \n{set(char_tokens)}")
Most common characters in the text
[(' ', 8005), ('e', 4931), ('t', 3855), ('a', 3640), ('i', 2954), ('o', 2921), ('s', 2715), ('r', 2661), ('n', 2645), ('h', 1944), ('l', 1826), ('c', 1414), ('d', 1270), ('m', 1169), ('u', 997), ('p', 841), ('f', 837), ('g', 719), ('y', 592), (',', 568)]

All characters in the text: 
{' ', 'f', 'm', '9', '5', 'b', 'k', 'r', '4', '−', ':', 'ñ', '1', 'w', 'ō', '.', '+', 'c', ',', '0', '–', 'h', 'č', '°', 'g', 'þ', '3', ';', 'v', 'z', 'æ', '(', '/', '%', '*', 'i', '-', 'q', 'p', 't', 's', 'u', 'ῆ', '8', '=', 'ð', 'l', 'ū', 'ē', 'ć', '?', 'ʻ', 'n', 'o', "'", 'a', '\n', '—', 'á', '7', 'j', 'e', '6', ')', 'ö', '"', '×', 'x', '±', 'γ', 'd', 'µ', 'y', '2'}

N-grams

Some words are better taken together: New York, Happy end, Wall street, Linear regression etc ... . When tokenizing we want to consider all possible adjacent pairs of words in the text. We can do this with the NLTK ngrams function

In [59]:
from nltk import ngrams

text = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?".lower()

# Tokenize
tokens = WordPunctTokenizer().tokenize(text)

# bigrams 
bigrams = [w for w in  ngrams(tokens,n=2)]
print(bigrams)

print()
bigrams = ['_'.join(bg) for bg in bigrams]
print(bigrams)
[('how', 'much'), ('much', 'wood'), ('wood', 'would'), ('would', 'a'), ('a', 'woodchuck'), ('woodchuck', 'chuck'), ('chuck', 'if'), ('if', 'a'), ('a', 'woodchuck'), ('woodchuck', 'could'), ('could', 'chuck'), ('chuck', 'wood'), ('wood', '?')]

['how_much', 'much_wood', 'wood_would', 'would_a', 'a_woodchuck', 'woodchuck_chuck', 'chuck_if', 'if_a', 'a_woodchuck', 'woodchuck_could', 'could_chuck', 'chuck_wood', 'wood_?']
In [60]:
# and for trigrams

trigrams = ['_'.join(w) for w in  ngrams(tokens,n=3)]

print(trigrams)
['how_much_wood', 'much_wood_would', 'wood_would_a', 'would_a_woodchuck', 'a_woodchuck_chuck', 'woodchuck_chuck_if', 'chuck_if_a', 'if_a_woodchuck', 'a_woodchuck_could', 'woodchuck_could_chuck', 'could_chuck_wood', 'chuck_wood_?']

add ngrams to list of tokens

Let's add the bigrams and trigrams to the list of tokens on the wikipedia Earth page and look at the frequency of ngrams.

In [61]:
text = wikipedia_page('Earth').lower()
unigrams = WordPunctTokenizer().tokenize(text)
bigrams = ['_'.join(w) for w in  ngrams(unigrams,n=2)]
trigrams = ['_'.join(w) for w in  ngrams(unigrams,n=3)]
In [62]:
tokens = unigrams + bigrams + trigrams
In [63]:
print(f"we have a total of {len(tokens)} tokens, including: \n- {len(unigrams)} unigrams \n- {len(bigrams)} bigrams \n- {len(trigrams)} trigrams. ")
we have a total of 30201 tokens, including: 
- 10068 unigrams 
- 10067 bigrams 
- 10066 trigrams. 
In [64]:
Counter(tokens).most_common(50)
Out[64]:
[('the', 672),
 (',', 525),
 ('.', 455),
 ('of', 318),
 ('and', 239),
 ('earth', 206),
 ('is', 169),
 ('to', 154),
 ('in', 147),
 ('a', 126),
 ('s', 120),
 ("'", 119),
 ("'_s", 119),
 ('(', 99),
 ("earth_'", 98),
 ("earth_'_s", 98),
 ('of_the', 87),
 ('._the', 80),
 ('-', 76),
 ('by', 72),
 ('from', 69),
 ('that', 68),
 ('as', 67),
 ('at', 57),
 (',_and', 56),
 ('with', 52),
 ('are', 52),
 ('sun', 49),
 ('surface', 49),
 (',_the', 49),
 ('in_the', 48),
 (')', 46),
 ('on', 45),
 ('about', 43),
 ('the_sun', 42),
 ('===', 40),
 ('of_earth', 40),
 ('to_the', 40),
 ('this', 39),
 ('million', 35),
 ('atmosphere', 34),
 ('1', 34),
 ('its', 34),
 ('life', 33),
 ('for', 32),
 ('moon', 32),
 ('solar', 31),
 ('%', 30),
 ('water', 30),
 ('years', 30)]

We have multiple bigrams in the top 50 tokens:

  • of_the
  • of_earth
  • in_the

Adding ngrams to a list of tokens may help down the line when classifying text.

In [ ]: