Stemming and Lemmatization

Lemmatization is the algorithmic process of determining the lemma of a word based on its intended meaning.

Stemming is the process of reducing inflected or sometimes derived words to their word stem, base or root form

Let's see how stemming works on the Wikipedia Earth's page

In [52]:
import requests
def wikipedia_page(title):
    '''
    This function returns the raw text of a wikipedia page 
    given a wikipedia page title
    '''
    params = { 
        'action': 'query', 
        'format': 'json', # request json formatted content
        'titles': title, # title of the wikipedia page
        'prop': 'extracts', 
        'explaintext': True
    }
    # send a request to the wikipedia api 
    response = requests.get(
         'https://en.wikipedia.org/w/api.php',
         params= params
     ).json()

    # Parse the result
    page = next(iter(response['query']['pages'].values()))
    # return the page content 
    if 'extract' in page.keys():
        return page['extract']
    else:
        return "Page not found"
In [53]:
# uncomment these lines to install NLTK and download relevant resources
#!pip install nltk 
#import nltk
#nltk.download('popular')
In [54]:
from nltk.tokenize import WordPunctTokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# Get the text, for instance from Wikipedia. 
text    = wikipedia_page('Earth').lower()

# Tokenize and remove stopwords
tokens  = WordPunctTokenizer().tokenize(text)
tokens = [tk for tk in tokens if tk not in stopwords.words('english')]
In [55]:
# Instantiate a stemmer
ps      = PorterStemmer()

# and stem
stems   = [ps.stem(tk) for tk in tokens ]
In [56]:
# look at a random selection of stemmed tokens
import numpy as np
for i in range(5):
    print()
    print(np.random.choice(stems, size = 10))
['gone' 'approxim' 'oxygen' 'orbit' ',' 'sea' 'earth' 'earth' '78' 'brief']

['19th' 'water' ',' 'mantl' ',' '.' 'properti' ',' 'although' 'classif']

['iron' ',' '89' '(' 'call' 'earli' 'height' 'much' 'temperatur' 'sun']

['-' 'multicellular' ',' 'motion' 'fresh' '1' 'age' 'microbi' ',' ',']

['dramat' 'myth' '918' 'complet' '%' ',' 'view' '232' 'atmospher' 'within']

Your results will differ but we see that some words are brutally truncated.

Lemmatize with spacy

Since stemming can be brutal, we need a smarter way to reduce the number of forms of words. Lemmatization reduces a word to its lemma. And the lemma is the word form you would find in a dictionary.

Let's see how we can tokenize and lemmatize with the library spacy.io

see this page to install spacy: https://spacy.io/usage and download the models

In [57]:
#import the library
import spacy

# load the small English model
nlp = spacy.load("en_core_web_sm")

Tokenization

Right out of the box

In [58]:
# parse a text
doc = nlp("Roads? Where we’re going we don’t need roads!")

for token in doc:
    print(token)
Roads
?
Where
we
’re
going
we
do
n’t
need
roads
!

Lemmatization

Also right out of the box

The lemma of a token is directly available via token.lemma_

In [59]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I came in and met with her teammates at the meeting.")

print(f"{'Token':10}\t Lemma ")
print(f"{'-------':10}\t ------- ")
for token in doc:
    print(f"{token.text:10}\t {token.lemma_} ")
Token     	 Lemma 
-------   	 ------- 
I         	 -PRON- 
came      	 come 
in        	 in 
and       	 and 
met       	 meet 
with      	 with 
her       	 -PRON- 
teammates 	 teammate 
at        	 at 
the       	 the 
meeting   	 meeting 
.         	 . 

Notice how the word "met" was correctly lemmatized to "meet" while the noun "meeting" remained lemmatized to "meeting". Lemmatization of a word depends on its context and its grammatical role.

Form detection

Spacy offers many other functions including some handy word caracterization methods

  • is_space
  • is_punct
  • is_upper
  • is_digit
In [60]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("All aboard! \t Train NXH123 departs from platform 22 at 3:16 sharp.")


print(f"token \t\tspace? \tpunct?\tupper?\tdigit?")

token, token.is_space, token.is_punct, token.is_upper, token.is_digit

for token in doc:
    print(f"{str(token):10} \t{token.is_space} \t{token.is_punct} \t{token.is_upper} \t{token.is_digit}")
token 		space? 	punct?	upper?	digit?
All        	False 	False 	False 	False
aboard     	False 	False 	False 	False
!          	False 	True 	False 	False
	          	True 	False 	False 	False
Train      	False 	False 	False 	False
NXH123     	False 	False 	True 	False
departs    	False 	False 	False 	False
from       	False 	False 	False 	False
platform   	False 	False 	False 	False
22         	False 	False 	False 	True
at         	False 	False 	False 	False
3:16       	False 	False 	False 	False
sharp      	False 	False 	False 	False
.          	False 	True 	False 	False

There's plenty more to Spacy that we will explore in a future notebook.

In [ ]: