Lemmatization is the algorithmic process of determining the lemma of a word based on its intended meaning.
Stemming is the process of reducing inflected or sometimes derived words to their word stem, base or root form
Let's see how stemming works on the Wikipedia Earth's page
import requests
def wikipedia_page(title):
'''
This function returns the raw text of a wikipedia page
given a wikipedia page title
'''
params = {
'action': 'query',
'format': 'json', # request json formatted content
'titles': title, # title of the wikipedia page
'prop': 'extracts',
'explaintext': True
}
# send a request to the wikipedia api
response = requests.get(
'https://en.wikipedia.org/w/api.php',
params= params
).json()
# Parse the result
page = next(iter(response['query']['pages'].values()))
# return the page content
if 'extract' in page.keys():
return page['extract']
else:
return "Page not found"
# uncomment these lines to install NLTK and download relevant resources
#!pip install nltk
#import nltk
#nltk.download('popular')
from nltk.tokenize import WordPunctTokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
# Get the text, for instance from Wikipedia.
text = wikipedia_page('Earth').lower()
# Tokenize and remove stopwords
tokens = WordPunctTokenizer().tokenize(text)
tokens = [tk for tk in tokens if tk not in stopwords.words('english')]
# Instantiate a stemmer
ps = PorterStemmer()
# and stem
stems = [ps.stem(tk) for tk in tokens ]
# look at a random selection of stemmed tokens
import numpy as np
for i in range(5):
print()
print(np.random.choice(stems, size = 10))
Your results will differ but we see that some words are brutally truncated.
Since stemming can be brutal, we need a smarter way to reduce the number of forms of words. Lemmatization reduces a word to its lemma. And the lemma is the word form you would find in a dictionary.
Let's see how we can tokenize and lemmatize with the library spacy.io
see this page to install spacy: https://spacy.io/usage and download the models
#import the library
import spacy
# load the small English model
nlp = spacy.load("en_core_web_sm")
Right out of the box
# parse a text
doc = nlp("Roads? Where we’re going we don’t need roads!")
for token in doc:
print(token)
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I came in and met with her teammates at the meeting.")
print(f"{'Token':10}\t Lemma ")
print(f"{'-------':10}\t ------- ")
for token in doc:
print(f"{token.text:10}\t {token.lemma_} ")
Notice how the word "met" was correctly lemmatized to "meet" while the noun "meeting" remained lemmatized to "meeting". Lemmatization of a word depends on its context and its grammatical role.
Spacy offers many other functions including some handy word caracterization methods
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("All aboard! \t Train NXH123 departs from platform 22 at 3:16 sharp.")
print(f"token \t\tspace? \tpunct?\tupper?\tdigit?")
token, token.is_space, token.is_punct, token.is_upper, token.is_digit
for token in doc:
print(f"{str(token):10} \t{token.is_space} \t{token.is_punct} \t{token.is_upper} \t{token.is_digit}")
There's plenty more to Spacy that we will explore in a future notebook.