Let's count the frequency of the words in the Wikipedia Earth page
We use the same function to download content from wikipedia
import requests
def wikipedia_page(title):
'''
This function returns the raw text of a wikipedia page
given a wikipedia page title
'''
params = {
'action': 'query',
'format': 'json', # request json formatted content
'titles': title, # title of the wikipedia page
'prop': 'extracts',
'explaintext': True
}
# send a request to the wikipedia api
response = requests.get(
'https://en.wikipedia.org/w/api.php',
params= params
).json()
# Parse the result
page = next(iter(response['query']['pages'].values()))
# return the page content
if 'extract' in page.keys():
return page['extract']
else:
return "Page not found"
text = wikipedia_page('Earth').lower()
Next, we count the number of times each word is present in the text.
First we split the text over whitespaces and then use the Counter class to find the 20 most common words.
from collections import Counter
# we transform the text into a list of words
# by splitting over the space character ' '
word_list = text.split(' ')
# and count the words
word_counts = Counter(word_list)
word_counts.most_common(20)
We define a list of stopwords, frequent words that are mostly meaningless and remove them from the text.
# transform the text into a list of words
words_list = text.split(' ')
# define the list of words you want to remove from the text
stopwords = ['the', 'of', 'and', 'is','to','in','a','from','by','that', 'with', 'this', 'as', 'an', 'are','its', 'at', 'for']
# use a python list comprehension to remove the stopwords from words_list
words_without_stopwords = [ word for word in words_list if word not in stopwords ]
We get a very different list of frequent words. Much more relevant to the text.
Counter(words_without_stopwords).most_common(20)
We can generate a wordcloud on the text with the stopwords removed
from wordcloud import WordCloud
# Instantiate a new wordcloud.
wordcloud = WordCloud(
random_state = 8,
normalize_plurals = False,
width = 600,
height= 300,
max_words = 300,
stopwords = [])
# Transform the list of words back into a string
text_without_stopwords = ' '.join(words_without_stopwords)
# Apply the wordcloud to the text.
wordcloud.generate(text_without_stopwords)
# And plot
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1, figsize = (9,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
Wordcloud comes with its own predefined list of stopwords
print(f"Wordcloud has {len(WordCloud().stopwords)} stopwords:")
print()
print(list(WordCloud().stopwords))
You don't need to use default lists of stopwords. There a couple of github repos that offer much more exhaustive lists.
For instance
Looking at the frequency of words from the original text we notice a pattern.
The frequency of the nth word is roughly proportional to 1/n. The most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
Let's calculate the observed relative frequency of a token:
occurence of the token / occurence of "the"
where "the" is the most common token
and compare it to the inverse of the rank of the token
import numpy as np
text = wikipedia_page('Earth').lower()
word_list = text.split(' ')
word_counts = Counter(word_list).most_common(10)
maxfreq = word_counts[0][1]
print(f" rank word observed frequency ~= Zipf frequency")
for i in range(10):
# print(f"{i+1}) {word_counts[i][0]} \tfreq: {word_counts[i][1] / word_counts[0][1]} ~= {1/(i+1)}")
print(f"{i+1:4}) {word_counts[i][0]:10} freq: {np.round(word_counts[i][1] / word_counts[0][1],2):5} ~= {np.round(1/(i+1),2)}")
Zipf's law is an empirical law observed in multiple domains. The wikipedia article explains it all.