Stopwords and word frequency

Let's count the frequency of the words in the Wikipedia Earth page

We use the same function to download content from wikipedia

In [2]:
import requests

def wikipedia_page(title):
    '''
    This function returns the raw text of a wikipedia page 
    given a wikipedia page title
    '''
    params = { 
        'action': 'query', 
        'format': 'json', # request json formatted content
        'titles': title, # title of the wikipedia page
        'prop': 'extracts', 
        'explaintext': True
    }
    # send a request to the wikipedia api 
    response = requests.get(
         'https://en.wikipedia.org/w/api.php',
         params= params
     ).json()

    # Parse the result
    page = next(iter(response['query']['pages'].values()))
    # return the page content 
    if 'extract' in page.keys():
        return page['extract']
    else:
        return "Page not found"
In [3]:
text = wikipedia_page('Earth').lower()

Next, we count the number of times each word is present in the text.

First we split the text over whitespaces and then use the Counter class to find the 20 most common words.

In [5]:
from collections import Counter
# we transform the text into a list of words 
# by splitting over the space character ' '
word_list = text.split(' ')
# and count the words
word_counts = Counter(word_list)
In [6]:
word_counts.most_common(20)
Out[6]:
[('the', 638),
 ('of', 317),
 ('and', 233),
 ('is', 167),
 ('to', 153),
 ('in', 140),
 ('a', 119),
 ("earth's", 82),
 ('by', 71),
 ('from', 69),
 ('that', 68),
 ('earth', 66),
 ('as', 64),
 ('at', 57),
 ('with', 52),
 ('are', 52),
 ('on', 44),
 ('about', 43),
 ('this', 38),
 ('surface', 36)]

Let's remove the stopwords

We define a list of stopwords, frequent words that are mostly meaningless and remove them from the text.

In [7]:
# transform the text into a list of words
words_list = text.split(' ')
# define the list of words you want to remove from the text
stopwords = ['the', 'of', 'and', 'is','to','in','a','from','by','that', 'with', 'this', 'as', 'an', 'are','its', 'at', 'for']
# use a python list comprehension to remove the stopwords from words_list
words_without_stopwords = [ word for word in words_list if word not in stopwords ]

We get a very different list of frequent words. Much more relevant to the text.

In [9]:
Counter(words_without_stopwords).most_common(20)
Out[9]:
[("earth's", 82),
 ('earth', 66),
 ('on', 44),
 ('about', 43),
 ('surface', 36),
 ('million', 35),
 ('solar', 31),
 ('life', 29),
 ('other', 28),
 ('it', 28),
 ('which', 27),
 ('sun', 26),
 ('have', 26),
 ('into', 25),
 ('or', 25),
 ('over', 24),
 ('has', 24),
 ('was', 21),
 ('crust', 21),
 ('atmosphere', 20)]

We can generate a wordcloud on the text with the stopwords removed

In [11]:
from wordcloud import WordCloud
# Instantiate a new wordcloud.
wordcloud = WordCloud(
        random_state = 8,
        normalize_plurals = False,
        width = 600, 
        height= 300,
        max_words = 300,
        stopwords = [])

# Transform the list of words back into a string 
text_without_stopwords  = ' '.join(words_without_stopwords)

# Apply the wordcloud to the text.
wordcloud.generate(text_without_stopwords)

# And plot
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1, figsize = (9,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
Out[11]:
(-0.5, 599.5, 299.5, -0.5)

WordCloud stopwords

Wordcloud comes with its own predefined list of stopwords

In [15]:
print(f"Wordcloud has {len(WordCloud().stopwords)} stopwords:") 
print()
print(list(WordCloud().stopwords))
Wordcloud has 192 stopwords:

['about', "didn't", 'her', 'against', 'for', "couldn't", 'his', 'more', "why's", 'by', 'this', 'but', "hasn't", 'itself', "here's", "we'll", 'herself', "i'll", 'otherwise', 'our', "she'd", "we'd", "it's", "you've", 'own', 'where', 'themselves', 'ought', 'very', 'all', 'were', "haven't", 'is', 'why', 'your', 'during', 'because', 'they', 'however', 'she', 'be', 'again', 'below', "hadn't", 'been', 'whom', "they're", 'me', 'should', 'could', 'down', "aren't", 'are', 'we', 'any', "wasn't", 'since', 'or', 'then', 'he', 'their', 'some', 'ever', "he'll", 'once', 'of', "you're", 'after', 'such', 'cannot', "don't", 'too', 'would', 'ourselves', 'my', 'himself', 'from', 'having', "you'd", "weren't", 'until', 'him', "what's", 'in', 'an', 'under', 'r', 'was', 'both', "he'd", 'between', 'ours', 'hers', "can't", "who's", 'yourself', 'on', 'shall', 'www', 'does', 'each', "we're", "shan't", 'as', 'into', "when's", "they'd", 'so', 'out', 'theirs', "isn't", 'that', "mustn't", 'same', 'get', 'http', 'who', 'had', 'with', 'k', 'them', 'here', 'the', 'also', "i'd", "i'm", "where's", 'can', 'no', 'these', "they've", "we've", 'which', 'while', 'further', 'just', 'there', 'to', 'if', 'have', "there's", "he's", 'off', "how's", 'those', 'up', 'when', "doesn't", 'nor', 'you', 'and', 'being', "shouldn't", 'than', 'do', 'before', 'few', 'has', "let's", 'at', 'over', 'myself', 'did', 'most', "i've", 'hence', 'doing', "they'll", "she's", 'else', 'not', "wouldn't", 'a', 'it', 'other', 'only', "she'll", 'how', 'above', 'its', "that's", 'what', 'yours', 'therefore', 'yourselves', 'like', 'through', "won't", "you'll", 'com', 'am', 'i']

More exhaustive lists of stopwords

You don't need to use default lists of stopwords. There a couple of github repos that offer much more exhaustive lists.

For instance

Zipf's law

Looking at the frequency of words from the original text we notice a pattern.

The frequency of the nth word is roughly proportional to 1/n. The most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

Let's calculate the observed relative frequency of a token:

occurence of the token / occurence of "the" 

where "the" is the most common token

and compare it to the inverse of the rank of the token

In [38]:
import numpy as np
text = wikipedia_page('Earth').lower()
word_list = text.split(' ')
word_counts = Counter(word_list).most_common(10)

maxfreq = word_counts[0][1]
print(f" rank word  observed frequency ~= Zipf frequency")

for i in range(10):
#     print(f"{i+1}) {word_counts[i][0]} \tfreq: {word_counts[i][1] / word_counts[0][1]} ~= {1/(i+1)}")
    print(f"{i+1:4}) {word_counts[i][0]:10} freq: {np.round(word_counts[i][1] / word_counts[0][1],2):5} ~= {np.round(1/(i+1),2)}")
    
 rank word  observed frequency ~= Zipf frequency
   1) the        freq:   1.0 ~= 1.0
   2) of         freq:   0.5 ~= 0.5
   3) and        freq:  0.37 ~= 0.33
   4) is         freq:  0.26 ~= 0.25
   5) to         freq:  0.24 ~= 0.2
   6) in         freq:  0.22 ~= 0.17
   7) a          freq:  0.19 ~= 0.14
   8) earth's    freq:  0.13 ~= 0.12
   9) by         freq:  0.11 ~= 0.11
  10) from       freq:  0.11 ~= 0.1

Zipf's law is an empirical law observed in multiple domains. The wikipedia article explains it all.

In [ ]: