Analysis of Pronoun Usage In Presidential Addresses

This notebook is designed to look at how presidents have used first person vs. second person pronouns during their speeches.

In [20]:
import pandas as pd
import json
import nltk

Load in Data

The data used in this notebook comes from Vocativ's collection of presidential addressses, which can be found here: https://github.com/Vocativ-data/presidents_readability

In [2]:
objects = json.loads(open("../../vocativ_president_data/The original speeches.json").read())["objects"]
In [3]:
speeches_df = pd.DataFrame(objects)
In [4]:
speeches_df["word_count"] = speeches_df["Text"].apply(lambda x: len(x.split()))
In [3]:
json_data = open().read()
In [5]:
speeches_df["tokens"] = speeches_df["Text"].apply(lambda x: nltk.word_tokenize(x))

Find and Count All First-Person Singular Pronouns

In [6]:
speeches_df["i"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "i"]), axis=1)
speeches_df["me"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "me"]), axis=1)
speeches_df["my"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "my"]), axis=1)
speeches_df["mine"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "mine"]), axis=1)
speeches_df["myself"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "myself"]), axis=1)
In [7]:
speeches_df["first_person_singular"] = speeches_df.apply(lambda x: x["i"] + x["me"] + x["my"] +\
                                                                x["mine"] + x["myself"], axis=1)

Find And Count All First-Person Plural Pronouns

In [8]:
speeches_df["we"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "we"]), axis=1)
speeches_df["our"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "our"]), axis=1)
speeches_df["ours"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "ours"]), axis=1)
speeches_df["ourselves"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "ourselves"]), axis=1)
speeches_df["us"] = speeches_df.apply(lambda x: len([ t for t in x["tokens"] if t.lower() == "us"]), axis=1)
In [9]:
speeches_df["first_person_plural"] = speeches_df.apply(lambda x: x["we"] + x["our"] + x["ours"] + x["ourselves"] + x["us"], axis=1)
In [10]:
speeches_df["first_person"] = speeches_df.apply(lambda x: x["first_person_singular"] + x["first_person_singular"], axis=1)

Segment Off Necessary Data Points

In [11]:
speech_analysis = speeches_df[["word_count", "tokens", "President", "first_person", 
                               "first_person_singular", "first_person_plural"]]

We only want modern presidents (since 1929) because that's the data that's available for our news conference analysis. This is a list of all the presidents with names matching the data found in the President column of the address dataframe.

In [12]:
news_conf_presidents = ["Richard Nixon", "Gerald Ford", "George H. W. Bush", "Lyndon B. Johnson", "Jimmy Carter", 
                        "Bill Clinton", "Harry S. Truman", "Ronald Reagan", "Barack Obama", "John F. Kennedy", 
                        "Franklin D. Roosevelt", "Dwight D. Eisenhower", "Herbert Hoover", "George W. Bush"]
In [13]:
modern_presidents = speech_analysis[speech_analysis["President"].isin(news_conf_presidents)]
In [14]:
presidents = pd.DataFrame(modern_presidents.groupby("President").sum())

Analyze Each President's Total Corpus of Speeches

In [15]:
presidents["pct_first"] = presidents.apply(lambda x: round(100.0 * x["first_person"] / x["word_count"], 2), axis=1)
In [16]:
presidents["pct_first_singular"] = presidents.apply(lambda x: round(100.0 * x["first_person_singular"] / x["word_count"], 2), axis=1)
In [17]:
presidents["pct_first_plural"] = presidents.apply(lambda x: round(100.0 * x["first_person_plural"] / x["word_count"], 2), axis=1)
In [18]:
presidents.sort("pct_first_singular", ascending=False)
Out[18]:
word_count first_person first_person_singular first_person_plural pct_first pct_first_singular pct_first_plural
President
Richard Nixon 67445 3368 1684 1943 4.99 2.50 2.88
Gerald Ford 40301 1950 975 1323 4.84 2.42 3.28
George H. W. Bush 89646 4308 2154 2878 4.81 2.40 3.21
Lyndon B. Johnson 246786 10116 5058 8062 4.10 2.05 3.27
Jimmy Carter 91936 3642 1821 2997 3.96 1.98 3.26
Bill Clinton 145846 5234 2617 5694 3.59 1.79 3.90
Harry S. Truman 31802 1132 566 852 3.56 1.78 2.68
Ronald Reagan 206217 6592 3296 6679 3.20 1.60 3.24
Barack Obama 33672 1046 523 1292 3.11 1.55 3.84
John F. Kennedy 160468 4670 2335 4907 2.91 1.46 3.06
Franklin D. Roosevelt 130024 3034 1517 3222 2.33 1.17 2.48
Dwight D. Eisenhower 17919 354 177 429 1.98 0.99 2.39
George W. Bush 45437 808 404 1818 1.78 0.89 4.00
Herbert Hoover 10718 178 89 303 1.66 0.83 2.83

Do a quick calculation to find the overall average so that you can compare it to Obama's 1.55 in table above.

In [19]:
round(100.0 * presidents["first_person_singular"].sum() / presidents["word_count"].sum(), 2)
Out[19]:
1.76