Using Python to see how the Times writes about men and women

Neal Caren - University of North Carolina, Chapel Hill mail web twitter scholar

Do men and women come up in different contexts in the newspaper? One quick way to answer that question is to compare the words in sentences that discuss women with the words in sentences that discuss men. Here's an example of how to do this sort of analysis using Python.

The data comes from last week's (February 27, 2013-March 6, 2013) New York Times. I downloaded all the articles available through LexisNexis excluding only the corrections and paid obituaries. This totals 1,379 articles, or about 200 per day. Using a modified version of an old Python script, I removed all the metadata. put the text of each article in its own file, and placed all of the text files in a folder called articles. It is not the most efficient way to go about it, but sometimes the text data comes that way so I thought I would be useful to set it up that way for didactic purposes.

We begin by loading a few modules. The only modules that you might need to install is nltk, which is a powerful suite for text processing and analysis. For this analysis, I'm only using the NLTK function that splits text into sentences. glob is a useful module for retrieving the contents of a directory, and string.punctuation is just a string with all the ASCII punctuation marks, that is !"#$%&'()*+,-/:;<=>?@[\]^_{|}~.

In [35]:
from __future__ import division

import glob
import nltk
from string import punctuation

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

The heart of the analysis will be figuring out whether a sentence is talking about a man, woman, both or neither. As a first pass, I'm going to assume that the sentence is talking about man if it uses words like, "he", "dad" or "Mr.", and is probably talking about a woman if it uses words like, "she", "mother", or "Ms.". It isn't perfect, but depending on the text, it can be quite useful. Rather than start from scratch, I build off of Danielle Sucher's list from her Jailbreak the Patriarchy browser plugin.

In [36]:
#Two lists  of words that are used when a man or woman is present, based on Danielle Sucher's https://github.com/DanielleSucher/Jailbreak-the-Patriarchy
male_words=set(['guy','spokesman','chairman',"men's",'men','him',"he's",'his','boy','boyfriend','boyfriends','boys','brother','brothers','dad','dads','dude','father','fathers','fiance','gentleman','gentlemen','god','grandfather','grandpa','grandson','groom','he','himself','husband','husbands','king','male','man','mr','nephew','nephews','priest','prince','son','sons','uncle','uncles','waiter','widower','widowers'])
female_words=set(['heroine','spokeswoman','chairwoman',"women's",'actress','women',"she's",'her','aunt','aunts','bride','daughter','daughters','female','fiancee','girl','girlfriend','girlfriends','girls','goddess','granddaughter','grandma','grandmother','herself','ladies','lady','lady','mom','moms','mother','mothers','mrs','ms','niece','nieces','priestess','princess','queens','she','sister','sisters','waitress','widow','widows','wife','wives','woman'])

I'm storing them as sets rather than lists because later on I want to look at whether or not words in a sentence overlap with these words, and Python will return the intersection of sets, but not lists.

The function below takes a work list and returns the gender of the person being talked about, if any, based on the number of words a sentence has in common with either the male or female word lists.

In [37]:
def gender_the_sentence(sentence_words):
    mw_length=len(male_words.intersection(sentence_words))
    fw_length=len(female_words.intersection(sentence_words))

    if mw_length>0 and fw_length==0:
        gender='male'
    elif mw_length==0 and fw_length>0: 
        gender='female'
    elif mw_length>0 and fw_length>0: 
        gender='both'
    else:
        gender='none'
    return gender

I don't really care about proper nouns, especially people's names (e.g. it is boring that 'Boehner' is always male), so I need a way to identify them. To do that, I'm going to count how many times a word's first letter is capitalized and how many times it isn't. With a large enough text and if you ignore the first words of sentences, this is pretty robust way to identify proper nouns.

In [38]:
def is_it_proper(word):
        if word[0]==word[0].upper():
            case='upper'
        else:
            case='lower'
        
        word_lower=word.lower()
        try:
            proper_nouns[word_lower][case] = proper_nouns[word_lower].get(case,0)+1
        except Exception,e:
            #This is triggered when the word hasn't been seen yet
            proper_nouns[word_lower]= {case:1}

Note that here I'm using .get() to retrieve the values stored the proper noun dictionary. This is one way to avoid error messages when the key isn't in the dictionary. Here, proper_nouns[word_lower].get(case,0) returns the value of word_lower if that combination of word and capitalization has been seen before and 0 if has not. The except is only triggered when the word hasn't been seen yet.

I'm going to keep track of each the words in each sentence with a couple of counters. This function doesn't return anything but it does increment the word_freq, word_counter, and sentence_counter dictionaries.

In [39]:
def increment_gender(sentence_words,gender):
    sentence_counter[gender]+=1
    word_counter[gender]+=len(sentence_words)
    for word in sentence_words:
        word_freq[gender][word]=word_freq[gender].get(word,0)+1

And so we begin. I set up the counters to store the various quantities of interest. These are the ones that modified in the increment_gender function. Some of the values probably don't need to be entered now, particularly for the word and sentence counters, but starting with zeroes helps remind me what they are for.

In [40]:
sexes=['male','female','none','both']
sentence_counter={sex:0 for sex in sexes}
word_counter={sex:0 for sex in sexes}
word_freq={sex:{} for sex in sexes}
proper_nouns={}

I've stored all the files at text files in a directory called articles and I wanted to grab all their names.

In [41]:
file_list=glob.glob('articles/*.txt')

The basic idea is to read each file, split it into sentences, and then process each sentence. The processing begins by splitting the sentence into words and removing punctuation. Then for each word that doesn't begin the sentence, I figure out if it is capitalized or not as part of the hunt for proper nouns. Then, I estimate whether the sentence is likely talking about a man or a woman, based on the occurrences of the various gender lists. Finally, I add word that is used to the appropriate gender word frequencies counter. So the sentence, "She is lovely." would add 'she','is', and 'lovely' to our count of words used when talking about a female. It would also increment the lower case counters for 'is' and 'lovely'.

In [42]:
for file_name in file_list:
    #Open the file
    text=open(file_name,'rb').read()
    
    #Split into sentences
    sentences=tokenizer.tokenize(text)
    
    for sentence in sentences:
        #word tokenize and strip punctuation
            sentence_words=sentence.split()
            sentence_words=[w.strip(punctuation) for w in sentence_words 
                            if len(w.strip(punctuation))>0]
            
            #figure out how often each word is capitalized
            [is_it_proper(word) for word in sentence_words[1:]]

            #lower case it
            sentence_words=set([w.lower() for w in sentence_words])
            
            #Figure out if there are gendered words in the sentence by computing the length of the intersection of the sets
            gender=gender_the_sentence(sentence_words)

            #Increment some counters
            increment_gender(sentence_words,gender)

After all the articles are parsed, it is time to start analyzing the word frequencies.

First, I create a set consisting of all words which were capitalized more often than not.

In [43]:
proper_nouns=set([word for word in proper_nouns if  
                  proper_nouns[word].get('upper',0) / 
                  (proper_nouns[word].get('upper',0) + 
                   proper_nouns[word].get('lower',0))>.50])

I don't really care about rare words, so I select the top 1,000 words, based on frequencies, from both the male and female word dictionaries. From that list, I subtract the words used to identify the sentence as either male or female along with the proper nouns.

In [44]:
common_words=set([w for w in sorted (word_freq['female'],
                                     key=word_freq['female'].get,reverse=True)[:1000]]+[w for w in sorted (word_freq['male'],key=word_freq['male'].get,reverse=True)[:1000]])

common_words=list(common_words-male_words-female_words-proper_nouns)

I compute how likely the word appears in a male subject sentence versus a female subject sentence. (My first instinct was to create ratios, but they are undefined when a word is not used to talk about the sex used in the denominator.) I also need to control for the fact that there is likely an imbalance in how many words are written about men and women. If 'hair' is mentioned in 10 male-subjected sentences and 10 female-subject sentences, that could be taken as a sign of parity, but not if there a total of 20 female-subject (50%) sentences and 100 male-subject sentences (10%). I'll score 'hair' as a 16.6% male, which is (10%)/(50%+10%). Later on, if we want, we can recover the ratios by computing (100-16.6)/16.6, which is 5x, the same as 50%/10%.

In [45]:
male_percent={word:(word_freq['male'].get(word,0) / word_counter['male']) 
              / (word_freq['female'].get(word,0) / word_counter['female']+word_freq['male'].get(word,0)/word_counter['male']) for word in common_words}

We can print out some basic statistics based on our counters about overall rates of coverage.

In [46]:
print '%.1f%% gendered' % (100*(sentence_counter['male']+sentence_counter['female'])/
                           (sentence_counter['male']+sentence_counter['female']+sentence_counter['both']+sentence_counter['none']))
print '%s sentences about men.' % sentence_counter['male']
print '%s sentences about women.' % sentence_counter['female']
print '%.1f sentences about men for each sentence about women.' % (sentence_counter['male']/sentence_counter['female'])
25.9% gendered
19681 sentences about men.
6242 sentences about women.
3.2 sentences about men for each sentence about women.

Finally, I print out the words that are disproporately found in the male and female subject sentences. For the 50 distincitve female and male words, I print the ratio of gendered %s along with the count of the number of male-subject and female-subject sentences that had the word. This script isn't particularly pretty, but it gets the job done.

In [47]:
header ='Ratio\tMale\tFemale\tWord'
print 'Male words'
print header
for word in sorted (male_percent,key=male_percent.get,reverse=True)[:50]:
    try:
        ratio=male_percent[word]/(1-male_percent[word])
    except:
        ratio=100
    print '%.1f\t%02d\t%02d\t%s' % (ratio,word_freq['male'].get(word,0),word_freq['female'].get(word,0),word)

print '\n'*2
print 'Female words'
print header
for word in sorted (male_percent,key=male_percent.get,reverse=False)[:50]:
    try:
        ratio=(1-male_percent[word])/male_percent[word]
    except:
        ratio=100
    print '%.1f\t%01d\t%01d\t%s' % (ratio,word_freq['male'].get(word,0),word_freq['female'].get(word,0),word)
Male words
Ratio	Male	Female	Word
11.2	72	02	prime
10.8	70	02	baseball
9.5	92	03	official
9.5	61	02	capital
9.5	61	02	governor
5.8	75	04	fans
5.3	120	07	minister
5.3	51	03	sequester
5.2	118	07	league
4.5	58	04	failed
4.4	57	04	cardinals
4.2	54	04	finance
4.0	78	06	reporters
3.9	50	04	winning
3.8	73	06	finally
3.6	116	10	players
3.5	56	05	acknowledged
3.5	67	06	address
3.4	66	06	attack
3.3	108	10	opposition
3.3	54	05	rest
3.3	53	05	camp
3.2	52	05	costs
3.1	91	09	goal
3.1	50	05	crowd
3.0	118	12	bank
2.9	57	06	referring
2.9	66	07	sports
2.9	56	06	surgery
2.9	56	06	missed
2.8	55	06	pressure
2.8	64	07	teammates
2.8	91	10	economy
2.8	54	06	release
2.7	123	14	pope
2.7	130	15	meeting
2.6	84	10	victory
2.6	58	07	veteran
2.5	226	28	political
2.5	104	13	spending
2.5	64	08	effect
2.5	56	07	spend
2.5	72	09	continue
2.5	95	12	foreign
2.4	71	09	injury
2.4	94	12	election
2.4	78	10	running
2.4	116	15	manager
2.4	54	07	elected
2.4	99	13	tax



Female words
Ratio	Male	Female	Word
100.0	0	29	pregnant
100.0	0	17	husband's
51.6	1	16	suffrage
40.3	2	25	breast
12.9	4	16	gender
11.8	6	22	pregnancy
6.8	10	21	dresses
5.7	13	23	birth
5.5	13	22	memoir
4.8	25	37	baby
4.7	17	25	disease
4.6	14	20	interviewed
4.6	12	17	abortion
4.6	24	34	dress
4.5	23	32	married
4.3	12	16	activist
4.3	25	33	author
4.1	14	18	drama
3.9	30	36	hair
3.8	18	21	rape
3.6	24	27	dog
3.6	19	21	novel
3.5	99	108	children
3.4	16	17	statue
3.4	17	18	victim
3.4	51	53	cancer
3.3	41	42	violence
3.2	32	32	younger
3.2	20	20	festival
3.1	34	33	study
3.1	30	29	teacher
3.1	27	26	sex
3.1	43	41	fashion
3.1	20	19	opera
3.0	18	17	singing
3.0	62	57	child
2.8	23	20	wear
2.8	30	26	native
2.6	34	27	dance
2.6	29	23	graduated
2.5	33	26	writer
2.5	23	18	favor
2.5	41	32	eyes
2.5	22	17	becomes
2.5	47	36	kids
2.5	21	16	eat
2.4	29	22	domestic
2.4	29	22	traditional
2.4	77	58	parents
2.4	32	24	drug

My quick interepretation: If your knowledge of men's and women's roles in society came just from reading last week's New York Times, you would think that men play sports and run the government. Women do feminine and domestic things. To be honest, I was a little shocked at how stereotypical the words used in the women subject sentences were.

Now this is only data from one week, and certainly some of the findings are driven by that. Coverage of suffrage, for example, was presumably driven by the 100th anniversary of the Woman Suffrage Procession. Similarly, the male list is also tied to recent news events, as one one would expect from data from a newspaper. These lists also just reported the extreme words, many of which were only used in a handful of articles. A more rigorous analysis would probably look at the complete distribution of words.

I should also add that after I ran this analysis for the first time, I noticed a few words, like 'spokesman' and 'actress' that should have been included on the original lists.

If you wanted to output the full table, you could easily write it to a tab delimited file.

In [48]:
outfile_name='gender.tsv'
tsv_outfile=open(outfile_name,'wb')
header='percent_male\tmale_count\tfemalecount\tword\n'
tsv_outfile.write(header)
for word in common_words:
    row = '%.2f\t%01d\t%01d\t%s\n' % (100*male_percent[word],word_freq['male'].get(word,0),word_freq['female'].get(word,0),word)
    tsv_outfile.write(row)
tsv_outfile.close()

As an addendum, we can look at the most popular words. In this case, we will look at the 100 most frequently used words, and then compare what proportion of male subject sentences had those words and what proportion of female subject sentences had those words.

In [49]:
all_words=[w for w in word_freq['none']]+[w for w in word_freq['both']]+[w for w in word_freq['male']]+[w for w in word_freq['female']]
all_words={w:(word_freq['male'].get(w,0)+word_freq['female'].get(w,0)+word_freq['both'].get(w,0)+word_freq['none'].get(w,0)) for w in set(all_words)}

print 'word\tMale\tFemale'
for word in sorted (all_words,key=all_words.get,reverse=True)[:100]:
    print '%s\t%.1f%%\t%.1f%%' % (word,100*word_freq['male'].get(word,0)/sentence_counter['male'],100*word_freq['female'].get(word,0)/sentence_counter['female'])
word	Male	Female
the	66.4%	63.1%
and	41.3%	43.0%
to	42.7%	40.3%
a	45.2%	44.8%
of	40.0%	39.0%
in	38.7%	37.7%
that	23.8%	21.7%
for	18.6%	19.4%
is	14.1%	16.0%
on	17.2%	14.7%
with	16.0%	15.5%
said	24.6%	20.6%
was	18.9%	16.4%
at	12.4%	13.2%
he	48.3%	0.0%
it	10.3%	10.5%
as	12.5%	12.2%
by	8.8%	9.4%
but	10.7%	9.3%
from	9.8%	9.3%
his	32.5%	0.0%
an	9.2%	9.7%
be	7.5%	7.5%
have	6.4%	6.9%
are	4.6%	5.5%
not	8.2%	7.7%
has	8.8%	7.0%
this	5.7%	5.7%
who	9.4%	9.8%
i	6.6%	7.6%
they	3.9%	4.5%
mr	21.9%	0.0%
or	3.8%	4.4%
had	8.4%	7.6%
more	4.8%	4.3%
about	5.8%	6.2%
one	5.5%	5.6%
will	4.3%	3.8%
their	3.1%	4.6%
which	4.7%	4.5%
would	5.5%	4.3%
new	4.3%	4.3%
were	3.7%	4.6%
when	6.2%	5.8%
we	3.6%	3.4%
its	2.6%	2.4%
you	2.7%	3.4%
been	4.6%	4.2%
she	0.0%	41.6%
than	3.1%	3.0%
if	3.6%	3.4%
up	3.7%	3.6%
after	5.3%	3.8%
out	4.1%	3.7%
her	0.0%	33.9%
all	3.0%	3.5%
like	3.4%	4.0%
there	2.9%	3.1%
also	3.3%	3.4%
other	2.8%	2.8%
what	3.3%	3.3%
two	3.4%	3.2%
no	2.9%	2.6%
some	2.8%	2.8%
so	3.0%	3.3%
can	2.1%	2.4%
last	3.6%	2.4%
into	3.3%	3.1%
first	3.6%	4.1%
it's	1.9%	2.7%
time	3.2%	2.9%
over	2.9%	2.2%
years	3.2%	3.1%
people	2.5%	2.5%
just	2.5%	2.6%
through	2.0%	1.9%
could	2.8%	2.6%
p.m	0.5%	0.7%
year	2.3%	2.0%
them	2.1%	2.3%
most	2.2%	1.9%
do	1.9%	1.9%
now	2.5%	2.2%
because	2.6%	2.6%
even	2.2%	1.9%
my	2.2%	3.7%
many	1.9%	2.0%
only	2.2%	2.1%
him	8.1%	0.0%
how	1.9%	2.0%
where	2.3%	2.4%
those	1.5%	1.4%
before	2.6%	2.0%
get	1.7%	1.6%
percent	0.9%	0.8%
work	1.7%	3.1%
make	1.7%	1.8%
then	1.9%	1.9%
made	2.2%	2.1%
way	1.7%	1.8%

While there's a couple of interesting findings here, for the most part, the basic building blocks of sentences are fairly similarly in the male and female subject sentences. Now, this is just based on word frequencies, and a more nuanced examination would probably discover additionally findings of interest. For example, my guess is that that 'work', near the bottom of this list, is used not only more frequently in the female subject sentences, but also in a different context and as a different part of speech (e.g. 'men work', 'women juggle home and work responsibilities.'). Comparing word frequencies only gets you so far, but it is pretty quick and easy way to conduct some preliminary data analysis.