Notebook

William Li, wli@csail.mit.edu

Introduction¶

csail-related@csail.mit.edu ("csail-related") is the lab-wide email list of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). The archives of csail-related, dating back to July 2004, are publicly available at http://lists.csail.mit.edu/pipermail/csail-related.

What is the participation by gender on csail-related, and how does it compare to the gender ratio in CSAIL? Relative to their population in CSAIL, do males contribute disproportionately more to csail-related?

Methodology¶

We propose the following:

Scrape the csail-related archives by month.
Apply a gender checker on people's first names and compute the fraction of emails originating from males for each month. A common method of doing this is to use data derived from U.S. Census data. Many open-source packages are available to do this.
Estimate the fraction of males and females in CSAIL from the CSAIL directory.
Test the following null hypothesis under $\alpha = 0.05$: the fraction of emails sent by males to csail-related is equal to, or less than, the fraction of males in CSAIL.

Code/Implementation¶

Most of the code is wrapped in functions at the end of this document. This notebook and related code can be found at https://github.com/wpli/mailman-stats.

From July 2004 to January 2014, there have been 115 months:

In [66]:

print len( get_all_ym_tuples() )

We will get "ym_email_dict", which has ( year, month ) as keys and the list of authors as tuples:

In [67]:

ym_email_dict = get_emails_by_month()

# sample entry: July 2004
print ym_email_dict[(2004,7)]

['C. Scott Ananian', 'C. Scott Ananian', 'Rodney Brooks', 'Michael McGeachie', 'Michael G. Ross', 'Richard Stallman']

In [68]:

print "Total Emails:", numpy.sum( [ len(i) for i in ym_email_dict.values() ] )
print "Total Unique Authors:", len( set( [ i for sublist in ym_email_dict.values() for i in sublist ] ) )

Total Emails: 12259
Total Unique Authors: 2072

Next, we apply our gender checker on first names. We apply some rule-based logic for names that are in the format "Last, First" and for names with initials (e.g. K. Bob Smith). We use a dictionary available from this repository, which contains 5177 names mapping to gender. Note that, if the name dictionary does not include the name, the name is excluded from the analysis.

In [69]:

ym_percentage_dict = get_ym_percentage_dict()

#sample entry: October 2013
print "Example entry (October 2013):", ym_percentage_dict[(2013,10)]
print "Average male fraction:", numpy.average( ym_percentage_dict.values() )

Example entry (October 2013): 0.731543624161
Average male fraction: 0.746385800605

We need to get an estimate of the percentage of males on csail-related. We do not have direct access to this information. As a proxy, we parse the names from http://www.csail.mit.edu/peoplesearch (also publicly available) and apply our gender checker. It appears that CSAIL is approximately 70% male.

In [70]:

population_male_fraction = get_male_female_counts()
print population_male_fraction

0.696124031008

74.6% of emails came from males, while CSAIL is comprised of 69.6% males.

It might be interesting to show how the fraction has changed month to month. There doesn't seem to be a particularly strong upward or downward trend.

In [71]:

plot( [ ym_percentage_dict[ym] for ym in get_all_ym_tuples() ] )
xlabel( "Months Since July 2004" )
ylabel( "Fraction of Emails Sent by Males" )

Out[71]:

<matplotlib.text.Text at 0x106f22410>

Here is a histogram of the fractions for each month:

In [72]:

hist( ym_percentage_dict.values() )
xlabel('Fraction of Emails Sent by Males' )
ylabel( 'Number of Months' )

Out[72]:

<matplotlib.text.Text at 0x1007cca50>

Analysis¶

According to the above procedure, just under three-quarters of emails to csail-related are from males.

We formulate the following null hypothesis: the fraction of emails sent by males is less than or equal to 0.70 (the estimated population of males).

We apply a one-sided t-test to our data. See http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html for the scipy.stats documentation (note that this function returns the two-tailed p-value, and we want the one-tailed p-value).

In [73]:

from scipy import stats
data = numpy.array( ym_percentage_dict.values() )
ttest = stats.ttest_1samp(data,0.7)
print "t-statistic:", ttest[0]
print "one-tailed p-value:", ttest[1] / 2

t-statistic: 5.64971091117
one-tailed p-value: 6.00082482668e-08

The p-value ie $6\times10^{-08}$, which meets our significance threshold of $\alpha=0.05$ for rejecting the null hypothesis.

Conclusions and Further Work¶

This analysis suggests that males contribute disproportionately more to csail-related than females. The difference is small (74.6% of emails are sent by males, while 69.7% of the listings in the people directory are male) but statistically significant.

Limitations: The gender checker is imperfect: some names are not available and some gender inferences may be incorrect. As well, our estimate of the population of members of csail-related may be incorrect, since we used the January 2014 CSAIL directory. However, these potential sources of error seem unlikely to affect the general direction of the results.

Thanks to Ramesh Sridharan for feedback.

Appendix: Code¶

In [1]:

from lxml import etree
from StringIO import StringIO
import urllib2
sys.path.append( '../third-party/gender-from-name/' )
import gender
import collections
import calendar



def get_male_female_counts():
    with open( '../data/mitcsailpeople.html' ) as f:
        html_string = f.read()
        parser = etree.HTMLParser()
        tree = etree.parse( StringIO( html_string ), parser )

    root = tree.getroot()
    for i in list(root):
        if i.tag == 'body': 
            body_element = i
        
    table_element = [ i for i in body_element if i.tag == 'table' ][0]

    csail_first_names = []
    
    for elem in list( table_element ):
        if elem.tag == 'tbody':
            tr_elems = [ i for i in list( elem ) if i.tag == 'tr' ]
            for tr_elem in tr_elems:
                td_elems = [ i for i in list( tr_elem ) if i.tag == 'td' ]
                if len( td_elems ) == 6:
                    first_name_elem = td_elems[1]
                    assert first_name_elem.getchildren()[0].tag == 'a'
                    csail_first_names.append( first_name_elem.getchildren()[0].text )
        else:
            pass
    

    ct = 0
    current_csail_gender_list = []
    for name in csail_first_names:
        if name == None:
            pass
        elif " " in name:
            split_name = name.split()
            if len( split_name[0] ) > 1:
                name_to_use = split_name[0]
            else:                
                name_to_use = split_name[1]        
        else:
            name_to_use = name
            
        if name_to_use.upper() in gender.gender:
            inferred_gender = gender.gender[name_to_use.upper()]
            current_csail_gender_list.append( inferred_gender )
        else:
            #print repr(name_to_use.upper()),
            ct += 1

    total_counts = collections.Counter( current_csail_gender_list )
    total_male = total_counts['male']
    total_female = total_counts['female']
    
    return float( total_male ) / ( total_male + total_female )

def get_all_ym_tuples():
    years = range( 2004,2015)
    all_month_numbers = range(1,13)
    months = list( calendar.month_name[1:] )
    
    year_month_tuples = []
    
    year = 2004
    months = range(7,13)
    for m in months:
        year_month_tuples.append( ( year, m ) )
    
    years = range( 2005, 2014 )
    
    for y in years:
        for m in all_month_numbers:
            year_month_tuples.append( ( y, m ) )
            
    year_month_tuples.append( ( 2014, 1 ) )

    return year_month_tuples


def get_emails_by_month():
    ym_email_dict = {}
    year_month_tuples = get_all_ym_tuples()
    base_url = "http://lists.csail.mit.edu/pipermail/csail-related/"
    for ym in year_month_tuples:
        year = ym[0]
        month_name = calendar.month_name[ym[1]]
        
        folder = str( year ) + "-" + month_name + "/"
        full_url = base_url + folder + "author.html"
        data = urllib2.urlopen( full_url )
        html_string = data.read()
        parser = etree.HTMLParser()
        tree = etree.parse( StringIO( html_string), parser )
        root = tree.getroot()
        for i in list(root):
            if i.tag == 'body': 
                body_element = i 
        
        #print body_element
        
        ul_elements = [ i for i in list( body_element ) if i.tag == 'ul' ]
        email_thread_elements = ul_elements[1]
        all_authors = []
        for elem in email_thread_elements.getchildren():
            if elem.tag == 'li':
                for elem2 in elem.getchildren():
                    if elem2.tag == 'i':
                        all_authors.append( elem2.text.strip('\n' ) )
        #print year, month_name, len( flatten_threads( all_authors ) )
        ym_email_dict[ ym ] = all_authors 
        
    return ym_email_dict
        
    
def get_ym_percentage_dict():
    ym_percentage_dict = {}        
    for ym in get_all_ym_tuples():
        all_authors = ym_email_dict[ym]
        gender_list = []
        gender_unknown = []
        for name in list(all_authors):

                first_name = get_name( name )
                if first_name != None:
                    first_name_upper = first_name.upper()
                    
                    if first_name_upper in gender.gender:    
                        gender_result = gender.gender[ first_name.upper() ]
                        gender_list.append( gender_result )
                        #if type(gender_result) is tuple:
                        #    print first_name_upper, gender_result
                    else:
                        gender_unknown.append( first_name_upper )
                        #print first_name, "not in dictionary"
        
        total_counts = collections.Counter( gender_list )
        total_male = total_counts['male']
        total_female = total_counts['female']
    
        ym_percentage_dict[ym] = float( total_male ) / ( total_male + total_female )
    return ym_percentage_dict

def get_name( full_name ):
    name_to_check = None
    if full_name == '':
        name_to_check = None
    else:
        name_words = full_name.split()
        if "," in name_words[0]:
            first_name_start_idx = 1
        else: 
            first_name_start_idx = 0
            
        possible_first_name = name_words[first_name_start_idx]
        possible_first_name_no_initials = possible_first_name.replace( '.', '' )
        if len( possible_first_name_no_initials ) == 1:
            if first_name_start_idx+1 < len( name_words ):
                name_to_check = name_words[first_name_start_idx+1]
            else:
                name_to_check = None
                
        else:
            name_to_check = name_words[first_name_start_idx]
    
        # clean name
        name_to_check = name_to_check.replace( '"', '' )
    
    return name_to_check
        
    
    

Male and Female Participation on csail-related@csail.mit.edu¶