William Li, wli@csail.mit.edu
csail-related@csail.mit.edu ("csail-related") is the lab-wide email list of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). The archives of csail-related, dating back to July 2004, are publicly available at http://lists.csail.mit.edu/pipermail/csail-related.
What is the participation by gender on csail-related, and how does it compare to the gender ratio in CSAIL? Relative to their population in CSAIL, do males contribute disproportionately more to csail-related?
We propose the following:
Most of the code is wrapped in functions at the end of this document. This notebook and related code can be found at https://github.com/wpli/mailman-stats.
From July 2004 to January 2014, there have been 115 months:
print len( get_all_ym_tuples() )
115
We will get "ym_email_dict", which has ( year, month ) as keys and the list of authors as tuples:
ym_email_dict = get_emails_by_month()
# sample entry: July 2004
print ym_email_dict[(2004,7)]
['C. Scott Ananian', 'C. Scott Ananian', 'Rodney Brooks', 'Michael McGeachie', 'Michael G. Ross', 'Richard Stallman']
print "Total Emails:", numpy.sum( [ len(i) for i in ym_email_dict.values() ] )
print "Total Unique Authors:", len( set( [ i for sublist in ym_email_dict.values() for i in sublist ] ) )
Total Emails: 12259 Total Unique Authors: 2072
Next, we apply our gender checker on first names. We apply some rule-based logic for names that are in the format "Last, First" and for names with initials (e.g. K. Bob Smith). We use a dictionary available from this repository, which contains 5177 names mapping to gender. Note that, if the name dictionary does not include the name, the name is excluded from the analysis.
ym_percentage_dict = get_ym_percentage_dict()
#sample entry: October 2013
print "Example entry (October 2013):", ym_percentage_dict[(2013,10)]
print "Average male fraction:", numpy.average( ym_percentage_dict.values() )
Example entry (October 2013): 0.731543624161 Average male fraction: 0.746385800605
We need to get an estimate of the percentage of males on csail-related. We do not have direct access to this information. As a proxy, we parse the names from http://www.csail.mit.edu/peoplesearch (also publicly available) and apply our gender checker. It appears that CSAIL is approximately 70% male.
population_male_fraction = get_male_female_counts()
print population_male_fraction
0.696124031008
74.6% of emails came from males, while CSAIL is comprised of 69.6% males.
It might be interesting to show how the fraction has changed month to month. There doesn't seem to be a particularly strong upward or downward trend.
plot( [ ym_percentage_dict[ym] for ym in get_all_ym_tuples() ] )
xlabel( "Months Since July 2004" )
ylabel( "Fraction of Emails Sent by Males" )
<matplotlib.text.Text at 0x106f22410>
Here is a histogram of the fractions for each month:
hist( ym_percentage_dict.values() )
xlabel('Fraction of Emails Sent by Males' )
ylabel( 'Number of Months' )
<matplotlib.text.Text at 0x1007cca50>
According to the above procedure, just under three-quarters of emails to csail-related are from males.
We formulate the following null hypothesis: the fraction of emails sent by males is less than or equal to 0.70 (the estimated population of males).
We apply a one-sided t-test to our data. See http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html for the scipy.stats documentation (note that this function returns the two-tailed p-value, and we want the one-tailed p-value).
from scipy import stats
data = numpy.array( ym_percentage_dict.values() )
ttest = stats.ttest_1samp(data,0.7)
print "t-statistic:", ttest[0]
print "one-tailed p-value:", ttest[1] / 2
t-statistic: 5.64971091117 one-tailed p-value: 6.00082482668e-08
The p-value ie $6\times10^{-08}$, which meets our significance threshold of $\alpha=0.05$ for rejecting the null hypothesis.
This analysis suggests that males contribute disproportionately more to csail-related than females. The difference is small (74.6% of emails are sent by males, while 69.7% of the listings in the people directory are male) but statistically significant.
Limitations: The gender checker is imperfect: some names are not available and some gender inferences may be incorrect. As well, our estimate of the population of members of csail-related may be incorrect, since we used the January 2014 CSAIL directory. However, these potential sources of error seem unlikely to affect the general direction of the results.
Thanks to Ramesh Sridharan for feedback.
from lxml import etree
from StringIO import StringIO
import urllib2
sys.path.append( '../third-party/gender-from-name/' )
import gender
import collections
import calendar
def get_male_female_counts():
with open( '../data/mitcsailpeople.html' ) as f:
html_string = f.read()
parser = etree.HTMLParser()
tree = etree.parse( StringIO( html_string ), parser )
root = tree.getroot()
for i in list(root):
if i.tag == 'body':
body_element = i
table_element = [ i for i in body_element if i.tag == 'table' ][0]
csail_first_names = []
for elem in list( table_element ):
if elem.tag == 'tbody':
tr_elems = [ i for i in list( elem ) if i.tag == 'tr' ]
for tr_elem in tr_elems:
td_elems = [ i for i in list( tr_elem ) if i.tag == 'td' ]
if len( td_elems ) == 6:
first_name_elem = td_elems[1]
assert first_name_elem.getchildren()[0].tag == 'a'
csail_first_names.append( first_name_elem.getchildren()[0].text )
else:
pass
ct = 0
current_csail_gender_list = []
for name in csail_first_names:
if name == None:
pass
elif " " in name:
split_name = name.split()
if len( split_name[0] ) > 1:
name_to_use = split_name[0]
else:
name_to_use = split_name[1]
else:
name_to_use = name
if name_to_use.upper() in gender.gender:
inferred_gender = gender.gender[name_to_use.upper()]
current_csail_gender_list.append( inferred_gender )
else:
#print repr(name_to_use.upper()),
ct += 1
total_counts = collections.Counter( current_csail_gender_list )
total_male = total_counts['male']
total_female = total_counts['female']
return float( total_male ) / ( total_male + total_female )
def get_all_ym_tuples():
years = range( 2004,2015)
all_month_numbers = range(1,13)
months = list( calendar.month_name[1:] )
year_month_tuples = []
year = 2004
months = range(7,13)
for m in months:
year_month_tuples.append( ( year, m ) )
years = range( 2005, 2014 )
for y in years:
for m in all_month_numbers:
year_month_tuples.append( ( y, m ) )
year_month_tuples.append( ( 2014, 1 ) )
return year_month_tuples
def get_emails_by_month():
ym_email_dict = {}
year_month_tuples = get_all_ym_tuples()
base_url = "http://lists.csail.mit.edu/pipermail/csail-related/"
for ym in year_month_tuples:
year = ym[0]
month_name = calendar.month_name[ym[1]]
folder = str( year ) + "-" + month_name + "/"
full_url = base_url + folder + "author.html"
data = urllib2.urlopen( full_url )
html_string = data.read()
parser = etree.HTMLParser()
tree = etree.parse( StringIO( html_string), parser )
root = tree.getroot()
for i in list(root):
if i.tag == 'body':
body_element = i
#print body_element
ul_elements = [ i for i in list( body_element ) if i.tag == 'ul' ]
email_thread_elements = ul_elements[1]
all_authors = []
for elem in email_thread_elements.getchildren():
if elem.tag == 'li':
for elem2 in elem.getchildren():
if elem2.tag == 'i':
all_authors.append( elem2.text.strip('\n' ) )
#print year, month_name, len( flatten_threads( all_authors ) )
ym_email_dict[ ym ] = all_authors
return ym_email_dict
def get_ym_percentage_dict():
ym_percentage_dict = {}
for ym in get_all_ym_tuples():
all_authors = ym_email_dict[ym]
gender_list = []
gender_unknown = []
for name in list(all_authors):
first_name = get_name( name )
if first_name != None:
first_name_upper = first_name.upper()
if first_name_upper in gender.gender:
gender_result = gender.gender[ first_name.upper() ]
gender_list.append( gender_result )
#if type(gender_result) is tuple:
# print first_name_upper, gender_result
else:
gender_unknown.append( first_name_upper )
#print first_name, "not in dictionary"
total_counts = collections.Counter( gender_list )
total_male = total_counts['male']
total_female = total_counts['female']
ym_percentage_dict[ym] = float( total_male ) / ( total_male + total_female )
return ym_percentage_dict
def get_name( full_name ):
name_to_check = None
if full_name == '':
name_to_check = None
else:
name_words = full_name.split()
if "," in name_words[0]:
first_name_start_idx = 1
else:
first_name_start_idx = 0
possible_first_name = name_words[first_name_start_idx]
possible_first_name_no_initials = possible_first_name.replace( '.', '' )
if len( possible_first_name_no_initials ) == 1:
if first_name_start_idx+1 < len( name_words ):
name_to_check = name_words[first_name_start_idx+1]
else:
name_to_check = None
else:
name_to_check = name_words[first_name_start_idx]
# clean name
name_to_check = name_to_check.replace( '"', '' )
return name_to_check