''' ------------------------------------------------------------------------------- Filename : ch4.ipynb Date : 2012-09-17 Author : C. Vogel Purpose : Replicate the naive Bayes e-mail classifier in Chapter 3 of : _Machine Learning for Hackers_. Input Data : e-mail files, split into spam and ham (non-spam) folders are available : at the book's github repository at https://github.com/johnmyleswhite/ : ML_for_Hackers.git. Libraries : dateutil, Matplotlib, NLTK, textmining, Numpy 1.6.1, pandas 0.9.0 : statsmodels ------------------------------------------------------------------------------- This notebook is a Python port of the R code in Chapter 4 of _Machine Learning for Hackers_ by D. Conway and J.M. White. E-mail files, split into folders classified as spam or ham (non-spam), should be located in a /data/ subfolder of the working directory. See the paths defined just after the import statements below to see what directory structure this script requires. Copying complete data folder from the book's github repository should be sufficient. For a detailed description of the analysis and the process of porting it to Python, see: slendrmeans.wordpress.com/will-it-python. '''


This program identifies 'priority' e-mails in a dataset of e-mails. E-mail $i$ is a priority e-mail if Rank($i$) $\ge$ threshold, where Rank($i$) is the product of 5 weighting factors. So, if:

$\prod_{k=1}^{5}\omega_k(i) \ge \textrm{threshold}$

then e-mail $i$ is a priority message. The weighting functions $\omega_k(i)$ and the threshold are calculated from a training dataset (train_df).

The five weights functions are:

  1. Sender Weight: the more often the sender's email address occurs in the training
    dataset, the more weight the e-mail recieves. Addresses not in the training set get weight of 1.
  2. Thread-Sender Weight: If the more often a sender's address occurs in threads in the training set, the more weight the e-mail recieves. Addresses not in threads in the training dataset get a weight of 1.
  3. Thread-Activity Weight: If the e-mail is part of a thread in the training set, it gets weighted based on how 'active' the thread is, in terms of messages-per-unit-of-time.
  4. Subject Term Weight: If the terms in an e-mails subject line occur in the subjects of the training set thread, it gets a weight proportional to the activities of the threads its terms occur in.
  5. Message Term Weight: An e-mail gets a message-term-weight that's proportional to how freqently the term occurs in the messages of the training set.

The program is broken into sections of functionality:

  • Parse the e-mails: Read each e-mail file into a dataframe that contains the e-mail's sender, subject, and message.
  • Split this dataframe into a training and testing part.
  • Identify thread e-mails in the training set
  • Build the five weight functions based on the training set e-mails
  • Rank the e-mails in the training set using the weight functions; compute a threshold weight
  • Rank the e-mails in the test set. Flag e-mails with ranks higher than the threshold as 'priority'
  • Plot and explore the results.
In [1]:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.
In [3]:
import os
import re
import math
import random
import numpy as np
import datetime as dt
import dateutil.parser as dtp
import matplotlib.pyplot as plt
import textmining as txtm
import string
from pandas import *
import sys
from statsmodels.nonparametric import KDE
from tdm_df import tdm_df  # Term-document matrix builder from CH3
In [4]:
# Define paths to the data. We'll be prioritizing e-mails in the "Easy Ham" folder
data_path = os.path.abspath(os.path.join('..','ch3', 'data'))
easyham_path = os.path.abspath(os.path.join(data_path, 'easy_ham'))   


A set of functions that, combined, generates the next e-mail from a list of files, and returns a tuple containing its sender, subject, message, and the name of the file.

In [5]:
def parse_email_list(file_list):
    Create a generator that iterates through the e-mails in
    a folder, and parses them as the generator is called.
    for f in file_list:
        yield parse_email(os.path.join(easyham_path, f))

def parse_email(path):
    Get important elements of an e-mail file (date, sender's
    e-mail address, subject, and message contents.
    filename        = os.path.split(path)[-1]
    header, message = get_header_and_message(path)
    date            = get_date(header)
    sender          = get_sender(header)
    subj            = get_subject(header)
    return date, sender, subj, message, filename

def get_header_and_message(path):
    Split e-mail's header and its message into two
    separate strings.
    with open(path, 'rU') as con:
        email = con.readlines()
        first_blank_index = email.index('\n')
        header = email[:first_blank_index]
        message = ''.join(email[(first_blank_index + 1): ])
    return header, message

# Regex pattern that finds the date line
# (it's a line that starts with 'Date')
date_pattern = re.compile('^Date')

def get_date(header):
    Find the date line of the e-mail's header, and parse
    the date into a datetime object.
    # Get the first line that matches the date-line
    # regex pattern
    dateline = [l for l in header if 
                re.search(date_pattern, l) != None][0]
    # Grab the text after 'Date: ' (6 chars)
    dateline = dateline[6:].strip()
    # Return the date as parsed by dateutil.parser.parse()
    return dtp.parse(dateline)

# Characters that may separate words in the From line
splitfrom_pattern = re.compile('[\"\:<> ]')

def get_sender(header):
    Find the 'From:' line in the e-mail's header data, and 
    extract the sender's e-mail address from it.
    # Find line in header that contains 'From: '
    sender = filter(lambda s: s.find('From: ') != -1, header)[0]
    # Get 'words' in From line
    sender = re.split(splitfrom_pattern, sender)
    sender = filter(lambda s: s != ' ' and s != '', sender)
    # Find the first word that is an e-mail address (contains @)
    sender = filter(lambda s: '@' in s, sender)[0]
    sender = re.sub('\\[a-z]', '', sender)
    return sender.lower().rstrip()

def get_subject(header):
    Find the subject line of the e-mail's header, and extract
    the subject text.
    subject = filter(lambda s: s.find('Subject: ') != -1, header)
    if len(subject) > 0:
        subject_start = subject[0].find('Subject: ') + 9
        subject = subject[0][subject_start:]
        return subject.lower()
        return ''

This function will using the parse_email_list function to iterate through the e-mail generator and collect the e-mails' date, sender, subject, message (and the name of the file containing the e-mail) into a dictionary of lists. The dictionary is then converted to a DataFrame.

In [6]:
def make_email_df(email_dir):
    Parse each e-mail in a directory and return a dataframe
     of each e-mail's date, sender, subject, and message
    email_dict = {'date'     : [],
                  'sender'   : [], 
                  'subject'  : [], 
                  'message'  : [],
                  'filename' : []}

    # Get a list of e-mails to parse
    file_list = os.listdir(email_dir)
    file_list = [f for f in file_list if  f != 'cmds'] 
    # A generator that that returns the parsed components
    # of each subsequent e-mail as called
    parsed_emails = parse_email_list(file_list)
    # Fill up the dictionary with the generator
    for pe in parsed_emails:
        date, sender, subject, message, filename = pe
    # Make the dictionary a data frame.
    email_df = DataFrame(email_dict, 
                         columns = ['date', 'sender', 'subject', 
                                    'message', 'filename'])  
    return email_df
In [7]:
# Make a dataframe of information in the
# 'Easy Ham' emails
email_df = make_email_df(easyham_path)

The first half of the e-mails will be used as the 'training' set. From this set we'll extract features that we'll use to classify the second half of e-mails. Typically, we might shuffle the data before splitting the set, so that we get a representative view of features. But, since we're classifying e-mails, it's more appropriate to take the earlier e-mails as the training set, and see how well they identify 'priority' e-mails amongst subsequently-recieved e-mails.

In [8]:
def train_test_split(df, train_fraction = .5, shuffle = True, 
                     preserve_index = True, seed = None):
    Split a dataframe into training and test sets by randomly assigning
    its observations to one set or the other.

    df: a DataFrame
    train_fraction: the fraction of `df` to be assigned to the 
        training set (1-train_fraction will go to the test set).
    preserve_index: If True, the split dataframes keep their index
        values from `df`. If False, each set gets a new integer index
        of arange(# rows).
    seed: the random number generator seed for row shuffling.
    if seed: 
    nrows = df.shape[0]
    split_point = int(train_fraction * nrows)
    rows = range(nrows)
    if shuffle: 
    train_rows = rows[:split_point]
    test_rows  = rows[split_point:]
    train_index = df.index[train_rows]
    test_index  = df.index[test_rows]
    train_df = df.ix[train_index, :]
    test_df  = df.ix[test_index, :]
    if not preserve_index:
        train_df.index = arange(train_df.shape[0])
        test_df.index  = arange(test_df.shape[0])
    return train_df, test_df

Some dates lack time-zone info, making them impossible to sort We'll sort ignoring timezone, knowing that it's buggy, but for comparison with the book. The R code in ML4H ignores timezones, and this leads to some errors. See note at the end.

In [9]:
email_df['sort_date'] = email_df['date'].map(lambda d: 
    dt.datetime(d.year, d.month, d.day, d.hour, d.minute, d.second))
email_df = email_df.sort('sort_date')
train_df, test_df = train_test_split(email_df, shuffle = False,
                                     preserve_index = False, 
                                     seed = 224)
In [10]:
# Each sender address get's a weight equal to the log of 
# 1 + the number of e-mails from that address.
def get_sender_weights(email_df):
    freq = email_df['sender'].value_counts()
    sender_weights = DataFrame({'freq'   : freq,
                                'weight' : np.log(1.0 + freq)})
    sender_weights = sender_weights.sort('weight', ascending = False)
    return sender_weights

Plot sender frequencies of the top 30 senders in the training set. Get the frequencies of these same addresses from the test set, and plot for comparisons. This will give us some idea of senders we'll end up flagging in the test set.

In [11]:
sender_weights = get_sender_weights(train_df)
sender_weights_test = get_sender_weights(test_df)

nshow = 30
top_emails = sender_weights[:nshow].index

plt.figure(figsize = (6, 14))
plt.barh(np.arange(nshow), sender_weights['freq'][top_emails], 
         align = 'center', 
         label = 'Training')
plt.barh(np.arange(nshow), sender_weights_test['freq'][top_emails], 
         align = 'center', 
         left  = sender_weights['freq'][top_emails].fillna(0),
         fc    = 'orange', 
         alpha = .8,
         label = 'Test')
plt.ylim((0 - .5, nshow - .5))
plt.title('Frequency of top %i sender addresses' % nshow)
plt.yticks(np.arange(nshow), top_emails)
plt.legend(frameon = False)
<matplotlib.legend.Legend at 0x1070c7190>


We want to identify threads amongst the e-mails and exctract information from them separately. We'll find threads that have been replied-to or forwarded, and collect all e-mails with the same subject (after stripping reply and forward tags out) into the same thread.

For example:

  1. 'Hi there'
  2. 'Re: Hi there'
  3. 'Re: Re: Hi there'
  4. 'Fw: Hi there'
  5. 'Re[3]: Hi there'

will all be in the same thread. This differs slightly from how the book identifies thread messages. They only flag messages with 'Re:' prefixes, so they would miss messages [1] and [4]. (TODO: Check how they would flag [5])

After flagging them, we'll collect info about these threads# into a separate DataFrame.

Regex patterns indicating a thread. Looking for reply and forward markers. Note that some mail services use 'Re:Re:Re:'-type prexifes to mark multiple replies (or forwards), while other use 'Re[3]:'.

In [12]:
reply_pattern   = '(re:|re\[\d\]:)'
fwd_pattern = '(fw:|fw[\d]:)'
In [13]:
def thread_flag(s):
    Returns True if string s matches the thread patterns.
    If s is a pandas Series, returns a Series of booleans.
    if isinstance(s, basestring):
        return re.search(reply_pattern, s) is not None
        return s.str.contains(reply_pattern, re.I)

def clean_subject(s):
    Removes all the reply and forward labeling from a 
    string (an e-mail subject) s. 
    If s is a pandas Series, returns a Series of cleaned
    This will help find the initial message in the thread
    (which won't have any of the reply/forward labeling.
    if isinstance(s, basestring):
        s_clean = re.sub(reply_pattern, '', s, re.I)
        s_clean = re.sub(fwd_pattern, '', s_clean, re.I)
        s_clean = s_clean.strip()
        s_clean = s.str.replace(reply_pattern, '', re.I)
        s_clean = s_clean.str.replace(fwd_pattern, '', re.I)
        s_clean = s_clean.str.strip()
    return s_clean

def get_thread_df(email_df):
    Identify threads in an e-mail DataFrame, and extract
    them into a new DataFrame.
    # Find threads by e-mails with reply patterns in their subjects.
    # Then get a set of thread subjects.
    is_thread = thread_flag(email_df['subject'])
    thread_subj = email_df['subject'][is_thread]
    # Clean up the subjects by removing reply and forward labels
    thread_subj = clean_subject(thread_subj) 
    thread_subj = thread_subj.unique()
    # Search for these thread subjects in the original
    # e-mail DataFrame (so we pick up the original e-mail in the
    # thread (which won't have a reply pattern in its subject)
    # Prepare the DataFrame for searching by cleaning up the 
    # subjects.
    search_df = email_df[['date', 'sender', 'subject']]
    search_df['subject'] = clean_subject(search_df['subject'])
    # Find subject matches
    thread_matches = [subj in thread_subj for subj in search_df['subject']]
    match_df = search_df.ix[thread_matches, :]
    return match_df 
In [14]:
# Get a DataFrame of threads in the training e-mails;
# compute sender weights within this subset of threads.
# To do the latter, we can use the same function we
# used on the whole DataFrame above.
thread_df = get_thread_df(train_df)
thread_sender_weights = get_sender_weights(thread_df)
In [15]:
def get_thread_activity(thread_df):
    Compute 'activity' statistics on threads in a DataFrame:
    frequency: Number of e-mails observed in the thread
    span: Time before the first and last e-mail observed (seconds)
    weight: Number e-mails-per-second in the thread (log-scale)
    clean_times = thread_df['date'].map(lambda t: t.tzinfo is not None)
    thread_df_clean = thread_df.ix[clean_times, :]
    freq_by_thread = thread_df['subject'].value_counts()
    # NB: I'm not sure if total_seconds() is daylight-savings aware
    # so this may be mildly buggy.
    seconds_span = lambda x: (
        (np.max(x) - np.min(x)).total_seconds())
    span_by_thread = thread_df_clean.groupby('subject')
    span_by_thread = span_by_thread['date'].aggregate(seconds_span)
    activity = DataFrame({'freq'   : freq_by_thread,
                          'span'   : span_by_thread,
                          'weight' : 10 + np.log10(freq_by_thread/span_by_thread)})
    # Restricting to threads with more than 2 e-mails may be unecessary, but we 
    # could get 1-e-mail threads for a couple of reasons: misclassification (it's 
    # not really a thread), and left-censored threads -- where we found the last 
    # reply in the thread, but the earlier messages were recieved before the data 
    # was collected.
    activity = activity[activity['freq'] >= 2]
    activity = activity[notnull(activity['weight'])]
    return activity
In [16]:
thread_activity_df = get_thread_activity(thread_df)

Check against table on p. 116

Since book ignores time-zone info., spans in book are often wrong (see, e.g., 'please help newbie compile mplayer :-)')

In [17]:
threads_to_check = ['please help a newbie compile mplayer :-)',
                    'prob. w/ install/uninstall',

print thread_activity_df.ix[threads_to_check, :]
                                          freq    span    weight
please help a newbie compile mplayer :-)     4   13509  6.471437
prob. w/ install/uninstall                   4   12114  6.518772
http://apt.nixia.no/                        10  265303  5.576258


We create term-document-matrices for the e-mails subjects and messages. (E.g., each subject is a document, or each message is a document). Then we make two different sets of term weighting functions.

  1. Subject-term weights:
    • The set of all thread subjects in the training set has terms $\\{t_1, \ldots, t_n\\}$ .
    • Each term, $t_i$, occurs in the subjects of threads $\\{s_1(t_i), \ldots, s_k(t_i)\\}$
  • Each thread subject $s_j$ has an activity weight.
  • Each term, then, gets a weight that is the average of the activity weights of the thread subjects it occurs in. Note that only terms that are in the subjects of the threads in the training data get weight.
    • An e-mail's subject-term-weight is the average of its subject's term weights. If none of the terms in its subject have a weight, then its subject-term-weight is 1.
  1. Message-term weights:
  • Each message in the training set has terms $\\{t_1, \ldots, t_n\\}$.
  • Each of these terms has a weight based on its frequency (the number of times it appear across all messages). The weight is $\log_{10}(\textrm{freq}(t_i))\ \ $ .
  • An e-mails message-term-weight is the average of the term-weights of all the terms in its message. If the e-mail contains no terms that match the weighted terms (i.e., no terms that are message terms from training set messages), the e-mail gets a message-term-weight of 1.
In [18]:
# Stopwords from R. NLTK has a set of stopwords we could use as 
# well, but this minimizes discrepancies with the book.
rsw = read_csv('../ch3/r_stopwords.csv')['x'].values.tolist()

def get_thread_subject_term_weights(thread_activity_df):
    Creates a term->weight map based on a DataFrame containing
    thread subjects and their activity weights
    thread_subjects = thread_activity_df.index
    thread_tdm = tdm_df(thread_subjects, remove_punctuation = False,
                                         remove_digits = False,
                                         stopwords = rsw)
    def calc_term_weight(term): 
        threads_with_term = np.where(thread_tdm[term] > 0.0)
        weight_vec = thread_activity_df['weight'].ix[threads_with_term] 
        return weight_vec.mean()
    term_weights = Series({t: calc_term_weight(t) for t in thread_tdm})
    return term_weights
In [19]:
thread_subject_terms_weights = \
In [20]:
# Example term weights
print thread_subject_terms_weights.head(10)
aa           5.759571
adam         5.975084
adhesion     4.814012
adsl         6.397131
advice       4.915926
alsa         5.817899
angry        6.028029
announces    6.093981
anolther     5.317371
app          5.699035
In [21]:
def get_message_term_weights(email_df):
    Creates a term->weight map for terms in the messages of the training
    e-mail DataFrame
    messages = email_df['message']
    term_freq = tdm_df(messages, stopwords = rsw).sum()
    term_weight_df = DataFrame({'freq'   : term_freq,
                                'weight' : np.log10(term_freq)})
    return term_weight_df   
In [22]:
message_terms_weights = get_message_term_weights(train_df)

The following function comprises the two term-weighting functions. It computes either a message's subject-term-weight or message-term-weight, depending on what arguments we pass to it.

In [23]:
def get_weight_from_terms(term_list, weight_df, subject = False):
    Given a term-list from an e-mail's message, and a term->weights
    map contained in a DataFrame, calculate the e-mail's message or
    subject term-weight. (default is message)
    if isinstance(term_list, basestring):
        term_list = [term_list]
    if subject:
        weights = weight_df
        weights = weight_df['weight']
    term_list_weight = weights[term_list].mean()
    if np.isnan(term_list_weight):
        term_list_weight = 1.0
    return term_list_weight

Rank the e-mails

The following functions iterate through a DataFrame of e-mails and compute the weights of its e-mails by combining all the functions we've defined above.

In [24]:
def rank_email(email_df, row):
    Ranks an e-mail (as contained in the row of a DataFrame) by
    computing and combining its five weights.
    email   = email_df.ix[row, :]
    date    = email['date']
    sender  = email['sender']
    subject = email['subject']
    message = email['message']
    # 1. Get sender weights (all messages)
    sender_weight = (sender_weights['weight']
                     .get(sender) or 1.0)
    # 2. Get sender weights (threads)
    thread_sender_weight = (thread_sender_weights['weight']
                           .get(sender) or 1.0)
    # 3. Get thread activity weight
    is_thread = thread_flag(subject)
    subject = clean_subject(subject)
    if is_thread:
        activity_weight = (thread_activity_df['weight']
                           .get(subject) or 1.0)
        activity_weight = 1.0
    # 4. Get subject line weight via thread-subject term weights
    subj_terms = tdm_df(subject, remove_punctuation = False,
                                 remove_digits = False,
                                 stopwords = rsw).columns
    subj_term_weight = get_weight_from_terms(subj_terms, 
                            subject = True)
    # 5. Message term weight
    message_terms = tdm_df(message, stopwords = rsw).columns
    message_term_weight = get_weight_from_terms(message_terms,
    weights = [sender_weight, 
    # The e-mail's final rank is just the product of the weights.
    rank = np.array(weights).prod()
    return rank
In [25]:
def make_rank_df(email_df):
    Rank each e-mail in a DataFrame.
    n_emails = email_df.shape[0]
    sender_weight_results = []
    thread_sender_weight_results = []
    activity_weight_results = []
    subj_term_weight_results = []
    message_term_weight_results = []
    rank_results = []
    rank_df = email_df.copy()
    for e in xrange(n_emails):
        weights_rank = rank_email(email_df, e)
    rank_df['rank'] = rank_results
    return rank_df

Rank the e-mails in the training set.

In [26]:
train_ranks = make_rank_df(train_df)

Calculate the threshold for flagging a 'priority' message. New messages that have a rank greater than the threshold get flagged as priority.

In [27]:
threshold = train_ranks['rank'].median()

Rank the e-mails in the test set.

In [28]:
test_ranks = make_rank_df(test_df)


Plot the density of e-mail ranks in the test and training set. Note the mass of e-mails in the test set are ranked lower than priority, and only a tail of e-mails are flagged.

In [29]:
train_kde = KDE(train_ranks['rank'])
test_kde = KDE(test_ranks['rank'])

plt.figure(figsize(8, 6))
plt.fill(train_kde.support, train_kde.density, color = 'steelblue', alpha = .7, 
         label = 'Train')
plt.fill(test_kde.support, test_kde.density, color = 'red', alpha = .7, 
         label = 'Test')
plt.xlim(0, 400)
plt.ylim(0, np.max(test_kde.density))
plt.axvline(threshold, linestyle = '--', label = 'Priority threshold')
plt.title('Distribution of ranks for training and test e-mails')
plt.legend(frameon = False)
<matplotlib.legend.Legend at 0x10f450f90>


How many e-mails do we flag as priority in the test set?

In [30]:
(test_ranks['rank'] > threshold).sum()

What sender addresses get flagged in the test set? We should see some familiar addresses from the graph above.

Which training set e-mail is ranked highest?

In [32]:
maxrank_email = train_ranks.ix[train_ranks['rank'].idxmax(), :]
print 'Maximum rank in training set: ', train_ranks['rank'].max()
Maximum rank in training set:  1020.94633354

This turns out to be a lot higher than the largest found by the authors' R program. Let's check it out.

It turns out this e-mail's thread activity weight is substantially higher than found in the R code. As shown next, this is the result of a time-zone bug in the R code.

In [33]:
thread_activity_df.ix[clean_subject(maxrank_email['subject']), :]
freq      2.00000
span      4.00000
weight    9.69897
Name: [sadev] [bug 840] spam_level_char option change/removal

There are two messages in this thread (in the training set).

In [34]:
734    2002-09-06 10:56:23-07:00
763    2002-09-06 13:56:19-04:00
Name: date

E-mail 734 come 4 seconds after 763. But if we mistakenly ignore the time-zone (like in the R code) 734 comes 2:59:56 before 763.

All else equal, this will reduce the rank of this e-mail by 35%.

In [35]:
correct_weight = 10 + np.log10(2 / 4.)
incorrect_weight = 10 + np.log10(2 / (2* 3600 + 59 * 60 + 56.))
print 'Activity weight correct time-zone: ', correct_weight
print 'Activity weight incorrect time-zone: ', incorrect_weight
print 'Rank difference: ' , '{:2.1%}'.format(1 - incorrect_weight / correct_weight)
Activity weight correct time-zone:  9.69897000434
Activity weight incorrect time-zone:  6.26776711978
Rank difference:  35.4%