Click-Through Rate Prediction - AVAZU Kaggle Competition¶

This is our ipython notebook which details our work in building a classifier on the avazu click through rate prediction.

We start by loading the usual libraries and define functions for improved plots.¶

In [31]:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import brewer2mpl
from matplotlib import rcParams

#colorbrewer2 Dark2 qualitative color table
dark2_cmap = brewer2mpl.get_map('Dark2', 'Qualitative', 7)
dark2_colors = dark2_cmap.mpl_colors

rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'white'
rcParams['patch.facecolor'] = dark2_colors[0]
rcParams['font.family'] = 'StixGeneral'


def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    """
    Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()   

1. Load the training data¶

In [2]:

import csv
csv_file_object = csv.reader(open(r"train_rev2","rb"))
header = csv_file_object.next()

At this stage, we realised the size of the dataset (9Gb) was prohibitive to loading and manipulating into ipython notebook. Therefore we decided to explore a small sample of the dataset to try and understand the feature space.

We load consecutive 1000 rows of the dataset.¶

This allows us to subsequently plot the growth in feature space for each categorical variable as we are intending to use a boolean 0-1 hot encoding. We can therefore observe whether the features for that particular category grow with the number of examples of whether they tail off; this will help us determine whether that particular category is a useful feature to use in the model. (see Vowpal Wabbit section below)

In [3]:

results = []
results2 = []
results3 = []
results4 = []
results5 = []
counter = 0
path  = "train_rev2"
with open(path, "r") as data:
    #Count the number of lines: 47686352
#     for i, l in enumerate(data):
#         pass
#     print i + 1
#     header = data.readline()
    for line in data:
        counter += 1
        line = line.strip("\n")
        line = line.strip()
        if counter <= 1000:
            results.append(line.split(","))
        if counter >= 1000 and counter <= 2000:
            results2.append(line.split(","))
        if counter >= 2000 and counter <= 3000:
            results3.append(line.split(","))
        if counter >= 3000 and counter <= 4000:
            results4.append(line.split(","))
        if counter >= 4000 and counter <= 5000:
            results5.append(line.split(","))

We then load the first 1000 datapoints into a pandas dataframe to use as training set.

In [4]:

testing  = pd.DataFrame(data=np.asarray(results[1:]), columns=header)
testing.click = testing.click.astype(int)
print testing.irow(0)
testing.head()

id                    10000222510487979663
click                                    0
hour                              14100100
C1                                    1005
banner_pos                               0
site_id                           d41d8cd9
site_domain                       d41d8cd9
site_category                     d41d8cd9
app_id                            ee72efa5
app_domain                        85262c2b
app_category                      7e5068fc
device_id                         d41d8cd9
device_ip                         22f9c6ba
device_os                         c31b3236
device_make                       3d517f89
device_model                      3e238c9b
device_type                              1
device_conn_type                         0
device_geo_country                fc9fdf08
C17                                  11999
C18                                    320
C19                                     50
C20                                   1248
C21                                      2
C22                                     39
C23                                     -1
C24                                     13
Name: 0, dtype: object

Out[4]:

	id	click	hour	C1	banner_pos	site_id	site_domain	site_category	app_id	app_domain	...	device_conn_type	device_geo_country	C17	C18	C19	C20	C21	C22	C23	C24
0	10000222510487979663	0	14100100	1005	0	d41d8cd9	d41d8cd9	d41d8cd9	ee72efa5	85262c2b	...	0	fc9fdf08	11999	320	50	1248	2	39	-1	13
1	10000335031004381249	0	14100100	1005	0	d41d8cd9	d41d8cd9	d41d8cd9	7ddd1e29	85262c2b	...	0	e22428cc	12026	320	50	1248	2	39	-1	13
2	10000413097548171036	0	14100100	1010	1	d41d8cd9	d41d8cd9	d41d8cd9	7dd0bcc4	d41d8cd9	...	2	5343b21a	5470	320	50	394	2	303	-1	15
3	10000436876114817886	0	14100100	1002	0	d5589b4a	d41d8cd9	d41d8cd9	d41d8cd9	d41d8cd9	...	0	0b3b97fa	16723	320	50	1876	2	291	-1	33
4	10000488446663934007	1	14100100	1005	0	d41d8cd9	d41d8cd9	d41d8cd9	aa55fc10	85262c2b	...	2	75778bf8	17012	320	50	1871	3	35	100053	23

5 rows × 27 columns

2. Data Processing¶

At this point, we choose to convert the time into a numerical variable from the categorical format that it is currently in. (hour: format is YYMMDDHH) We add this new column to the dataframe under the name "timestamp".

NB:, at a later stage we realised that all training examples came from the same month and year but vary according to hour and day of the week. Therefore for our feature selection we actually split the hour data into two separate categories, namely day of week and time of day instead.

In [10]:

from datetime import datetime, date, time
testing['timestamp'] = testing['hour'].map(lambda x: int(datetime.strptime(x[4:6] + "/" + x[2:4] + "/" + x[0:2] + " " + x[6:8], "%d/%m/%y %H").strftime("%s")))
testing.tail()

Out[10]:

	id	click	hour	C1	banner_pos	site_id	site_domain	site_category	app_id	app_domain	...	device_geo_country	C17	C18	C19	C20	C21	C22	C23	C24	timestamp
994	10085753362715434105	0	14100100	1005	1	b01fd8c0	a56a5285	7e5068fc	d41d8cd9	d41d8cd9	...	959848ca	14915	320	50	1623	3	175	100156	42	1412096400
995	10085762866609622098	0	14100100	1005	0	d41d8cd9	d41d8cd9	d41d8cd9	ca1502d1	96b73ddc	...	3d26b0b1	16966	320	50	1919	0	169	100108	17	1412096400
996	10085798392965113138	0	14100100	1005	0	d41d8cd9	d41d8cd9	d41d8cd9	ca1502d1	96b73ddc	...	3d26b0b1	16966	320	50	1919	0	169	100108	17	1412096400
997	10085816395998880696	1	14100100	1005	0	d41d8cd9	d41d8cd9	d41d8cd9	8374fbd9	4f04e5f8	...	e529a9ce	16615	320	50	1863	3	39	100248	23	1412096400
998	10085854709161323225	0	14100100	1005	0	d41d8cd9	d41d8cd9	d41d8cd9	ca1502d1	96b73ddc	...	0d149b90	16839	320	50	1883	0	1451	100094	17	1412096400

5 rows × 28 columns

We convert the categorical variables into indicator variables to use in the classifier algorithm using the get_dummies method. We will not require the index column, nor the click or hour columns.

In [11]:

trial_text = header[3:-1]
df = pd.DataFrame()

for elem in trial_text:
    interim = pd.get_dummies(testing[elem])
    if "d41d8cd9" in interim.columns.values:
        interim.drop("d41d8cd9", axis=1, inplace=True)
    interim.rename(columns=lambda x: elem + "_" + x, inplace=True)
    df = df.join(interim)

We populate the dataframe with a 0-1 hot encoding for each categorical column. The cell below adds the 1s.

In [12]:

for ix, value in enumerate(testing.ix[:,3:-1].values):
    mydict = {}
    indy = []
    indy.append(ix)
    for j, title in enumerate(value):
        mydict[header[j+3] + "_" + title] = 1
    df2 = pd.DataFrame(mydict, index = indy)
    df = df.append(df2)

This cell adds the 0s and the quantitative timestamps to the data_frame

In [13]:

df.fillna(0, inplace=True)

df['timestamp'] = testing['timestamp']
print df.shape
df.head()

(999, 2486)

Out[13]:

	...	timestamp
0	...	1412096400
1	...	1412096400
2	...	1412096400
3	...	1412096400
4	...	1412096400

5 rows × 2486 columns

3. Data Exploration¶

We plot the different variables to explore the significance of the features and their relation to leading to a click event. Deployed in blue is a simple sequential bar chart of all the unique features in each of the different categories; in red, the percentage of that category who clicked, as a line chart.

In [70]:

# Samping the scatter plot functionality
fig, axes=plt.subplots(figsize=(20, 15), nrows=6, ncols=5)
fig.subplots_adjust(hspace = 0.8)
fig.subplots_adjust(wspace = 0.8)
bar_width = 1.

for ix in np.arange(6):
    for ix2 in np.arange(5):
        if (ix*5 + ix2) < len(header):
            count_data = testing.groupby(header[ix*5 + ix2]).click.count() #Count plot of clicks
            sum_data = testing.groupby(header[ix*5 + ix2]).click.sum() #Sum plot of clicks
            percentage_data = sum_data / count_data
            index = np.arange(len(count_data))
            axes[ix][ix2].bar(index, count_data, bar_width, color='b', label='category count')
            axes[ix][ix2].set_ylabel("Category count")
            axes[ix][ix2].set_xlabel("Sequential Categories")
            axes[ix][ix2].set_title(header[ix*5 + ix2])
            remove_border(axes[ix][ix2], top=False, right=True, left=True, bottom=True)
            secondAxis = axes[ix][ix2].twinx()
            secondAxis.plot(index + bar_width / 2., percentage_data, color='r', label='percentage from category who clicked')
            secondAxis.set_ylabel("Percentage Click")
        else:
            fig.delaxes(axes[ix][ix2])

From the above plots, we can get an overview of the different categories and their variability as well as how they influence clicking. For example, we can observe that in category C1, type "0", although not many examples are present, those available seem to consistently lead to a click event suggesting that the feature C1-0 might be a good predictor of clicking.

4. Model Fitting in scikit-learn (first experimentation)¶

We import the required libraries to experiment with the Naive Bayes and Logistic Regression classifiers.

In [14]:

from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

We define the supervised learning vector y.

In [15]:

y = testing.click

We split the data into a training and test set so that we can validate our model.

In [16]:

xtrain, xtest, ytrain, ytest = train_test_split(df, y)

4.1 Naive Bayes¶

We build a Naive Bayes model and fit it to our training set.

In [17]:

clf = MultinomialNB().fit(xtrain, ytrain)

In [18]:

print "Training accuracy: %0.2f%%" % (100 * clf.score(xtrain, ytrain))
print "Test accuracy: %0.2f%%" % (100 * clf.score(xtest, ytest))

Training accuracy: 92.26%
Test accuracy: 90.40%

In [48]:

prob = clf.predict_proba(xtest)
plt.hist(prob[:,1])
remove_border()
plt.ylabel("Count")
plt.xlabel("Probability")
plt.title("Naive Bayes Probability Histogram")

Out[48]:

<matplotlib.text.Text at 0x11e2a5710>

The model shows a high level accuracy which is misleading because due to the low number of examples corresponding to a click, even if you were to always predict no click, you would perform to a high level of accuracy, only misclassifying those few examples where a click actually had occurred. Also our model suffers from overfitting due to the large number of features and in our case the low number of training examples because we could only upload 1000 examples at a time.

4.2 Logistic Regression¶

We now experiment with fitting a logistic regression model, this does not suffer from the same limitations as the naive bayes with regards to independance of features. However we would expect this to suffer from similar over-fitting problems

In [20]:

clf2 = LogisticRegression().fit(xtrain, ytrain)

In [83]:

print "Training accuracy: %0.2f%%" % (100 * clf2.score(xtrain, ytrain))
print "Test accuracy: %0.2f%%" % (100 * clf2.score(xtest, ytest))

Training accuracy: 92.26%
Test accuracy: 90.40%

In [50]:

prob = clf2.predict_proba(xtest)
plt.hist(prob[:,1])
remove_border()
plt.ylabel("Count")
plt.xlabel("Probability")
plt.title("Logistic Regression Probability Histogram")

Out[50]:

<matplotlib.text.Text at 0x116120690>

This model suffers from the same problems as the Naive Bayes above, the model is over-fit and training on a much larger dataset is required

5. Visualise the Feature Space¶

In order to better understand the feature space and potentially reduce it, or remove problems with the data, we aim to visualise the features.

The plot below attempts to look at the growth of the feature space as training data is added. This allows us to see how the feature space grows as the dataset increases and suggests categories we may want to remove.

Features Scaling Plot¶

In order to check how the feature space grows, we plot a progressive bar chart of the cumulative number of unique category variables on adding 1000 additional data rows at a time. If this plot grows linearly, then we can assume there is not much generalisable information in this category and potentially remove it from our feature space.

In [51]:

testing2  = pd.DataFrame(data=np.asarray(results2[:]), columns=header)
testing2.click = testing2.click.astype(int)
testing2['timestamp'] = testing2['hour'].map(lambda x: int(datetime.strptime(x[4:6] + "/" + x[2:4] + "/" + x[0:2] + " " + x[6:8], "%d/%m/%y %H").strftime("%s")))

In [73]:

fig, axes=plt.subplots(figsize=(20,15), nrows=6, ncols=5)
fig.subplots_adjust(hspace = 0.8)
fig.subplots_adjust(wspace = 0.6)

for ix in np.arange(6):
    for ix2 in np.arange(5):
        if (ix*5 + ix2) < len(header):
            first_count = testing[header[ix*5 + ix2]].unique()
            first_count_len = len(first_count)
            first_count_set = set(first_count)
            second_count_set = set(testing2[header[ix*5 + ix2]].unique())
            second_count_len = len(second_count_set.difference(first_count_set))
            index = np.arange(2)
            axes[ix][ix2].bar(index, [first_count_len, first_count_len+second_count_len], color='b', label='count')
            axes[ix][ix2].set_title(header[ix*5 + ix2])
            axes[ix][ix2].set_ylabel("Count")
            axes[ix][ix2].set_xlabel("Index")
            remove_border(axes[ix][ix2])
        else:
            fig.delaxes(axes[ix][ix2])

We now do this a few more times for feature exploring purposes.

In [71]:

testing3  = pd.DataFrame(data=np.asarray(results3[:]), columns=header)
testing3.click = testing3.click.astype(int)
testing3['timestamp'] = testing3['hour'].map(lambda x: int(datetime.strptime(x[4:6] + "/" + x[2:4] + "/" + x[0:2] + " " + x[6:8], "%d/%m/%y %H").strftime("%s")))

testing4  = pd.DataFrame(data=np.asarray(results4[:]), columns=header)
testing4.click = testing4.click.astype(int)
testing4['timestamp'] = testing4['hour'].map(lambda x: int(datetime.strptime(x[4:6] + "/" + x[2:4] + "/" + x[0:2] + " " + x[6:8], "%d/%m/%y %H").strftime("%s")))

testing5  = pd.DataFrame(data=np.asarray(results5[:]), columns=header)
testing5.click = testing5.click.astype(int)
testing5['timestamp'] = testing5['hour'].map(lambda x: int(datetime.strptime(x[4:6] + "/" + x[2:4] + "/" + x[0:2] + " " + x[6:8], "%d/%m/%y %H").strftime("%s")))

In [75]:

fig, axes=plt.subplots(figsize=(20,15), nrows=6, ncols=5)
fig.subplots_adjust(hspace = 0.8)
fig.subplots_adjust(wspace = 0.6)

for ix in np.arange(6):
    for ix2 in np.arange(5):
        if (ix*5 + ix2) < len(header):
            first_count = testing[header[ix*5 + ix2]].unique()
            first_count_len = len(first_count)
            first_count_set = set(first_count)
            second_count_set = set(testing2[header[ix*5 + ix2]].unique())
            second_count_set = second_count_set.union(first_count_set)
            second_count_len = len(second_count_set)
            third_count_set = set(testing3[header[ix*5 + ix2]].unique())
            third_count_set = third_count_set.union(second_count_set)
            third_count_len = len(third_count_set)
            fourth_count_set = set(testing4[header[ix*5 + ix2]].unique())
            fourth_count_set = fourth_count_set.union(third_count_set)
            fourth_count_len = len(fourth_count_set)
            fifth_count_set = set(testing5[header[ix*5 + ix2]].unique())
            fifth_count_set = fifth_count_set.union(fourth_count_set)
            fifth_count_len = len(fifth_count_set)        
            index = np.arange(5)
            axes[ix][ix2].bar(index, [first_count_len, second_count_len, third_count_len, fourth_count_len, fifth_count_len], color='b', label='count')
            axes[ix][ix2].set_title(header[ix*5 + ix2])
            axes[ix][ix2].set_ylabel("Count")
            axes[ix][ix2].set_xlabel("Index")
            remove_border(axes[ix][ix2])
        else:
            fig.delaxes(axes[ix][ix2])

From the plots we can see that a few of the categories seem to scale with the size of the dataset. In this initial analysis, we decide to remove these features to reduce the feature space and try to reduce the time required to create models.

We decide to re-run the model but not include the following features: - device_ip - device_model - device_id

6. Re-experiment with Model Fitting in scikit-learn¶

In [76]:

trial_text_new = header[3:]
df_new = pd.DataFrame()

for elem_new in trial_text_new:
    if elem_new != "device_ip" and elem_new != "device_id" and elem_new != "device_model":
        interim_new = pd.get_dummies(testing[elem_new])
        if "d41d8cd9" in interim_new.columns.values:
            interim_new.drop("d41d8cd9", axis=1, inplace=True)
        interim_new.rename(columns=lambda x: elem_new + "_" + x, inplace=True)
        df_new = df_new.join(interim_new)

In [77]:

COLnames = df_new.columns.values
df_new.head()

Out[77]:

	C1_1002	C1_1005	C1_1010	banner_pos_0	banner_pos_1	site_id_02e31e62	site_id_05222bb6	site_id_060f567a	site_id_09ab1430	site_id_0be51948	...	C24_32	C24_33	C24_42	C24_46	C24_48	C24_52	C24_62	C24_79	C24_82	C24_95

0 rows × 949 columns

In [78]:

for ix, value in enumerate(testing.ix[:,3:-1].values):
    mydict = {}
    indy = []
    indy.append(ix)
    for j, title in enumerate(value):
        if j != 8 and j != 9 and j != 12 and title != "d41d8cd9":
            mydict[header[j+3] + "_" + title] = 1
    df2_new = pd.DataFrame(mydict, index = indy)
    df_new = df_new.append(df2_new)

In [99]:

df_new['timestamp'] = testing['timestamp']
df_new.fillna(0, inplace=True)
y=testing.click
xtrain, xtest, ytrain, ytest = train_test_split(df_new, y)

In [80]:

clf = MultinomialNB().fit(xtrain, ytrain)
print "Testing Accuracy: %0.2f%%" % (100 * clf.score(xtest, ytest))
print "Training Accuracy: %0.2f%%" % (100 * clf.score(xtrain, ytrain))
prob = clf.predict_proba(xtest)
plt.hist(prob[:,1])
remove_border()
plt.ylabel("Count")
plt.xlabel("Probability")
plt.title("Naive Bayes Probability Histogram")

Testing Accuracy: 90.40%
Training Accuracy: 92.26%

Out[80]:

<matplotlib.text.Text at 0x10b815f10>

In [82]:

clf2 = LogisticRegression().fit(xtrain, ytrain)
print "Testing Accuracy: %0.2f%%" % (100 * clf2.score(xtest, ytest))
print "Training Accuracy: %0.2f%%" % (100 * clf2.score(xtrain, ytrain))
prob = clf2.predict_proba(xtest)
plt.hist(prob[:,1])
remove_border()
plt.ylabel("Count")
plt.xlabel("Probability")
plt.title("Logistic Regression Probability Histogram")

Testing Accuracy: 90.40%
Training Accuracy: 92.26%

Out[82]:

<matplotlib.text.Text at 0x1117bc6d0>

The accuracy of both models has not changed and remains very high. Although the above was a useful exploration and practise exercise, we need to train the models on much larger datasets with such a large number of features in order to avoid overfitting, particularly as data is available. Therefore we decide to learn about Vowpal Wabbit which seems to be the current first choice for large data analysis on Kaggle. It is an online learner which we will aim to introduce below and detail the steps that we took to train it.

7. Vowpal Wabbit Learning¶

Due to the size of the dataset ~9GB, it is very difficult to train models using approaches best suited for smaller datasets in Scikit Learn, as the computer is not able to fit the dataset into memory. After frustratingly attempting to read the dataset into ipython notebook, we explored methods for machine learning (ML) on 'Big Data':

Sampling and Statistical Methods
Online Learners
Parallelisation

After exploring the Kaggle forums and approaches on large datasets, we were introduced to the Vowpal Wabbit tool. This is an online learner which is able to deal with very large datasets on local machines by iteratively training a model using each new example.

Vowpal Wabbit Strategy¶

Our strategy for using Vowpal Wabbit was as follows:

'Hot Encoding' of categorical variables
Exploration of validation error and its causes
Exploration of feature space with vw-varinfo
Fitting of hyperparameters
Experimentation of bootstrapping and multiple passes

7.1 Hot Encoding of Variables¶

The variables are encoded as 'hot encoding': which means each unique value for a categorical variable is used as a boolean predictor.

Data processing¶

From our previous analysis (see section 2.), we discovered that the training data came from the same month and year, therefore decided to extract day of week and hour of the day information from the "hour" variable in the dataset. As a result the categorical variable "hour" was not included as a feature in the model.
Also from previous analysis, we observed that the categories labelled as "device_id" and "device_ip" grow linearly with the number of examples and therefore decided to remove those from our features.
The categorical variable "id" was not used as a feature either as it consisted of a unique id for each row.
The entire input file was then turned into vw encoding using another python script [github_link - adapted from Triskelion info@mlwave.com]

7.2 Validation Error Problem¶

The loss function used was a logistic loss function as defined by the Kaggle competition and also this makes sense for evaluating probabilities as it is derived from a Bernoulli distribution.

The logarithmic loss (interpreted as the validation error) from our trained model showed initial decrease (as expected from convergence) but then started to increase again. Looking into how to debug the model, we expect this could be due to one of the following: 1. Data is not shuffled but instead ordered according to a systematic component meaning the learner is trained first on a subset of the data then experiences another subset etc... the most obvious candidate category for this is the time 'seasonality'. 2. The learning rate is too high and the algorithm is not converging properly. In section 7.4 we explain how to optmise the learning rate to help the algorithm converge.

7.3 Exploration of feature space with vw-varinfo¶

In order to reduce the feature space further, we need to examine the contribution the features make to our Vowpal Wabbit logistic regression model. In order to do this we us the vw-varinfo function to output the weighting of each feature within the logistic regression.

Note: You can use varinfo in the same way as vw as detailed in https://github.com/JohnLangford/vowpal_wabbit/wiki/using-vw-varinfo but remember to adjust the command line path (that confused us for a while...) :¶

ie. if you were running:

vw data.train¶

you probably now need to run:

../utl/vw-varinfo data.train¶

We read in the varinfo text file to explore the features and their weights.

In [88]:

tester = []
path  = "varinfo.txt"
with open(path,"r") as infile:
    for row in infile:
        row = row.strip("c^").split("\n")
        tester.append(row[0])

In [91]:

the_brain = {}
feature_set = []
totalScores = []
for value in tester[1:]:
    totalScores.append((value.split()[0], value.split()[5]))
    feature = value.split()[0].rsplit('_', 1)[0]
    score = float(value.split()[5].strip("%"))
    if feature not in the_brain:
        feature_set.append(feature)
        the_brain[feature] = [score]
    else:
        the_brain[feature].append(score)
print len(feature_set)

In [102]:

fig, axes=plt.subplots(figsize=(23,15), nrows=6, ncols=5)
fig.subplots_adjust(hspace = 0.8)
fig.subplots_adjust(wspace = 0.6)

for ix in np.arange(6):
    for ix2 in np.arange(5):
        if (ix*6 + ix2) < len(feature_set):
            axes[ix][ix2].hist(the_brain[feature_set[ix*6+ix2]], bins = 10)
            axes[ix][ix2].set_title(feature_set[ix*6+ix2])
            axes[ix][ix2].set_ylabel("Count")
            axes[ix][ix2].set_xlabel("Weight")
            remove_border(axes[ix][ix2])
        else:
            fig.delaxes(axes[ix][ix2])

We create an array that collects all the features that are weighted less than |1%| with the aim to remove those features from our training set to reduce the feature space.

In [99]:

lessThanAbsOne = [e1[0] for e1 in totalScores if e1[1] < 1 and e1[1] > -1]

We then create a csv file with these features that we will then read into the python script to create the vw encoding with the removed features.

In [100]:

with open('featureRemove.csv', 'w') as fp:
    a = csv.writer(fp, delimiter='\n')
    data = lessThanAbsOne
    a.writerow(data)

7.4 Fitting of hyperparameters¶

7.4.1. Learning rate¶

The learning rate for our model may be too high, preventing convergence causing the logarithmic loss to be increase. Vowpal Wabbit provides a golden search function to optimise the learning rate within a user defined range to yield the lowest validation error and improve convergence.

We run the following command in Vowpal Wabbit to search for the best learning rate parameter (on our training set with reduced features):

../utl/vw-hypersearch 0.1 10 vw --loss_function logistic -l % reducedTrain.vw¶

In our case, the optimised learning rate turned out to be 4.5, which is quite high compared to the default value of 0.5. However when we trained our model with this new parameter, the log loss did decrease as expected but the effect was negligeable.

7.4.2. Regularisation¶

Vowpal Wabbit provides two parameters to add regularization to the model, namely L1 and L2: as described in the very useful post: http://fastml.com/large-scale-l1-feature-selection-with-vowpal-wabbit/

We run the following command in Vowpal Wabbit to search for the best regularization parameter l1 (on our training set with reduced features):

../utl/vw-hypersearch -L 1e-20 1 vw --loss_function logistic --l1 % reducedTrain.vw¶

The -L prompt does a log-space search, which we chose to increase speed.

In our case, the optimised l1 regularization parameter was of the order 10-8, which being so close to zero, we chose to leave off and train our dataset without regularization.

7.5 Experimentation of bootstrapping and multiple passes¶

Considering the possible bias of the model to a sub-section of the training data (observed due to a bump in the error rate), bootstrapping and mutiple passes through the training data were explored to improve the prediction. Below is a graph of the log loss error with our initial model and our model using boostrapping and multiple passes. Vowal Wabbit allows multiple passes of the dataset to be made in training which allows each row to be used multiple times; however there is then a risk of overfitting to the training data and reducing the error rate at the cost of making a non generalisable model.

In [195]:

path  = "log_loss_.csv"
counter = 0
boot_res_x = []
boot_res_y = []
first_res_x = []
first_res_y = []
with open(path, "r") as data:
    for line in data:
        line = line.replace(" ", ",")
        line = line.split(",")
        line = [e.strip("\n") for e in line if e != ""]
        if counter > 1 and line[6]!='26':
            boot_res_y.append(line[0])
            boot_res_x.append(line[3])
        if counter > 1 and line[6]=='26':
            first_res_y.append(line[0])
            first_res_x.append(line[3])
        counter = counter + 1
        
#PLOTs
plt.plot(boot_res_x, boot_res_y, color = "b", label = "bootstrap samples")
plt.plot(first_res_x, first_res_y, color = "r", label = "simple model")
plt.ylim(0.2, 0.5)
remove_border()
plt.title("Log loss versus number of training examples")
plt.ylabel("Log loss")
plt.xlabel("Number of training examples")
plt.legend()

Out[195]:

<matplotlib.legend.Legend at 0x11e1f1910>

Note: blue plot shows many more training examples as running multiple passes on the model.

The above plot shows that even with bootstrapping and running multiple passes, we observe the "bump" showing increase in log loss error before stabilisation; this seems to indicate that the data would still benefit from being shuffled and probably suffers from some seasonality effects. However the bump has been reduced so there seems to have been some advantage to running bootstrap and multiple passes.

Some technical considerations which we did not find to be easily documented around bootstrap and running multiple passes.¶

Bootstrap (--bs): The documentation states boostrapping prompt should be -B or --bootstrap. We found it had to be: --bs
Caching (--c): In order to use mutiple passes you need to cache data to increase the algorithms speed. However if you encounter an error saying that no cache file exists this is due to the fact you have to kill (-k) existing cache files. This seems to be the case even though there are no existing cache files that you can see in your directory.
Passes (--passes): Like all vowpal wabbit parameters you need to determine the optimal argument and the risk of over-fitting increases with the number of passes.
Holdout (--holdout_period): This details the number of rows from the training data to 'hold-out' in order to calculate the cross-validation error and is introduced for 'over-fit avoidance'. Performance is evaluated on the held-out-data.
Early Terminate (--early_terminate): Again introduced for 'over-fit avoidance', this is used to specify the number of passes to allow even though the loss is increasing before terminating.