This is our ipython notebook which details our work in building a classifier on the avazu click through rate prediction.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import brewer2mpl
from matplotlib import rcParams
#colorbrewer2 Dark2 qualitative color table
dark2_cmap = brewer2mpl.get_map('Dark2', 'Qualitative', 7)
dark2_colors = dark2_cmap.mpl_colors
rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'white'
rcParams['patch.facecolor'] = dark2_colors[0]
rcParams['font.family'] = 'StixGeneral'
def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
"""
Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks
The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
"""
ax = axes or plt.gca()
ax.spines['top'].set_visible(top)
ax.spines['right'].set_visible(right)
ax.spines['left'].set_visible(left)
ax.spines['bottom'].set_visible(bottom)
#turn off all ticks
ax.yaxis.set_ticks_position('none')
ax.xaxis.set_ticks_position('none')
#now re-enable visibles
if top:
ax.xaxis.tick_top()
if bottom:
ax.xaxis.tick_bottom()
if left:
ax.yaxis.tick_left()
if right:
ax.yaxis.tick_right()
import csv
csv_file_object = csv.reader(open(r"train_rev2","rb"))
header = csv_file_object.next()
At this stage, we realised the size of the dataset (9Gb) was prohibitive to loading and manipulating into ipython notebook. Therefore we decided to explore a small sample of the dataset to try and understand the feature space.
This allows us to subsequently plot the growth in feature space for each categorical variable as we are intending to use a boolean 0-1 hot encoding. We can therefore observe whether the features for that particular category grow with the number of examples of whether they tail off; this will help us determine whether that particular category is a useful feature to use in the model. (see Vowpal Wabbit section below)
results = []
results2 = []
results3 = []
results4 = []
results5 = []
counter = 0
path = "train_rev2"
with open(path, "r") as data:
#Count the number of lines: 47686352
# for i, l in enumerate(data):
# pass
# print i + 1
# header = data.readline()
for line in data:
counter += 1
line = line.strip("\n")
line = line.strip()
if counter <= 1000:
results.append(line.split(","))
if counter >= 1000 and counter <= 2000:
results2.append(line.split(","))
if counter >= 2000 and counter <= 3000:
results3.append(line.split(","))
if counter >= 3000 and counter <= 4000:
results4.append(line.split(","))
if counter >= 4000 and counter <= 5000:
results5.append(line.split(","))
We then load the first 1000 datapoints into a pandas dataframe to use as training set.
testing = pd.DataFrame(data=np.asarray(results[1:]), columns=header)
testing.click = testing.click.astype(int)
print testing.irow(0)
testing.head()
id 10000222510487979663 click 0 hour 14100100 C1 1005 banner_pos 0 site_id d41d8cd9 site_domain d41d8cd9 site_category d41d8cd9 app_id ee72efa5 app_domain 85262c2b app_category 7e5068fc device_id d41d8cd9 device_ip 22f9c6ba device_os c31b3236 device_make 3d517f89 device_model 3e238c9b device_type 1 device_conn_type 0 device_geo_country fc9fdf08 C17 11999 C18 320 C19 50 C20 1248 C21 2 C22 39 C23 -1 C24 13 Name: 0, dtype: object
id | click | hour | C1 | banner_pos | site_id | site_domain | site_category | app_id | app_domain | ... | device_conn_type | device_geo_country | C17 | C18 | C19 | C20 | C21 | C22 | C23 | C24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10000222510487979663 | 0 | 14100100 | 1005 | 0 | d41d8cd9 | d41d8cd9 | d41d8cd9 | ee72efa5 | 85262c2b | ... | 0 | fc9fdf08 | 11999 | 320 | 50 | 1248 | 2 | 39 | -1 | 13 |
1 | 10000335031004381249 | 0 | 14100100 | 1005 | 0 | d41d8cd9 | d41d8cd9 | d41d8cd9 | 7ddd1e29 | 85262c2b | ... | 0 | e22428cc | 12026 | 320 | 50 | 1248 | 2 | 39 | -1 | 13 |
2 | 10000413097548171036 | 0 | 14100100 | 1010 | 1 | d41d8cd9 | d41d8cd9 | d41d8cd9 | 7dd0bcc4 | d41d8cd9 | ... | 2 | 5343b21a | 5470 | 320 | 50 | 394 | 2 | 303 | -1 | 15 |
3 | 10000436876114817886 | 0 | 14100100 | 1002 | 0 | d5589b4a | d41d8cd9 | d41d8cd9 | d41d8cd9 | d41d8cd9 | ... | 0 | 0b3b97fa | 16723 | 320 | 50 | 1876 | 2 | 291 | -1 | 33 |
4 | 10000488446663934007 | 1 | 14100100 | 1005 | 0 | d41d8cd9 | d41d8cd9 | d41d8cd9 | aa55fc10 | 85262c2b | ... | 2 | 75778bf8 | 17012 | 320 | 50 | 1871 | 3 | 35 | 100053 | 23 |
5 rows × 27 columns
At this point, we choose to convert the time into a numerical variable from the categorical format that it is currently in. (hour: format is YYMMDDHH) We add this new column to the dataframe under the name "timestamp".
NB:, at a later stage we realised that all training examples came from the same month and year but vary according to hour and day of the week. Therefore for our feature selection we actually split the hour data into two separate categories, namely day of week and time of day instead.
from datetime import datetime, date, time
testing['timestamp'] = testing['hour'].map(lambda x: int(datetime.strptime(x[4:6] + "/" + x[2:4] + "/" + x[0:2] + " " + x[6:8], "%d/%m/%y %H").strftime("%s")))
testing.tail()
id | click | hour | C1 | banner_pos | site_id | site_domain | site_category | app_id | app_domain | ... | device_geo_country | C17 | C18 | C19 | C20 | C21 | C22 | C23 | C24 | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
994 | 10085753362715434105 | 0 | 14100100 | 1005 | 1 | b01fd8c0 | a56a5285 | 7e5068fc | d41d8cd9 | d41d8cd9 | ... | 959848ca | 14915 | 320 | 50 | 1623 | 3 | 175 | 100156 | 42 | 1412096400 |
995 | 10085762866609622098 | 0 | 14100100 | 1005 | 0 | d41d8cd9 | d41d8cd9 | d41d8cd9 | ca1502d1 | 96b73ddc | ... | 3d26b0b1 | 16966 | 320 | 50 | 1919 | 0 | 169 | 100108 | 17 | 1412096400 |
996 | 10085798392965113138 | 0 | 14100100 | 1005 | 0 | d41d8cd9 | d41d8cd9 | d41d8cd9 | ca1502d1 | 96b73ddc | ... | 3d26b0b1 | 16966 | 320 | 50 | 1919 | 0 | 169 | 100108 | 17 | 1412096400 |
997 | 10085816395998880696 | 1 | 14100100 | 1005 | 0 | d41d8cd9 | d41d8cd9 | d41d8cd9 | 8374fbd9 | 4f04e5f8 | ... | e529a9ce | 16615 | 320 | 50 | 1863 | 3 | 39 | 100248 | 23 | 1412096400 |
998 | 10085854709161323225 | 0 | 14100100 | 1005 | 0 | d41d8cd9 | d41d8cd9 | d41d8cd9 | ca1502d1 | 96b73ddc | ... | 0d149b90 | 16839 | 320 | 50 | 1883 | 0 | 1451 | 100094 | 17 | 1412096400 |
5 rows × 28 columns
We convert the categorical variables into indicator variables to use in the classifier algorithm using the get_dummies method. We will not require the index column, nor the click or hour columns.
trial_text = header[3:-1]
df = pd.DataFrame()
for elem in trial_text:
interim = pd.get_dummies(testing[elem])
if "d41d8cd9" in interim.columns.values:
interim.drop("d41d8cd9", axis=1, inplace=True)
interim.rename(columns=lambda x: elem + "_" + x, inplace=True)
df = df.join(interim)
We populate the dataframe with a 0-1 hot encoding for each categorical column. The cell below adds the 1s.
for ix, value in enumerate(testing.ix[:,3:-1].values):
mydict = {}
indy = []
indy.append(ix)
for j, title in enumerate(value):
mydict[header[j+3] + "_" + title] = 1
df2 = pd.DataFrame(mydict, index = indy)
df = df.append(df2)
This cell adds the 0s and the quantitative timestamps to the data_frame
df.fillna(0, inplace=True)
df['timestamp'] = testing['timestamp']
print df.shape
df.head()
(999, 2486)
C17_10198 | C17_10199 | C17_10200 | C17_10229 | C17_10289 | C17_1037 | C17_1039 | C17_10704 | C17_10901 | C17_10941 | ... | site_id_f3c495d0 | site_id_f440e761 | site_id_f4e01d44 | site_id_f701c177 | site_id_f91a85b6 | site_id_fb5ff023 | site_id_fe1972d4 | site_id_ff2e3304 | site_id_ffa60702 | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1412096400 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1412096400 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1412096400 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1412096400 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1412096400 |
5 rows × 2486 columns
We plot the different variables to explore the significance of the features and their relation to leading to a click event. Deployed in blue is a simple sequential bar chart of all the unique features in each of the different categories; in red, the percentage of that category who clicked, as a line chart.
# Samping the scatter plot functionality
fig, axes=plt.subplots(figsize=(20, 15), nrows=6, ncols=5)
fig.subplots_adjust(hspace = 0.8)
fig.subplots_adjust(wspace = 0.8)
bar_width = 1.
for ix in np.arange(6):
for ix2 in np.arange(5):
if (ix*5 + ix2) < len(header):
count_data = testing.groupby(header[ix*5 + ix2]).click.count() #Count plot of clicks
sum_data = testing.groupby(header[ix*5 + ix2]).click.sum() #Sum plot of clicks
percentage_data = sum_data / count_data
index = np.arange(len(count_data))
axes[ix][ix2].bar(index, count_data, bar_width, color='b', label='category count')
axes[ix][ix2].set_ylabel("Category count")
axes[ix][ix2].set_xlabel("Sequential Categories")
axes[ix][ix2].set_title(header[ix*5 + ix2])
remove_border(axes[ix][ix2], top=False, right=True, left=True, bottom=True)
secondAxis = axes[ix][ix2].twinx()
secondAxis.plot(index + bar_width / 2., percentage_data, color='r', label='percentage from category who clicked')
secondAxis.set_ylabel("Percentage Click")
else:
fig.delaxes(axes[ix][ix2])
From the above plots, we can get an overview of the different categories and their variability as well as how they influence clicking. For example, we can observe that in category C1, type "0", although not many examples are present, those available seem to consistently lead to a click event suggesting that the feature C1-0 might be a good predictor of clicking.
We import the required libraries to experiment with the Naive Bayes and Logistic Regression classifiers.
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
We define the supervised learning vector y.
y = testing.click
We split the data into a training and test set so that we can validate our model.
xtrain, xtest, ytrain, ytest = train_test_split(df, y)
We build a Naive Bayes model and fit it to our training set.
clf = MultinomialNB().fit(xtrain, ytrain)
print "Training accuracy: %0.2f%%" % (100 * clf.score(xtrain, ytrain))
print "Test accuracy: %0.2f%%" % (100 * clf.score(xtest, ytest))
Training accuracy: 92.26% Test accuracy: 90.40%
prob = clf.predict_proba(xtest)
plt.hist(prob[:,1])
remove_border()
plt.ylabel("Count")
plt.xlabel("Probability")
plt.title("Naive Bayes Probability Histogram")
<matplotlib.text.Text at 0x11e2a5710>
The model shows a high level accuracy which is misleading because due to the low number of examples corresponding to a click, even if you were to always predict no click, you would perform to a high level of accuracy, only misclassifying those few examples where a click actually had occurred. Also our model suffers from overfitting due to the large number of features and in our case the low number of training examples because we could only upload 1000 examples at a time.
We now experiment with fitting a logistic regression model, this does not suffer from the same limitations as the naive bayes with regards to independance of features. However we would expect this to suffer from similar over-fitting problems
clf2 = LogisticRegression().fit(xtrain, ytrain)
print "Training accuracy: %0.2f%%" % (100 * clf2.score(xtrain, ytrain))
print "Test accuracy: %0.2f%%" % (100 * clf2.score(xtest, ytest))
Training accuracy: 92.26% Test accuracy: 90.40%
prob = clf2.predict_proba(xtest)
plt.hist(prob[:,1])
remove_border()
plt.ylabel("Count")
plt.xlabel("Probability")
plt.title("Logistic Regression Probability Histogram")
<matplotlib.text.Text at 0x116120690>
This model suffers from the same problems as the Naive Bayes above, the model is over-fit and training on a much larger dataset is required
In order to better understand the feature space and potentially reduce it, or remove problems with the data, we aim to visualise the features.
In order to check how the feature space grows, we plot a progressive bar chart of the cumulative number of unique category variables on adding 1000 additional data rows at a time. If this plot grows linearly, then we can assume there is not much generalisable information in this category and potentially remove it from our feature space.
testing2 = pd.DataFrame(data=np.asarray(results2[:]), columns=header)
testing2.click = testing2.click.astype(int)
testing2['timestamp'] = testing2['hour'].map(lambda x: int(datetime.strptime(x[4:6] + "/" + x[2:4] + "/" + x[0:2] + " " + x[6:8], "%d/%m/%y %H").strftime("%s")))
fig, axes=plt.subplots(figsize=(20,15), nrows=6, ncols=5)
fig.subplots_adjust(hspace = 0.8)
fig.subplots_adjust(wspace = 0.6)
for ix in np.arange(6):
for ix2 in np.arange(5):
if (ix*5 + ix2) < len(header):
first_count = testing[header[ix*5 + ix2]].unique()
first_count_len = len(first_count)
first_count_set = set(first_count)
second_count_set = set(testing2[header[ix*5 + ix2]].unique())
second_count_len = len(second_count_set.difference(first_count_set))
index = np.arange(2)
axes[ix][ix2].bar(index, [first_count_len, first_count_len+second_count_len], color='b', label='count')
axes[ix][ix2].set_title(header[ix*5 + ix2])
axes[ix][ix2].set_ylabel("Count")
axes[ix][ix2].set_xlabel("Index")
remove_border(axes[ix][ix2])
else:
fig.delaxes(axes[ix][ix2])
We now do this a few more times for feature exploring purposes.
testing3 = pd.DataFrame(data=np.asarray(results3[:]), columns=header)
testing3.click = testing3.click.astype(int)
testing3['timestamp'] = testing3['hour'].map(lambda x: int(datetime.strptime(x[4:6] + "/" + x[2:4] + "/" + x[0:2] + " " + x[6:8], "%d/%m/%y %H").strftime("%s")))
testing4 = pd.DataFrame(data=np.asarray(results4[:]), columns=header)
testing4.click = testing4.click.astype(int)
testing4['timestamp'] = testing4['hour'].map(lambda x: int(datetime.strptime(x[4:6] + "/" + x[2:4] + "/" + x[0:2] + " " + x[6:8], "%d/%m/%y %H").strftime("%s")))
testing5 = pd.DataFrame(data=np.asarray(results5[:]), columns=header)
testing5.click = testing5.click.astype(int)
testing5['timestamp'] = testing5['hour'].map(lambda x: int(datetime.strptime(x[4:6] + "/" + x[2:4] + "/" + x[0:2] + " " + x[6:8], "%d/%m/%y %H").strftime("%s")))
fig, axes=plt.subplots(figsize=(20,15), nrows=6, ncols=5)
fig.subplots_adjust(hspace = 0.8)
fig.subplots_adjust(wspace = 0.6)
for ix in np.arange(6):
for ix2 in np.arange(5):
if (ix*5 + ix2) < len(header):
first_count = testing[header[ix*5 + ix2]].unique()
first_count_len = len(first_count)
first_count_set = set(first_count)
second_count_set = set(testing2[header[ix*5 + ix2]].unique())
second_count_set = second_count_set.union(first_count_set)
second_count_len = len(second_count_set)
third_count_set = set(testing3[header[ix*5 + ix2]].unique())
third_count_set = third_count_set.union(second_count_set)
third_count_len = len(third_count_set)
fourth_count_set = set(testing4[header[ix*5 + ix2]].unique())
fourth_count_set = fourth_count_set.union(third_count_set)
fourth_count_len = len(fourth_count_set)
fifth_count_set = set(testing5[header[ix*5 + ix2]].unique())
fifth_count_set = fifth_count_set.union(fourth_count_set)
fifth_count_len = len(fifth_count_set)
index = np.arange(5)
axes[ix][ix2].bar(index, [first_count_len, second_count_len, third_count_len, fourth_count_len, fifth_count_len], color='b', label='count')
axes[ix][ix2].set_title(header[ix*5 + ix2])
axes[ix][ix2].set_ylabel("Count")
axes[ix][ix2].set_xlabel("Index")
remove_border(axes[ix][ix2])
else:
fig.delaxes(axes[ix][ix2])
From the plots we can see that a few of the categories seem to scale with the size of the dataset. In this initial analysis, we decide to remove these features to reduce the feature space and try to reduce the time required to create models.
We decide to re-run the model but not include the following features: - device_ip - device_model - device_id
trial_text_new = header[3:]
df_new = pd.DataFrame()
for elem_new in trial_text_new:
if elem_new != "device_ip" and elem_new != "device_id" and elem_new != "device_model":
interim_new = pd.get_dummies(testing[elem_new])
if "d41d8cd9" in interim_new.columns.values:
interim_new.drop("d41d8cd9", axis=1, inplace=True)
interim_new.rename(columns=lambda x: elem_new + "_" + x, inplace=True)
df_new = df_new.join(interim_new)
COLnames = df_new.columns.values
df_new.head()
C1_1002 | C1_1005 | C1_1010 | banner_pos_0 | banner_pos_1 | site_id_02e31e62 | site_id_05222bb6 | site_id_060f567a | site_id_09ab1430 | site_id_0be51948 | ... | C24_32 | C24_33 | C24_42 | C24_46 | C24_48 | C24_52 | C24_62 | C24_79 | C24_82 | C24_95 |
---|
0 rows × 949 columns
for ix, value in enumerate(testing.ix[:,3:-1].values):
mydict = {}
indy = []
indy.append(ix)
for j, title in enumerate(value):
if j != 8 and j != 9 and j != 12 and title != "d41d8cd9":
mydict[header[j+3] + "_" + title] = 1
df2_new = pd.DataFrame(mydict, index = indy)
df_new = df_new.append(df2_new)
df_new['timestamp'] = testing['timestamp']
df_new.fillna(0, inplace=True)
y=testing.click
xtrain, xtest, ytrain, ytest = train_test_split(df_new, y)
clf = MultinomialNB().fit(xtrain, ytrain)
print "Testing Accuracy: %0.2f%%" % (100 * clf.score(xtest, ytest))
print "Training Accuracy: %0.2f%%" % (100 * clf.score(xtrain, ytrain))
prob = clf.predict_proba(xtest)
plt.hist(prob[:,1])
remove_border()
plt.ylabel("Count")
plt.xlabel("Probability")
plt.title("Naive Bayes Probability Histogram")
Testing Accuracy: 90.40% Training Accuracy: 92.26%
<matplotlib.text.Text at 0x10b815f10>
clf2 = LogisticRegression().fit(xtrain, ytrain)
print "Testing Accuracy: %0.2f%%" % (100 * clf2.score(xtest, ytest))
print "Training Accuracy: %0.2f%%" % (100 * clf2.score(xtrain, ytrain))
prob = clf2.predict_proba(xtest)
plt.hist(prob[:,1])
remove_border()
plt.ylabel("Count")
plt.xlabel("Probability")
plt.title("Logistic Regression Probability Histogram")
Testing Accuracy: 90.40% Training Accuracy: 92.26%
<matplotlib.text.Text at 0x1117bc6d0>
The accuracy of both models has not changed and remains very high. Although the above was a useful exploration and practise exercise, we need to train the models on much larger datasets with such a large number of features in order to avoid overfitting, particularly as data is available. Therefore we decide to learn about Vowpal Wabbit which seems to be the current first choice for large data analysis on Kaggle. It is an online learner which we will aim to introduce below and detail the steps that we took to train it.
Due to the size of the dataset ~9GB, it is very difficult to train models using approaches best suited for smaller datasets in Scikit Learn, as the computer is not able to fit the dataset into memory. After frustratingly attempting to read the dataset into ipython notebook, we explored methods for machine learning (ML) on 'Big Data':
After exploring the Kaggle forums and approaches on large datasets, we were introduced to the Vowpal Wabbit tool. This is an online learner which is able to deal with very large datasets on local machines by iteratively training a model using each new example.
Our strategy for using Vowpal Wabbit was as follows:
The variables are encoded as 'hot encoding': which means each unique value for a categorical variable is used as a boolean predictor.
From our previous analysis (see section 2.), we discovered that the training data came from the same month and year, therefore decided to extract day of week and hour of the day information from the "hour" variable in the dataset. As a result the categorical variable "hour" was not included as a feature in the model.
Also from previous analysis, we observed that the categories labelled as "device_id" and "device_ip" grow linearly with the number of examples and therefore decided to remove those from our features.
The categorical variable "id" was not used as a feature either as it consisted of a unique id for each row.
The entire input file was then turned into vw encoding using another python script [github_link - adapted from Triskelion info@mlwave.com]
The loss function used was a logistic loss function as defined by the Kaggle competition and also this makes sense for evaluating probabilities as it is derived from a Bernoulli distribution.
The logarithmic loss (interpreted as the validation error) from our trained model showed initial decrease (as expected from convergence) but then started to increase again. Looking into how to debug the model, we expect this could be due to one of the following: 1. Data is not shuffled but instead ordered according to a systematic component meaning the learner is trained first on a subset of the data then experiences another subset etc... the most obvious candidate category for this is the time 'seasonality'. 2. The learning rate is too high and the algorithm is not converging properly. In section 7.4 we explain how to optmise the learning rate to help the algorithm converge.
In order to reduce the feature space further, we need to examine the contribution the features make to our Vowpal Wabbit logistic regression model. In order to do this we us the vw-varinfo function to output the weighting of each feature within the logistic regression.
We read in the varinfo text file to explore the features and their weights.
tester = []
path = "varinfo.txt"
with open(path,"r") as infile:
for row in infile:
row = row.strip("c^").split("\n")
tester.append(row[0])
the_brain = {}
feature_set = []
totalScores = []
for value in tester[1:]:
totalScores.append((value.split()[0], value.split()[5]))
feature = value.split()[0].rsplit('_', 1)[0]
score = float(value.split()[5].strip("%"))
if feature not in the_brain:
feature_set.append(feature)
the_brain[feature] = [score]
else:
the_brain[feature].append(score)
print len(feature_set)
26
fig, axes=plt.subplots(figsize=(23,15), nrows=6, ncols=5)
fig.subplots_adjust(hspace = 0.8)
fig.subplots_adjust(wspace = 0.6)
for ix in np.arange(6):
for ix2 in np.arange(5):
if (ix*6 + ix2) < len(feature_set):
axes[ix][ix2].hist(the_brain[feature_set[ix*6+ix2]], bins = 10)
axes[ix][ix2].set_title(feature_set[ix*6+ix2])
axes[ix][ix2].set_ylabel("Count")
axes[ix][ix2].set_xlabel("Weight")
remove_border(axes[ix][ix2])
else:
fig.delaxes(axes[ix][ix2])
We create an array that collects all the features that are weighted less than |1%| with the aim to remove those features from our training set to reduce the feature space.
lessThanAbsOne = [e1[0] for e1 in totalScores if e1[1] < 1 and e1[1] > -1]
We then create a csv file with these features that we will then read into the python script to create the vw encoding with the removed features.
with open('featureRemove.csv', 'w') as fp:
a = csv.writer(fp, delimiter='\n')
data = lessThanAbsOne
a.writerow(data)
The learning rate for our model may be too high, preventing convergence causing the logarithmic loss to be increase. Vowpal Wabbit provides a golden search function to optimise the learning rate within a user defined range to yield the lowest validation error and improve convergence.
We run the following command in Vowpal Wabbit to search for the best learning rate parameter (on our training set with reduced features):
In our case, the optimised learning rate turned out to be 4.5, which is quite high compared to the default value of 0.5. However when we trained our model with this new parameter, the log loss did decrease as expected but the effect was negligeable.
Vowpal Wabbit provides two parameters to add regularization to the model, namely L1 and L2: as described in the very useful post: http://fastml.com/large-scale-l1-feature-selection-with-vowpal-wabbit/
We run the following command in Vowpal Wabbit to search for the best regularization parameter l1 (on our training set with reduced features):
The -L prompt does a log-space search, which we chose to increase speed.
In our case, the optimised l1 regularization parameter was of the order 10-8, which being so close to zero, we chose to leave off and train our dataset without regularization.
Considering the possible bias of the model to a sub-section of the training data (observed due to a bump in the error rate), bootstrapping and mutiple passes through the training data were explored to improve the prediction. Below is a graph of the log loss error with our initial model and our model using boostrapping and multiple passes. Vowal Wabbit allows multiple passes of the dataset to be made in training which allows each row to be used multiple times; however there is then a risk of overfitting to the training data and reducing the error rate at the cost of making a non generalisable model.
path = "log_loss_.csv"
counter = 0
boot_res_x = []
boot_res_y = []
first_res_x = []
first_res_y = []
with open(path, "r") as data:
for line in data:
line = line.replace(" ", ",")
line = line.split(",")
line = [e.strip("\n") for e in line if e != ""]
if counter > 1 and line[6]!='26':
boot_res_y.append(line[0])
boot_res_x.append(line[3])
if counter > 1 and line[6]=='26':
first_res_y.append(line[0])
first_res_x.append(line[3])
counter = counter + 1
#PLOTs
plt.plot(boot_res_x, boot_res_y, color = "b", label = "bootstrap samples")
plt.plot(first_res_x, first_res_y, color = "r", label = "simple model")
plt.ylim(0.2, 0.5)
remove_border()
plt.title("Log loss versus number of training examples")
plt.ylabel("Log loss")
plt.xlabel("Number of training examples")
plt.legend()
<matplotlib.legend.Legend at 0x11e1f1910>
Note: blue plot shows many more training examples as running multiple passes on the model.
The above plot shows that even with bootstrapping and running multiple passes, we observe the "bump" showing increase in log loss error before stabilisation; this seems to indicate that the data would still benefit from being shuffled and probably suffers from some seasonality effects. However the bump has been reduced so there seems to have been some advantage to running bootstrap and multiple passes.