Notebook

5. Learning, badly¶

The time has arrived, the time to learn. Will we succeed? Let's see it

5.1. Preparing the notebook¶

In [1]:

%matplotlib inline
%config InlineBackend.figure_format='retina'

In [2]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from matplotlib import cm as cmap

from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder

sns.set(font='sans')

5.2. Learning with `random forest`¶

When I needed to select an algorithm for training I didn't know which of them I studied about was to work correctly. So, I asked to my thesis's supervisor, Fernando Sancho Ph.D., and his recommendation, according to his experience with other related projects, was to use random forests.

In the feature_columns variable showed in the next piece of code, we can see which attributes are going to be used for predicting.

In [3]:

labelize_columns = ['medallion', 'hack_license', 'vendor_id']

interize_columns = ['pickup_month', 'pickup_weekday', 'pickup_non_working_today', 'pickup_non_working_tomorrow']

feature_columns = ['medallion', 'hack_license', 'vendor_id', 'pickup_month', 'pickup_weekday', 'pickup_day',
                   'pickup_time_in_mins', 'pickup_non_working_today', 'pickup_non_working_tomorrow', 'fare_amount',
                   'surcharge', 'tolls_amount', 'passenger_count', 'trip_time_in_secs', 'trip_distance', 'pickup_longitude',
                   'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']

class_column = 'tip_label'

In [4]:

data = pd.read_csv('../data/dataset/dataset.csv')

Before starting the training, we need to transform the no numeric attributes to numeric ones, so that they can be used with scitkit-learn.

In [5]:

for column in labelize_columns:
    real_column = data[column].values
    
    le = LabelEncoder()
    le.fit(real_column)
    labelized_column = le.transform(real_column)
    
    data[column] = labelized_column
    
    le = None
    real_column = None
    labelized_column = None

In [6]:

for column in interize_columns:
    data[column] = data[column].astype(int)

Let's start the training! We are going to use 10-fold stratified cross-validation for training a random forest model with 256 trees.

In [7]:

data_features = data[feature_columns].values
data_classes = data[class_column].values

In [8]:

cross_validation = StratifiedShuffleSplit(data_classes, n_iter=10, test_size=0.1, random_state=0)

scores = []
confusion_matrices = []

for train_index, test_index in cross_validation:
    data_features_train, data_classes_train = data_features[train_index], data_classes[train_index]
    data_features_test, data_classes_test = data_features[test_index], data_classes[test_index]
    
    '''
    You need at least 16GB RAM for predicting 6 classes with 256 trees.
    Of course, you can use a lower number, but gradually you'll notice worse performance.
    '''
    clf = RandomForestClassifier(n_estimators=256, n_jobs=-1)
    clf.fit(data_features_train, data_classes_train)
    
    # Saving the scores.
    test_score = clf.score(data_features_test, data_classes_test)
    scores.append(test_score)
    
    # Saving the confusion matrices.
    data_classes_pred = clf.predict(data_features_test)
    cm = confusion_matrix(data_classes_test, data_classes_pred)
    confusion_matrices.append(cm)
    
    clf = None

print 'Accuracy mean: ' + str(np.mean(scores))
print 'Accuracy std: ' + str(np.std(scores))

Accuracy mean: 0.529469
Accuracy std: 0.000902712024956

A prediction with an accuracy of 52.95%. What happened?

5.3. What happened?¶

As I am not a machine learning expert, I'm not 100% sure of what were the problems for this bad result. This is an indicator that I have yet to study more machine learning theory. A thing I'm willing to do, spoiler, specially after the results we will obtain in the [next notebook](6. Learning, a better way.ipynb).

For trying to know the reason of the bad accuracy, let's use another tool for measuring the performance, a confusion matrix.

In [9]:

classes = [' ', '[0-10)', '[10-15)', '[15-20)', '[20-25)', '[25-30)', '[30-inf)']

first = True
cm = None

for cm_iter in confusion_matrices:
    if first:
        cm = cm_iter.copy()
        first = False
    else:
        cm = cm + cm_iter

fig, axes = plt.subplots()

colorbar = axes.matshow(cm, cmap=cmap.Blues)
fig.colorbar(colorbar, ticks=[0, 25000, 50000, 75000, 100000, 125000, 150000, 175000, 200000, 225000, 250000])

axes.set_xlabel('Predicted class', fontsize=15)
axes.set_ylabel('True class', fontsize=15)

axes.set_xticklabels(classes)
axes.set_yticklabels(classes)

axes.tick_params(labelsize=12)

This is pretty strange. It looks like all the classes want to be only in two of them! Let's check how the tip is distributed in the dataset.

In [10]:

tip = data.groupby('tip_perc').size()
tip.index = np.floor(tip.index)

ax = tip.groupby(tip.index).sum().plot(kind='bar', figsize=(15, 5))

ax.set_xlabel('floor(tip_perc)', fontsize=18)
ax.set_ylabel('number of trips', fontsize=18)
ax.tick_params(labelsize=12)

tip = None

By looking at the previous figure we can say that the social norm is to tip the 20% of the charge. Perhaps that is the question, to know if a tip will be above or below that norm.

For answering that, let's change the classes for use only two:

$$ ``<\:20"\:\:and\:\:``>=\:20" $$

In [11]:

tip_labels = ['< 20', '>= 20']
tip_ranges_by_label = [[0.0, 20.0], [20.0, 51.0]]

for i, tip_label in enumerate(tip_labels):
    tip_mask = ((data.tip_perc >= tip_ranges_by_label[i][0]) & (data.tip_perc < tip_ranges_by_label[i][1]))
    data.tip_label[tip_mask] = tip_label
    
    tip_mask = None

In [12]:

data.to_csv('../data/dataset/dataset.csv', index=False)

Will this change work? Let's find it out in the [next notebook](6. Learning, a better way.ipynb).

5. Learning, badly¶

5.1. Preparing the notebook¶

5.2. Learning with random forest¶

5.3. What happened?¶

5.2. Learning with `random forest`¶