The size of the dataset makes impossible to perform the learning in a relatively short time. A solution for this may be select a sample from the dataset and learning from it. Let's select a random portion of 1,000,000 trips equally distributed throughout the year.
%matplotlib inline
%config InlineBackend.figure_format='retina'
import datetime
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from __future__ import division
sns.set(font='sans')
read_path = '../data/cleaned/cleaned_{0}.csv'
save_path = '../data/dataset/dataset.csv'
# This auxiliary function applies another one to every row in a DataFrame for creating new columns.
def iterate_and_apply(dataframe, function, necesary_columns):
perform = True
step = 100000
start = 0
to = step
while perform:
new_columns = dataframe[start:to][necesary_columns].apply(function, axis=1)
if len(new_columns) == 0:
perform = False
else:
dataframe.update(new_columns)
new_columns = None
start += step
to += step
return dataframe
complete_length = 88156805
final_length = 1000000
current_length = 0
first = True
data = None
months = range(1, 13)
for month in months:
data_aux = pd.read_csv(read_path.format(month), index_col=0)
if month != 12:
this_length = int(final_length * (data_aux.shape[0] / complete_length))
else:
this_length = final_length - current_length
current_length += this_length
data_aux = data_aux.ix[np.random.choice(data_aux.index, this_length, replace=False)].copy()
data_aux = data_aux.reset_index(drop=True)
if first:
data = data_aux.copy()
first = False
else:
data = data.append(data_aux, ignore_index=True)
data_aux = None
Now that we have obtained the sample from the dataset, let's create new attributes. This new task would be impossible to do in the entire dataset. Now it's the perfect time: the data is clean and small.
datetime
attributes¶The date and time attribute that currently exists can't fit in the scikit-learn
's algorithms, we need to decompose it. Here it's proposed a possible list of extracted attributes from pickup_datetime
, explaining those that they need it:
For creating the last two attributes, the dataset nyc_2013_holidays.csv
(mentioned in the [first notebook](1. Preparing the environment.ipynb)) has to be used.
# A mini-dataset with the 2013 holidays in NYC.
annual_holidays = pd.read_csv('../data/nyc_2013_holidays.csv')
# Columns needed.
datetime_necesary_columns = ['pickup_datetime']
datetime_column_names = ['pickup_month', 'pickup_weekday', 'pickup_day', 'pickup_time_in_mins', 'pickup_non_working_today',
'pickup_non_working_tomorrow']
# It says if a day is a holiday in NYC.
def is_in_annual_holidays(the_day):
return annual_holidays[(annual_holidays.month == the_day.month) & (annual_holidays.day == the_day.day)].shape[0]
# It calculates data related with 'pickup_datetime'.
def calculate_datetime_extra(row):
dt = datetime.datetime.strptime(row.pickup_datetime, '%Y-%m-%d %H:%M:%S')
pickup_month = dt.month
pickup_weekday = dt.weekday()
pickup_day = dt.day
pickup_time_in_mins = (dt.hour * 60) + dt.minute
pickup_non_working_today = int((pickup_weekday == 5) or (pickup_weekday == 6) or is_in_annual_holidays(dt))
pickup_non_working_tomorrow = int((pickup_weekday == 4) or (pickup_weekday == 5) or
is_in_annual_holidays(dt + datetime.timedelta(days=1)))
return pd.Series({
datetime_column_names[0]: pickup_month,
datetime_column_names[1]: pickup_weekday,
datetime_column_names[2]: pickup_day,
datetime_column_names[3]: pickup_time_in_mins,
datetime_column_names[4]: pickup_non_working_today,
datetime_column_names[5]: pickup_non_working_tomorrow
})
for column in datetime_column_names:
data[column] = np.nan
data = iterate_and_apply(data, calculate_datetime_extra, datetime_necesary_columns)
Predict the tip percentage of a trip is a regression problem. Given the human nature of the data, even the noise that we couldn't clean, perform a regression could give us disastrous results. Maybe, we could change the problem to a classification one. Let's create a new attribute to predict it. It could be a total of six labels, denoting a few ranges of tips:
$$ [0,\:10),\:[10,\:15),\:[15,\:20),\:[20,\:25),\:[25,\:30)\:and\:[30,\:+\infty) $$tip_label_column_name = 'tip_label'
tip_labels = ['[0-10)', '[10-15)', '[15-20)', '[20-25)', '[25-30)', '[30-inf)']
tip_ranges_by_label = [[0.0, 10.0], [10.0, 15.0], [15.0, 20.0], [20.0, 25.0], [25.0, 30.0], [30.0, 51.0]]
data[tip_label_column_name] = ''
for i, tip_label in enumerate(tip_labels):
tip_mask = ((data.tip_perc >= tip_ranges_by_label[i][0]) & (data.tip_perc < tip_ranges_by_label[i][1]))
data.tip_label[tip_mask] = tip_label
tip_mask = None
ax = data.groupby('tip_label').size().plot(kind='bar')
ax.set_xlabel('tip_label', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)
These are the classes that we are going to predict!
Finally, we are going to save this dataset with a particular column order, and use it in the [next notebook](5. Learning, badly.ipynb).
tip_label_order = ['medallion', 'hack_license', 'vendor_id', 'pickup_datetime', 'pickup_month', 'pickup_weekday',
'pickup_day', 'pickup_time_in_mins', 'pickup_non_working_today', 'pickup_non_working_tomorrow',
'fare_amount', 'surcharge', 'tip_amount', 'tip_perc', 'tip_label', 'tolls_amount', 'total_amount',
'passenger_count', 'trip_time_in_secs', 'trip_distance', 'pickup_longitude', 'pickup_latitude',
'dropoff_longitude', 'dropoff_latitude']
data = data.reindex_axis(tip_label_order, axis=1)
if not os.path.exists('../data/dataset/'):
os.makedirs('../data/dataset/')
data.to_csv(save_path, index=False)