We'll be working with a dataset that can be difficult to accurately predict! We'll learn why it's important to go beyond just using accuracy when working in classification problems, hypothesize some approaches to working with the features we have, and have a discussion on what are we gaining versus losing.
Included in the repository today is a used car sale data set. Many features relate to cost, though they are all defined in the data dictionary here:
lines = open('../data/lemons_description.txt')
for line in lines:
print line.strip()
Field Name Definition RefID Unique (sequential) number assigned to vehicles IsBadBuy Identifies if the kicked vehicle was an avoidable purchase PurchDate The Date the vehicle was Purchased at Auction Auction Auction provider at which the vehicle was purchased VehYear The manufacturer's year of the vehicle VehicleAge The Years elapsed since the manufacturer's year Make Vehicle Manufacturer Model Vehicle Model Trim Vehicle Trim Level Submodel Vehicle Submodel Color Vehicle Color Transmission Vehicles transmission type (Automatic, Manual) WheelTypeID The type id of the vehicle wheel WheelType The vehicle wheel type description (Alloy, Covers) VehOdo The vehicles odometer reading Nationality The Manufacturer's country Size The size category of the vehicle (Compact, SUV, etc.) TopThreeAmericanName Identifies if the manufacturer is one of the top three American manufacturers MMRAcquisitionAuctionAveragePrice Acquisition price for this vehicle in average condition at time of purchase MMRAcquisitionAuctionCleanPrice Acquisition price for this vehicle in the above Average condition at time of purchase MMRAcquisitionRetailAveragePrice Acquisition price for this vehicle in the retail market in average condition at time of purchase MMRAcquisitonRetailCleanPrice Acquisition price for this vehicle in the retail market in above average condition at time of purchase MMRCurrentAuctionAveragePrice Acquisition price for this vehicle in average condition as of current day MMRCurrentAuctionCleanPrice Acquisition price for this vehicle in the above condition as of current day MMRCurrentRetailAveragePrice Acquisition price for this vehicle in the retail market in average condition as of current day MMRCurrentRetailCleanPrice Acquisition price for this vehicle in the retail market in above average condition as of current day PRIMEUNIT Identifies if the vehicle would have a higher demand than a standard purchase AcquisitionType Identifies how the vehicle was aquired (Auction buy, trade in, etc) AUCGUART The level guarntee provided by auction for the vehicle (Green light - Guaranteed/arbitratable, Yellow Light - caution/issue, red light - sold as is) KickDate Date the vehicle was kicked back to the auction BYRNO Unique number assigned to the buyer that purchased the vehicle VNZIP Zipcode where the car was purchased VNST State where the the car was purchased VehBCost Acquisition cost paid for the vehicle at time of purchase IsOnlineSale Identifies if the vehicle was originally purchased online WarrantyCost Warranty price (term=36month and millage=36K)
import pandas as pd
from sklearn import tree
from sklearn.cross_validation import cross_val_score
# Load in data and create sets. dropping all na columns on the live data set.
lemons = pd.read_csv('../data/lemons.csv')
lemons_oos = pd.read_csv('../data/lemons_oos.csv')
print lemons.dtypes
RefId int64 IsBadBuy int64 PurchDate object Auction object VehYear int64 VehicleAge int64 Make object Model object Trim object SubModel object Color object Transmission object WheelTypeID float64 WheelType object VehOdo int64 Nationality object Size object TopThreeAmericanName object MMRAcquisitionAuctionAveragePrice float64 MMRAcquisitionAuctionCleanPrice float64 MMRAcquisitionRetailAveragePrice float64 MMRAcquisitonRetailCleanPrice float64 MMRCurrentAuctionAveragePrice float64 MMRCurrentAuctionCleanPrice float64 MMRCurrentRetailAveragePrice float64 MMRCurrentRetailCleanPrice float64 PRIMEUNIT object AUCGUART object BYRNO int64 VNZIP1 int64 VNST object VehBCost float64 IsOnlineSale int64 WarrantyCost int64 dtype: object
Below includes a very simple "benchmark" script. One common test we'd want to consider for model evaluation is if we can do better than random, which we can use the sklearn DummyClassifier to do.
lemons = lemons.dropna(axis=1)
# Generating a list of continuous data features from the describe dataframe.
# Then, removing the two non-features (RefId is an index, IsBadBuy is the prediction value)
features = list(lemons.describe().columns)
features.remove('RefId')
features.remove('IsBadBuy')
best_score = -1
for depth in range(1, 10):
scores = cross_val_score(tree.DecisionTreeClassifier(max_depth=depth, random_state=1234),
lemons[features],
lemons.IsBadBuy,
scoring='roc_auc',
cv=5)
if scores.mean() > best_score:
best_depth = depth
best_score = scores.mean()
# Is the best score we have better than each DummyClassifier type?
from sklearn import dummy, metrics
for strat in ['stratified', 'most_frequent', 'uniform']:
dummyclf = dummy.DummyClassifier(strategy=strat).fit(lemons[features], lemons.IsBadBuy)
print 'did better than %s?' % strat, metrics.roc_auc_score(lemons.IsBadBuy, dummyclf.predict(lemons[features])) < best_score
# seems so!
# Create a classifier and prediction.
clf = tree.DecisionTreeClassifier(max_depth=depth, random_state=1234).fit(lemons[features], lemons.IsBadBuy)
y_pred = clf.predict(lemons_oos[features])
# Create a submission
submission = pd.DataFrame({ 'RefId' : lemons_oos.RefId, 'prediction' : y_pred })
submission.to_csv('submission.csv')
did better than stratified? True did better than most_frequent? True did better than uniform? True
This is a good start for us!
In order for us to work on improving this model, we'll have to continue exploring the data set available, impute missing values, create new features, and scale numerical values.
Feature scaling can play a significant role in the performance of our model. Many of the techniques that follow are typically more applicable to algorithms were the algorithm is less dependent on learning weights; in particular, for classifiers.
One common technique is to subtract the mean away from all values in a feature. This effectively "zeroes" the feature, and is easier for the model to assert the normal centers of the data as being the same.
$x^` = x - mean(x)$
Another common technique takes the above one step further: not only do we center the data on 0, but then provide a scope for the data to reside in (either -1 to 1, or 0 to 1, are typical). Normalizing the fe
Normalizing to 0 and 1 (where 0 remains 0)
$x^` = \dfrac{x_0}{max(x)}$
Normalizing to 0 and 1 (where min == 0)
$x^` = \dfrac{x_0 - min(x)}{max(x) - min(x))}$
Normalizing to -1 and 1 (where mean == 0)
$x^` = \dfrac{x_0 - mean(x)}{max(x) - mean(x)}$
Standardization using mean and standard deviation (where mean = 0)
Standardization is a slightly different process for normalizing where our data splits are represented using standard deviations instead.
$x^` = \dfrac{x_0 - mean(x)}{std(x))}$
Assuming the input of an array, how would we end up writing code to handle each of these transformations?
import numpy as np
from __future__ import division
class Transformations(object):
"""since these transformations are all related, we'll nest them all under a feature norm class"""
def mean_at_zero(self, arr):
return np.array([i - np.mean(arr) for i in arr])
def norm_to_min_zero(self, arr):
return np.array([i / max(arr) for i in arr])
def norm_to_absolute_min_zero(self, arr):
"""should be a range of 0 to 1, where 0 maintains its 0 value"""
def norm_to_neg_pos(self, arr):
"""should be a range of -1 to 1, where 0 represents the mean"""
def norm_by_std(self, arr):
"""should be a range where 0 represents the mean"""
## tests to make sure we built this correctly:
transformer = Transformations()
a = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
print transformer.mean_at_zero(a) == np.array([-2, -1, 0, 1, 2])
print transformer.norm_to_min_zero(a) == np.array([0.2, 0.4, 0.6, 0.8, 1.0])
print transformer.norm_to_absolute_min_zero(a) == np.array([0.0, 0.25, 0.5, 0.75, 1.0])
print transformer.norm_to_neg_pos(a) == np.array([-1.0, -0.5, 0.0, 0.5, 1.0])
print transformer.norm_by_std(a) == np.array([-1.414213562373095, -0.7071067811865475, 0.0, 0.7071067811865475, 1.414213562373095])
[ True True True True True] [ True True True True True] False False False
-c:28: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future. -c:29: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future. -c:30: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
Scikit learn has functions to also handle some of this:
from sklearn import preprocessing
print a
print preprocessing.scale(a, with_mean=True, with_std=False)
print preprocessing.scale(a, with_mean=True, with_std=True)
[ 1. 2. 3. 4. 5.] [-2. -1. 0. 1. 2.] [-1.41421356 -0.70710678 0. 0.70710678 1.41421356]
In class, we previously learned about linear transformations, particularly using log when we're looking at a log-log distribution.
However, transforming to log isn't always going to land us a perfect linear fit. Instead, we'd ideally like to solve for the power law, which is identifying a curve's amplitude and index. We can do this with one extra step: fitting a linear model to the log(10)-transformed data, optimizing against the error. scipy has a handy function to solve this for us.
Another option could be to experiment with the plfit library (not included in anaconda).
%matplotlib inline
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
class PowerLaw(object):
def fit(self, x, y, transform=True):
"""
returns back the amplitude and index of a powerlaw relationship.
assumes the data is not already log10 transformed.
return: [index, amp], also stored on the instance
"""
if transform:
x = np.log10(x)
y = np.log10(y)
# define our (line) fitting function and error function to optimize on
fitfunc = lambda p, x: p[0] + p[1] * x
errfunc = lambda p, x, y: (y - fitfunc(p, x))
# defines a starting point to optimize from.
p_init = [1.0, -1.0]
out = sp.optimize.leastsq(errfunc, p_init, args=(x, y), full_output=1)
result = out[0]
self.index = result[1]
self.amp = 10.0**result[0]
return np.array([self.amp, self.index])
def transform(self, x):
"""returns the x-transformed data"""
return self.amp * (x**self.index)
xdata=np.array([ 0.00010851, 0.00021701, 0.00043403, 0.00086806, 0.00173611, 0.00347222])
ydata=np.array([ 29.56241016, 29.82245508, 25.33930469, 19.97075977, 12.61276074, 7.12695312])
powerlaw = PowerLaw()
powerlaw.fit(xdata, ydata)
print 'amp:',powerlaw.amp, 'index', powerlaw.index
sns.set_style('white')
plt.figure()
plt.subplot(2, 1, 1)
plt.plot(xdata, powerlaw.transform(xdata))
plt.plot(xdata, ydata)
plt.text(0.0020, 30, 'Ampli = %5.2f' % powerlaw.amp)
plt.text(0.0020, 25, 'Index = %5.2f' % powerlaw.index)
plt.xlabel('X')
plt.ylabel('Y')
plt.subplot(2, 1, 2)
plt.loglog(xdata, powerlaw.transform(xdata))
plt.plot(xdata, ydata)
plt.xlabel('X (log scale)')
plt.ylabel('Y (log scale)')
amp: 0.895599738306 index -0.409432655356
<matplotlib.text.Text at 0x10c628d50>
We have introduce several different concepts in class to understanding model performance throughout the last few weeks; here is one primary location that we can refer to for regressions and classification problems.
R-Squared
Definition: On (technically) a scale of 0 to 1, how well does this regression explain the variance in our data?
note: can be negative if the regression model is inversely related, though this rarely occurs
math: $R^2=\dfrac{SS_{res}}{SS_{tot}}$
Root Mean Squared Error (RMSE)
Definition: The square root of the mean of the squared errors, where squared error = $(y_{true} - y_{pred})^{2}$
math: $\sqrt{\dfrac{1}{n}\sum(y_{true} - y_{pred})^{2}}$
Confusion Matrix
Definition: Given class labels, a true vs predicted label for all observations.
Accuracy
Definition and math: $\dfrac{TP + TN}{TP + TN + FP + FN}$
Misclassification Rate
Definition and math: $\dfrac{FP + FN}{TP + TN + FP + FN}$
false positive rate (fpr)
Definition: The percent of the negatives were predicted as positive (how often is the predictor wrong on negatives?)
Math: $\dfrac{FP}{FP + FN}$
true positive rate/recall (tpr)
Definition: The percent of the positives were accurately measured as positives
Math: $\dfrac{TP}{TP + FN}$
** ROC Curve and AUC**
Definition: The area of a [1, 1] plot given the a line drawn from [0,0], [fpr, tpr], and [1, 1]
Note there are two routes to go with using AUC in python:
All of sklearn's model metrics are in the sklearn metrics page. Keep this around as a reference point so you know how to run the metrics you need to use!
Back to the data problem at hand!
We'll start by working through an explain on each column in the dataset together as a class and come up with some ideas on how to handle each column.
print lemons.groupby('Auction').Auction.count()
print lemons.groupby('Auction').IsBadBuy.mean()
# seems like the ADESA auction is particularly worse for bad buys (about 36% more)
# it may help to create a new column that specically refers to "is_adesa"
lemons['auct_adesa'] = lemons.Auction.apply(lambda x: 1 if x == 'ADESA' else 0)
print lemons.groupby('auct_adesa').IsBadBuy.mean()
Auction ADESA 10128 MANHEIM 28645 OTHER 12315 Name: Auction, dtype: int64 Auction ADESA 0.153732 MANHEIM 0.114400 OTHER 0.118149 Name: IsBadBuy, dtype: float64 auct_adesa 0 0.115527 1 0.153732 Name: IsBadBuy, dtype: float64
print plt.hist(lemons.VehicleAge)
print lemons.groupby('VehicleAge').IsBadBuy.mean()
# there seems to be a stronger relationship with bad buys as vehicles are older.
# is there anything we should do here?
(array([ 1.00000000e+00, 2.15900000e+03, 5.93600000e+03, 1.11580000e+04, 1.19690000e+04, 9.06400000e+03, 5.57700000e+03, 3.23400000e+03, 1.53200000e+03, 4.58000000e+02]), array([ 0. , 0.9, 1.8, 2.7, 3.6, 4.5, 5.4, 6.3, 7.2, 8.1, 9. ]), <a list of 10 Patch objects>) VehicleAge 0 0.000000 1 0.044465 2 0.062163 3 0.083259 4 0.110118 5 0.146293 6 0.180384 7 0.219852 8 0.253916 9 0.316594 Name: IsBadBuy, dtype: float64
Continue to parse through each column and determine relationships against IsBadBuy.
Generate a model in your group. The goal should be a cross validated model that, on average, performs better than the benchmark on the training data.
Once you've created a model your team is comfortable with, generate a "submission" csv file on the out of sample data. Ed, Julia, and Pooja will post "scores" against the actual values for the out of sample data.
Once you've done so, use the rest of the class today to work on your projects individually. Use this time to practice everything we learned today and to improve your project 2 for Monday.