Adapted from Chapter 8 of An Introduction to Statistical Learning
Let's pretend that instead of building a single model to solve a classification problem, you created five independent models, and each model was correct 70% of the time. If you combined these models into an "ensemble" and used their majority vote as a prediction, how often would the ensemble be correct?
Let's simulate it to find out!
import numpy as np
# set a seed for reproducibility
np.random.seed(1234)
# generate 1000 random numbers (between 0 and 1) for each model, representing 1000 observations
mod1 = np.random.rand(1000)
mod2 = np.random.rand(1000)
mod3 = np.random.rand(1000)
mod4 = np.random.rand(1000)
mod5 = np.random.rand(1000)
# each model independently predicts 1 (the "correct response") if random number was at least 0.3
preds1 = np.where(mod1 > 0.3, 1, 0)
preds2 = np.where(mod2 > 0.3, 1, 0)
preds3 = np.where(mod3 > 0.3, 1, 0)
preds4 = np.where(mod4 > 0.3, 1, 0)
preds5 = np.where(mod5 > 0.3, 1, 0)
# print the first 20 predictions from each model
print preds1[:20]
print preds2[:20]
print preds3[:20]
print preds4[:20]
print preds5[:20]
[0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1] [1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0] [1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1] [1 1 0 0 0 0 1 1 0 1 1 1 1 1 1 0 1 1 1 0] [0 0 1 0 0 0 1 0 1 0 0 0 1 1 1 1 1 1 1 1]
# add the predictions together
sum_of_preds = preds1 + preds2 + preds3 + preds4 + preds5
# ensemble predicts 1 (the "correct response") if at least 3 models predict 1
ensemble_preds = np.where(sum_of_preds >=3 , 1, 0)
# print the ensemble's first 20 predictions
print ensemble_preds[:20]
[1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1]
# how accurate was the ensemble?
ensemble_preds.mean()
0.84099999999999997
Amazing, right?
Ensemble learning (or "ensembling") is simply the process of combining several models to solve a prediction problem, with the goal of producing a combined model that is more accurate than any individual model. For classification problems, the combination is often done by majority vote. For regression problems, the combination is often done by taking an average of the predictions.
For ensembling to work well, the individual models must meet two conditions:
The idea, then, is that if you have a collection of individually imperfect (and independent) models, the "one-off" mistakes made by each model are probably not going to be made by the rest of the models, and thus the mistakes will be discarded when averaging the models.
It turns out that as you add more models to the voting process, the probability of error decreases. This is known as Condorcet's Jury Theorem, which was developed by a French political scientist in the 18th century.
Anyway, we'll see examples of ensembling below.
Some preliminary terminology: In statistics, "bootstrapping" refers to the process of using "bootstrap samples" to quantify the uncertainty of a model. Bootstrap samples are simply random samples with replacement:
# set a seed for reproducibility
np.random.seed(1)
# create an array of 0 to 9, then sample 10 times with replacement
np.random.choice(a=10, size=10, replace=True)
array([5, 8, 9, 5, 0, 0, 1, 7, 6, 9])
On their own, decision trees are not competitive with the best supervised learning methods in terms of predictive accuracy. However, they can be used as the basis for more sophisticated methods that have much higher accuracy!
One of the main issues with decision trees is high variance, meaning that different splits in the training data can lead to very different trees. "Bootstrap aggregation" (aka "bagging") is a general purpose procedure for reducing the variance of a machine learning method, but is particularly useful for decision trees.
What is the bagging process (in general)?
This increases predictive accuracy by reducing the variance, similar to how cross-validation reduces the variance associated with the test set approach (for estimating out-of-sample error) by splitting many times an averaging the results.
So how exactly can bagging be used with decision trees? Here's how it applies to regression trees:
It is applied in a similar fashion to classification trees, except that during the prediction stage, the overall prediction is based upon a majority vote of the trees.
What value should be used for B? Simply use a large enough value that the error seems to have stabilized. (Choosing a value of B that is "too large" will generally not lead to overfitting.)
import pandas as pd
# read in vehicle data
vehicles = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT4/master/data/used_vehicles.csv')
# convert car to 0 and truck to 1
vehicles['type'] = vehicles.type.map({'car':0, 'truck':1})
# print out data
vehicles
price | year | miles | doors | type | |
---|---|---|---|---|---|
0 | 22000 | 2012 | 13000 | 2 | 0 |
1 | 14000 | 2010 | 30000 | 2 | 0 |
2 | 13000 | 2010 | 73500 | 4 | 0 |
3 | 9500 | 2009 | 78000 | 4 | 0 |
4 | 9000 | 2007 | 47000 | 4 | 0 |
5 | 4000 | 2006 | 124000 | 2 | 0 |
6 | 3000 | 2004 | 177000 | 4 | 0 |
7 | 2000 | 2004 | 209000 | 4 | 1 |
8 | 3000 | 2003 | 138000 | 2 | 0 |
9 | 1900 | 2003 | 160000 | 4 | 0 |
10 | 2500 | 2003 | 190000 | 2 | 1 |
11 | 5000 | 2001 | 62000 | 4 | 0 |
12 | 1800 | 1999 | 163000 | 2 | 1 |
13 | 1300 | 1997 | 138000 | 4 | 0 |
# calculate the number of rows in vehicles
n_rows = vehicles.shape[0]
# set a seed for reproducibility
np.random.seed(123)
# create three bootstrap samples (will be used to select rows from the DataFrame)
sample1 = np.random.choice(a=n_rows, size=n_rows, replace=True)
sample2 = np.random.choice(a=n_rows, size=n_rows, replace=True)
sample3 = np.random.choice(a=n_rows, size=n_rows, replace=True)
# print samples
print sample1
print sample2
print sample3
[13 2 12 2 6 1 3 10 11 9 6 1 0 1] [ 9 0 0 9 3 13 4 0 0 4 1 7 3 2] [ 4 7 2 4 8 13 0 7 9 3 12 12 4 6]
# use sample1 to select rows from DataFrame
print vehicles.iloc[sample1, :]
price year miles doors type 13 1300 1997 138000 4 0 2 13000 2010 73500 4 0 12 1800 1999 163000 2 1 2 13000 2010 73500 4 0 6 3000 2004 177000 4 0 1 14000 2010 30000 2 0 3 9500 2009 78000 4 0 10 2500 2003 190000 2 1 11 5000 2001 62000 4 0 9 1900 2003 160000 4 0 6 3000 2004 177000 4 0 1 14000 2010 30000 2 0 0 22000 2012 13000 2 0 1 14000 2010 30000 2 0
from sklearn.tree import DecisionTreeRegressor
# grow one regression tree with each bootstrapped training set
treereg1 = DecisionTreeRegressor(random_state=123)
treereg1.fit(vehicles.iloc[sample1, 1:], vehicles.iloc[sample1, 0])
treereg2 = DecisionTreeRegressor(random_state=123)
treereg2.fit(vehicles.iloc[sample2, 1:], vehicles.iloc[sample2, 0])
treereg3 = DecisionTreeRegressor(random_state=123)
treereg3.fit(vehicles.iloc[sample3, 1:], vehicles.iloc[sample3, 0])
DecisionTreeRegressor(compute_importances=None, criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_density=None, min_samples_leaf=1, min_samples_split=2, random_state=123, splitter='best')
# read in out-of-sample data
oos = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT4/master/data/used_vehicles_oos.csv')
# convert car to 0 and truck to 1
oos['type'] = oos.type.map({'car':0, 'truck':1})
# print data
oos
price | year | miles | doors | type | |
---|---|---|---|---|---|
0 | 3000 | 2003 | 130000 | 4 | 1 |
1 | 6000 | 2005 | 82500 | 4 | 0 |
2 | 12000 | 2010 | 60000 | 2 | 0 |
# select feature columns (every column except for the 0th column)
feature_cols = vehicles.columns[1:]
# make predictions on out-of-sample data
preds1 = treereg1.predict(oos[feature_cols])
preds2 = treereg2.predict(oos[feature_cols])
preds3 = treereg3.predict(oos[feature_cols])
# print predictions
print preds1
print preds2
print preds3
[ 1300. 5000. 14000.] [ 1300. 1300. 13000.] [ 2000. 9000. 13000.]
# average predictions and compare to actual values
print (preds1 + preds2 + preds3)/3
print oos.price.values
[ 1533.33333333 5100. 13333.33333333] [ 3000 6000 12000]
Bagged models have a very nice property: out-of-sample error can be estimated without using the test set approach or cross-validation!
Here's how the out-of-sample estimation process works with bagged trees:
When B is sufficiently large, the out-of-bag error is an accurate estimate of out-of-sample error.
# set is a data structure used to identify unique elements
print set(range(14))
# only show the unique elements in sample1
print set(sample1)
set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]) set([0, 1, 2, 3, 6, 9, 10, 11, 12, 13])
# use the "set difference" to identify the out-of-bag observations for each tree
print sorted(set(range(14)) - set(sample1))
print sorted(set(range(14)) - set(sample2))
print sorted(set(range(14)) - set(sample3))
[4, 5, 7, 8] [5, 6, 8, 10, 11, 12] [1, 5, 10, 11]
Thus, we would predict the response for observation 4 by using tree 1 (because it is only out-of-bag for tree 1). We would predict the response for observation 5 by averaging the predictions from trees 1, 2, and 3 (since it is out-of-bag for all three trees). We would repeat this process for all observations, and then calculate the MSE using those predictions.
Although bagging increases predictive accuracy, it decreases model interpretability because it's no longer possible to visualize the tree to understand the importance of each variable.
However, we can still obtain an overall summary of "variable importance" from bagged models:
(We'll see an example of this below.)
Random Forests is a slight variation of bagged trees that has even better performance! Here's how it works:
Notes:
What's the point?
# read in the Titanic data
titanic = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT4/master/data/titanic.csv')
# encode sex feature
titanic['sex'] = titanic.sex.map({'female':0, 'male':1})
# fill in missing values for age
titanic.age.fillna(titanic.age.mean(), inplace=True)
# create three dummy variables, drop the first dummy variable, and store this as a DataFrame
embarked_dummies = pd.get_dummies(titanic.embarked, prefix='embarked').iloc[:, 1:]
# concatenate the two dummy variable columns onto the original DataFrame
# note: axis=0 means rows, axis=1 means columns
titanic = pd.concat([titanic, embarked_dummies], axis=1)
# create a list of feature columns
feature_cols = ['pclass', 'sex', 'age', 'embarked_Q', 'embarked_S']
# print the updated DataFrame
titanic.head(10)
survived | pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | embarked_Q | embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.000000 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 0 | 1 |
1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.000000 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 0 | 0 |
2 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.000000 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | 1 |
3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.000000 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 | 1 |
4 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.000000 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 | 1 |
5 | 0 | 3 | Moran, Mr. James | 1 | 29.699118 | 0 | 0 | 330877 | 8.4583 | NaN | Q | 1 | 0 |
6 | 0 | 1 | McCarthy, Mr. Timothy J | 1 | 54.000000 | 0 | 0 | 17463 | 51.8625 | E46 | S | 0 | 1 |
7 | 0 | 3 | Palsson, Master. Gosta Leonard | 1 | 2.000000 | 3 | 1 | 349909 | 21.0750 | NaN | S | 0 | 1 |
8 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | 0 | 27.000000 | 0 | 2 | 347742 | 11.1333 | NaN | S | 0 | 1 |
9 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | 0 | 14.000000 | 1 | 0 | 237736 | 30.0708 | NaN | C | 0 | 0 |
# import class, instantiate estimator, fit with all data
from sklearn.ensemble import RandomForestClassifier
rfclf = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1)
rfclf.fit(titanic[feature_cols], titanic.survived)
RandomForestClassifier(bootstrap=True, compute_importances=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_density=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100, n_jobs=1, oob_score=True, random_state=1, verbose=0)
These are the most important tuning parameters for Random Forests:
# compute the feature importances
pd.DataFrame({'feature':feature_cols, 'importance':rfclf.feature_importances_})
feature | importance | |
---|---|---|
0 | pclass | 0.160553 |
1 | sex | 0.366700 |
2 | age | 0.434528 |
3 | embarked_Q | 0.012129 |
4 | embarked_S | 0.026089 |
# compute the out-of-bag classification accuracy
rfclf.oob_score_
0.80022446689113358
Ensembling is incredibly popular, when the primary goal is predictive accuracy. For example, the team that eventually won the $1 million Netflix Prize used an ensemble of 107 models early on in the competition.
There was a recent paper in the Journal of Machine Learning Research titled "Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?" (Spoiler alert: Random Forests did very well.) In the comments about the paper on Hacker News, Ben Hamner (Kaggle's chief scientist) said the following:
This is consistent with our experience running hundreds of Kaggle competitions: for most classification problems, some variation on ensembled decision trees (random forests, gradient boosted machines, etc.) performs the best. This is typically in conjunction with clever data processing, feature selection, and internal validation.
One key exception is where the data is richly and hierarchically structured. Text, speech, and visual data falls under this category. In many cases here, variations of neural networks (deep neural nets/CNN's/RNN's/etc.) provide very dramatic improvements.
But as you can imagine, ensembling may not often be practical in a real-time environment.
You can also build your own ensembles: just build a variety of models and average them together! Here are some strategies for building independent models:
Note that there is an entire class of well-known ensembling methods that we did not discuss, namely boosting. Instead of building independent models and averaging the predictions, the models are built sequentially on repeatedly modified versions of the data. More information is available in the scikit-learn documentation on Ensemble Methods, namely the sections on AdaBoost and Gradient Tree Boosting.