Preface

Welcome! Allow me to be the first to offer my congratulations on your decision to take an interest in Applied Predictive Modeling with Python! This is a collection of IPython Notebooks that provides an interactive way to reproduce this awesome book by Kuhn and Johnson.

If you experience any problems along the way or have any feedback at all, please reach out to me.

Best Regards,
Lei Gong
Email: [email protected]
Twitter: @_LeiG

Setups

In [1]:
import numpy
import scipy
import pandas
import sklearn
import matplotlib
import rpy2
import pyearth
import statsmodels

Prepare Datasets

Thanks to the authors, all datasets that are necessary in order to reproduce the examples in the book are available in the .RData format from their R package $\texttt{caret}$ and $\texttt{AppliedPredictiveModeling}$. To prepare them for our purpose, I did a little hack so that you can download all the datasets and convert them from .RData to .csv by running this script "fetch_data.py".

In [2]:
%run ../fetch_data.py
Using existing datasets folder:/Users/leigong/Documents/Research/DataScience/Applied-Predictive-Modeling/datasets
Downloading AppliedPredictiveModeling from http://cran.r-project.org/src/contrib/AppliedPredictiveModeling_1.1-6.tar.gz (2 MB)
Decomposing /Users/leigong/Documents/Research/DataScience/Applied-Predictive-Modeling/datasets/AppliedPredictiveModeling_1.1-6.tar.gz
Checking that the AppliedPredictiveModeling file exists...
=> Success!
Downloading Caret from http://cran.r-project.org/src/contrib/caret_6.0-37.tar.gz (2 MB)
Decomposing /Users/leigong/Documents/Research/DataScience/Applied-Predictive-Modeling/datasets/caret_6.0-37.tar.gz
Checking that the Caret file exists...
=> Success!
Extract .RData files from the package...
Convert .RData to .csv and clean up .RData files...
=> Success!

1. Introduction

Predictive modeling: the process of developing a mathematical tool or model that generates an accurate prediction.

There are a number of common reasons why predictive models fail, e.g,

  • inadequante pre-processing of the data
  • inadequate model validation
  • unjustified extrapolation
  • over-fitting the model to the existing data
  • explore relatively few models when searching for relationships

1.1 Prediction Versus Interpretation

The trade-off between prediction and interpretation depends on the primary goal of the task. The unfortunate reality is that as we push towards higher accuracy, models become more complex and their interpretability becomes more difficult.

1.2 Key Ingredients of Predictive Models

The foundation of an effective predictive model is laid with intuition and deep knowledge of the problem context, which are entirely vital for driving decisions about model development. The process begins with relevant data.

1.3 Terminology

  • The sample, data point, observation, or instance refer to a single independent unit of data
  • The training set consists of the data used to develop models while the test or validation set is used solely for evaluating the performance of a final set of candidate models. NOTE: usually people refer to the validation set for evaluating candidates and divide training set using cross-validation into several sub-training and test sets to tune parameters in model development.
  • The predictors, independent variables, attributes, or descriptors are the data used as input for the prediction equation.
  • The outcome, dependent variable, target, class, or response refer to the outcome event or quantity that is being predicted.

1.4 Example Data Sets and Typical Data Scenarios

1.5 Overview

  • Part I General Strategies
    • Ch.2 A short tour of the predictive modeling process
    • Ch.3 Data pre-processing
    • Ch.4 Over-fitting and model tuning
  • Part II Regression Models
    • Ch.5 Measuring performance in regression models
    • Ch.6 Linear regression and its cousins
    • Ch.7 Nonlinear regression models
    • Ch.8 Regression trees and rule-based models
    • Ch.9 A summary of solubility models
    • Ch.10 Case study: compressive strength of concrete
  • Part III Classification Models
    • Ch.11 Measuring performance in classification models
    • Ch.12 Discriminant analysis and other linear classification models
    • Ch.13 Nonlinear classification models
    • Ch.14 Classification trees and rule-based models
    • Ch.15 A summary of grant application models
    • Ch.16 Remedies for severe class imbalance
    • Ch.17 Case study: job scheduling
  • Part IV Other Considerations
    • Ch.18 Measuring predictor importance
    • Ch.19 An introduction to feature selection
    • Ch.20 Factors that can affect model performance