Learning Pipelines¶

Preface -¶

These are notes I wrote for myself while studying for PhD qualifying exams at Stanford.

I wanted to published them so that others could benefit.

This draws from:

Classification Pipeline on Iris data¶

Pulling data and preprocessing -¶

In [1]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn import datasets
%matplotlib inline

Get data.

In [2]:

iris = datasets.load_iris()
X = iris.data
y = iris.target
l = iris.target_names

Class labels.

We want numerical class labels.

We can use LabelEncoder to do this:

from sklearn.preprocessing import LabelEncoder`

enc = LabelEncoder()

label_encoder = enc.fit(y)

y = label_encoder.transform(y) + 1

label_dict = {1: 'Setosa', 2: 'Versicolor', 3:'Virginica'}

Basics -¶

Peek at the schema.

In [4]:

d=pd.DataFrame(X)
d.columns=iris.feature_names
d['Class_Bin']=y
d['Class_Name']=d['Class_Bin'].apply(lambda x:l[x])
d.columns=['s_l','s_w','p_l','p_w','b','name']
d=d[['s_l','s_w','p_l','p_w','name']]
d.head(3)

Out[4]:

	s_l	s_w	p_l	p_w	name
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa

Get distributions across categorical labels.

There are many nice tricks for this in seaborn:

http://stanford.edu/~mwaskom/software/seaborn/tutorial/axis_grids.html

In [8]:

g = sns.FacetGrid(d, row="name",hue="name",size=1.7,aspect=4)
g.map(sns.distplot,"s_l")

Out[8]:

<seaborn.axisgrid.FacetGrid at 0x1098211d0>

You can easily compare across categorical labels, as well.

In [9]:

iris = sns.load_dataset("iris")
g = sns.PairGrid(iris)
g.map(plt.scatter)

PCA -¶

We can ask which features explain the most variance in the samples data.

These features may be most useful for distriminating between flower types.

In [10]:

d_pca=d[['s_l','s_w','p_l','p_w']]

In [11]:

sns.jointplot(d_pca['s_l'],d_pca['s_w'])

Out[11]:

<seaborn.axisgrid.JointGrid at 0x10d062dd0>

Normalize by z-score, giving each feature $\mu=0$, $\sigma^2=1$.

In [12]:

X_1=np.subtract(d_pca,d_pca.mean(axis=0))
X_2=np.divide(X_1,X_1.std(axis=0))

Now, the means are zero and the data have the same variance across features.

We can see this.

In [13]:

sns.jointplot(X_2['s_l'],X_2['s_w'])

Out[13]:

<seaborn.axisgrid.JointGrid at 0x10b9681d0>

Next, we apply PCA. This simply compresses the data into a smaller number of features or axes.

The axes are just the princical eigenvectors of the covarance matrix.

Variance of the data is maximized along the first principal component.

The coefficients (loadings) tell us the degree to which each feature contribute to defining this direction of greatest variance:

Features with loadings close to zero play little role.
Features with largeer, non-zero loadings strongly define this direction of highest variance (and may be useful for classification).

In [14]:

from sklearn.decomposition import PCA
n_comp = 2
pca = PCA(n_components=n_comp)
X_t = pca.fit(X_2).transform(X_2)

We can plot along our categories.

In [15]:

ix=np.array(d['name'])
l=list(set(d['name']))
colors=['r','g','b']      
for i,c in zip(l,colors):
    plt.scatter(X_t[ix==i,0],X_t[ix==i,1],color=c,edgecolor="white",s=50,lw=1,label=i)
plt.legend(loc='upper left',scatterpoints=1,prop={'size':8})

Out[15]:

<matplotlib.legend.Legend at 0x10db96190>

From above, we can that PCA is also a nice way to "cluster" the data.

In the below, we also see each feature seems to conribute to PC1 and PC2.

In [16]:

loadings = pca.components_
print loadings

[[ 0.52237162 -0.26335492  0.58125401  0.56561105]
 [-0.37231836 -0.92555649 -0.02109478 -0.06541577]]

Pre-processing -¶

Note that we can do a few different kinds of pre-processing on the data.

It is often useful to standardize features by removing the mean and scaling to unit variance:

Standardization of a dataset is a common requirement for many machine learning estimators.
If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function.
This make the estimator unable to learn from other features correctly as expected.

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Feature engineering -¶

Note that we can use PCA, as shown above, to reduce dimensionality.

Pipeline -¶

Here are some classification pipelines that we can use, which have different modes of classification.

In the multiclass case for Logistic Regression, the training algorithm uses a one-vs.-all (OvA) scheme, rather than the “true” multinomial LR.

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [17]:

from sklearn.cross_validation import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from scipy.stats import sem
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier

clf_nb_simple = Pipeline(steps=[        
    ('classifier', GaussianNB())   
    ])

clf_nb_all = Pipeline(steps=[
    ('scaler', StandardScaler()),           
    ('classifier', GaussianNB())   
    ])

clf_nb_pca = Pipeline(steps=[
    ('scaler', StandardScaler()),    
    ('reduce_dim', PCA(n_components=2)),
    ('classifier', GaussianNB())   
    ])

clf_lr_simple = Pipeline(steps=[  
    ('classifier', LogisticRegression())   
    ])

clf_lr_pca = Pipeline(steps=[
    ('reduce_dim', PCA(n_components=2)),
    ('classifier', LogisticRegression())   
    ])

clf_svm_linear = Pipeline(steps=[
    ('classifier', svm.SVC(kernel='linear',C=1.0))   
    ])

clf_svm_poly = Pipeline(steps=[
    ('classifier', svm.SVC(kernel='poly',C=1.0))   
    ])

clf_rf = Pipeline(steps=[
    ('classifier', RandomForestClassifier(n_estimators=30))   
    ])

Cross-validation for each pipeline -¶

Here is a funtion we an use to perform $k-fold$ cross validation for each pipeline.

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html

In [82]:

import os
from IPython.display import Image
i = Image(filename=os.getcwd()+'/Images/train.png')
i

Out[82]:

In [87]:

def evaluate_cross_validation(clf, X, y, K):
    
    # K-fold cross validation iterator of k=5 folds
    cv = KFold(len(y), K, shuffle=True, random_state=0)
    
    # Score used is the one returned by score method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv=cv)
    
    # Report output
    print scores
    print ("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores))

Model testing -¶

Here's our data. Let's split it for testing and training.

In [109]:

# Data 
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Splits
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.40)

Now, let's run CV on each pipeline:

No difference with and without z-score normalization.
PCA reduces quality.

In [124]:

for clf in [clf_simple,clf_all,clf_pca,clf_lr_simple,clf_lr_pca,clf_svm_linear,clf_svm_poly,clf_rf]: 
    evaluate_cross_validation(clf, X_train, y_train, 5)
    print "---"

[ 0.94444444  0.94444444  1.          1.          0.77777778]
Mean score: 0.933 (+/-0.041)
---
[ 0.94444444  0.94444444  1.          1.          0.77777778]
Mean score: 0.933 (+/-0.041)
---
[ 0.88888889  0.83333333  0.88888889  0.94444444  0.83333333]
Mean score: 0.878 (+/-0.021)
---
[ 0.94444444  1.          0.94444444  0.77777778  0.88888889]
Mean score: 0.911 (+/-0.038)
---
[ 0.83333333  0.94444444  1.          0.72222222  0.66666667]
Mean score: 0.833 (+/-0.063)
---
[ 1.          1.          1.          1.          0.88888889]
Mean score: 0.978 (+/-0.022)
---
[ 0.88888889  0.88888889  1.          1.          0.83333333]
Mean score: 0.922 (+/-0.033)
---
[ 0.94444444  0.94444444  0.94444444  1.          0.83333333]
Mean score: 0.933 (+/-0.027)
---

Deploying the model -¶

SVM with a linear kernel does the best.

But, let's just use logistic regression; it also works well, and the features weights can be extracted for interpretation.

We can use a confusion matrix to evaluate our performance:

http://scikit-learn.org/stable/auto_examples/plot_confusion_matrix.html

In [127]:

from sklearn import metrics

# Let's not use the pipeline, because we want to extract learned parameters!
clf_lr_simple=LogisticRegression()
clf_lr_simple.fit(X_train, y_train)
pred_test = clf_lr_simple.predict(X_test)

print('Prediction accuracy for the test dataset:')
print('{:.2%}'.format(metrics.accuracy_score(y_test, pred_test)))

print('\nConfusion Matrix of the Logistic regression classifier:')
print(metrics.confusion_matrix(y_test,clf_lr_simple.predict(X_test)))

Prediction accuracy for the test dataset:
98.33%

Confusion Matrix of the Logistic regression classifier:
[[15  0  0]
 [ 0 23  0]
 [ 0  1 21]]

In [128]:

clf_lr_simple.coef_

Out[128]:

array([[ 0.41221728,  1.28871111, -2.05561652, -0.89539473],
       [ 0.15094567, -1.08540973,  0.65545797, -1.07027715],
       [-1.24459794, -1.40446219,  1.90860717,  1.93783625]])

In [129]:

clf_lr_simple.intercept_

Out[129]:

array([ 0.25774972,  0.33573647, -0.74802498])

Let's re-do this for 2D.

In [138]:

# Data 
iris = datasets.load_iris()
X = iris.data
y = iris.target

In [139]:

ix=y!=2
X=X[ix,:] 
y=y[ix]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.40)

In [140]:

clf_lr_simple.fit(X_train, y_train)
pred_test = clf_lr_simple.predict(X_test)

print('Prediction accuracy for the test dataset:')
print('{:.2%}'.format(metrics.accuracy_score(y_test, pred_test)))

print('\nConfusion Matrix of the Logistic regression classifier:')
print(metrics.confusion_matrix(y_test,clf_lr_simple.predict(X_test)))

Prediction accuracy for the test dataset:
100.00%

Confusion Matrix of the Logistic regression classifier:
[[21  0]
 [ 0 19]]

In [141]:

clf_lr_simple.coef_

Out[141]:

array([[-0.32885884, -1.31791178,  1.97767729,  0.90105476]])

The model heavily weights the second and third features, which makes sense (below).

In [144]:

ix=np.array(y_train)
l=list(set(y_train))
colors=['r','g']      
for i,c in zip(l,colors):
    plt.scatter(X_train[ix==i,1],X_train[ix==i,2],color=c,edgecolor="white",s=50,lw=1,label=i)
plt.legend(loc='upper left',scatterpoints=1,prop={'size':8})

Out[144]:

<matplotlib.legend.Legend at 0x10c23e950>

Regression Pipeline on Housing data¶

PCA -¶

In [23]:

from sklearn.datasets import load_boston
data = load_boston()
X = data.data
y = data.target

In [148]:

print X.shape
print y.shape

(506, 13)
(506,)

Examine PCA without scaling.

In [150]:

n_comp = 2
pca = PCA(n_components=n_comp)
X_t = pca.fit(X).transform(X)
plt.scatter(X_t[:,0],X_t[:,1],color='k',edgecolor="white",s=50,lw=1,label=i)

Out[150]:

<matplotlib.collections.PathCollection at 0x10a871fd0>

Remember always re-scale prior to PCA!

In [164]:

X_1=np.subtract(X,X.mean(axis=0))
X_2=np.divide(X_1,X_1.std(axis=0))

In [165]:

# pipeline = Pipeline([('scaling',StandardScaler()),('pca',PCA(n_components=2))])
# X_t=pipeline.fit_transform(X)
X_t = pca.fit(X_2).transform(X_2)
plt.scatter(X_t[:,0],X_t[:,1],color='k',edgecolor="white",s=50,lw=1,label=i)

Out[165]:

<matplotlib.collections.PathCollection at 0x10a737790>

Loadings tell us the contribution to variance; in this case, it's not heavily skewed.

In [192]:

loads=pd.DataFrame(pca.components_)
loads.columns=list(data.feature_names[0:13])
loads=np.abs(loads)
s=loads.sum(axis=0)
s.sort(ascending=False)
loads=loads[s.index]
loads

Out[192]:

	DIS	AGE	RAD	ZN	TAX	NOX	CRIM	PTRATIO	CHAS	INDUS	B	LSTAT	RM
0	0.321735	0.313851	0.319817	0.256521	0.338539	0.342976	0.249593	0.205021	0.005099	0.346861	0.202732	0.309841	0.189437
1	0.349181	0.311748	0.270398	0.321308	0.238859	0.219857	0.313186	0.308704	0.456726	0.111816	0.234957	0.075982	0.153877

DIS: weighted distances to five Boston employment centres.

AGE: proportion of owner-occupied units built prior to 1940.

RM: average number of rooms per dwelling.

Distributions -¶

In [193]:

feat=pd.DataFrame(X)
feat.columns=data.feature_names[0:13]
feat.head(3)

Out[193]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98
1	0.02731	0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14
2	0.02729	0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03

In [203]:

sns.kdeplot(feat['RM'],shade=True,color="#6495ED")

Out[203]:

<matplotlib.axes.AxesSubplot at 0x110f16610>

In [202]:

# Median Value (attribute 14) is usually the target!
sns.kdeplot(data.target,shade=True,color="#6495ED")
plt.xlabel('price ($1000s)')

Out[202]:

<matplotlib.text.Text at 0x110d91b10>

Testing models -¶

Get the data and parse into test/train sets.

In [121]:

from sklearn.cross_validation import train_test_split
data = load_boston()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=33)

Validation.

In [18]:

def evaluate_cross_validation(clf, X, y, K):
    
    # K-fold cross validation iterator of k=5 folds
    cv = KFold(len(y), K, shuffle=True, random_state=0)
    
    # Score used is the one returned by score method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv=cv)
    
    # Report output
    print scores
    print ("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores))

def train_and_evaluate(clf, X_train, y_train):
    
    clf.fit(X_train, y_train)
    
    print "Coefficient of determination on training set:",clf.score(X_train, y_train)
    
    # create a k-fold croos validation iterator of k=5 folds
    cv = KFold(X_train.shape[0], 5, shuffle=True, random_state=33)
    scores = cross_val_score(clf, X_train, y_train, cv=cv)
    print "Average coefficient of determination using 5-fold crossvalidation:",np.mean(scores)

Pipelines. A key point is to normalize data (to avoid that large-valued features weight too much in the final result)!

In [62]:

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import ExtraTreesRegressor

clf_lin = Pipeline(steps=[          
    ('classifier', LinearRegression())   
    ])

clf_lin_rescale = Pipeline(steps=[
    ('scaler', StandardScaler()),           
    ('classifier', LinearRegression())   
    ])

clf_lin_sgd = Pipeline(steps=[        
    ('scaler', StandardScaler()),           
    ('classifier', SGDRegressor(loss='squared_loss',penalty='l2',random_state=0)) # Use L2 norm
    ])

clf_tree = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('classifier', DecisionTreeRegressor(max_depth=10))   
    ])

clf_forrest = Pipeline(steps=[  
    ('scaler', StandardScaler()),                         
    ('classifier', ExtraTreesRegressor(n_estimators=10,random_state=42))   
    ])

clf_svr_lin = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('classifier', SVR(kernel='linear'))   
    ])

clf_svr_log = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('classifier', SVR(kernel='poly'))   
    ])

Re-scale the data!

In [41]:

# Re-scale the data
scalerX = StandardScaler().fit(X_train)
scalery = StandardScaler().fit(y_train)
# Transform
X_train_r = scalerX.transform(X_train)
y_train_r = scalery.transform(y_train)
X_test_r = scalerX.transform(X_test)
y_test_r = scalery.transform(y_test)
# Check stats
print np.max(X_train_r), np.min(X_train_r), np.mean(X_train_r), np.max(y_train_r), np.min(y_train_r), np.mean(y_train_r)

10.2028980046 -4.66702040845 2.47038706385e-15 2.91774920367 -1.93147098641 3.58552238032e-16

*Linear regression*

Insensitive to the k-fold validation split.
Insensitive to re-scaling.

In [42]:

# Normal
train_and_evaluate(clf_lin,X_train,y_train)

Coefficient of determination on training set: 0.754922212481
Average coefficient of determination using 5-fold crossvalidation: 0.71357937169

In [48]:

# Change -fold validation split
evaluate_cross_validation(clf_lin,X_train,y_train,5)

[ 0.77283295  0.62650766  0.76095746  0.69548657  0.78776521]
Mean score: 0.729 (+/-0.030)

In [49]:

# Run on re-scaled data
train_and_evaluate(clf_lin,X_train_r,y_train_r)

Coefficient of determination on training set: 0.754922212481
Average coefficient of determination using 5-fold crossvalidation: 0.71357937169

*SGDRegressor*

Will try find the hyperplane that minimizes a certain loss function.
Loss function is usually sum of squared distances from each instance to the hyperplane.
It uses Stochastic Gradient Descent to find the minimum.

Note that we can include a penalty to avoid over-filling:

The penalization parameter for linear regression methods is introduced to avoid overfitting.
It does this by penalizing those hyperplanes having some of their coefficients too large.
This seeks hyperplanes where each feature contributes more or less the same to the predicted value.

In [51]:

evaluate_cross_validation(clf_lin_sgd,X_train,y_train,5)

[ 0.75092485  0.62711972  0.77537704  0.64586213  0.76790778]
Mean score: 0.713 (+/-0.032)

*Trees*

In [54]:

evaluate_cross_validation(clf_tree,X_train,y_train,5)

[ 0.79676533  0.72310177  0.70000383  0.88639828  0.88403543]
Mean score: 0.798 (+/-0.039)

*Random forrest*

In [55]:

evaluate_cross_validation(clf_forrest,X_train,y_train,5)

[ 0.85341746  0.77618484  0.86206256  0.88043327  0.86252547]
Mean score: 0.847 (+/-0.018)

*Support Vector Machines*

In [56]:

evaluate_cross_validation(clf_svr_lin,X_train,y_train,5)
print "---"
evaluate_cross_validation(clf_svr_log,X_train,y_train,5)

[ 0.7722969   0.60929476  0.77767164  0.64406362  0.71482953]
Mean score: 0.704 (+/-0.034)
---
[ 0.7032406   0.67363599  0.66652327  0.4995156   0.75848911]
Mean score: 0.660 (+/-0.043)

Model with feature evaluation -¶

Let's move ahead with linear regression and random forrests.

In [156]:

clf_linear_no_pipe=LinearRegression()
train_and_evaluate(clf_linear_no_pipe,X_train_r,y_train_r)

Coefficient of determination on training set: 0.754922212481
Average coefficient of determination using 5-fold crossvalidation: 0.71357937169

Let's grab the learnd weights.

In [157]:

linear=pd.DataFrame(clf_linear_no_pipe.coef_)
linear=linear.transpose()
linear.columns=data.feature_names[0:13]
linear.head(3)

Out[157]:

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	-0.112866	0.130689	0.01208	0.090544	-0.178805	0.31822	-0.017445	-0.333202	0.267166	-0.217379	-0.203847	0.056625	-0.407941

Let's move ahead with linear regression and random forrests.

In [158]:

clf_rf_no_pipe=ExtraTreesRegressor(n_estimators=10,random_state=42)
train_and_evaluate(clf_rf_no_pipe,X_train_r,y_train_r)

Coefficient of determination on training set: 1.0
Average coefficient of determination using 5-fold crossvalidation: 0.861916487933

Let'c compare learned weights between the methods.

In [159]:

trees=pd.DataFrame(clf_rf_no_pipe.feature_importances_)
trees=trees.transpose()
trees.columns=data.feature_names[0:13]
m=pd.concat([linear,trees],axis=0).transpose()
m.columns=['Linear Regression','Random Forrest']
sns.heatmap(m)

Out[159]:

<matplotlib.axes.AxesSubplot at 0x111d14710>

Key drivers:

LSTAT: % lower status of the population
TAX: Full-value property-tax rate per $10,000
RAD: Index of accessibility to radial highways

Test Random Forrest -¶

In [160]:

from sklearn import metrics
def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confusion_matrix=True, show_r2_score=False):
    y_pred=clf.predict(X)   
    if show_accuracy:
        print "Accuracy:{0:.3f}".format(metrics.accuracy_score(y,y_pred)),"\n"

    if show_classification_report:
        print "Classification report"
        print metrics.classification_report(y,y_pred),"\n"
        
    if show_confusion_matrix:
        print "Confusion matrix"
        print metrics.confusion_matrix(y,y_pred),"\n"
        
    if show_r2_score:
        print "Coefficient of determination:{0:.3f}".format(metrics.r2_score(y,y_pred)),"\n"

measure_performance(X_test_r,y_test_r,clf_rf_no_pipe,show_accuracy=False,show_classification_report=False,show_confusion_matrix=False,show_r2_score=True)

Coefficient of determination:0.828

Compare normalized housing price prediction and target.

In [161]:

y_pred=clf_rf_no_pipe.predict(X_test_r)
plt.scatter(y_test_r,y_pred)

Out[161]:

<matplotlib.collections.PathCollection at 0x11205e8d0>

Test Linear Regression -¶

In [162]:

measure_performance(X_test_r,y_test_r,clf_linear_no_pipe,show_accuracy=False,show_classification_report=False,show_confusion_matrix=False,show_r2_score=True)

Coefficient of determination:0.676

In [163]:

y_pred=clf_linear_no_pipe.predict(X_test_r)
plt.scatter(y_test_r,y_pred)

Out[163]:

<matplotlib.collections.PathCollection at 0x112186510>