These are notes I wrote for myself while studying for PhD qualifying exams at Stanford.
I wanted to published them so that others could benefit.
This draws from:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn import datasets
%matplotlib inline
Get data.
iris = datasets.load_iris()
X = iris.data
y = iris.target
l = iris.target_names
Class labels.
We want numerical class labels.
We can use LabelEncoder
to do this:
from sklearn.preprocessing import LabelEncoder`
enc = LabelEncoder()
label_encoder = enc.fit(y)
y = label_encoder.transform(y) + 1
label_dict = {1: 'Setosa', 2: 'Versicolor', 3:'Virginica'}
Peek at the schema.
d=pd.DataFrame(X)
d.columns=iris.feature_names
d['Class_Bin']=y
d['Class_Name']=d['Class_Bin'].apply(lambda x:l[x])
d.columns=['s_l','s_w','p_l','p_w','b','name']
d=d[['s_l','s_w','p_l','p_w','name']]
d.head(3)
s_l | s_w | p_l | p_w | name | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
Get distributions across categorical labels.
There are many nice tricks for this in seaborn
:
http://stanford.edu/~mwaskom/software/seaborn/tutorial/axis_grids.html
g = sns.FacetGrid(d, row="name",hue="name",size=1.7,aspect=4)
g.map(sns.distplot,"s_l")
<seaborn.axisgrid.FacetGrid at 0x1098211d0>
You can easily compare across categorical labels, as well.
iris = sns.load_dataset("iris")
g = sns.PairGrid(iris)
g.map(plt.scatter)
We can ask which features explain the most variance in the samples data.
These features may be most useful for distriminating between flower types.
d_pca=d[['s_l','s_w','p_l','p_w']]
sns.jointplot(d_pca['s_l'],d_pca['s_w'])
<seaborn.axisgrid.JointGrid at 0x10d062dd0>
Normalize by z-score, giving each feature $\mu=0$, $\sigma^2=1$.
X_1=np.subtract(d_pca,d_pca.mean(axis=0))
X_2=np.divide(X_1,X_1.std(axis=0))
Now, the means are zero and the data have the same variance across features.
We can see this.
sns.jointplot(X_2['s_l'],X_2['s_w'])
<seaborn.axisgrid.JointGrid at 0x10b9681d0>
Next, we apply PCA. This simply compresses the data into a smaller number of features or axes.
The axes are just the princical eigenvectors of the covarance matrix.
Variance of the data is maximized along the first principal component.
The coefficients (loadings) tell us the degree to which each feature contribute to defining this direction of greatest variance:
from sklearn.decomposition import PCA
n_comp = 2
pca = PCA(n_components=n_comp)
X_t = pca.fit(X_2).transform(X_2)
We can plot along our categories
.
ix=np.array(d['name'])
l=list(set(d['name']))
colors=['r','g','b']
for i,c in zip(l,colors):
plt.scatter(X_t[ix==i,0],X_t[ix==i,1],color=c,edgecolor="white",s=50,lw=1,label=i)
plt.legend(loc='upper left',scatterpoints=1,prop={'size':8})
<matplotlib.legend.Legend at 0x10db96190>
From above, we can that PCA is also a nice way to "cluster" the data.
In the below, we also see each feature seems to conribute to PC1
and PC2
.
loadings = pca.components_
print loadings
[[ 0.52237162 -0.26335492 0.58125401 0.56561105] [-0.37231836 -0.92555649 -0.02109478 -0.06541577]]
Note that we can do a few different kinds of pre-processing on the data.
It is often useful to standardize features by removing the mean and scaling to unit variance:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
Note that we can use PCA, as shown above, to reduce dimensionality.
Here are some classification pipelines that we can use, which have different modes of classification.
In the multiclass case for Logistic Regression, the training algorithm uses a one-vs.-all (OvA) scheme, rather than the “true” multinomial LR.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from scipy.stats import sem
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
clf_nb_simple = Pipeline(steps=[
('classifier', GaussianNB())
])
clf_nb_all = Pipeline(steps=[
('scaler', StandardScaler()),
('classifier', GaussianNB())
])
clf_nb_pca = Pipeline(steps=[
('scaler', StandardScaler()),
('reduce_dim', PCA(n_components=2)),
('classifier', GaussianNB())
])
clf_lr_simple = Pipeline(steps=[
('classifier', LogisticRegression())
])
clf_lr_pca = Pipeline(steps=[
('reduce_dim', PCA(n_components=2)),
('classifier', LogisticRegression())
])
clf_svm_linear = Pipeline(steps=[
('classifier', svm.SVC(kernel='linear',C=1.0))
])
clf_svm_poly = Pipeline(steps=[
('classifier', svm.SVC(kernel='poly',C=1.0))
])
clf_rf = Pipeline(steps=[
('classifier', RandomForestClassifier(n_estimators=30))
])
Here is a funtion we an use to perform $k-fold$ cross validation for each pipeline.
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html
import os
from IPython.display import Image
i = Image(filename=os.getcwd()+'/Images/train.png')
i
def evaluate_cross_validation(clf, X, y, K):
# K-fold cross validation iterator of k=5 folds
cv = KFold(len(y), K, shuffle=True, random_state=0)
# Score used is the one returned by score method of the estimator (accuracy)
scores = cross_val_score(clf, X, y, cv=cv)
# Report output
print scores
print ("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores))
Here's our data. Let's split it for testing and training.
# Data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Splits
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.40)
Now, let's run CV on each pipeline:
z-score normalization
.PCA
reduces quality.for clf in [clf_simple,clf_all,clf_pca,clf_lr_simple,clf_lr_pca,clf_svm_linear,clf_svm_poly,clf_rf]:
evaluate_cross_validation(clf, X_train, y_train, 5)
print "---"
[ 0.94444444 0.94444444 1. 1. 0.77777778] Mean score: 0.933 (+/-0.041) --- [ 0.94444444 0.94444444 1. 1. 0.77777778] Mean score: 0.933 (+/-0.041) --- [ 0.88888889 0.83333333 0.88888889 0.94444444 0.83333333] Mean score: 0.878 (+/-0.021) --- [ 0.94444444 1. 0.94444444 0.77777778 0.88888889] Mean score: 0.911 (+/-0.038) --- [ 0.83333333 0.94444444 1. 0.72222222 0.66666667] Mean score: 0.833 (+/-0.063) --- [ 1. 1. 1. 1. 0.88888889] Mean score: 0.978 (+/-0.022) --- [ 0.88888889 0.88888889 1. 1. 0.83333333] Mean score: 0.922 (+/-0.033) --- [ 0.94444444 0.94444444 0.94444444 1. 0.83333333] Mean score: 0.933 (+/-0.027) ---
SVM
with a linear kernel does the best.
But, let's just use logistic regression; it also works well, and the features weights can be extracted for interpretation.
We can use a confusion matrix
to evaluate our performance:
http://scikit-learn.org/stable/auto_examples/plot_confusion_matrix.html
from sklearn import metrics
# Let's not use the pipeline, because we want to extract learned parameters!
clf_lr_simple=LogisticRegression()
clf_lr_simple.fit(X_train, y_train)
pred_test = clf_lr_simple.predict(X_test)
print('Prediction accuracy for the test dataset:')
print('{:.2%}'.format(metrics.accuracy_score(y_test, pred_test)))
print('\nConfusion Matrix of the Logistic regression classifier:')
print(metrics.confusion_matrix(y_test,clf_lr_simple.predict(X_test)))
Prediction accuracy for the test dataset: 98.33% Confusion Matrix of the Logistic regression classifier: [[15 0 0] [ 0 23 0] [ 0 1 21]]
clf_lr_simple.coef_
array([[ 0.41221728, 1.28871111, -2.05561652, -0.89539473], [ 0.15094567, -1.08540973, 0.65545797, -1.07027715], [-1.24459794, -1.40446219, 1.90860717, 1.93783625]])
clf_lr_simple.intercept_
array([ 0.25774972, 0.33573647, -0.74802498])
Let's re-do this for 2D.
# Data
iris = datasets.load_iris()
X = iris.data
y = iris.target
ix=y!=2
X=X[ix,:]
y=y[ix]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.40)
clf_lr_simple.fit(X_train, y_train)
pred_test = clf_lr_simple.predict(X_test)
print('Prediction accuracy for the test dataset:')
print('{:.2%}'.format(metrics.accuracy_score(y_test, pred_test)))
print('\nConfusion Matrix of the Logistic regression classifier:')
print(metrics.confusion_matrix(y_test,clf_lr_simple.predict(X_test)))
Prediction accuracy for the test dataset: 100.00% Confusion Matrix of the Logistic regression classifier: [[21 0] [ 0 19]]
clf_lr_simple.coef_
array([[-0.32885884, -1.31791178, 1.97767729, 0.90105476]])
The model heavily weights the second and third features, which makes sense (below).
ix=np.array(y_train)
l=list(set(y_train))
colors=['r','g']
for i,c in zip(l,colors):
plt.scatter(X_train[ix==i,1],X_train[ix==i,2],color=c,edgecolor="white",s=50,lw=1,label=i)
plt.legend(loc='upper left',scatterpoints=1,prop={'size':8})
<matplotlib.legend.Legend at 0x10c23e950>
from sklearn.datasets import load_boston
data = load_boston()
X = data.data
y = data.target
print X.shape
print y.shape
(506, 13) (506,)
Examine PCA without scaling.
n_comp = 2
pca = PCA(n_components=n_comp)
X_t = pca.fit(X).transform(X)
plt.scatter(X_t[:,0],X_t[:,1],color='k',edgecolor="white",s=50,lw=1,label=i)
<matplotlib.collections.PathCollection at 0x10a871fd0>
Remember always re-scale prior to PCA!
X_1=np.subtract(X,X.mean(axis=0))
X_2=np.divide(X_1,X_1.std(axis=0))
# pipeline = Pipeline([('scaling',StandardScaler()),('pca',PCA(n_components=2))])
# X_t=pipeline.fit_transform(X)
X_t = pca.fit(X_2).transform(X_2)
plt.scatter(X_t[:,0],X_t[:,1],color='k',edgecolor="white",s=50,lw=1,label=i)
<matplotlib.collections.PathCollection at 0x10a737790>
Loadings tell us the contribution to variance; in this case, it's not heavily skewed.
loads=pd.DataFrame(pca.components_)
loads.columns=list(data.feature_names[0:13])
loads=np.abs(loads)
s=loads.sum(axis=0)
s.sort(ascending=False)
loads=loads[s.index]
loads
DIS | AGE | RAD | ZN | TAX | NOX | CRIM | PTRATIO | CHAS | INDUS | B | LSTAT | RM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.321735 | 0.313851 | 0.319817 | 0.256521 | 0.338539 | 0.342976 | 0.249593 | 0.205021 | 0.005099 | 0.346861 | 0.202732 | 0.309841 | 0.189437 |
1 | 0.349181 | 0.311748 | 0.270398 | 0.321308 | 0.238859 | 0.219857 | 0.313186 | 0.308704 | 0.456726 | 0.111816 | 0.234957 | 0.075982 | 0.153877 |
DIS: weighted distances to five Boston employment centres.
AGE: proportion of owner-occupied units built prior to 1940.
RM: average number of rooms per dwelling.
feat=pd.DataFrame(X)
feat.columns=data.feature_names[0:13]
feat.head(3)
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 396.90 | 4.98 |
1 | 0.02731 | 0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.90 | 9.14 |
2 | 0.02729 | 0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 |
sns.kdeplot(feat['RM'],shade=True,color="#6495ED")
<matplotlib.axes.AxesSubplot at 0x110f16610>
# Median Value (attribute 14) is usually the target!
sns.kdeplot(data.target,shade=True,color="#6495ED")
plt.xlabel('price ($1000s)')
<matplotlib.text.Text at 0x110d91b10>
Get the data and parse into test/train sets.
from sklearn.cross_validation import train_test_split
data = load_boston()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=33)
Validation.
def evaluate_cross_validation(clf, X, y, K):
# K-fold cross validation iterator of k=5 folds
cv = KFold(len(y), K, shuffle=True, random_state=0)
# Score used is the one returned by score method of the estimator (accuracy)
scores = cross_val_score(clf, X, y, cv=cv)
# Report output
print scores
print ("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores))
def train_and_evaluate(clf, X_train, y_train):
clf.fit(X_train, y_train)
print "Coefficient of determination on training set:",clf.score(X_train, y_train)
# create a k-fold croos validation iterator of k=5 folds
cv = KFold(X_train.shape[0], 5, shuffle=True, random_state=33)
scores = cross_val_score(clf, X_train, y_train, cv=cv)
print "Average coefficient of determination using 5-fold crossvalidation:",np.mean(scores)
Pipelines. A key point is to normalize data (to avoid that large-valued features weight too much in the final result)!
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import ExtraTreesRegressor
clf_lin = Pipeline(steps=[
('classifier', LinearRegression())
])
clf_lin_rescale = Pipeline(steps=[
('scaler', StandardScaler()),
('classifier', LinearRegression())
])
clf_lin_sgd = Pipeline(steps=[
('scaler', StandardScaler()),
('classifier', SGDRegressor(loss='squared_loss',penalty='l2',random_state=0)) # Use L2 norm
])
clf_tree = Pipeline(steps=[
('scaler', StandardScaler()),
('classifier', DecisionTreeRegressor(max_depth=10))
])
clf_forrest = Pipeline(steps=[
('scaler', StandardScaler()),
('classifier', ExtraTreesRegressor(n_estimators=10,random_state=42))
])
clf_svr_lin = Pipeline(steps=[
('scaler', StandardScaler()),
('classifier', SVR(kernel='linear'))
])
clf_svr_log = Pipeline(steps=[
('scaler', StandardScaler()),
('classifier', SVR(kernel='poly'))
])
Re-scale the data!
# Re-scale the data
scalerX = StandardScaler().fit(X_train)
scalery = StandardScaler().fit(y_train)
# Transform
X_train_r = scalerX.transform(X_train)
y_train_r = scalery.transform(y_train)
X_test_r = scalerX.transform(X_test)
y_test_r = scalery.transform(y_test)
# Check stats
print np.max(X_train_r), np.min(X_train_r), np.mean(X_train_r), np.max(y_train_r), np.min(y_train_r), np.mean(y_train_r)
10.2028980046 -4.66702040845 2.47038706385e-15 2.91774920367 -1.93147098641 3.58552238032e-16
*Linear regression*
# Normal
train_and_evaluate(clf_lin,X_train,y_train)
Coefficient of determination on training set: 0.754922212481 Average coefficient of determination using 5-fold crossvalidation: 0.71357937169
# Change -fold validation split
evaluate_cross_validation(clf_lin,X_train,y_train,5)
[ 0.77283295 0.62650766 0.76095746 0.69548657 0.78776521] Mean score: 0.729 (+/-0.030)
# Run on re-scaled data
train_and_evaluate(clf_lin,X_train_r,y_train_r)
Coefficient of determination on training set: 0.754922212481 Average coefficient of determination using 5-fold crossvalidation: 0.71357937169
*SGDRegressor*
Note that we can include a penalty to avoid over-filling:
evaluate_cross_validation(clf_lin_sgd,X_train,y_train,5)
[ 0.75092485 0.62711972 0.77537704 0.64586213 0.76790778] Mean score: 0.713 (+/-0.032)
*Trees*
evaluate_cross_validation(clf_tree,X_train,y_train,5)
[ 0.79676533 0.72310177 0.70000383 0.88639828 0.88403543] Mean score: 0.798 (+/-0.039)
*Random forrest*
evaluate_cross_validation(clf_forrest,X_train,y_train,5)
[ 0.85341746 0.77618484 0.86206256 0.88043327 0.86252547] Mean score: 0.847 (+/-0.018)
*Support Vector Machines*
evaluate_cross_validation(clf_svr_lin,X_train,y_train,5)
print "---"
evaluate_cross_validation(clf_svr_log,X_train,y_train,5)
[ 0.7722969 0.60929476 0.77767164 0.64406362 0.71482953] Mean score: 0.704 (+/-0.034) --- [ 0.7032406 0.67363599 0.66652327 0.4995156 0.75848911] Mean score: 0.660 (+/-0.043)
Let's move ahead with linear regression and random forrests.
clf_linear_no_pipe=LinearRegression()
train_and_evaluate(clf_linear_no_pipe,X_train_r,y_train_r)
Coefficient of determination on training set: 0.754922212481 Average coefficient of determination using 5-fold crossvalidation: 0.71357937169
Let's grab the learnd weights.
linear=pd.DataFrame(clf_linear_no_pipe.coef_)
linear=linear.transpose()
linear.columns=data.feature_names[0:13]
linear.head(3)
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.112866 | 0.130689 | 0.01208 | 0.090544 | -0.178805 | 0.31822 | -0.017445 | -0.333202 | 0.267166 | -0.217379 | -0.203847 | 0.056625 | -0.407941 |
Let's move ahead with linear regression and random forrests.
clf_rf_no_pipe=ExtraTreesRegressor(n_estimators=10,random_state=42)
train_and_evaluate(clf_rf_no_pipe,X_train_r,y_train_r)
Coefficient of determination on training set: 1.0 Average coefficient of determination using 5-fold crossvalidation: 0.861916487933
Let'c compare learned weights between the methods.
trees=pd.DataFrame(clf_rf_no_pipe.feature_importances_)
trees=trees.transpose()
trees.columns=data.feature_names[0:13]
m=pd.concat([linear,trees],axis=0).transpose()
m.columns=['Linear Regression','Random Forrest']
sns.heatmap(m)
<matplotlib.axes.AxesSubplot at 0x111d14710>
Key drivers:
from sklearn import metrics
def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confusion_matrix=True, show_r2_score=False):
y_pred=clf.predict(X)
if show_accuracy:
print "Accuracy:{0:.3f}".format(metrics.accuracy_score(y,y_pred)),"\n"
if show_classification_report:
print "Classification report"
print metrics.classification_report(y,y_pred),"\n"
if show_confusion_matrix:
print "Confusion matrix"
print metrics.confusion_matrix(y,y_pred),"\n"
if show_r2_score:
print "Coefficient of determination:{0:.3f}".format(metrics.r2_score(y,y_pred)),"\n"
measure_performance(X_test_r,y_test_r,clf_rf_no_pipe,show_accuracy=False,show_classification_report=False,show_confusion_matrix=False,show_r2_score=True)
Coefficient of determination:0.828
Compare normalized housing price prediction and target.
y_pred=clf_rf_no_pipe.predict(X_test_r)
plt.scatter(y_test_r,y_pred)
<matplotlib.collections.PathCollection at 0x11205e8d0>
measure_performance(X_test_r,y_test_r,clf_linear_no_pipe,show_accuracy=False,show_classification_report=False,show_confusion_matrix=False,show_r2_score=True)
Coefficient of determination:0.676
y_pred=clf_linear_no_pipe.predict(X_test_r)
plt.scatter(y_test_r,y_pred)
<matplotlib.collections.PathCollection at 0x112186510>