Notebook

$$ \newcommand{\dprod}[2]{{\langle #1, #2 \rangle}} \newcommand{\argmax} [1] {{\underset{#1}{\mathrm{argmax}~}}} \newcommand{\argmin} [1] {{\underset{#1}{\mathrm{argmin}~}}} \renewcommand{\min} [1] {{\underset{#1}{\mathrm{min}~}}} \renewcommand{\v}[1]{\mathbf{#1}} $$

In [18]:

%pylab inline
%load_ext autoreload
%autoreload
import numpy as np
import matplotlib.pyplot as pl
import util

from scipy.stats import itemfreq
from scipy.stats import bernoulli
from scipy.stats import multivariate_normal as mvnorm

import sklearn
import scipy
import matplotlib
print("SKLEARN",sklearn.__version__)
print ("SCIPY",scipy.version.full_version)
print("NUMPY",np.__version__)
print("MATPLOTLIB",matplotlib.__version__)

X, Y = util.load_data() # passenger_class, is_female, sibsp, parch, fare, embarked (categorical 0-3)
X_demean = X - np.mean(X, axis=0)
X_unitsd = X_demean/(np.std(X_demean,axis=0))
X_whiten = np.dot(X_demean, util.whitening_matrix(X_demean))

Populating the interactive namespace from numpy and matplotlib
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
SKLEARN 0.15.2
SCIPY 0.14.1
NUMPY 1.8.2
MATPLOTLIB 1.3.1

WARNING: pylab import has clobbered these variables: ['f', 'logistic']
`%matplotlib` prevents importing * from pylab and numpy

Classification¶

Classification models learn a mapping $h(\boldsymbol{X})$ from a feature space $\boldsymbol{X}$ to a finite set of labels $\boldsymbol{Y}$

In this lab we will focus for simplicity on binary classification, where the labels are assumed to be in $\{-1,1\}$ or alternatively $\{0,1\}$.

We will use simple generated datasets and a real data set on the sinking of the Titanic to explore some different classification algorithms. For a description of the variables and more information on the data see: https://www.kaggle.com/c/titanic-gettingStarted/data

One approach to learning a classification function $h(\boldsymbol{X})$ is to model $P(y|\boldsymbol{x})$ and convert that to a classification by setting: \begin{equation}h(\boldsymbol{X}) = \begin{cases} 1 & \text{if }P(y|\boldsymbol{x}) > \frac{1}{2}\\ 0 & \text{otherwise}\end{cases} \end{equation}

Example: Suppose we want to build a model to predict the probability of survival on the Titanic based on just two categorical features, a persons class (1,2 or 3) and their sex (1=female,0=male). An obvious approach would be to create a category for each combination of our features (female 1st, female 2nd ... male 3rd) and calculate the proportion who survived in each as an estimate for the survival probability $P(y|\boldsymbol{x})$. For each observation in our test data - we simply look up the survival rate in the corresponding category.

This corresponds to maximum likelihood estimation: $\hat{\theta} = argmax_{\theta'}P(data|\theta')$, where the parameters, $\theta$, we want to estimate are the true probabilities $P(y|\boldsymbol{x})$ for each combination of $\boldsymbol{x}$ and $data$ is the set features and labels we have observed.

In [19]:

combinations = [(i,j) for i in [1,2,3] for j in [0,1]]
for c in combinations:
    match = np.where((X[:,0] == c[0]) * (X[:,1] == c[1]))[0]
    print(c,sum(Y[match])/float(len(match)))

(1, 0) 0.38961038961
(1, 1) 0.984615384615
(2, 0) 0.178082191781
(2, 1) 0.912280701754
(3, 0) 0.142180094787
(3, 1) 0.458333333333

Question: Why will this approach not work in general? What happens as we increase the number of features or the number of values each feature can take? What about if features are continuous?

Naive Bayes¶

Following Bayes Rule, we can write:

$P(y|\boldsymbol{x}) \propto P(y)P(\boldsymbol{x}|y) = P(y)P(x_1,x_2...x_D|y) $

It easy to estimate $P(y)$ as just the proportions of each class in the training data. We could also directly estimate $P(x_1,x_2...x_D|y)$ for each $y$ (for example with kernel density estimation) but as the number of features $D$ gets large this estimation suffers the curse of dimensionality.

Naive Bayes assumes that the data was generated by a model where all the features are independent of one-another given the class label so that we can estimate $P(x_j|y)$ separately for each feature.

\begin{equation} P(y|\boldsymbol{x}) \propto P(y)\prod_{j=1}^D P(x_j|y) \end{equation}

The normalisation constant can be obtained;

\begin{equation} P(y|\boldsymbol{x}) = \frac{P(y)\prod_{j=1}^D P(x_j|y)}{P(\boldsymbol{x})}, \end{equation}

where, \begin{equation} P(\boldsymbol{x}) = P(y=0)\prod_{j=1}^D P(x_j|y=0) + P(y=1)\prod_{j=1}^D P(x_j|y=1), \end{equation} this operation is called marginalisation, since we marginalise (or sum/integrate out) $y$ from the joint distribution (top line) $P(y, \mathbf{x}) = P(y)P(\mathbf{x}|y)$ to obtain a distribution over $\mathbf{x}$.

Exercise: Implement a Naive Bayes model for the Titanic data set using passenger_class and is_female as features

In [20]:

# a function that may be useful
def proportions(array):
    """ returns an map from each unique value in the input array to the proportion of times that value occures """
    prop = itemfreq(array)
    prop[:,1] = prop[:,1]/sum(prop,axis=0)[1]
    return dict(prop)

class Naive_Bayes:
    def train(self,X,Y):
        """ trains the model with features X and labels Y """
        # 1) Estimate P(Y=1)
        self.py = sum(Y)/float(len(Y))

        # 2) For each feature, x, estimate P(x|y=1) and P(x|y=0)
        survived = X[np.where(Y==1)[0],:] # the features of those who survived
        died  = X[np.where(Y==0)[0],:] # the features for those who died

        # estimate P(gender|survived) - return a map from gender value to probability
        self.gs = proportions(survived[:,1])

        # estimate P(class|survived) - return a map from class to probability
        self.cs = proportions(survived[:,0]) 

        # estimate P(gender|died)-return a map from gender value to probability
        self.gd = proportions(died[:,1])

        # estimate P(class|died) - return a map from class to probability
        self.cd = proportions(died[:,0])
    
    def predict(self,sex,p_class):
        """ outputs the probability of survival for a given class and gender """
        # caclulate unormalized P(y = 1|sex,p_class) as P(y=1)P(sex|y=1)P(p_class|y=1) 
        ps = self.py*self.gs[sex]*self.cs[p_class]

        # calculate unormalized P(y = 0|sex,p_class) as P(y=0)P(sex|y=0)P(p_class|y=0)
        pd = (1-self.py)*self.gd[sex]*self.cd[p_class]

        # calculates the survival ratio as ps/pd and the normalized probability  from the ratio
        r = ps/pd
        psn = r/(1+r)
        # Or alternatively: psn = ps / (ps + pd) 
        return psn

# run the model
model = Naive_Bayes()
model.train(X,Y)
for p_class,sex in combinations:
    print((p_class,sex),model.predict(sex,p_class))

(1, 0) 0.424333952192
(1, 1) 0.889163490822
(2, 0) 0.273467443321
(2, 1) 0.803786624851
(3, 0) 0.106778873609
(3, 1) 0.565412381535

Exercise: Compare these predictions with those just based on the proportion of survivals. How true is the Naive Bayes assumption for this case?

Question: How does the number of parameters to be learnt scale with the number of features for Naive Bayes?

Exersise: Run Naive Bayes from Sci-Kit Learn using the same features

In [21]:

from sklearn.preprocessing import OneHotEncoder
from sklearn.naive_bayes import MultinomialNB

# Sklearn doesn't have a model that expects categorical data. 
# We need to first encode our (p_class, sex) to (is_first,is_second,is_third,is_female,is_male)

# use preprocessing.OneHotEncoder to create a new dataset X2 that is the transformation of the first 2 columns of X

nb_enc = OneHotEncoder()
X2 = nb_enc.fit(X[:,0:2]).transform(X[:,0:2]).toarray()

# fit a Multinommial Naive Bayes Model

nb = MultinomialNB()
nb.fit(X2,Y)

# transforms our combinations to the one-hot encoding
c = nb_enc.transform(np.asarray(combinations)).toarray()

# gets predictions for each combination
predictions = nb.predict_proba(c)

# prints your predictions in the same format as previous models
for i in range(len(c)):
    print(combinations[i],predictions[i][1])

(1, 0) 0.42266307097
(1, 1) 0.886389226883
(2, 0) 0.274102383273
(2, 1) 0.800962556238
(3, 0) 0.107960953587
(3, 1) 0.563280897519

Naive Bayes can also handle continuous features. The data below is generated by a gaussian mixture model. For each class there is a separate 2-dimensional gaussian distribution over the features x1,x2.

In [22]:

# Generates some data from a Gaussian Mixture Model. 
mean0 = [-1,-1]  # the mean of the gaussian for class 0      
mean1 = [1,1] # the mean of the gaussian for class 1
cov0 = [[.5, .28], [.28, .5]] # the covariance matrix for class 0
cov1 = [[1, -.8], [-.8, 1]] # the covariance matrix for class 1
mixture = util.GaussianMixture(mean0,cov0,mean1,cov1)
mX,mY = mixture.sample(500,0.5,plot=True)

Exercise: Fit A Gaussian Naive Bayes model using Sklearn

In [23]:

from sklearn.naive_bayes import GaussianNB

# fit a GaussianNB model

gnb = GaussianNB()
gnb.fit(mX,mY)


# plots the probability that a point in x,y belogs to the class Y=1 according to your model and the decision boundary p=.5
x = np.linspace(-4,4,100)
y = np.linspace(-4,4,100)
test_points = np.dstack(np.meshgrid(x, y)).reshape(-1,2)
z = gnb.predict_proba(test_points)[:,1].reshape(len(x),len(y)) # probability Y = 1
f,ax = subplots(1,1,figsize=(5,5))
cn = ax.contourf(x,y,z)
ct = ax.contour(cn,levels=[0.5])
ax.scatter(mX[:,0],mX[:,1],s=5, c = ["black" if t < 1 else "white" for t in mY],alpha=1)
ax.clabel(ct)
show()

# model would be fit perfectly by QDA (doesn't make the same independence assumption as Naive Bayes)

# Try changing the covariance matrices and refitting your model. 
# When does the probability distribution returned by Naive Bayes resemble the true one

Logistic Regression¶

Logistic regression models $P(y|\boldsymbol{x})$ directly by assuming it is a (logistic) function of a linear combination of the features. The logistic function $\theta(s) = \frac{e^s}{e^s+1}$ maps the weighted features to $[0,1]$ to allow it to model a probability. Training logistic regression corresponds to learning the weights $\boldsymbol{w}$ to maximise the likelihood function:

\begin{equation} P(y_1...y_n|\boldsymbol{x}_1...\boldsymbol{x}_n,\boldsymbol{w}) = \prod_{i=1}^n \theta(y_i\boldsymbol{w}^T\boldsymbol{x}_i) \end{equation}

Maximising the likelihood $P(y_1...y_n|\boldsymbol{x}_1...\boldsymbol{x}_n,\boldsymbol{w})$ is equivalent to minimising the negative log-likelihood: \begin{equation} \boldsymbol{w}^* = argmin_{\boldsymbol{w}}\left( -\log\left(\prod_{i=1}^n \theta(y_i\boldsymbol{w}^T\boldsymbol{x}_i)\right)\right) = argmin_{\boldsymbol{w}}\left( \sum_{i=1}^n \ln(1+e^{-y_i\boldsymbol{w}^T\boldsymbol{x}_i})\right) \end{equation}

Once we have the weights $\boldsymbol{w}^*$, we can predict the probability that a new observation belongs to each class.

Question: Suppose that we have a data set that is linearly separable. What happens to the weights $w$ when we run logistic regression?

Exercise: Use Sklearn to fit a logistic regression model on the gaussian mixture data.

In [24]:

#Run Logistic regression on the gaussian mixture data
from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression(C = 10000, fit_intercept=False)
logistic.fit(mX,mY)

Out[24]:

LogisticRegression(C=10000, class_weight=None, dual=False,
          fit_intercept=False, intercept_scaling=1, penalty='l2',
          random_state=None, tol=0.0001)

In [25]:

# plot the probability y = 1 as over the feature space as for Naive Bayes
logistz = logistic.predict_proba(test_points)[:,1].reshape(len(x),len(y)) # probability Y = 1
f,ax = subplots(1,1,figsize=(5,5))
cn = ax.contourf(x,y,logistz)
ct = ax.contour(cn,levels=[0.5])
ax.scatter(mX[:,0],mX[:,1],s=5, c = ["black" if t < 1 else "white" for t in mY],alpha=1)
ax.clabel(ct)
show()# implement the jacobian

In [26]:

# we can model more complex decision boundaries by expanding the feature space to include combinations of features

# re-fit logistic regression adding in all quadratic combinations of features ie x1,x2,x1x2,x1^2,x2^2
from sklearn.preprocessing import PolynomialFeatures
poly_expand = PolynomialFeatures(2)
m2X = poly_expand.fit(mX).transform(mX)

logistic.fit(m2X,mY)
testpoints2 = poly_expand.transform(test_points)
logistic2z = logistic.predict_proba(testpoints2)[:,1].reshape(len(x),len(y)) # probability Y = 1
f,ax = subplots(1,1,figsize=(5,5))
cn = ax.contourf(x,y,logistic2z)
ct = ax.contour(cn,levels=[0.5])
ax.scatter(mX[:,0],mX[:,1],s=5, c = ["black" if t < 1 else "white" for t in mY],alpha=1)
ax.clabel(ct)
show()

With large numbers of features there is a risk of over fitting to the training data. We can tune a logistic regression model to reduce the risk of over fitting by penalising large weights, $\boldsymbol{w}$

Exersise: Experiment with the regularisation parameters sklearn provides: penalty = "l1" or "l2" and C = inverse of weight of regularisation term

In [27]:

lreg = LogisticRegression(penalty = "l2",C = .0001)
lreg.fit(m2X,mY)

# plots the probability as before
logistic2z_reg = lreg.predict_proba(testpoints2)[:,1].reshape(len(x),len(y)) # probability Y = 1
f,ax = subplots(1,1,figsize=(5,5))
cn = ax.contourf(x,y,logistic2z_reg)
ct = ax.contour(cn,levels=[0.5])
ax.scatter(mX[:,0],mX[:,1],s=5, c = ["black" if t < 1 else "white" for t in mY],alpha=1)
ax.clabel(ct)
show()

In [28]:

# Run logistic regression on the titanic data

titanic_logist = LogisticRegression()
titanic_logist.fit(X,Y)


# Look at the coefficients (weights) in the model. Are they meaningfull? 
# Do you need to change the way any of the features were encoded?
print(titanic_logist.coef_)
print(titanic_logist.intercept_)

[[-0.86167767  2.46368356 -0.30054043 -0.02526696  0.0037132   0.11749166]]
[ 0.50953296]

Implement Logistic Regression (Optional)¶

Recall for logistic regression we are trying to find (assuming we have encoded $Y$ as $\{-1,1\}$)
\begin{equation} \boldsymbol{w}^* = argmin_{\boldsymbol{w}}\left( \sum_{i=1}^n \ln(1+e^{-y_i\boldsymbol{w}^T\boldsymbol{x}_i})\right) \end{equation}

This is a convex optimisation problem in $\boldsymbol{w}$.

We can solve it using gradient decent (or an optimisation library)

Exercise: Implement logistic regression using scipy's optimisation library and run it on the gaussian mixture data mX,mY.

In [29]:

# implement logistic regression using the scipy's optimization library and run it on the gaussian model data
from scipy.optimize import minimize

# copy input so our modifications don't effect original data
dataX = mX
dataY = mY

# encode mY as -1,1
dataY[mY==0] = -1

# add a column of all ones to mX to allow us to fit an intercept
dataX = np.hstack((np.ones((mX.shape[0],1)),mX))


# implement the loss function
def loss(w,X,Y):
    wx = np.dot(X,w)
    l_i = np.log(1+np.exp(-Y*wx))
    return sum(l_i)
    

# start the optimization with randomly guessed weights    
w0 = np.random.random((dataX.shape[1],1)) 

optimal = minimize(loss,w0,args=(dataX,dataY),method="BFGS")    
w = optimal.x
print(w)

# how does this compare with the coefficients you saw using Sklearn? 
# try refitting the sklearn logistic model with a very high value for C (like 10000).

[-2.29274861  4.30844335  4.43623391]

The optimisation method we are using (BFGS) needs to know the jacobian (gradient of the loss function with respect to w). Since we didn't supply it, python is approximating it numerically. We can speed things up by supplying it.

\begin{equation} L(\boldsymbol{w}) = \sum_{i=1}^n \ln(1+e^{-y_i\boldsymbol{w}^T\boldsymbol{x}_i}) \longleftarrow \text{ loss function}\\ \nabla{L} = [\frac{\partial L}{\partial w_1},\frac{\partial L}{\partial w_2}, ..., \frac{\partial L}{\partial w_D}] \longleftarrow \text{ definition of gradient}\\ \frac{\partial L}{\partial w_j} = -\sum_{i=1}^n x_{ij} \frac{ y_i e^{-y_i\boldsymbol{w}^T\boldsymbol{x}_i}}{1+e^{-y_i\boldsymbol{w}^T\boldsymbol{x}_i}} \longleftarrow \text{ result of taking partial derivative of loss function with respect to weight $j$}\\ \end{equation}

Exercise: repeat the previous exercise but supply the Jacobian to the minimizer

In [30]:

# implement jacobian

def grad_loss(w,X,Y):
    wx = np.dot(X,w)
    ewx = np.exp(-Y*wx)
    s = -Y*ewx/(1+ewx)
    result = np.dot(s,X)
    return result
    
# start the optimization with randomly guessed weights    
w0 = np.random.random((dataX.shape[1],1)) 

optimal = minimize(loss,w0,args=(dataX,dataY),jac = grad_loss, method="BFGS")    
print(optimal)
w = optimal.x
print(w) 

        x: array([-2.29274986,  4.3084435 ,  4.43623469])
     njev: 22
  success: True
      fun: 33.81762255740206
   status: 0
     nfev: 22
      jac: array([ -5.07304650e-06,  -4.03896784e-06,   1.06056500e-06])
 hess_inv: array([[ 0.34073774, -0.31149446, -0.34200123],
       [-0.31149446,  0.47704394,  0.42733023],
       [-0.34200123,  0.42733023,  0.5493757 ]])
  message: 'Optimization terminated successfully.'
[-2.29274986  4.3084435   4.43623469]

Support Vector Machines#¶

A Support Vector Machine (SVM) is a classifier that attempts to separate classes of data with a plane. The location of the plane in the input space is selected based on a set of training examples labelled with a binary variable. To query the label of a new point, the algorithms simply determines on which side of the plane the point sits. In two dimensions, this amounts to dividing the input space into two sections with a straight line. In higher dimensions, the division is determined by a plane (or hyperplane in 4 or more dimensions). In general, we refer to SVMs as using a separating hyperplane in any dimension.

The plane is chosen such that, if possible, it completely separates the two classes of points in the training data. This constraint is not sufficient however; many different planes may all successfully separate the training data. Therefore, SVMs use a plane which maximises the margin, or the distance parallel to the plane between the nearest data points of each category. The figure below illustrates this idea.

In [31]:

util.margins_and_hyperplane()

Both the above lines successfully separate the two classes of data. However, the green line has a much larger margin than the red line, increasing the likelihood that new data will also be successfully classified. It is also possible to move this argument beyond intuition and prove that large margins increase the probability of correct classification.

Given $m$ training data $\{\v{x}_i, y_i\}_{i=1}^{m}$ with inputs $\v{x}_i \in \mathbb{R}^k$ and labels $y_i \in \{-1,+1\}$.

We aim to draw a hyperplane $x : \dprod{\v{w}}{\v{x}} - b = 0$ with $\v{w} \in \mathbb{R}^k$ and $b \in \mathbb{R}$.

Then we can always scale $\v{w}$ and $b$ such that the boundary of the margin on the negative side satisfies $\dprod{\v{w}}{\v{x}} - b = -1$ and the positive side satisfies $\dprod{\v{w}}{\v{x}} - b = +1$. These two are equivalent to:

$$ y_i\dprod{\v{w}}{\v{x}_i} - b \ge 1 \qquad \forall i \in \{1,\ldots,m\} $$

Under this constraint, the size of the margin is $\frac{2}{\|\v{w}\|}$, so therefore minimising $\|\v{w}\|$ will find the maximum margine hyperplane. For reasons of computational effeciency, we instead consider the equivalent optimisation:

$$ \argmin{\v{w},b} \frac{1}{2}\|\v{w}\|^2 \qquad \mathrm{s.t.}\,\,\,y_i\dprod{\v{w}}{\v{x}_i} - b \ge 1 \qquad \forall i \in \{1,\ldots,m\} $$

This is known as the primal form of the SVM optimisation problem.

Dual Form###¶

In order to determine this plane efficently, the above optimisation is not usually attempted directly. Instead, we solve the Lagrangian 'dual' problem. Assume each support vector exerts a perpendicular force of size $\alpha_i$ and direction $y_j \cdot \v{w}/\|w\|$ on the (rigid) hyperplane, then the optimal solution is the mechanically stable one (ie the one where the sum of forces and torques on the sheet is zero).

$$ \argmax{\v{\alpha}} W(\alpha) = \sum_{i=1}^{m}\alpha_i - \frac{1}{2}\sum_{i,j=1}^{m} \alpha_i \alpha_j y_i y_j \dprod{\v{x}_i}{\v{x}_j} $$

subject to $\alpha_i \ge 0, \sum_{i=1}^{m}\alpha_i y_i = 0$. As an optimisation problem, this is a convex quadratic program (for which efficient algorithms exist).

The hyperplane itself is parametrised in terms of the $\alpha$ weights as $$ f(x) = \mathrm{sgn} (\sum_{i=1}^{m} y_i \alpha_i \dprod{\v{x}}{\v{x}_i} + b). $$

Exercise: Show that the size of the margin is $\frac{2}{\|\v{w}\|}$.

Answer have a look at these slides and wikipedia.

Exercise: Use Scikit.learn linear svm object to find the maximum margin hyperplane separating the dataset generated below.

In [32]:

X2 = np.vstack((np.ones((150,2)),-1*np.ones((150,2)))) + 0.3*np.random.normal(size=(300,2))
Y2 = np.hstack((np.ones(150),-1*np.ones(150)))

from sklearn import svm
svm_instance = svm.SVC(kernel='linear')
svm_instance.fit(X2,Y2)

util.plot_svm(X2, Y2, svm_instance)

Exercise: Now come up with a new dataset whose two categories do not overlap, but for which there is no way to separate them with a plane. Bonus points for creativity here. Run a linear SVM and see what happens.

In [33]:

#X3 = ??
#Y3 = ??

#####

#util.plot_svm(X3, Y3, svm_instance)

Beyond Linear classification##¶

In more complicated data, it may not be possible to draw a straight line between the classes (the data are not 'linearly seperable'). To ameliorate this, the data can first be transformed by some function $\Psi$ that maps them into a higher dimensional space. Depending on the choice of $\Psi$, the data may now be linearly separable and so amenable to SVM classfication.

For example, the figure below (top) illustrates a dataset in 2D that is not linearly separable. However, if we perform the transformation

$\Phi : \mathbb{R}^2 \rightarrow \mathbb{R}^3$

$\Phi(x) = ([x]_1^2, [x]_2^2 \sqrt{2}[x]_1[x]_2)$,

The resulting dataset is linearly separable. The transformed data is illustrated below (bottom).

In [34]:

util.nonlinear_example()

When a SVM classfier is run in the transformed space, the resulting hyperplane produces the following (correct) classfication in the original space:

In [35]:

util.nonlinear_svm()

This idea motivates a technique for increasing the complexity of classification tasks that an SVM can perform: Transform the input data into a high dimensional space where it may be linearly separable, find the maximum margin hyperplane in that space, then project it back into the original input space to produce a classification rule that may be non-linear. How do perform this process efficiently is described in the next section.

The Kernel Trick###¶

Explicitly computing a data transformation $\Phi$ for very high output dimensions can quickly become intractable. However, examination of the SVM optimisation shows that we actually never need this transformation, only the inner product between two transformed vectors. We can keep the algorithm tractable by computing the inner product in the transformed space without ever explicitly computing the transformation itself. This is known as the 'kernel trick'.

Rather than defining $\Phi$, we define a kernel function that maps two elements into the input space into $\mathbb{R}$. $$ k : \mathcal{S}\otimes\mathcal{S} \rightarrow \mathbb{R}. $$

Then $k$ defines the dot product in the transform space: $$ k(x_i, x_j) \triangleq \dprod{\Phi(x_i)}{\Phi(x_j)}. $$

This definition also generalises to complex-valued kernels, but we will keep things simple for now. For $k$ to behave as a valid inner product, it must be positive definite. That is, given any set ${x_i}_{i=1}^{m} \in \mathcal{S}$, $$ \sum_{i=1}^{m}\sum_{j=1}^{m} c_i c_j k(x_i,x_j) \ge 0 $$ for all $c_i, c_j$ in $\mathbb{R}$.

Now, we can choose a positive definite kernel function, and it will induce a transformation of our data to a different space. This can be much more efficient that computing the transformation directly, which may be extremely high dimensional (or even infinite dimensional).

Exercise: To get a sense of how much time is saved, consider the transformation which maps a 3 dimensional vector to a point whose dimensions correspond to all possible products of the components of that vector of degree 10 or less: $\Phi_{10}(([x]_1,[x]_2,[x]_3)) = [x]_1^{10}, [x]_1^9 [x]_2, [x]_1^8 [x]_2 [x]_3, \cdots$ How many terms are involved in computing this transformation explicitly? What about the 100-dimensional case?

With the kernel trick, we can compute dot products in this space much more easily: $k(x,x^{\prime}) = \dprod{\Phi_{10}(x)}{\Phi_{10}(x^{\prime})} = \dprod{x}{x^{\prime}}^{10}$.

Soft Margins##¶

Even with kernel trick, it may not be possible to separate the classes of training data with a hyperplane. For example, noise may cause the two classes to overlap. In this case, we can add slack variables $\xi_i$, which, for each data point, allow it to exist a distance proportional to $\xi_i$ on the wrong side of the plane. The resulting optimisation is identical, except that the weights $\alpha_i$ are bounded by a regularisation constant $C$;

$$ \argmax{\v{\alpha}} W(\alpha) = \sum_{i=1}^{m}\alpha_i - \frac{1}{2}\sum_{i,j=1}^{m} \alpha_i \alpha_j y_i y_j \dprod{\v{x}_i}{\v{x}_j} $$

subject to $0 \le \alpha_i \le C, \sum_{i=1}^{m}\alpha_i y_i = 0$.

Changing $C$ trades off between maximising the margin of the hyperplane, and minimising the degree of mis-classification.

Exercise: Run an SVM classifier over the titanic dataset, as you did in the previous sections. Compare the result of a linear kernel and a radial basis function('rbf') kernel. Note that the Scikit.learn SVM classifier has implemented the soft margin appoarch described above.

In [36]:

from sklearn import svm
svm_instance = svm.SVC(kernel='rbf')
svm_instance.fit(X,Y)
plot_dim_1 = 0
plot_dim_2 = 1
util.plot_svm(X, Y, svm_instance, plot_dim_1, plot_dim_2, (-5,-5), (6,6))

Preprocessing Data for SVMs##¶

The decision boundary for the exercise above was poor at sepating the two classes of data (Try visualising different pairs of dimensions to get an idea). This is, in part, because SVMs are sensitive to the statistical properties of their input. Input spaces in which different dimensions are shifted or scaled will tend to make it more difficult for the algorithm to find a separating hyperplane. One approach is to encode a more complex kernel that can account for these different scales and offsets, but a simpler approach is to pre-transform the data before passing it to the SVM.

Typically, the data are transformed to have mean zero, and either unit standard deviation in each dimension, or 'whitened' to remove correlation between dimension and produce a dataset whose covariance matrix is the identity. These transformations are illustrated below on a 2D Gaussian dataset.

In [37]:

util.illustrate_preprocessing()

Exercise: re-implement the SVM classifier with RBF kernel on the titanic dataset. This time, use the whitened inputs X_whiten. Has the result improved? How we measure this improvement in a quantitative way is the subject of the next section.

Optional: Implement a function which transforms a dataset to have zero mean and an identity matrix as its covariance.

In [38]:

from sklearn import svm
svm_instance = svm.SVC(kernel='rbf')
svm_instance.fit(X_whiten,Y)
plot_dim_1 = 0
plot_dim_2 = 1
util.plot_svm(X_whiten, Y, svm_instance, plot_dim_1, plot_dim_2, (-5,-5), (6,6))

Note that now, the new dimensions of the input space are each weighted linear combinations of the old dimensions. When evaluating testing data, don't forget to transform it as well! (See the code at the beginning of this notebook for details).

Evaluating Classification Algorithms¶

Exersise: Run the Sklearn Naive Bayes, Logistic Regression and SVM models you previously trained on the titanic data set on a test set and compare how they perform.

Hint: you don't want to re-train any models, only do prediction. If you are calling 'fit' on anything you are doing something wrong

In [39]:

# Compare the performance of the model you built on the titanic test data
testX,testY = util.load_test_data()

# Encode the test features for prediction with Naive Bayes using the encoder you built previously
testXNB = nb_enc.transform(testX[:,0:2]).toarray()

# Naive Bayes - generate predictions
nb_pred = nb.predict(testXNB)

# Logistic Regression
logist_pred = titanic_logist.predict(testX)

# SVM
#svm_pred = titanic_svm.predict(testX)

Sklearn provides many metrics for evaluating the performance of a classifcation algorithm see http://scikit-learn.org/stable/modules/model_evaluation.html You can pass in one of these metrics to Sklearn's cross-validation functionality so as to to optimize algorithms tunable parameters to the loss you care about. For this lab though we will just use them to explore how well we did on the test set

In [40]:

# Sklearn provides many metrics for evaluating the performance of a classifcation algorithm see http://scikit-learn.org/stable/modules/model_evaluation.html
# You can pass in one of these metrics to Sklearns cross-validation functionality so as to to optimize algorithms tunable 
# parameters to the loss you care about. 
# For this lab though we will just use them to explore how well we did on the test set
from sklearn.metrics import confusion_matrix, roc_curve, auc,accuracy_score 

# 1) Print the proportion of predictions that were correct using accuracy_score(trueY,pred)
print(accuracy_score(testY,nb_pred))
print(accuracy_score(testY,logist_pred))


# 2) print the confusion matrix using confusion_matrix(trueY, pred))
# returns a matrix [[true=0,pred=0,true=1,pred=0][true=1,pred=0,true=1,pred=1]]
print(confusion_matrix(testY, nb_pred))
print(confusion_matrix(testY, logist_pred))

# Question: A true positive is an instance that is actually 1 and is classified as 1
# A false positive is an instance that is actually 0 and is classified as 1
# Which cells of the confusion matrix contain these values?


# 3) ROC curves 
# For algorithms that estimate the probability an instance belongs to a class, 
# we could vary the threshold above which we classify instances as positive to 
# change the ratio of true positives to false positives. 
# The ROC curve plots the proportion of true positives versus the proportion of false positives for each threshold.
# Plot the ROC curves: the function roc_curve(y_true, y_score) takes: 
#                                    y_true: true labels  
#                                    y_score: a score indicatingly likelihood that each instance is positive
#                                    returns: a tuple (false_positive_rate, true_positive_rate, thresholds)


nb_score = nb.predict_proba(testXNB)[:,1]
nb_fpr,nb_tpr,nb_thresh = roc_curve(testY,nb_score) 

lr_score = titanic_logist.predict_proba(testX)[:,1]
lr_fpr,lr_tpr,lr_thresh = roc_curve(testY,lr_score)

f,ax = pl.subplots(1,1)
ax.plot(nb_fpr,nb_tpr,label="Naive Bayes")
ax.plot(lr_fpr,lr_tpr,label="Logistic Regression")
ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate")
ax.set_title("ROC curves")
ax.legend(loc="lower right")
show()

0.810897435897
0.807692307692
[[180  23]
 [ 36  73]]
[[182  21]
 [ 39  70]]

Question: What would the ROC curve look like for a classifier that made no errors?

Question: What is the significance of an ROC curve that lies below the line false_positive = true_positive?

In [40]: