# Logistic Classification¶


This tour details the logistic classification method (for 2 classes and multi-classes).

Warning: Logisitic classification is actually called "logistic regression" in the literature, but it is in fact a classification method.

We recommend that after doing this Numerical Tours, you apply it to your own data, for instance using a dataset from LibSVM.

Disclaimer: these machine learning tours are intended to be overly-simplistic implementations and applications of baseline machine learning methods. For more advanced uses and implementations, we recommend to use a state-of-the-art library, the most well known being Scikit-Learn

In [1]:
options(warn=-1) # turns off warnings, to turn on: "options(warn=0)"
library(plot3D)
library(pracma)
library(grid)

# Importing the libraries
for (f in list.files(path="nt_toolbox/toolbox_general/", pattern="*.R")) {
source(paste("nt_toolbox/toolbox_general/", f, sep=""))
}
for (f in list.files(path="nt_toolbox/toolbox_signal/", pattern="*.R")) {
source(paste("nt_toolbox/toolbox_signal/", f, sep=""))
}


We define a few helpers.

In [2]:
Xm = function(X){as.matrix(X - rep(colMeans(X), rep.int(nrow(X), ncol(X))))}
Cov = function(X){data.matrix(1. / (n - 1) * t(Xm(X)) %*% Xm(X))}


## Two Classes Logistic Classification¶

Logistic classification is, with support vector machine (SVM), the baseline method to perform classification. Its main advantage over SVM is that is is a smooth minimization problem, and that it also output class probabity, offering a probabilistic interpretation of the classification.

To understand the behavior of the method, we generate synthetic data distributed according to a mixture of Gaussian with an overlap governed by an offset $\omega$. Here classes indexes are set to $y_i \in \{-1,1\}$ to simplify the equations.

In [3]:
n = 1000 # number of sample
p = 2 # dimensionality
omega = 1.5 * 2.5 #offset
n1 = n/2
X = rbind(randn(n1,2), randn(n1,2) + rep(1, n1) * omega)
y = c(rep(1, n1), rep(-1, n1))


Plot the classes.

In [4]:
options(repr.plot.width=5, repr.plot.height=5)

for (i in c(-1, 1))
{
I = (y==i)
plot(X[I,1], X[I,2], col=(i + 3), xlim=c(min(X[,1]), max(X[,1])),
ylim=c(min(X[,2]), max(X[,2])), xlab="", ylab="", pch=16)
par(new=TRUE)
}

cols = c(2,4)
legend("topright", legend=c(-1, 1), col=cols, pch=16)


Logistic classification minimize a logistic loss in place of the usual $\ell^2$ loss for regression $$\umin{w} E(w) \eqdef \frac{1}{n} \sum_{i=1}^n L(\dotp{x_i}{w},y_i)$$ where the logistic loss reads $$L( s,y ) \eqdef \log( 1+\exp(-sy) )$$ This corresponds to a smooth convex minimization. If $X$ is injective, this is also strictly convex, hence it has a single global minimum.

Compare the binary (ideal) 0-1 loss, the logistic loss and the <https://en.wikipedia.org/wiki/Hinge_loss hinge loss> (the one used for SVM).

In [5]:
options(repr.plot.width=7, repr.plot.height=6)
t = seq(-3, 3, length=255)
plot(t, log(1 + exp(t)), type="l", col=2)
#plot(t, t > 0)
lines(t, t > 0, col=3)
lines(t, pmax(t, 0), col=4)
legend("topleft", legend=c('Binary', 'Logistic', 'Hinge'), col=c(3,2,4), pch="-")


This can be interpreted as a <https://en.wikipedia.org/wiki/Maximum_likelihood_estimation maximum likelihood estimator> when one models the probability of belonging to the two classes for sample $x_i$ as $$h(x_i) \eqdef (\th(x_i),1-\th(x_i)) \qwhereq \th(s) \eqdef \frac{e^{s}}{1+e^s} = (1+e^{-s})^{-1}$$

Re-writting the energy to minimize $$E(w) = \Ll(X w,y) \qwhereq \Ll(s,y)= \frac{1}{n} \sum_i L(s_i,y_i),$$ its gradient reads $$\nabla E(w) = X^\top \nabla \Ll(X w,y) \qwhereq \nabla \Ll(s,y) = \frac{y}{n} \odot \th(-y \odot s),$$ where $\odot$ is the pointwise multiplication operator, i.e. * in Python.

Define the energies.

In [6]:
L = function(s,y){1/n * sum( log(1 + exp(-s * y)))}
E = function(w,X,y){L(X %*% w, y)}


In [7]:
theta = function(v){1 / (1 + exp(-v))}
nablaL = function(s, r){ - 1/n * y * theta(-s * y)}
nablaE = function(w,X,y){t(X) %*% nablaL(X %*% w,y)}


Important: in order to improve performance, it is important (especially in low dimension $p$) to add a constant bias term $w_{p+1} \in \RR$, and replace $\dotp{x_i}{w}$ by $\dotp{x_i}{w} + w_{p+1}$. This is equivalently achieved by adding an extra $(p+1)^{\text{th}}$ dimension equal to 1 to each $x_i$, which we do using a convenient macro.

In [8]:
AddBias = function(X){cbind(X, rep(1, dim(X)[1]))}


With this added bias term, once $w_{\ell=0} \in \RR^{p+1}$ initialized (for instance at $0_{p+1}$),

In [9]:
w = rep(0, p + 1)
dim(w) = c(p+1, 1)


Perform one step of gradient descent reads $$w_{\ell+1} = w_\ell - \tau_\ell \nabla E(w_\ell).$$

In [10]:
tau = .8 # here we are using a fixed tau
w = w - tau * nablaE(w, AddBias(X), y)

$$\tau < \frac{2}{L}$$$$L \leq \frac{1}{4}\norm{X}^2$$

If one chooses $$\tau < \tau_{\max} \eqdef \frac{2}{\frac{1}{4}\norm{X}^2},$$ then one is sure that the gradient descent converges.

In [11]:
tau_max = 2/(1/4 * base::norm(AddBias(X), "2")**2)
print(tau_max)

[1] 0.0005154158


Exercise 1

Implement a gradient descent $$w_{\ell+1} = w_\ell - \tau_\ell \nabla E(w_\ell).$$ Monitor the energy decay. Test different step size, and compare with the theory (in particular plot in log domain to illustrate the linear rate). etAR(1); etAR(1);

In [12]:
source("nt_solutions/ml_3_classification/exo1.R")

In [13]:
## Insert your code here.


Generate a 2D grid of points.

In [14]:
q = 201
tx = seq(min(X[,1]), max(X[,1]), length=q)
ty = seq(min(X[,2]),max(X[,2]), length=q)
B = as.vector(meshgrid(ty, tx)$X) A = as.vector(meshgrid(ty, tx)$Y)
G = matrix(c(A, B), nrow=length(A), ncol=2)


Evaluate class probability associated to weight vectors on this grid.

In [15]:
Theta = theta(AddBias(G) %*% w)
dim(Theta) = c(q, q)


Display the data overlaid on top of the classification probability, this highlight the separating hyperplane $\enscond{x}{\dotp{w}{x}=0}$.

In [16]:
color = function(x){rev(cm.colors(x))}
image(tx,ty, Theta, xlab="", ylab="", col=color(10), xaxt="n", yaxt="n")
par(new=TRUE)
for (i in c(-1, 1))
{
I = (y==i)
plot(X[I,1], X[I,2], col=(i + 3), xlim=c(min(X[,1]), max(X[,1])),
ylim=c(min(X[,2]), max(X[,2])), xlab="", ylab="", pch=16, xaxt="n", yaxt="n")
par(new=TRUE)
}

cols = c(2,4)
legend("topright", legend=c(-1, 1), col=cols, pch=16)


Exercise 2

Test the influence of the separation offset $\omega$ on the result.

In [17]:
source("nt_solutions/ml_3_classification/exo2.R")

In [18]:
## Insert your code here.


Exercise 3

Test logistic classification on a real life dataset. You can look at the Numerical Tour on stochastic gradient descent for an example. Split the data in training and testing to evaluate the classification performance, and check the impact of regularization.

In [19]:
source("nt_solutions/ml_3_classification/exo3.R")

In [20]:
## Insert your code here.


## Kernelized Logistic Classification¶

Logistic classification tries to separate the classes using a linear separating hyperplane $\enscond{x}{\dotp{w}{x}=0}.$

In order to generate a non-linear descision boundary, one can replace the parametric linear model by a non-linear non-parametric model, thanks to kernelization. It is non-parametric in the sense that the number of parameter grows with the number $n$ of sample (while for the basic method, the number of parameter is $p$. This allows in particular to generate decision boundary of arbitrary complexity.

The downside is that the numerical complexity of the method grows (at least) quadratically with $n$.

The good news however is that thanks to the theory of reproducing kernel Hilbert spaces (RKHS), one can still compute this non-linear decision function using (almost) the same numerical algorithm.

Given a kernel $\kappa(x,z) \in \RR$ defined for $(x,z) \in \RR^p$, the kernelized method replace the linear decision functional $f(x) = \dotp{x}{w}$ by a sum of kernel centered on the samples $$f_h(x) = \sum_{i=1}^p h_i k(x_i,x)$$ where $h \in \RR^n$ is the unknown vector of weight to find.

When using the linear kernel $\kappa(x,y)=\dotp{x}{y}$, one retrieves the previously studied linear method.

Macro to compute pairwise squared Euclidean distance matrix.

In [21]:
distmat = function(X,Z)
{
dist1 = diag(X %*% t(X))
dist2 = diag(Z %*% t(Z))
n1 = dim(X)[1]
n2 = dim(Z)[1]
out = matrix(0, n1, n2)
for (i in 1:n1)
{
for (j in 1:n2)
{
out[i,j] = dist1[i] + dist2[j]
}
}
out = out - 2 * X %*% t(Z)

return(out)
}


The gaussian kernel is the most well known and used kernel $$\kappa(x,y) \eqdef e^{-\frac{\norm{x-y}^2}{2\sigma^2}} .$$ The bandwidth parameter $\si>0$ is crucial and controls the locality of the model. It is typically tuned through cross validation.

In [22]:
kappa = function(X,Z,sigma){exp( -distmat(X,Z)/(2*sigma^2))}


We generate synthetic data in 2-D which are not separable by an hyperplane.

In [23]:
n = 1000
p = 2
t = 2 * pi * rand(n/2,1)
R = 2.5
r = R * (1 + .2 * rand(n/2,1)) # radius
X1 = cbind(cos(t) * r, sin(t) * r)
X = rbind(randn(n/2, 2), X1)
y = c(rep(1, n/2), rep(-1, n/2))


Display the classes.

In [24]:
options(repr.plot.width=5, repr.plot.height=5)

for (i in c(-1, 1))
{
I = (y==i)
plot(X[I,1], X[I,2], col=(i + 3), xlim=c(min(X[,1]), max(X[,1])),
ylim=c(min(X[,2]), max(X[,2])), xlab="", ylab="", pch=16)
par(new=TRUE)
}

cols = c(2,4)
legend("topright", legend=c(-1, 1), col=cols, pch=16)


Once avaluated on grid points, the kernel define a matrix $$K = (\kappa(x_i,x_j))_{i,j=1}^n \in \RR^{n \times n}.$$

In [25]:
sigma = 1
K = kappa(X, X, sigma)
image(K, col=color(10), ylim=c(1, 0))


Valid kernels are those that gives rise to positive symmetric matrices $K$. The linear and Gaussian kernel are valid kernel functions. Other popular kernels include the polynomial kernel $\dotp{x}{y}^a$ for $a \geq 1$ and the Laplacian kernel $\exp( -\norm{x-y}^2/\si )$.

The kernelized Logistic minimization reads $$\umin{h} F(h) \eqdef \Ll(K h,y).$$

In [26]:
F = function(h,K,y){L(K %*% h, y)}
nablaF = function(h,K,y){t(K) %*% nablaL(K %*% h,y)}


This minimization can be related to an infinite dimensional optimization problem where one minimizes directly over the function $f$. This is shown to be equivalent to the above finite-dimenisonal optimization problem thanks to the theory of RKHS.

Exercise 4

Implement a gradient descent to minimize $F(h)$. Monitor the energy decay. Test different step size, and compare with the theory.

In [27]:
source("nt_solutions/ml_3_classification/exo4.R")

In [28]:
## Insert your code here.


Once this optimal $h$ has been found, class probability at a point $x$ are obtained as $$(\th(f_h(x)), 1-\th(f_h(x))$$ where $f_h$ has been defined above.

We evaluate this classification probability on a grid.

In [29]:
q = 201
tmax = 3.5
t = seq(-tmax, tmax, length=q)
B = as.vector(meshgrid(t)$X) A = as.vector(meshgrid(t)$Y)
G = matrix(c(A, B), nrow=length(A), ncol=2)
Theta = theta(kappa(G,X,sigma) %*% h)
dim(Theta) = c(q, q)


Display the classification probability.

In [30]:
image(t,t, Theta, xlab="", ylab="", col=color(10), xaxt="n", yaxt="n")
par(new=TRUE)
for (i in c(-1, 1))
{
I = (y==i)
plot(X[I,1], X[I,2], col=(i + 3), xlim=c(min(X[,1]), max(X[,1])),
ylim=c(min(X[,2]), max(X[,2])), xlab="", ylab="", pch=16, xaxt="n", yaxt="n")
par(new=TRUE)
}

cols = c(2,4)
legend("topright", legend=c(-1, 1), col=cols, pch=16)


Exercise 5

Display evolution of the classification probability with $\sigma$

In [31]:
source("nt_solutions/ml_3_classification/exo5.R")

In [32]:
## Insert your code here.


Exercise 6

Separate the dataset into a training set and a testing set. Evaluate the classification performance for varying $\si$. Try to introduce regularization and minmize $$\umin{h} F(h) \eqdef \Ll(K h,y) + \la R(h)$$ where for instance $R=\norm{\cdot}_2^2$ or $R=\norm{\cdot}_1$.

In [33]:
source("nt_solutions/ml_3_classification/exo6.R")

In [34]:
## Insert your code here.


## Multi-Classes Logistic Classification¶

The logistic classification method is extended to an arbitrary number $k$ of classes by considering a familly of weight vectors $w_\ell$_{\ell=1}^k, which are conveniently stored as columns of matrix $W \in \RR^{p \times k}$.

This allows to model probabilitically the belonging of a point $x \in \RR^p$ to a the classes using an exponential model $$h(x) = \pa{ \frac{ e^{-\dotp{x}{w_\ell}} }{ \sum_m e^{-\dotp{x}{w_m}} } }_\ell$$ This vector $h(x) \in [0,1]^k$ describes the probability of $x$ belonging to the different classes, and $\sum_\ell h(x)_\ell = 1$.

The computation of $w$ is obtained by solving a maximum likelihood estimator $$\umax{w \in \RR^k} \frac{1}{n} \sum_{i=1}^n \log( h(x_i)_{y_i} )$$ where we recall that $y_i \in \{1,\ldots,k\}$ is the class index of point $x_i$.

This is conveniently rewritten as $$\umin{w} \sum_i \text{LSE}( XW )_i - \dotp{XW}{D}$$ where $D \in \{0,1\}^{n \times k}$ is the binary class index matrices $$D_{i,\ell} = \choice{ 1 \qifq y_i=\ell, \\ 0 \quad \text{otherwise}. }$$ and LSE is the log-sum-exp operator $$\text{LSE}(S) = \log\pa{ \sum_\ell \exp(S_{i,\ell}) } \in \RR^n.$$

In [35]:
LSE0 = function(S){log(apply(exp(S), 1, sum))}


The computation of LSE is unstable for large value of $S_{i,\ell}$ (numerical overflow, producing NaN), but this can be fixed by substracting the largest element in each row, since $\text{LSE}(S+a)=\text{LSE}(S)+a$ if $a$ is constant along rows. This is the celebrated LSE trick.

In [36]:
max2 = function(S){apply(S, 1, max)}
LSE = function(S){LSE0(S - max2(S)) + max2(S)}

In [37]:
# check equality of LSE and LSE0
S = randn(4,5)
norm( LSE(S)-LSE0(S))

3.14018491736755e-16

The gradient of the LSE operator is the <https://en.wikipedia.org/wiki/Softmax_function soft-max operator> $$\nabla \text{LSE}(S) = \text{SM}(S) \eqdef \pa{ \frac{ e^{S_{i,\ell}} }{ \sum_m e^{S_{i,m}} } }$$

In [38]:
SM0 = function(S){exp(S) / apply(exp(S), 1, sum)}


Similarely to the LSE, it needs to be stabilized.

In [39]:
SM = function(S){SM0(S-max2(S))}

In [40]:
# Check equality of SM and SM0
norm( SM(S)-SM0(S) )

1.73950904861202e-16

We load a dataset of $n$ images of size $p = 8 \times 8$, representing digits from 0 to 9 (so there are $k=10$ classes).

Load the dataset and randomly permute it. Separate the features $X$ from the data $y$ to predict information.

In [41]:
file_name = 'nt_toolbox/data/digits.csv'
#A = A[sample(dim(A)[1]),]
X = as.matrix(A[,1:(dim(A)[2] - 1)])
y = A[,dim(A)[2]]


$n$ is the number of samples, $p$ is the dimensionality of the features, $k$ the number of classes.

In [42]:
n = dim(X)[1]
p = dim(X)[2]
CL = sort(unique(y)) # list of classes.
k = length(CL)


Display a few samples digits

In [43]:
options(repr.plot.width=6, repr.plot.height=6)
par(mar = rep(2, 4))
par(mfrow=c(k, 5))

q = 5
for (i in 1:k)
{
I = which(y==CL[i])
for (j in 1:q)
{

f = as.numeric(X[I[j],])
f = f / max(f)
dim(f) = c(sqrt(p), sqrt(p))
image(-f[,ncol(f):1], col=gray(c(0,1)), xaxt="n", yaxt="n", bty="n")
}
}


Perform dimensionality reduction using PCA.

In [44]:
svd_decomp = svd(Xm(X))
U = svd_decomp$u s = svd_decomp$d
V = svd_decomp\$v
X0r = Xm(X) %*% V

options(repr.plot.width=4, repr.plot.height=4)
plot(s, type="o", col=4, ylab="", xlab="", pch=16)


Display in 2D.

In [45]:
options(repr.plot.width=6, repr.plot.height=6)

plot_multiclasses(X, y, 2)