Notebook

Linear Discriminant Analysis -¶

Lance Martin¶

Preface -¶

These are notes I wrote for myself while studying for PhD qualifying exams at Stanford.

I wanted to published them so that others could benefit.

They draw from two sources:

CS229, an excellent machine learning class at Stanford taught by Andrew Ng.
An Introduction to Statistical Learning (ISL), an excellent text by G. James, D. Witten, T. Hastie and R. Tibshirani (Springer, 2013).

Some of the figures are taken from ISL with permission from the authors.

Quick cheatsheet -¶

$\textbf{Algorithm type} -$

Classification: It will compute a class label.
Generative: It will try to learn the features, given the label, $P(x \mid y)$.

$\textbf{Model} -$

The model uses a multivariate Gaussian to model the features given a class label.
$ P(x \mid y_o) = \frac{1}{\Sigma^{0.5} (2 \pi)^{0.5n} } exp -\frac{ (x-\mu_o) (x-\mu_o)^T }{2 \Sigma} $

$\textbf{Assumptions} -$

The class features can be modeled as Gaussian.
The covariance matrix ($\Sigma$), which is essentialy the shape of the distribution, is shared between classes.

$\textbf{Likelihood estimate} -$

We define a joint loglikelihood function, as we did previously!
$l(\phi,\mu_o,\mu_1,\Sigma) = \sum_{i=1}^m log (P(x \mid y)) p(y)$.

$\textbf{Strategies for training our model parameters} -$

(1) Analytical approaches to find model parameters:

In this case, took the derivative of the likelihood function with respect to each and set equal to zero.
This allowed us to derive formulas for the maximum likelihood estimates of each parameter.

$\textbf{Implementing the model} -$

Previously, we had a hypothesis function with respect to our trained parameters that we could on the test data.
Now, we train maximum likelihood estimates of each parameter on data for each class, giving us $\phi$, $\mu_o$, $\mu_1$, $\Sigma$.
For test data, we then compute the posterior probability that it belongs to each class, using Bayes rule:
$ P(Y=1|X) = \frac{P(X|y=1)P(y=1)}{P(X)} $
To assign the test example, $X$, we simply compute the posterior probability for each of the classes and pick the better one!

The rationale for generative algorithms -¶

Logistic regression is discriminatative. It will try to learn the output given the features, $P(y \mid x)$.

But, there is an alternative strategy. Generative algorithms will try to learn the features, given the label, $P(x \mid y)$.

We can model the distribution of the predictors $x$ separately in each of the response classes (i.e. given $Y$ ).

Then, we can use Bayes’ theorem to flip these around into estimates, $P(x \mid y)$.

The reasons for doing this rather than logistic regression:

Parameter estimates using LDA are more stable when classes are well-seperated and when $n$ is small and predictors are normal.
We have more than two response classes.

Bayes rule -¶

Recall Bayes' rule:

$$ P(Y=k|X) = \frac{P(X|y=k)P(y=k)}{\sum\limits_{l=1}^K P(X|y=l)P(y=l))} $$

So, we can break this down into the relevant parts:

Prior probability that a given observation is associated with the $k$th category: $P(y=k)$
Density for $X$ given the category prior: $P(X|y=1)$.

From the training data, we simply need to estimate:

$P(y=k)$. For example, this could simply be the fraction of the training observations that belong to the kth class.
For $P(X|y=1)$, we must make an assumption.

Gaussian density -¶

We can model the features as multi-variate Gaussian.

$$ P(x \mid y_o) = \frac{1}{\Sigma^{0.5} (2 \pi)^{0.5n} } exp -\frac{ (x-\mu_o) (x-\mu_o)^T }{2 \Sigma} $$

Where $\Sigma$ is the covariance matrix, which is shared between the two classes:

$ \begin{bmatrix} \sigma_{11} & \sigma_{12} \\ \sigma_{21} & \sigma_{22} \end{bmatrix} $

The variances ($\sigma_{11}, \sigma_{22}$) are just special cases of covariance, which measures how two variables change together:

$$ Cov(a,b) = \frac{1}{n-1} \sum\limits_{i=1}^{n} (a_i - \mu_a) (b_i - \mu_b) $$

Increacing variance $\sigma_{11}$ alters the width of each distribution, but of course the mean remains fixed.

Increacing covariance skews the shape of the bi-variate Gaussian, because the data are becoming more linearly coupled.

Also, recall that Pearson correlation is the normalized covariance.

In the extreme, high covariance results in strong linear dependance and a Pearson correlation close to 1.

In [59]:

import sys,os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
%matplotlib inline

# Class features are bi-variate Gaussian parameterized differently based upon class 
u_0 = np.array([0,0])
u_1 = np.array([-1,-1])
# Shared covariance
cov = np.array([[1,1.5],[0.5,1]])

for u in [u_0]:
    X = np.random.multivariate_normal(u, cov, 10000)
    sns.jointplot(X[:,0], X[:,1], kind="hex")

The height of the surface at any particular point represents the probability that both $X_1$ and $X_2$ fall in a small region around that point.

With a model for $P(x^i \mid y^i)$, we can define a joint loglikelihood function:

$$l(\phi,\mu_o,\mu_1,\Sigma) = \sum_{i=1}^m log (P(x^i \mid y^i)) p(y^i)$$

Learning strategy: Analytical solution -¶

In CS229, we solved this analytically by taking the derivative of the likelihood function with respect to each, setting it equal to zero, and solving for the each parameter.

Implementation -¶

We have an assumed Gaussian density funtion for our features with respect to each class.

With the training data, we then parameterize Gaussian densities that maximize the likelihood, as before.

For each test example, we can then assign it to the class that yields a higher posterior probability:

$$ P(Y=k|X) = \frac{P(X|y=k)P(y=k)}{P(X)} $$

In [85]:

from sklearn.datasets import make_blobs 
X, y = make_blobs(n_samples=10000,n_features=2,centers=2) # Generate isotropic Gaussian blobs for clustering.
sns.jointplot(X[:,0], X[:,1], kind="hex")

Out[85]:

<seaborn.axisgrid.JointGrid at 0x110eaf750>

In [93]:

from sklearn.lda import LDA
clf = LDA()

# Raw data
from matplotlib.colors import ListedColormap
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
plt.scatter(X[:, 0],X[:, 1],c=y,cmap=cm_bright)

# Train
clf.fit(X,y)

# Meshgrid on which test classification
xx1, xx2 = np.meshgrid(np.linspace(X[:, 0].min(),X[:, 0].max(), 100), np.linspace(X[:, 1].min(),X[:, 1].max(), 100))
X_pred = np.c_[xx1.ravel(), xx2.ravel()]  # convert 2d grid into seq of points
pred = clf.predict(X_pred)
Z = pred.reshape((100, 100))  # reshape seq to grid
plt.contourf(xx1,xx2,Z,cmap=plt.cm.RdBu,alpha=0.4)

Out[93]:

<matplotlib.contour.QuadContourSet instance at 0x114491cf8>