These are notes I wrote for myself while studying for PhD qualifying exams at Stanford.
I wanted to published them so that others could benefit.
They draw from two sources:
Some of the figures are taken from ISL with permission from the authors.
$\textbf{Algorithm type} -$
$\textbf{Model} -$
$\textbf{Assumptions} -$
$\textbf{Likelihood estimate} -$
$\textbf{Strategies for training our model parameters} -$
(1) Analytical approaches to find model parameters:
$\textbf{Implementing the model} -$
Logistic regression is discriminatative. It will try to learn the output given the features, $P(y \mid x)$.
But, there is an alternative strategy. Generative algorithms will try to learn the features, given the label, $P(x \mid y)$.
We can model the distribution of the predictors $x$ separately in each of the response classes (i.e. given $Y$ ).
Then, we can use Bayes’ theorem to flip these around into estimates, $P(x \mid y)$.
The reasons for doing this rather than logistic regression:
Recall Bayes' rule:
$$ P(Y=k|X) = \frac{P(X|y=k)P(y=k)}{\sum\limits_{l=1}^K P(X|y=l)P(y=l))} $$So, we can break this down into the relevant parts:
From the training data, we simply need to estimate:
We can model the features as multi-variate Gaussian.
$$ P(x \mid y_o) = \frac{1}{\Sigma^{0.5} (2 \pi)^{0.5n} } exp -\frac{ (x-\mu_o) (x-\mu_o)^T }{2 \Sigma} $$Where $\Sigma$ is the covariance matrix, which is shared between the two classes:
$ \begin{bmatrix} \sigma_{11} & \sigma_{12} \\ \sigma_{21} & \sigma_{22} \end{bmatrix} $
The variances ($\sigma_{11}, \sigma_{22}$) are just special cases of covariance, which measures how two variables change together:
$$ Cov(a,b) = \frac{1}{n-1} \sum\limits_{i=1}^{n} (a_i - \mu_a) (b_i - \mu_b) $$Increacing variance $\sigma_{11}$ alters the width of each distribution, but of course the mean remains fixed.
Increacing covariance skews the shape of the bi-variate Gaussian, because the data are becoming more linearly coupled.
Also, recall that Pearson correlation is the normalized covariance.
In the extreme, high covariance results in strong linear dependance and a Pearson correlation close to 1.
import sys,os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Class features are bi-variate Gaussian parameterized differently based upon class
u_0 = np.array([0,0])
u_1 = np.array([-1,-1])
# Shared covariance
cov = np.array([[1,1.5],[0.5,1]])
for u in [u_0]:
X = np.random.multivariate_normal(u, cov, 10000)
sns.jointplot(X[:,0], X[:,1], kind="hex")
The height of the surface at any particular point represents the probability that both $X_1$ and $X_2$ fall in a small region around that point.
With a model for $P(x^i \mid y^i)$, we can define a joint loglikelihood function:
$$l(\phi,\mu_o,\mu_1,\Sigma) = \sum_{i=1}^m log (P(x^i \mid y^i)) p(y^i)$$
In CS229, we solved this analytically by taking the derivative of the likelihood function with respect to each, setting it equal to zero, and solving for the each parameter.
We have an assumed Gaussian density funtion for our features with respect to each class.
With the training data, we then parameterize Gaussian densities that maximize the likelihood, as before.
For each test example, we can then assign it to the class that yields a higher posterior probability:
$$ P(Y=k|X) = \frac{P(X|y=k)P(y=k)}{P(X)} $$from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000,n_features=2,centers=2) # Generate isotropic Gaussian blobs for clustering.
sns.jointplot(X[:,0], X[:,1], kind="hex")
<seaborn.axisgrid.JointGrid at 0x110eaf750>
from sklearn.lda import LDA
clf = LDA()
# Raw data
from matplotlib.colors import ListedColormap
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
plt.scatter(X[:, 0],X[:, 1],c=y,cmap=cm_bright)
# Train
clf.fit(X,y)
# Meshgrid on which test classification
xx1, xx2 = np.meshgrid(np.linspace(X[:, 0].min(),X[:, 0].max(), 100), np.linspace(X[:, 1].min(),X[:, 1].max(), 100))
X_pred = np.c_[xx1.ravel(), xx2.ravel()] # convert 2d grid into seq of points
pred = clf.predict(X_pred)
Z = pred.reshape((100, 100)) # reshape seq to grid
plt.contourf(xx1,xx2,Z,cmap=plt.cm.RdBu,alpha=0.4)
<matplotlib.contour.QuadContourSet instance at 0x114491cf8>