# Discriminative Classification¶

### Problem: difficult class-conditional data distribitions¶

Our task will be the same as in the preceding class on (generative) classification. But this time, the class-conditional data distributions look very non-Gaussian, yet the linear discriminative boundary looks easy enough:

In [1]:
# Generate dataset {(x1,y1),...,(xN,yN)}
# x is a 2-d feature vector [x_1;x_2]
# y ∈ {false,true} is a binary class label
# p(x|y) is multi-modal (mixture of uniform and Gaussian distributions)
using PyPlot
include("scripts/lesson8_helpers.jl")
N = 200
X, y = genDataset(N) # Generate data set, collect in matrix X and vector y
X_c1 = X[:,findall(.!y)]'; X_c2 = X[:,findall(y)]' # Split X based on class label
X_test = [3.75; 1.0] # Features of 'new' data point
function plotDataSet()
plot(X_c1[:,1], X_c1[:,2], "bx", markersize=8)
plot(X_c2[:,1], X_c2[:,2], "r+", markersize=8, fillstyle="none")
plot(X_test[1], X_test[2], "ko")
xlabel(L"x_1"); ylabel(L"x_2"); legend([L"y=0", L"y=1",L"y=?"], loc=2)
xlim([-2;10]); ylim([-4, 8])
end
plotDataSet();


### Main Idea of Discriminative Classification¶

• Again, a data set is given by $D = \{(x_1,y_1),\dotsc,(x_N,y_N)\}$ with $x_n \in \mathbb{R}^D$ and $y_n \in \mathcal{C}_k$, with $k=1,\ldots,K$.
• Sometimes, the precise assumptions of the (multinomial-Gaussian) generative model $$p(x_n,\mathcal{C}_k|\theta) = \pi_k \cdot \mathcal{N}(x_n|\mu_k,\Sigma)$$ clearly do not match the data distribution.
• Here's an IDEA! Let's model the posterior $$p(\mathcal{C}_k|x_n)$$ directly, without any assumptions on the class densities.
• Of course, this implies also that we build direct models for the discrimination boundaries $$\log \frac{p(\mathcal{C}_k|x_n)}{p(\mathcal{C}_j|x_n)} \overset{!}{=} 0$$

### 1. Model Specification¶

• [Q.] What model should we use for $p(\mathcal{C}_k|x_n)$?
• [A.] Get inspiration from the generative approach: choose the familiar softmax structure with linear discrimination bounderies for the posterior class probability $$p(\mathcal{C}_k|x_n,\theta) = \frac{e^{\theta_k^T x_n}}{\sum_j e^{\theta_j^T x_n}}$$ but do not impose a Gaussian structure on the class features.
• $\Rightarrow$ There are two key differences between the discriminative and generative approach:
1. In the discriminative approach, the parameters $\theta_k$ are not structured into $\{\mu_k,\Sigma,\pi_k \}$. This provides discriminative approach with more flexibility.
2. ML learning for the discriminative approach by optimization of conditional likelihood $\prod_n p(y_n|x_n,\theta)$ rather than joint likelihood $\prod_n p(y_n,x_n|\theta)$.

### 2. ML Estimation for Discriminative Classification¶

• The conditional log-likelihood for discriminative classification is

$$\mathrm{L}(\theta) = \log \prod_n \prod_k {p(\mathcal{C}_k|x_n,\theta)}^{y_{nk}}$$

• Computing the gradient $\nabla_{\theta_k} \mathrm{L}(\theta)$ (NB: revised text) leads to (for proof, see next slide)
$$\nabla_{\theta_k} \mathrm{L}(\theta) = \sum_n \Big( \underbrace{y_{nk}}_{\text{target}} - \underbrace{\frac{e^{\theta_k^T x_n}}{ \sum_j e^{\theta_j^T x_n}}}_{\text{prediction}} \Big)\cdot x_n$$
• Compare this to the gradient for linear regression:
$$\nabla_\theta \mathrm{L}(\theta) = \sum_n \left(y_n - \theta^T x_n \right) x_n$$
• In both cases
$$\nabla_\theta \mathrm{L} = \sum_n \left( \text{target}_n - \text{prediction}_n \right) \cdot \text{input}_n$$
• The parameter vector $\theta$ for logistic regression can be estimated through iterative gradient-based adaptation. E.g. (with iteration index $i$),
$$\hat{\theta}^{(i+1)} = \hat{\theta}^{(i)} + \eta \cdot \left. \nabla_\theta \mathrm{L}(\theta) \right|_{\theta = \hat{\theta}^{(i)}}$$

### 2. (OPTIONAL) Proof of Derivative of Log-likelihood for Discriminative Classification¶

• The Log-likelihood is $\mathrm{L}(\theta) = \log \prod_n \prod_k {\underbrace{p(\mathcal{C}_k|x_n,\theta)}_{p_{nk}}}^{y_{nk}} = \sum_{n,k} y_{nk} \log p_{nk}$
• Use the fact that the softmax $\phi_k \equiv e^{a_k} / {\sum_j e^{a_j}}$ has analytical derivative:
\begin{align*} \frac{\partial \phi_k}{\partial a_j} &= \frac{(\sum_j e^{a_j})e^{a_k}\delta_{kj}-e^{a_j}e^{a_k}}{(\sum_j e^{a_j})^2} = \frac{e^{a_k}}{\sum_j e^{a_j}}\delta_{kj} - \frac{e^{a_j}}{\sum_j e^{a_j}} \frac{e^{a_k}}{\sum_j e^{a_j}}\\ &= \phi_k \cdot(\delta_{kj}-\phi_j) \end{align*}
• Take the derivative of $\mathrm{L}(\theta)$ (or: how to spend a hour ...) \begin{align*} \nabla_{\theta_j} \mathrm{L}(\theta) &= \sum_{n,k} \frac{\partial \mathrm{L}_{nk}}{\partial p_{nk}} \cdot\frac{\partial p_{nk}}{\partial a_{nj}}\cdot\frac{\partial a_{nj}}{\partial \theta_j} \\ &= \sum_{n,k} \frac{y_{nk}}{p_{nk}} \cdot p_{nk} (\delta_{kj}-p_{nj}) \cdot x_n \\ &= \sum_n \Big( y_{nj} (1-p_{nj}) -\sum_{k\neq j} y_{nk} p_{nj} \Big) \cdot x_n \\ &= \sum_n \left( y_{nj} - p_{nj} \right)\cdot x_n \\ &= \sum_n \Big( \underbrace{y_{nj}}_{\text{target}} - \underbrace{\frac{e^{\theta_j^T x_n}}{\sum_{j^\prime} e^{\theta_{j^\prime}^T x_n}}}_{\text{prediction}} \Big)\cdot x_n \end{align*}

### 3. Application - Classify a new input¶

• Discriminative model-based prediction for a new input $x_\bullet$ is easy, namely substitute the ML estimate in the model to get
$$p(\mathcal{C}_k |\, x_\bullet,\hat\theta) = \frac{ \mathrm{exp}\left( \hat \theta_k^T x_\bullet \right) }{ \sum_{k^\prime} \mathrm{exp}\left(\hat \theta_{k^\prime}^T x_\bullet \right)} \propto \mathrm{exp}\left(\hat \theta_k^T x_\bullet\right)$$
• The contours of equal probability (discriminant boundaries) are lines (hyperplanes) in feature space given by $$\log \frac{{p(\mathcal{C}_k|x,\hat \theta )}}{{p(\mathcal{C}_j|x,\hat \theta )}} = \left( \hat{\theta}_{k} - \hat{\theta}_j\right) ^T x = 0$$

#### CODE EXAMPLE¶

Let us perform ML estimation of $\theta$ on the data set from the introduction. To allow an offset in the discrimination boundary, we add a constant 1 to the feature vector $x$. We only have to specify the (negative) log-likelihood and the gradient w.r.t. $\theta$. Then, we use an off-the-shelf optimisation library to minimize the negative log-likelihood.

We plot the resulting maximum likelihood discrimination boundary. For comparison we also plot the ML discrimination boundary obtained from the generative Gaussian classifier from lesson 7.

In [2]:
using Optim # Optimization library

y_1 = zeros(length(y))# class 1 indicator vector
y_1[findall(y)] .= 1
X_ext = vcat(X, ones(1, length(y))) # Extend X with a row of ones to allow an offset in the discrimination boundary

# Implement negative log-likelihood function
function negative_log_likelihood(θ::Vector)
# Return negative log-likelihood: -L(θ)
p_1 = 1.0 ./ (1.0 .+ exp.(-X_ext' * θ))   # P(C1|X,θ)
return -sum(log.( (y_1 .* p_1) + ((1 .- y_1).*(1 .- p_1))) ) # negative log-likelihood
end

# Use Optim.jl optimiser to minimize the negative log-likelihood function w.r.t. θ
results = optimize(negative_log_likelihood, zeros(3), LBFGS())
θ = results.minimizer

# Plot the data set and ML discrimination boundary
plotDataSet()
p_1(x) = 1.0 ./ (1.0 .+ exp(-([x;1.]' * θ)))
boundary(x1) = -1 ./ θ[2] * (θ[1]*x1 .+ θ[3])
plot([-2.;10.], boundary([-2.; 10.]), "k-");
# # Also fit the generative Gaussian model from lesson 7 and plot the resulting discrimination boundary for comparison
generative_boundary = buildGenerativeDiscriminationBoundary(X, y)
plot([-2.;10.], generative_boundary([-2;10]), "k:");
legend([L"y=0";L"y=1";L"y=?";"Discr. boundary";"Gen. boundary"], loc=3);


Given $\hat{\theta}$, we can classify a new input $x_\bullet = [3.75, 1.0]^T$:

In [3]:
x_test = [3.75;1.0]
println("P(C1|x•,θ) = $(p_1(x_test))")  P(C1|x•,θ) = 0.5982499473948646  • The generative model gives a bad result because the feature distribution of one class is clearly non-Gaussian: the model does not fit the data well. • The discriminative approach does not suffer from this problem because it makes no assumptions about the feature distribition$p(x|y)$, it just estimates the conditional class distribution$p(y|x)$directly. ### Recap Classification¶  Generative Discriminative 1 Like density estimation, model joint prob. $$p(\mathcal{C}_k) p(x|\mathcal{C}_k) = \pi_k \mathcal{N}(\mu_k,\Sigma)$$ Like (linear) regression, model conditional $$p(\mathcal{C}_k|x,\theta)$$ 2 Leads to softmax posterior class probability $$p(\mathcal{C}_k|x,\theta ) = e^{\theta_k^T x}/Z$$ with structured$\theta$Choose also softmax posterior class probability $$p(\mathcal{C}_k|x,\theta ) = e^{\theta_k^T x}/Z$$ but now with 'free'$\theta$3 For Gaussian$p(x|\mathcal{C}_k)$and multinomial priors, $$\hat \theta_k = \left[ {\begin{array}{c} { - \frac{1}{2} \mu_k^T \sigma^{-1} \mu_k + \log \pi_k} \\ {\sigma^{-1} \mu_k } \\ \end{array}} \right]$$ in one shot. Find$\hat\theta_k\$ through gradient-based adaptation $$\nabla_{\theta_k}\mathrm{L}(\theta) = \sum_n \Big( y_{nk} - \frac{e^{\theta_k^T x_n}}{\sum_{k^\prime} e^{\theta_{k^\prime}^T x_n}} \Big)\, x_n$$
In [4]:
open("../../styles/aipstyle.html") do f