Our first model assumption is that our target output can be modeled as
$y(\phi(\mathbf{x})) = f(\mathbf{w}^T\phi(\mathbf{x}))$
where $y$ will be a vector of probabilities with $K$ elements. The elements of $y$ are $y_i = p(C_i|\phi(\mathbf{x}))$ i.e. the probability that the correct class label is $C_i$ given the input vector $\mathbf{x}$. Often we will have simply that $\mathbf{w}^T\phi(\mathbf{x}) = \mathbf{w}^T\mathbf{x}+w_0$ in which case we will simply omit $\phi$ from the notation. The function $f$ is known as the activation function and its inverse is known as the link function. Note that $f$ will often be nonlinear.
It should be noted that nearly all of the material presented fails to be a fully Bayesian treatment of the classification problem. Primarily this is because a Bayesian approach to the classification problem is mathematically intractable. However, Bayes' Theorem will appear often in the discussion, which can lead to confusion. Where appropriate, I will try to clarify the difference.
Another factor to keep in mind is that the target variable, $y$, described in our model above, does not provide a decision for a class assignment for a given input $\mathbf{x}$. In real world cases where it is necessary to make a decision as to which class $\mathbf{x}$ should be assigned, one must apply an additional modeling step based on decision theory. There are a variety of decision models, all of which leverage the class posterior probability models, $p(C_k|\mathbf{x})$, such as
The models presented here are called linear because, when the decision criteria is that of minimizing the misclassification rate, they divide the input space into $K$ regions, where the boundaries between regions are linear functions of the input vector $\mathbf{x}$. The decision boundaries will correspond to where $\mathbf{w}^T\mathbf{x}=constant$, and thus represent a linear function of $\mathbf{x}$. In the case of transformed input, $\phi(\mathbf{x})$, the decision boundaries will correspond to where $\mathbf{w}^T\phi(\mathbf{x})=constant$, and thus represent a linear function of $\phi(\mathbf{x})$.
known as the softmax function where $a_j = \ln(p(\mathbf{x}|C_k)p(C_k))$. In the case of $K=2$, the activation function will reduce to the logistic sigmoid function
$$f = p(C_1|\mathbf{x}) = \frac{1}{1+\exp(-a)} = \sigma(a)$$
where $a = \ln \frac {p(\mathbf{x}|C_1)p(C_1)} {p(\mathbf{x}|C_2)p(C_2)}$
Note this is not the only possible form for the class posterior models. For example, one might also add a noise term to account for misclassification. This particular form for the activation function is a consequence of the model we choose for $p(\mathbf{x}|C_k)$ in sections below. Showing the outcome may be a bit of "putting the cart before the horse" but it will simplify the notation as we proceed.
Note that although we have applied Bayes' Theorem, this is not a Bayesian model. Nowhere have we modeled the parameter posterior probability $p(\mathbf{w}|\mathbf{t})$. Indeed, we will see shortly that we will use a maximum likelihood approach to determine $\mathbf{w}$.
These models are known as generative models because they can be used to generate synthetic input
data by applying Inverse Transform Sampling to the marginal distribution for $\mathbf{x}$ defined by
$p(\mathbf{x}) = \sum_k p(\mathbf{x}|C_k)p(C_k)$
To move forward, it is necessary to start making model assumptions. Here we will assume that we have continuous inputs, i.e. $x\in \Re$ (see Bishop 202 & Ng Lecture 5 pdf for discrete input using Naive Bayes & Laplace),and
that the class-conditional densities,
$p(\mathbf{x}|C_k)$ are modeled by a Gaussian distribution. Under the Gaussian assumption, the class-conditional density for class $C_k$ is
$p(\mathbf{x}|C_k) = \frac{1}{\left(2 \pi \right)^{D/2}} \frac{1}{|\mathbf{\Sigma}|^{1/2}} \exp \\{ -\frac{1}{2} (\mathbf{x} - \mathbf{\mu}_k)^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{\mu}_k) \\}$
where $D$ is the dimension of the input vector $\mathbf{x}$, $\mathbf{\Sigma}$ is the covariance matrix and $\mathbf{\mu}$ is the mean vector and where we have assumed that all classes share the same covariance matrix.
In the case of two classes, this result is substituted into the logistic sigmoid function and reduces to
$p(C_1|\mathbf{x}) = \sigma \left( \mathbf{w}^T \mathbf{x} + w_0 \right)$
were we have defined
$\mathbf{w} = \mathbf{\Sigma}^{-1} \left( \mathbf{\mu}_1 - \mathbf{\mu}_2 \right)$
$w_0 = -\frac{1}{2} \mathbf{\mu}_1^T \mathbf{\Sigma}^{-1} \mathbf{\mu}_1 + \frac{1}{2} \mathbf{\mu}_2^T \mathbf{\Sigma}^{-1} \mathbf{\mu}_2 + \ln \frac{p(C_1)} {p(C_2)}$
Notice that the class prior probabilities, $p(C_k)$, effectively act as a bias term. Also note, that we have yet to specify a model for these distributions. If we are to use the result above, we will need to make another model assumption. Here we will assume that the class priors are modeled by a Bernoulli distribution with $p(C_1)=\gamma$ and $p(C_2)=1-\gamma$.
These results can be extended to the case of $K>2$ classes for which we obtain
$a_k(\mathbf{x}) = \left[\mathbf{\Sigma}^{-1} \mathbf{\mu}_k \right]^T \mathbf{x} - \frac{1}{2} \mathbf{\mu}_k^T \mathbf{\Sigma}^{-1} \mathbf{\mu}_k + \ln p(C_k)$
which is used in the first activation function provided at the begining of this section.
Note that we still have not formulated a complete model for the posterior class densities, in that we have not yet solved for the model parameters, $\mathbf{\mu}$ and $\mathbf{\Sigma}$. We do that now using a maximum likelihood approach.
Taking the derivate of this expression with respect to the various model parameters, $\pi$, $\mu_1$, $\mu_2$, and $\mathbf{\Sigma}$, and setting it equal to zero, we obtain the following
estimates
$\pi = \frac{N_1}{N_1+N_2}$
where $N_1$ is the number of training inputs in class $C_1$ and $N_2$ is the number in class $C_2$.
$\mu_1 = \frac{1}{N_1} \sum_{n=1}^N t_n \mathbf{x_n}$
$\mu_2 = \frac{1}{N_2} \sum_{n=1}^N (1-t_n) \mathbf{x_n}$
$\mathbf{\Sigma} = \frac{1}{N}\left[ \sum_{n\in C_1} (\mathbf{x_n}-\mu_1)(\mathbf{x_n}-\mu_1)^T + \sum_{n\in C_2} (\mathbf{x_n}-\mu_2)(\mathbf{x_n}-\mu_2)^T \right] $
import numpy as np
import math
from matplotlib import pyplot as plt
import scipy.integrate as sciInt
import scipy.optimize as sciOpt
#select truth data values, these will NOT be known to the training algorithm, they
#are only used in generating the sample data
tu1 =-2.0
tu2 = 2.0
tSigma = 2.0
tGamma = 0.5
#the class conditional probability p(x|Ck)
def p_xCk(x, mu, sigma):
denom = math.sqrt(2.0 * math.pi * sigma)
arg = -0.5 * (x - mu) * (x - mu) / sigma
return math.exp(arg) / denom
#the marginal probability p(x)
def p_x(x, mu1, mu2, sigma, gamma):
return gamma * p_xCk(x, mu1, sigma) + (1.0 - gamma) * p_xCk(x, mu2, sigma)
#posterior class probability vector (p(C_1|x), p(C_2|x))
def p_Ckx(x, mu1, mu2, sigma, gamma):
a = math.log(p_xCk(x, mu1, sigma)*gamma/(p_xCk(x,mu2,sigma)*(1-gamma)))
pc1 = 1.0/(1.0 + math.exp(-a))
return (pc1, 1.0 - pc1)
domain = range(0,100,1)
domain = [(-5.0 + x/10.0) for x in domain]
f, axarr = plt.subplots(2,2)
f.subplots_adjust(right=1.5)
f.subplots_adjust(top=1.5)
#plot p(x|Ck)
pxC1 = [p_xCk(x,tu1,tSigma) for x in domain]
pxC2 = [p_xCk(x,tu2,tSigma) for x in domain]
ax1 = axarr[0,0]
ax1.plot(domain, pxC1)
ax1.plot(domain, pxC2)
ax1.set_ylabel('p(x|C_k)')
#plot the marginal distribution p(x)
px = [p_x(x, tu1, tu2, tSigma, tGamma) for x in domain]
ax2 = axarr[0,1]
ax2.plot(domain,px)
ax2.set_ylabel('p(x)')
#plot the posterior distributions p(C_1|x) & p(C_2|x)
pc1x = []
pc2x = []
for x in domain:
pck = p_Ckx(x, tu1, tu2, tSigma, tGamma)
pc1x.append(pck[0])
pc2x.append(pck[1])
ax3 = axarr[1,0]
ax3.plot(domain, pc1x)
ax3.plot(domain, pc2x)
ax3.set_ylabel('p(C_k|x)')
#the cumulative distribution function P(x<X)
def cdf(x, mu1, mu2, sigma, gamma):
return sciInt.quad(func=p_x, a=-np.inf, b=x, args=(mu1, mu2, sigma, gamma))
#inverse of the CDF
def inv_cdf(y, mu1, mu2, sigma, gamma):
def f(x):
return cdf(x,mu1,mu2,sigma,gamma)[0] - y
return sciOpt.newton(f, 0)
#this code can be used to check our inv_cdf func, if its correct, domain and cdfs should be the same
#domain = [x/10.0 for x in range(12)[1:11]]
#icdfs = [inv_cdf(x, tu1, tu2, tSigma, tGamma) for x in domain]
#cdfs = [cdf(x, tu1, tu2, tSigma, tGamma)[0] for x in icdfs]
#print domain
#print icdfs
#print cdfs
seed = 123456789
np.random.seed(seed)
num_samples = 100
x = np.zeros(num_samples)
t = np.zeros(num_samples)
pcx = np.zeros(num_samples)
n1 = 0
nae = 0
assignment_epsilon = 0.5
#assign x to 1 for C1 and 0 for C2
for i in range(num_samples):
rv = np.random.uniform()
x[i] = inv_cdf(rv, tu1, tu2, tSigma, tGamma)
pcx1 = p_Ckx(x[i], tu1, tu2, tSigma, tGamma)
pcx2 = pcx1[1]
pcx1 = pcx1[0]
#we don't want a perfect dividing line for our domain, otherwise why would we need a learning algorithim?
if math.fabs(pcx2-pcx1) <= assignment_epsilon:
nae = nae + 1
if np.random.uniform() <= 0.5:
t[i] = 1
n1 = n1 + 1
else:
t[i] = 0
elif pcx1 > pcx2:
t[i] = 1
n1 = n1 + 1
else: t[i]=0
#plot the simulated data
ax4 = axarr[1,1]
ax4.scatter(x, t)
ax4.axvline(0)
ax4.set_ylabel('class')
plt.show()
#compute the paratmer estimates using the synthetic training data
n2 = num_samples - n1
eGamma = n1 / float(num_samples)
zipped = zip(x,t)
eu1 = reduce(lambda accum, Z: accum + Z[0]*Z[1], zipped, 0) / float(n1)
eu2 = reduce(lambda accum, Z: accum + Z[0]*(1-Z[1]), zipped, 0) / float(n2)
def Sn(Z, ck, mu=(eu1,eu2)):
idx = 1-int(ck)
return (Z[0]-mu[idx])*(Z[0]-mu[idx])
eSigma = reduce(lambda accum, Z: accum + Sn(Z, Z[1]), zipped, 0) / float(num_samples)
print 'The number of inputs in C1 is {0}'.format(n1)
print 'The number of inputs that were assigned based on the assignment factor is {0}'.format(nae)
print 'The true model parameters are gamma={0}, u1={1}, u2={2}, Sigma={3}'.format(tGamma, tu1, tu2, tSigma)
print 'The estimated model parameters are gamma={0}, u1={1}, u2={2}, Sigma={3}'.format(eGamma, eu1, eu2, eSigma)
The number of inputs in C1 is 49 The number of inputs that were assigned based on the assignment factor is 12 The true model parameters are gamma=0.5, u1=-2.0, u2=2.0, Sigma=2.0 The estimated model parameters are gamma=0.49, u1=-2.1627582967, u2=2.02800375873, Sigma=1.66604897528
$y(\phi(\mathbf{x})) = f(\mathbf{w}^T\phi(\mathbf{x}))$
as the probabilistic generative models. However, rather than formulate models for the class-conditional densities, $p(\phi(\mathbf{x})|C_k)$, and the class priors, $p(C_k)$, the discriminative approach explicitly models the class posterior probabilities, $p(C_k|\phi(\mathbf{x}))$ with model parameters $\mathbf{w}$. As in the probabilistic generative approach, maximum likelihood is used to estimate the model parameters given some training data set, $\mathbf{t}$. The difference is the form of the likelihood function. In the probabilistic generative case, the likelihood function is a function of the joint probability, $p(\phi(\mathbf{x}),C_k)=p(\phi(\mathbf{x})|C_k)p(C_k)$. In the probabilistic discriminative approach, the likelihood function is a function of the conditional class posteriors, $p(C_k|\phi(\mathbf{x}))$ only.
NOTE The section on probablistic generative models focused on models that used the input, $\mathbf{x}$, directly. However, as noted previously, those models, and the results, hold equally well for input that undergoes a fixed transformation using a set of basis functions, $\phi$. In this section, we will focus on the inclusion of an input transformation via the basis functions.
Superficially, this error function looks the same as that obtained for linear regression under the assumption of a Gaussian noise model which had a closed form solution. However, the nonlinearity of the sigmoid function, $\sigma(\mathbf{w}^T \phi(\mathbf{x}_n))$ prevents a closed form solution in the logistic regression problem. We therefore must apply an iterative method to obtain a numerical solution for the parameters, $\mathbf{w}$. Here we will
consider the Newton-Raphson method for which minimizing the error function takes the form
$\mathbf{w}^{\tau+1} = \mathbf{w}^{\tau} - \mathbf{H}^{-1}\bigtriangledown E(\mathbf{w})$
where $\mathbf{H}$ is the Hessian matrix composed of the second derivatives of the error function
$\mathbf{H} = \bigtriangledown \bigtriangledown E(\mathbf{w}) = \Phi^T \mathbf{R} \Phi$
where $\Phi$ is the $N \times M$ design matrix whos $n^{th}$ row is given by $\phi(\mathbf{x_n})^T$ and $\mathbf{R}$ is an $N \times N$ diagonal matrix with elements $R_{n,n} = \sigma(\mathbf{w}^T \phi(\mathbf{x}_n)) \left[1-\sigma(\mathbf{w}^T \phi(\mathbf{x}_n))\right]$. This can be reduced to a form equivalent to that of localy weighted linear regression as follows
$\mathbf{w}^{\tau+1} = \left( \Phi^T \mathbf{R} \Phi \right)^{-1} \Phi^T \mathbf{R} \mathbf{z}$
where $\mathbf{z}$ is an $N$ dimensional vector defined by
$\mathbf{z} = \Phi \mathbf{w}^{\tau} - \mathbf{R}^{-1}(\mathbf{y} - \mathbf{t})$
import numpy as np
from matplotlib import pyplot as plt
seed = 123456789
np.random.seed(seed)
N = 100 #number of data points, SEE WHAT HAPPENS AS YOU INCREASE THIS TO SAY 200
D = 2 #dimension of input vector
t = np.zeros(N) #training set classifications
X = np.zeros((N,D)) #training data in input space
sigma = .25
mu0 = 0.0
mu1 = 1.0
#func for generating 2D class scatter plot
def createScatter(X, t, ax):
C1x = []
C1y = []
C2x = []
C2y = []
for i in range(len(t)):
if t[i] > 0:
C1x.append(X[i,0])
C1y.append(X[i,1])
else:
C2x.append(X[i,0])
C2y.append(X[i,1])
ax.scatter(C1x,C1y)
ax.scatter(C2x,C2y, color='r')
#Generate test data. (NOTE THIS IS NOT BASED ON A GENERATIVE APPROACH. I AM PICKING SOMETHING THAT I KNOW (OK, THINK) WILL WORK. IN PRACTICE THIS DATA WOULD BE OBTAINED VIA OBSERVATION)
#Pick a value from a uniform distribution [0,1). If it is less than 0.5, assign class 1 and pick x1,x2 from a ND(mu0,sigma) otherwise assign class 2 and pick x1,x2 from ND(mu1,sigma)
for i in range(N):
#choose class to sample for
fac = 1
if np.random.rand() <= 0.5:
thismu = mu0
t[i] = 1
else:
thismu = mu1
t[i] = 0
if np.random.rand() < 0.5: fac = -1
X[i,0] = fac * np.random.normal(thismu, sigma)
X[i,1] = fac * np.random.normal(thismu, sigma)
f, axarr = plt.subplots(1,3)
f.subplots_adjust(right=2.5)
f.text(0.75,0.975,'Fixed Input Transformation Yields Linear Class Boundary',horizontalalignment='center',verticalalignment='top')
ax1 = axarr[0]
ax1.set_xlabel('x1')
ax1.set_ylabel('x2')
createScatter(X,t,ax1)
#The training data does not have a linear boundary in the original input space. So lets apply a tranformation, phi to try to make it linearly seperable
#NOTE: This transformation is not the only one that works. For example try switching the values of MU1 and MU2. The result will be a different mapping
#that is still linearly seperable
def phi(x,mu,sigma):
detSigma = np.linalg.det(sigma)
fac = math.pow(2*math.pi, len(mu)/2.0) * math.sqrt(detSigma)
arg = -0.5 * np.dot((x-mu).T, np.dot(np.linalg.inv(sigma), x-mu) )
return math.exp(arg) / fac
phiX = np.zeros((N,D))
MU1 = np.ones(D)*mu0
MU2 = np.ones(D)*mu1
SIGMA = np.diag(np.ones(D))*sigma
for i in range(N):
phiX[i,0] = phi(x=X[i,:], mu=MU2, sigma=SIGMA)
phiX[i,1] = phi(x=X[i,:], mu=MU1, sigma=SIGMA)
ax2 = axarr[1]
ax2.set_xlabel('phi1(x)')
ax2.set_ylabel('phi2(x)')
createScatter(phiX, t, ax2)
#Now lets apply machine learning to determine the boundary. We will assume M = 3, i.e. that there are 3 free parameters that is w = [w0, w1, w2]^T and phi_n = [1, phiX[0], phiX[1]]
M = 3
Phi = np.ones((N,M))
Phi[:,1] = phiX[:,0]
Phi[:,2] = phiX[:,1]
w = np.zeros(M)
R = np.zeros((N,N))
y = np.zeros(N)
def sigmoid(a):
return 1.0 / (1.0 + math.exp(-a))
def totalErr(y,t):
e = 0.0
for i in range(len(y)):
if t[i] > 0:
e += math.log(y[i])
else:
e += math.log(1.0 - y[i])
return -e
#start Newton-Raphson. As a stopping criteria we will use a tolerance on the change in the error function and a max number of iterations
max_its = 100
tol = 1e-2
w0 = [w[0]]
w1 = [w[1]]
w2 = [w[2]]
err = []
error_delta = 1 + tol
current_error = 0
idx = 0
while math.fabs(error_delta)>tol and idx < max_its:
#update y & R
for i in range(N):
zipped = zip(w, Phi[i,:])
y[i] = sigmoid(reduce(lambda accum, Z: accum + Z[0]*Z[1], zipped, 0))
R[i,i] = y[i] - y[i]*y[i]
#update w
z = np.dot(Phi,w) - np.dot(np.linalg.pinv(R),y-t)
temp = np.linalg.pinv(np.dot(np.dot(Phi.T,R),Phi))
temp2 = np.dot(np.dot(temp, Phi.T),R)
w = np.dot(temp2, z)
w0.append(w[0])
w1.append(w[1])
w2.append(w[2])
idx += 1
temp = totalErr(y,t)
error_delta = current_error - temp
current_error = temp
err.append(error_delta)
print 'The total number of iterations was {0}'.format(idx)
print 'The total error was {0}'.format(current_error)
print 'The final change in error was {0}'.format(error_delta)
print 'The final parameters were {0}'.format(w)
#our decision boundary is now formed by the line where sigma(a) = 0.5, i.e. where a = 0, which for this example is where phi2 = -(w1/w2)phi1, i.e. where w * phi = .5
bdryx = (0,1)
bdryy = (-(w[0]+w[1]*bdryx[0])/w[2], -(w[0]+w[1]*bdryx[1])/w[2])
ax2.plot(bdryx, bdryy)
ax3 = axarr[2]
ax3.plot(w0, color = 'blue')
ax3.plot(w1, color = 'red')
ax3.plot(w2, color = 'green')
ax3.plot(err, color = 'black')
ax3.legend(("w0","w1","w2", "error delta"), loc='upper left')
ax3.set_xlabel('Newton-Raphson Iteration')
The total number of iterations was 13 The total error was 0.0034822214269 The final change in error was 0.00577180168398 The final parameters were [ -17.84374926 -14.0620823 163.06478261]
<matplotlib.text.Text at 0x108f0ad50>