Classification with Naive Bayes

COMP4670/8600 - Introduction to Statistical Machine Learning - Assignment 1


Student ID:


Maximum marks 19
Weight 19% of final grade
Format Complete this ipython notebook. Do not forget to fill in your name and student ID above
Submission mode Use wattle
Formulas All formulas which you derive need to be explained unless you use very common mathematical facts. Picture yourself as explaining your arguments to somebody who is just learning about your assignment. With other words, do not assume that the person marking your assignment knows all the background and therefore you can just write down the formulas without any explanation. It is your task to convince the reader that you know what you are doing when you derive an argument. Typeset all formulas in $\LaTeX$.
Code quality Python code should be well structured, use meaningful identifiers for variables and subroutines, and provide sufficient comments. Please refer to the examples given in the tutorials.
Code efficiency An efficient implementation of an algorithm uses fast subroutines provided by the language or additional libraries. For the purpose of implementing Machine Learning algorithms in this course, that means using the appropriate data structures provided by Python and in numpy/scipy (e.g. Linear Algebra and random generators).
Late penalty For every day (starts at midnight) after the deadline of an assignment, the mark will be reduced by 20%. No assignments shall be accepted if it is later than 5 days.
Cooperation All assignments must be done individually. Cheating and plagiarism will be dealt with in accordance with University procedures (please see the ANU policies on Academic Honesty and Plagiarism). Hence, for example, code for programming assignments must not be developed in groups, nor should code be shared. You are encouraged to broadly discuss ideas, approaches and techniques with a few other students, but not at a level of detail where specific solutions or implementation issues are described by anyone. If you choose to consult with other students, you will include the names of your discussion partners for each solution. If you have any questions on this, please ask the lecturer before you act.
Solution To be presented in the tutorials.

$\newcommand{\dotprod}[2]{\left\langle #1, #2 \right\rangle}$ $\newcommand{\onevec}{\mathbb{1}}$ $\newcommand{\B}[1]{\mathbf{#1}}$ $\newcommand{\Bphi}{\boldsymbol{\mathsf{\phi}}}$ $\newcommand{\BPhi}{\boldsymbol{\Phi}}$ $\newcommand{\Cond}{\,|\,}$ $\newcommand{\DNorm}[3]{\mathcal{N}(#1\Cond#2, #3)}$ $\newcommand{\DUniform}[3]{\mathcal{U}(#1 \Cond #2, #3)}$ $\newcommand{\Ex}[2][]{\mathbb{E}_{#1} \left[ #2 \right]}$ $\newcommand{\var}[1]{\operatorname{var}[#1]}$ $\newcommand{\cov}[1]{\operatorname{cov}[#1]}$ $\newcommand{\Norm}[1]{\lVert#1\rVert}$ $\DeclareMathOperator*{\argmax}{arg\,max}$

Setting up the environment (Please evaluate this cell to activate the $\LaTeX$ macros.)

In [ ]:
import csv, scipy, scipy.stats, collections, itertools
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

The data set

The data set contains census information. Our task is to predict whether an indivual earns more than some amount.

Please download the following data:

Read and preprocess the data.

In [ ]:
raw_data = np.genfromtxt("", dtype=str, delimiter=',', autostrip=True)

# targets
target = ">50K"
target_column = raw_data[:,-1]
Y = (target_column == target).reshape(-1,1)
assert any(Y) and any(~Y), set(target_column)

features = [
    'age', 'workclass', 'fnlwgt','education', 'education_num', 'marital_status', 
    'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 
    'hours_per_week', 'native_country'

raw_features = raw_data[:,:-1] # drop the target
assert raw_features.shape[1] == len(features)

def preprocess_continuous(column):
    # convert continuous variables stored as strings to binary by comparing with the median
    float_column = column.astype(float)
    return (float_column > np.median(float_column)).astype(float).reshape(-1,1)

def preprocess_categorical(column):
    # convert categorical variables to indicator vectors
    values = sorted(set(column))
    if len(values) == 2:
        values = values[:1]
    return np.hstack([column.reshape(-1,1)==v for v in values]).astype(float)

preprocessor = collections.defaultdict(lambda: preprocess_categorical)

# apply appropriate preprocessor to each column of raw_features
X_list = [preprocessor[feature](raw_features[:,features.index(feature)]) for feature in features]
for feature, X in zip(features, X_list):
    assert X.shape[0] == raw_features.shape[0]
    print(X.shape[1], '\t', feature)
make_feature_names = lambda feature, dimension: ['%s_%.2i' % (feature, i) for i in range(dimension)]
binary_feature_names_list = [make_feature_names(feature, X.shape[1]) for feature, X in zip(features, X_list)]
binary_feature_names = list(itertools.chain(*binary_feature_names_list))

X = np.hstack(X_list)
assert set(X.flatten()) == {0.0, 1.0}
assert len(binary_feature_names) == X.shape[1]

Plot the data.

In [ ]:
print('X.shape', '\t', X.shape)
print('Y.shape', '\t', Y.shape)
print('mean(Y)', np.mean(Y))

figure = lambda : plt.figure(figsize=(16,5))
plt.ylabel('data index');
plt.gca().set_xticklabels(binary_feature_names+['y'], rotation='vertical')

ntrain = 1200
ishuffle = np.arange(X.shape[0])
itrain = ishuffle[slice(0,ntrain)]
itest = ishuffle[slice(ntrain,None)]
X_train, Y_train = X[itrain,:], Y[itrain,:]
X_test, Y_test = X[itest,:], Y[itest,:]

(3 points) 1A: Naive Bayes: Maximum Likelihood (m.l.)

Assume we have dataset $\mathcal{D}=\left\{(\mathbf{x}_i,y_i\right)\}_{i=1,2,\ldots,n}$ where $\mathbf{x}_i\in\{0,1\}^d$, $y_i\in\{0,1\}$, $n$ is the number of data points and $d$ the number of features.

  1. State the independence assumption of the naive Bayes classifier.
  2. Appropriately assume Bernoulli random variables. Letting $p(x_j=1|y=k) = \rho_{j,k}$ for $k=0,1$ and $p(y=1)=\mu$, derive the maximum likelihood $\rho_{j,k}$ and $\mu$ (that is, the parameters which maximise the likelihood).
  3. Implement a function which computes these maximum likelihood parameters, and call it on X_train, Y_train.
  4. Plot the $\rho_{j,k}$ (vertical) vs. $j$ (horizontal) using plt.plot with marker='.', labeling appropriately.
  5. Print the number of $\rho_{j,k}$ which are zero.
  6. Explain the problems which zero- (or one-) valued $\rho_{j,k}$ can lead to.


--- replace this with your solution: add cells as necessary but leave the blue Answer above ---

(3 points) 1B: Naive Bayes: Maximum a Posteriori (m.a.p.)

Let $\rho_{j,k}\sim\text{Beta}(\beta)$, with p.d.f. $f_{\rho_{j,k}}(\rho)=\frac{\rho^{\beta-1}(1-\rho)^{\beta-1}}{Z(\beta)}$ where $Z$ is a normalisation factor. Assume a uniform prior for $\mu$.

  1. Derive the maximum a posteriori $\rho_{j,k}$ and $\mu$ given the above prior.
  2. Implement a function which computes these maximum a posteriori parameters.
  3. Verify with assert np.allclose() that the m.a.p. solution with $\beta=1$ is identical to the m.l. solution.
  4. Call your function on X_train, Y_train with $\beta=10,100,1000$. For each case scatter plot the m.a.p. parameters vs. the m.l. parameters, all on one axis, colored and labelled appropriately.
  5. Give one example of the role of $\beta$ as evidenced by the plot.


--- replace this with your solution: add cells as necessary but leave the blue Answer above ---

(2 points) 1C: Naive Bayes: Prediction

  1. Derive the log predictive distribution $\log p(y=1|\mathbf{x},\{\rho_{j,k}\},\mu)$. Where appropriate, implement this by adding log probabilities instead of multiplying probabilities. Use np.logaddexp to add probabilities stored in log space.
  2. Call the above funcion on X_test using m.a.p. parameters computed with $\beta=2$.
  3. Plot a histogram of the probabilities (use np.exp on the log probabilities) using plt.hist with normed=True,histtype='step',label='your_label'. Do this for the predictive test probabilities corresponding to Y_test==1 and Y_test==0, puting the histograms on the same plot, labeling appropriately.


--- replace this with your solution: add cells as necessary but leave the blue Answer above ---

(2 points) 1D: Naive Bayes: Evaluation

  1. Write a function evaluate which takes data X,Y and model parameters $\rho_{j,k}$ and $\mu$, and returns a dict with keys mean_logp_true (for the mean predictive log probability of the ground truth labels Y given the data matrix X and percent_correct (for the percent correctly classified, assuming we classify each datum into the class with maximum predictive probability).
  2. Write a function cross_validate(beta, nfolds, X, Y) which performs cross validation with nfolds folds, for hyper parameter $\beta$. For each split, call your evaluate function, and return a dict with averaged results for each evaluation metric. You can make use of our xval_inds function.
  3. For $\beta$ in np.logspace(np.log10(2),2,32) compute the evaluation metrics on the test set, X_test, Y_test, as well as the cross-validated estimates. Make one plot for each metric, showing both results on each (for a total of four curves). Label appropriately.
In [ ]:
def xval_inds(ndata, nfolds):
    # return a list of (trainind, testind) pairs for ndata data points and nfolds xval folds
    assert ndata % nfolds == 0
    nchunk = int(ndata / nfolds)
    itest = [list(range(i*nchunk,(i+1)*nchunk)) for i in range(int(nfolds))]
    itrain = [sorted(set(range(ndata)).difference(set(i))) for i in itest]
    return list(zip(itrain, itest))


--- replace this with your solution: add cells as necessary but leave the blue Answer above ---

(2 points) 1E: Naive Bayes: Discussion

  1. Explain the shortcomings of our preprocessing of (a) continous variables, and (b) categorical variables. Make note of the assumptions of our model.
  2. Suggest improvements to our data processing pipeline, both in terms of the representation and the model.
  3. Consider the distribution $p(a,b,c)$ where all three variables are binary. How many parameters are needed to specify this distribution (a) in general and (b) if it factorises as $p(a|c)p(b|c)p(c)$?


--- replace this with your solution: add cells as necessary but leave the blue Answer above ---

(3 points) 2A: Representation Proof

Consider a parametric model governed by the parameter vector $\mathbf{w}$ together with a dataset of input values $\mathbf{x}_1,\ldots,\mathbf{x}_N$ and a nonlinear feature mapping $\phi(\mathbf{x})$. Suppose that the dependence of the error function on $\mathbf{w}$ takes the form $J(\mathbf{w}) = f(\mathbf{w}^\top \phi(\mathbf{x}_1), \ldots, \mathbf{w}^\top \phi(\mathbf{x}_N)) + g(\mathbf{w}^\top \mathbf{w})$ where $g(\cdot)$ is a monotonically increasing function. By writing $\mathbf{w}$ in the form $$ \mathbf{w} = \sum_{n=1}^N \alpha_n \phi(\mathbf{x}_n) + \mathbf{w}_\perp $$ show that the value of $\mathbf{w}$ that minimizes $J(\mathbf{w})$ takes the form of a linear combination of the basis functions $\phi(\mathbf{x}_n)$ for $n = 1, \ldots, N$.


--- replace this with your solution: add cells as necessary but leave the blue Answer above ---

(2 points) Maximum likelihood (ML) and Maximum A Posteriori (MAP)

We assume data samples $X_n = \{ x_1,\dots,x_n \}$ are generated i.i.d. from a uniform distribution $ \DUniform{x}{0}{\theta} $ between $ 0 $ and an unknown positive parameter $\theta$: $$ p(x \Cond \theta) = \DUniform{x}{0}{\theta} = \begin{cases} 1/\theta & 0 \leq x \leq \theta \\ 0 & \textrm{otherwise} \\ \end{cases} $$

Assume the data samples $ X_4 = \{ 5, 7, 3, 9 \}$ have been observed.

  1. Calculate $\theta_{ML} = \argmax_{\theta} p(X_4 \Cond \theta)$, the maximum likelihood estimate of $\theta$ for the observed data.

  2. Calculate $p(\theta \Cond X_4)$, the posterior distribution of $\theta$ given that the data $ X_4 $ have been observed and the initial distribution for $\theta$ is given as $p(\theta) = p(\theta \Cond X_0) = \DUniform{x}{0}{10}$.

  3. Calculate $\theta_{MAP} = \argmax_{\theta} p(\theta \Cond X_4)$, the maximum a posteriori estimate of $\theta$ given the data $ X_4 $ and the initial distribution $p(\theta)$ as in the previous question.

  4. Calculate $\theta_{ML}$, $p(\theta \Cond X_4)$, and $\theta_{MAP}$ for the case that the observed data are $ X_4 = \{ 9, 5, 7, 3 \}$ instead of the $ X_4 = \{ 5, 7, 3, 9 \}$ given above.


--- replace this with your solution: add cells as necessary but leave the blue Answer above ---

(2 points) Laplace Approximation

The function $$ f(z) = z^k e^{-z^2/2} \qquad \qquad z \in [0, \infty), \qquad k > 0 $$ can be considered as an (unnormalised) probability density.

  1. Verify that it is possible to approximate $ f(z) $ with the Laplace Approximation.

  2. Using the Laplace Approximation, find the mean and the variance of the Normal Distribution which best approximates the normalised version of $ f(z) $.

  3. The analytical form of the normalisation of $ f(x) $ is not so easy to find. Therefore, use Python to implement a numerical approximation using $ N = 100 $ identically sized intervals between $ 0 $ and $ a = 10 $ to calculate the normalisation $$ \int_0^{\infty} f(z) \mathrm{d}z \approx \int_0^{a} z^k e^{-z^2/2} \mathrm{d}z \approx \sum_{i=1}^{100} \dots $$ and report the results for the normalisation with a precision of $5$ digits after the comma for the three cases $ k = \{0.5, 3, 5 \}$.

  4. For each of the three cases $ k = \{0.5, 3, 5 \}$, plot the normalised function $ f(z) $ and the corresponding Normal Distribution with parameters resulting from the Laplace Approximation.

  5. Why is it reasonable to replace the upper limit of $ \infty$ with $ a = 10 $ ?


--- replace this with your solution: add cells as necessary but leave the blue Answer above ---