Due date: Friday 9/21 at 11:59pm
In this assignment we will use the following datasets:
When writing your notebook, you can assume the datasets are in the same directory as the notebook. Please keep the same file names as in the UCI repository.
In this assignment you will work with several variants of the perceptron algorithm:
In each case make sure that your implementation of the classifier includes a bias term (in slide set 2 and page 7 in the book you will find guidance on how to add a bias term to an algorithm that is expressed without one).
Before we get to the adatron, we will derive an alternative form of the perceptron algorithm --- the dual perceptron algorithm. All we need to look at is the weight update rule:
$$\mathbf{w} \rightarrow \mathbf{w} + \eta y_i \mathbf{x}_i.$$This is performed whenever example $i$ is misclassified by the current weight vector. The thing to notice, is that the weight vector is always a weighted combination of the training examples since it is that way to begin with, and each update maintains that property. So in fact, rather than representing $\mathbf{w}$ explicitly, all we need to do is to keep track of how much each training example is contributing to the value of the weight vector, i.e. we will express it as:
$$\mathbf{w} = \sum_{i=1}^N \alpha_i y_i \mathbf{x}_i,$$where $\alpha_i$ are positive numbers that describe the magnitude of the contribution $\mathbf{x}_i$ is making to the weight vector, and $N$ is the number of training examples.
Therefore to initialize $\mathbf{w}$ to 0, we simply initialize $\alpha_i = 0$ for $i = 1,\ldots,N$. When expressed using the variables $\alpha_i$, the perceptron update rule becomes:
$$\alpha_i = \alpha_i + \eta y_i,$$and you can always retrieve the weight vector using its expansion in terms of the $\alpha_i$.
Now we're ready for the adatron - the only difference is in the initialization and update equation.
Initialization:
$\alpha_i = 1$ for $i = 1,\ldots,N$
Like in the perceptron we run the algorithm until convergence, or until a fixed number of epochs has passed (an epoch is a loop over all the training data), and an epoch of training consists of the following procedure:
for each training example $i=1,\ldots,N$ perform the following steps:
if
$(\alpha_i + \delta\alpha < 0)$ : $\alpha_i = 0$, else :
$\alpha_i = \alpha_i + \delta\alpha$The variable $\eta$ plays the role of the learning rate $\eta$ employed in the perceptron algorithm and $\delta \alpha$ is the proposed magnitude of change in $\alpha_i$. We note that the adatron tries to maintain a sparse representation in terms of the training examples by keeping many $\alpha_i$ equal to zero. The adatron converges to a special case of the SVM algorithm that we will learn later in the semester; this algorithm tries to maximize the margin with which each example is classified, which is captured by the variable $\gamma$ in the algorithm (notice that the magnitude of change proposed for each $\alpha_i$ becomes smaller as $\gamma$ increases towards 1).
Note: if you observe an overflow issues in running the adatron, add an upper bound on the value of $\alpha_i$.
Here's what you need to do:
# Your answer here
Whenever we train a classifier it is useful to know if we have collected a sufficient amount of data for accurate classification. A good way of determining that is to construct a learning curve, which is a plot of classifier performance (i.e. its error) as a function of the number of training examples. Plot a learning curve for the perceptron algorithm (with bias) using the Gisette dataset. The x-axis for the plot (number of training examples) should be on a logarithmic scale - something like 10,20,40,80,200,400,800. Use numbers that are appropriate for the dataset at hand, choosing values that illustrate the variation that you observe. What can you conclude from the learning curve you have constructed? Make sure that you use a fixed test set to evaluate performance while varying the size of the training set.
# Your answer here
In this section we will explore the effect of normalizing the data, focusing on normalization of features. The simplest form of normalization is to scale each feature to be in the range [-1, 1]. We'll call this scaling.
Here's what you need to do:
# your answer here
Answer the questions in the cells reserved for that purpose.
Mathematical equations should be written as LaTex equations; the assignment contains multiple examples of both inline formulas (such as the one exemplifying the notation for the norm of a vector $||\mathbf{x}||$ and those that appear on separate lines, e.g.:
$$ ||\mathbf{x}|| = \sqrt{\mathbf{x}^T \mathbf{x}}. $$Submit your report as a Jupyter notebook via Canvas. Running the notebook should generate all the plots and results in your notebook.
Here is what the grade sheet will look like for this assignment. A few general guidelines for this and future assignments in the course:
Grading sheet for the assignment:
Part 1: 60 points.
(30 points): Correct implementation of the classifiers
(15 points): Good protocol for evaluating classifier accuracy; results are provided in a clear and concise way
(15 points): Discussion of the results
Part 2: 20 points.
(15 points): Learning curves are correctly generated and displayed in a clear and readable way
( 5 points): Discussion of the results
Part 3: 20 points.
( 5 points): How to perform data scaling
(10 points): Comparison of normalized/raw data results; discussion of results
( 5 points): Range of features after standardization
Grading will be based on the following criteria: