Notebook

Machine learning intro¶

Preface -¶

These are notes I wrote for myself while studying for PhD qualifying exams at Stanford.

I wanted to published them so that others could benefit.

They draw from sources:

CS229, an excellent machine learning class at Stanford taught by Andrew Ng.
An Introduction to Statistical Learning (ISL), an excellent text by G. James, D. Witten, T. Hastie and R. Tibshirani (Springer, 2013).
A Few Useful Things to Know about Machine Learning, Pedro Domingos. Department of Computer Science and Engineering, UW.

Basic idea -¶

Machine learning is field of study that gives computers the ability to learn without being explicitly programmed.

It does this through induction:

Deductive (top-down): A conclusion is reached reductively by applying general rules.
Inductive (bottom-up): A conclusion is reached by generalizing or extrapolating from initial information.

Learners turn a small amount of input knowledge into a large amount of output knowledge.

Programming is a lot of work. We have to build everything from scratch.
Learning is more like farming, which lets nature do most of the work.

The input knowledge for induction is captured in a training set of data.

The goal of machine learning is to generalize beyond the examples in the training set.

In order to do this, the learner must embody knowledge or assumptions beyond the data it’s given in order to generalize beyond it.

Types of learning -¶

(1) Supervised learning: $learn$ a function, $h :x \mapsto y $, so that $h(x)$ is a good predictor of $y$.

Regression
Classification, which outputs a single discrete value (e.g., the class).

(2) Unsupervised learning: let the computer determine structure and patterns in data:

Clustering problems
Classifying sub-populations of single cells

General architecture of supervised learning algorithms -¶

Representation: Learner must be represented in a formal language that the computer can handle. For example, some classifiers.

Decision trees
Neural networks
Hyperplanes (e.g., Naive Bayes, Logistic regression)
Instances (e.g., SVMs)

Evaluation function: We use this to distinguish good learners from bad ones.

Accuracy/Error rate
Precision and recall
Squared error
Likelihood
K-L divergence

Optimization (learning strategy): The method we use to search for the highest-scoring learner.

Unconstrained continuous optimization (e.g., Gradient descent, Quasi-Newton methods)
Constrained continuous optimization (e.g., Quadratic programming)
Combinatorial optimization (e.g., Greedy search)

For example, let's consider a practical case.

We choose logistic regression, a parametric hyperplane representation, to classify our data.
We evaluate it using a likelihood funtion.
We train the optimal paramters using unconstrained continuous optimization, gradient ascent.

More speciifc notes on supervised learning algorithms -¶

We start with training data, a set of values ($X^1_i$ to $X^1_n$) for each target $Y^1$.

$X$, a set of measured features.
$Y$, a response.

There exists some $f$ that captures the relationship between $Y$ and $X$.

Machine learning refers to a set of approaches for estimating $f$.

$Y = f(X) + \epsilon$

$f$:

Parametric - A fixed size, like linear classifiers, and fixed form for $f$ with pre-defined parameters.
Non-parametric - Can grow with the data, like decision treesand can fit a wider range of possible shapes for $f$.

$Y$:

Continuous: Regression methods for $f$ are used here.
Categorical: Classification methods for $f$ are used here.

$\epsilon$:

$Y$ is also a funtion of $\epsilon$, which cannot be predicted with respect to $X$.
Irreducible error: Errors in our model due to $\epsilon$ cannot be reduced using a different statistical learning model.
Reducible error: Errors in our model that can be reduced using a different statistical learning model.
Irreducible error will always provide an upper bound on the accuracy of our prediction for $Y$.

Objectives for learning -¶

Our choice for $f$ is informed by:

Inference: How does $Y$ change with respect to different values of $X$.
Prediction: Predict a value for $Y$ with respect to $X$.

Error modes -¶

Above, we mentioned that we use an evaluation function to distinguish good learners from bad ones.

For example, mean squared error: $T =$ Avg $(y_o - \hat{f}(x_o))^2 = B^2 + V + var(\epsilon)$

Note that is is composed of:

Bias, $B$: Error introduced by using a model that does not capture the actual relationship (e.g., under-fitting).
Variance, $V$: Amount by which $\hat{f}$ would change if we estimated it using a different training data set (e.g., over-fitting).

In [2]:

import os
from IPython.display import Image
i = Image(filename=os.getcwd()+'/Images/Bias_Variance.jpg') # from the ISL text.
i

Out[2]:

What if the knowledge and data we have are not sufficient to completely determine the correct classifier?

Then we run the risk of just hallucinating a classifier: quirks in the data.

Causes for this:

Noise: incorrect labeling.
Bias: High bias is driven by rigididy; it can only model a strict relationship in the data.
Variance: High variance is driven by lack of rigidity; ability to fit training data.

Trade-offs between bias and variance:

A linear learner has high bias, because it only works if two classes is are seperated by a hyperplane.
Decision trees don’t have this problem because they can represent any Boolean function.
But on the other hand they can suffer from high variance.
Decision trees learned on different training sets generated by the same phenomenon are often very different.

De-bugging learning algorithms -¶

As mentioned, there are at least two obvious failure modes.

Overfitting (High variance):

Fitting high order polynomial to linear data.
In this case, training error will be much lower than test error.

Underfitting (High bias):

Fitting low order polynomial to complex data.
In this case, training error will also be high.

We can use diagnostics to understand why the algorithm is not working.

Learning curve:

Plot both test and training error with respect to the training set size.
The test set error will usually go down as we increace the training set size.
The training error usually goes up with respect to m, the training set size.
It becomes harder to fit the model to more training examples well.

Result are indicative of the error mode:

Overfitting: A large gap between test and training is indicative of overfitting. To resolve, get more training data.
Underfitting: Test and training error are similarly poor. So, here want to get a different model in order to better fit our data.

In summary, these approaches should be used to address high variance (overfitting):

More training examples.
Feature engineering: Smaller set of features.

Ways to combat overfitting:

Cross-validation.
Adding a regularization term to the evaluation function (e.g., to penalize classifiers with more structure).
Larger set of features.

General notes -¶

(1) Start with simple learning systems:

It is hard to predict where a system will fail, a priori.

(2) Avoid over-theorizing:

Ensure your field of theoretical work is relevant to the application of interest.

(3) As a rule of thumb, a dumb algorithm with lots and lots of data beats a clever one with modest amounts of it.