These are notes I wrote for myself while studying for PhD qualifying exams at Stanford.
I wanted to published them so that others could benefit.
They draw from sources:
Machine learning is field of study that gives computers the ability to learn without being explicitly programmed.
It does this through induction:
Learners turn a small amount of input knowledge into a large amount of output knowledge.
The input knowledge for induction is captured in a training set of data.
The goal of machine learning is to generalize beyond the examples in the training set.
In order to do this, the learner must embody knowledge or assumptions beyond the data it’s given in order to generalize beyond it.
(1) Supervised learning: $learn$ a function, $h :x \mapsto y $, so that $h(x)$ is a good predictor of $y$.
(2) Unsupervised learning: let the computer determine structure and patterns in data:
Representation: Learner must be represented in a formal language that the computer can handle. For example, some classifiers.
Evaluation function: We use this to distinguish good learners from bad ones.
Optimization (learning strategy): The method we use to search for the highest-scoring learner.
For example, let's consider a practical case.
We start with training data, a set of values ($X^1_i$ to $X^1_n$) for each target $Y^1$.
There exists some $f$ that captures the relationship between $Y$ and $X$.
Machine learning refers to a set of approaches for estimating $f$.
$Y = f(X) + \epsilon$
$f$:
$Y$:
$\epsilon$:
Our choice for $f$ is informed by:
Above, we mentioned that we use an evaluation function to distinguish good learners from bad ones.
For example, mean squared error: $T =$ Avg $(y_o - \hat{f}(x_o))^2 = B^2 + V + var(\epsilon)$
Note that is is composed of:
import os
from IPython.display import Image
i = Image(filename=os.getcwd()+'/Images/Bias_Variance.jpg') # from the ISL text.
i
What if the knowledge and data we have are not sufficient to completely determine the correct classifier?
Then we run the risk of just hallucinating a classifier: quirks in the data.
Causes for this:
Trade-offs between bias and variance:
As mentioned, there are at least two obvious failure modes.
Overfitting (High variance):
Underfitting (High bias):
We can use diagnostics to understand why the algorithm is not working.
Learning curve:
Result are indicative of the error mode:
In summary, these approaches should be used to address high variance (overfitting):
Ways to combat overfitting:
(1) Start with simple learning systems:
(2) Avoid over-theorizing:
(3) As a rule of thumb, a dumb algorithm with lots and lots of data beats a clever one with modest amounts of it.