5 Machine Learning Basics¶

[손고리즘] middle learning - 파이썬을 이용한 기계학습 알고리즘 기초 / 딥러닝 파트 5장 [2]
김무성

Contents¶

5.1 Learning Algorithms
5.2 Example: Linear Regression
5.3 Generalization, Capacity, Overﬁtting and Under-ﬁtting
5.4 Hyperparameters and Validation Sets
5.5 Estimators, Bias and Variance
5.6 Maximum Likelihood Estimation
5.7 Bayesian Statistics
5.8 Supervised Learning Algorithms
5.9 Unsupervised Learning Algorithms
5.10 Weakly Supervised Learning
5.11 Building a Machine Learning Algorithm
5.12 The Curse of Dimensionality and Statistical Lim-itations of Local Generalization

Deep learning is a speciﬁc kind of machine learning. In order to understand deeplearning well, one must have a solid understanding of the basic principles of ma-chine learning

5.1 Learning Algorithms¶

5.1.1 The Task, T
5.1.2 The Performance Measure, P
5.1.3 The Experience, E

A machine learning algorithm is an algorithm that is able to learn from data. Butwhat do we mean by learning?¶

A popular deﬁnition of learning in the context ofcomputer programs is
- “A computer program is said to learn
  - from experience E
  - with respect to some class of tasks T
  - and performance measure P ,
- if its performance at tasks in T , as measured by P , improves with experience E”

5.1.1 The Task, T¶

Classiﬁcation
Classiﬁcation with missing inputs
Regression
Transcription
Translation
Structured output
Anomaly detection
Synthesis and sampling
Imputation of missing values
Denoising
Density or probability function estimation

Classiﬁcation¶

Classiﬁcation with missing inputs¶

Classiﬁcation becomes more challenging if the computer program is not guaranteed that every measurement in its input vector will always be provided.
In order to solve the classiﬁcation task, the learning algorithm only has to deﬁne a single function mappingfrom a vector input to a categorical output.
When some of the inputsmay be missing, rather than providing a single classiﬁcation function, the learning algorithm must learn a set of functions.

Regression¶

Transcription¶

In this type of task, the machine learning system is asked toobserve a relatively unstructured representation of some kind of data andtranscribe it into discrete, textual form.

Translation¶

In a translation task, the input already consists of a sequenceof symbols in some language, and the computer program must convert thisinto a sequence of symbols in another language.

Structured output¶

Structured output tasks involve any task where the output is a vector con-taining important relationships between the diﬀerent elements.

Anomaly detection¶

In this type of task, the computer program sifts througha set of events or objects, and ﬂags some of them as being unusual or atypi-cal.

Synthesis and sampling¶

In this type of task, the machine learning algorithmis asked to generate new examples that are similar to those in the trainingdata.

Imputation of missing values¶

The algorithm must provide a prediction of the values of themissing entries.

a new example¶

missing value¶

Denoising¶

a corrupted example¶

a clean example¶

The learner must predict the cleanexample x from its corrupted version

or more generally predict the con-ditional probability distribution

Density or probability function estimation¶

P_model(x) can be interpreted as a probability density func-tion (if x is continuous) or a probability function (if x is discrete) on thespace that the examples were drawn from.

if we have performed density estimation toobtain a probability distribution p(x), we can use that distribution to solvethe missing value imputation task.

In practice, density estimationdoes not always allow us to solve all of these related tasks, because in manycases the required operations on p(x) are computationally intractable

5.1.2 The Performance Measure, P¶

accuracy
error rate
- 0-1 loss
probability
test set & training set

In order to evaluate the abilities of a machine learning algorithm, we must designa quantitative measure of its performance. Usually this performance measure Pis speciﬁc to the task T being carried out by the system

accuracy¶

Accuracy is just theproportion of examples for which the model produces the correct output.

error rate¶

We canalso obtain equivalent information by measuring the error rate, the proportion ofexamples for which the model produces an incorrect output.

0-1 loss¶

We often refer to the error rate as the expected 0-1 loss. The 0-1 loss on a particular example is 0 if it is correctly classiﬁed and 1 if it is not

probability¶

For tasks such as density estimation,we can measure the probability the model assigns to some examples.

test set & training set¶

We therefore evaluate these performance mea-sures using a test set of data that is separate from the data used for training themachine learning system.

5.1.3 The Experience, E¶

Unsupervised learning algorithms
Supervised learning algorithms
reinforcement learing algorithms
dataset

Most of the learning algorithms in this book can be understood¶

as being allowed to experience an entire dataset.
A dataset is a
- collection of many objects called examples,
  - with each example containing many features
    - that have been objectively measured.
  - Sometimes we will also call examples data points

Unsupervised learning algorithms¶

unsupervised

supervised

Supervised learning algorithms¶

label or target

Unsupervised learning and supervised learning are not formally deﬁned terms.¶

The lines between them are often blurred.
Many machine learning technologies can be used to perform both tasks.

For example, the chain rule of probability states that for a vector¶

the joint distribution can be decomposed as¶

This decomposition means that we can solve the ostensibly unsupervised problemof modeling p(x) by splitting it into n supervised learning problems.

we can solve the supervised learning problem of learning¶

by using tra-ditional unsupervised learning technologies to learn the joint distribution¶

and inferring¶

reinforcement learing algorithms¶

Some machine learning algorithms do not just experience a ﬁxed dataset.
For example, reinforcement learning algorithms interact with an environment, sothere is a feedback loop between the learning system and its experiences.
Such algorithms are beyond the scope of this book.

dataset¶

features
design matrix
heterogeneous data

features¶

Most machine learning algorithms simply experience a dataset. A dataset canbe described in many ways. In all cases, a dataset is a collection of examples.Each example is a collection of observations called features collected from a dif-ferent time or place.

design matrix¶

One common way of describing a dataset is with a design matrix.
A design matrix is a matrix containing a diﬀerent example in each row.
Each column of thematrix corresponds to a diﬀerent feature

design matrix example¶

For instance, the Iris dataset contains150 examples with four features for each example.
This means we can representthe dataset with a design matrix

is the sepal lengthof plant i

is the sepal width of plant i

heterogeneous data¶

Diﬀerent sections of this book describe how to handle diﬀerenttypes of heterogeneous data.
In cases like these, rather than describing the datasetas a matrix with m rows, we will describe it as a set containing m elements, e.g.

This notation does not imply that any two example vectors

and

have the same size.

Often when working with a dataset containing a design matrix of feature observations¶

we alsoprovide a vector of labels

with

providing the label for example i

5.2 Example: Linear Regression¶

input

output

linear regression

parameters

ith feature

ith weight

performance measure¶

test set¶

design matrix of input

regression target vector

predictionso of model on the test set

mean squared error

Intuitively, one can see that this error measure decreases to 0 when

We can also see that

so the error increases whenever the Euclidean distance between the predictionsand the targets increases.

To make a machine learning algorithm, we need to design an algorithm thatwill improve the weights

in a way that reduces

when the algorithmis allowed to gain experience by observing a training set

To minimize

we can simply solve for where its gradient is 0:

It’s worth noting that the term linear regression is often used to refer to aslightly more sophisticated model with one additional parameter—an intercept term

In this model

so the mapping from parameters to predictions is still a linear function but themapping from features to predictions is now an aﬃne function.

Linear regression is of course an extremely simple and limited learning al-gorithm, but it provides an example of how a learning algorithm can work.

5.3 Generalization, Capacity, Overﬁtting and Underﬁtting¶

5.3.1 The No Free Lunch Theorem
5.3.2 Regularization

The central challenge in machine learning is that we must perform well on new,previously unseen inputs—not just those on which our model was trained. ==> generalization

generalization error
training error

training set
test set

In our linear regression example, we trained the model by minimizing thetraining error,

but we actually care about the test error,

How can we aﬀect performance on the test set when we only get to observethe training set?
- The ﬁeld of statistical learning theory provides some answers.
- If the training and the test set are collected arbitrarily,
- there is indeed little we can do.
- If we are allowed to make some assumptions about how the training and test set are collected, then we can make some progress

some assumptions
- i.i.d. assumptions.
  - independent
  - identically distributed
    - data generating distribution, or data generating process

The factors determining how well a machinelearning algorithm will perform are its ability to:
1. Make the training error small.
2. Make the gap between training and test error small.

These two factors correspond to the two central challenges in machine learning:
- underﬁtting and
- overﬁtting.

capacity
- We can control whether a model is more likely to overﬁt or underﬁt by altering its capacity.
  - Informally, a model’s capacity is its ability to ﬁt a wide variety of functions.
  - Models with low capacity may struggle to ﬁt the training set.
  - Models with high capacity can overﬁt, i.e., memorize properties of the training set that do not serve them well on the test set.

hypothesis space
- One way to control the capacity of a learning algorithm is by choosing its hypothesis space, the set of functions that the learning algorithm is allowed to choose as being the solution.

A polynomial of degree one gives us the linear regression model with whichwe are already familiar, with prediction

By introducing

as another feature provided to the linear regression model, wecan learn a model that is quadratic as a function of

:

Note that this is still a linear function of the parameters, so we can still use thenormal equations to train the model in closed form.
We can continue to add more powers of x as additional features, for example to obtain a polynomial of degree 9:

Machine learning algorithms will generally perform best when their capacityis appropriate in regard to the true complexity of the task they need to performand the amount of training data they are provided with.

Occam’s razor

We must remember that while simpler functions are more likely to generalize(to have a small gap between training and test error) we must still choose a suﬃciently complex hypothesis to achieve low training error.

non-parametric model
- To reach the most extreme case of arbitrarily high capacity, we introducethe concept of non-parametric models. So far, we have seen only parametricmodels, such as linear regression.
- Parametric models learn a function describedby a parameter vector whose size is ﬁnite and ﬁxed before any data is observed.
- Non-parametric models have no such limitation.

k-NN

nearest neighbor regression(k-NN Regression)
- Unlikely linear regression,which has a ﬁxed-length vector of weights, the nearest neighbor regression modelsimply stores the X and y from the training set.
- When asked to classify a test point x, the model looks up the nearest entry in the training set and returns theassociated regression target.

Bayes error
- The error incurred by an oraclemaking predictions from the true distribution p(x, y) is called the Bayes error.
- 참고 - http://newsight.tistory.com/127

representational capacity & effective capacity
- It’s worth mentioning that capacity is not just determined by which model we use.
- The model speciﬁes which family of functions the learning algorithm can choose from when varying the parameters in order to reduce a training objective.
- This is called the representational capacity of the model.
- In many cases, ﬁnding the best function within this family is a very diﬃcult optimization problem.
- In practice, the learning algorithm does not actually ﬁnd the best function, just one that signiﬁcantly reduces the training error.
- These additional restrictions mean that the model’s eﬀective capacity may be less than its representational capacity.

5.3.1 The No Free Lunch Theorem¶

The no freelunch theorem for machine learning (Wolpert, 1996) states that, averaged overall possible data generating distributions, every classiﬁcation algorithm has the same error rate when classifying previously unobserved points. In other words,in some sense, no machine learning algorithm is universally any better than anyother. The most sophisticated algorithm we can conceive of has the same averageperformance (over all possible tasks) as merely predicting that every point belongsto the same class.

Fortunately, these results hold only when we average over all possible datagenerating distributions. If we make assumptions about the kinds of probability distributions we encounter in real-world applications, then we can design learning algorithms that perform well on these distributions.

5.3.2 Regularization¶

preference¶

The no free lunch theorem implies that we must design our machine learning algorithms to perform well on a speciﬁc task.
We do so by building a set of preferences into the learning algorithm.
When these preferences are aligned with the learning problems we ask the algorithm to solve, it performs better.

hypotheis space of solutions¶

So far, the only method of modifying a learning algorithm we have discussedis to increase or decrease the model’s capacity by adding or removing functionsfrom the hypothesis space of solutions the learning algorithm is able to choose.

functions¶

The behavior of our algorithm is strongly aﬀected not just by how large wemake the set of functions allowed in its hypothesis space, but by the speciﬁcidentity of those functions.
- linear functions
  - The learning algorithm we have studied so far, linearregression, has a hypothesis space consisting of the set of linear functions of itsinput.
- nonlinear functions

weight decay¶

For example, we can modify the training criterion for linear regression to include weight decay.
To perform linear regression with weight decay, we minimizenot only the mean squared error on the training set, but instead a criterion

that expresses a preference for the weights to have smaller squared

norm.

Speciﬁcally,

where

is a value chosen ahead of time that controls the strength of our preferencefor smaller weights.

When

we impose no preference,

and larger

forces the weights to become smaller.

Minimizing

results in a choice of weights thatmake a tradeoﬀ between ﬁtting the training data and being small.

regularization¶

In our weight decay example, we expressed our preference for linear functions deﬁned with smaller weights explicitly, via an extra term in the criterion we minimize.
There are many other ways of expressing preferences for diﬀerent solutions, both implicitly and explicitly.
Together, these diﬀerent approaches are known as regularization.
Regularization is any modiﬁcation we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

5.4 Hyperparameters and Validation Sets¶

5.4.1 Cross-Validation

hyperparameters¶

Most machine learning algorithms have several settings that we can use to control the behavior of the learning algorithm.
These settings are called hyperparameters.
The values of hyperparameters are not adapted by the learning algorithm itself(though we can design a nested learning procedure where one learning algorithmlearns the best hyperparameters for another learning algorithm).
In the polynomial regression example we saw in Fig. 5.2,
- there is a single hyperparameter: the degree of the polynomial, which acts as a capacity hyperparameter.
- The λ value used to control the strength of weight decay is another example of a hyperparameter.

validation set¶

More frequently, we do not learn the hyperparameter because it is not appropriate to learn that hyperparameter on the training set.
If learned on the training set, such hyperparameters would always choose the maximum possible model capacity, resulting in overﬁtting (referto Figure 5.3).
To solve this problem, we need a validation set of examples that the training algorithm does not observe.
Earlier we discussed how a held-out test set, composed of examples coming from the same distribution as the training set, can be used to estimate the generalization error of a learner, after the learning process has completed.
It is important that the test examples are not used in any way to make choices about the model, including its hyperparameters.
For this reason, no example from the test set can be used in the validation set.
For this reason, we always construct the validation set from the training data.
Speciﬁcally, we split the training data into two disjoint subsets.
Typically, one uses about 80% of the data for training and 20% for validation.

5.4.1 Cross-Validation¶

One issue with the idea of splitting the dataset into train/test or train/validation/test subsets is that only a small fraction of examples are used to evaluate generaliza-tion.
train/test
- These procedures are based on the idea of repeating the training / testing computation on diﬀerent randomly chosen subsets or splits of the original dataset.
train/validation/test
- If model selection or hyperparameter optimization is required, things get more computationally expensive:
- one can recurse the k-fold cross-validation idea, in-side the training set.
- So we can have an outer loop that estimates test error and provides a “training set” for a hyperparameter-free learner, calling it k times to“train”.
- That hyperparameter-free learner can then split its received training set by k-fold cross-validation into internal training/validation subsets (for example,splitting into k − 1 subsets is convenient, to reuse the same test blocks as the outer loop),
  - call a hyperparameter-speciﬁc learner for each choice of hyperparameter value on each of the training partition of this inner loop,
  - and compute the validation error by averaging across the k −1 validation sets
    - the errors made by the k −1 hyperparameter-speciﬁc learners trained on each of the internal training subsets.