- field of study that gives computers the ability to learn without being explicitly programmed : Arthur Samuel 1959
- a computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T as measured by P, improves with E : Tom Mitchell 1998 </ol> </p>
- Regression - algorithms that predict continuous outputs
- Categorization - algorithms that predict discrete outputs

**Unsupervised Learning** : problems where the algorithm is given a data set without any “right answers”. The objective is to find some underlying structure in the data, e.g. clustering

**Reinforcement Learning** : problems where a sequence of decisions are made as opposed to a single decision (or prediction)

**Learning Theory** : study of how and why (mathematically) a learning algorithm works

where the

If $X$ is a continuous RV, then the probability distribution gives way to a

$P(X\in (a,b)) = \int_a^b p_X(x)dx$

From this definition, one sees that the $P(X=a)=\int_a^a p_X(x)dx = 0$. The

where $E_{\Phi}[X]$ is the

The following definition of a **compound distribution** will also be useful. Let $t$ be a RV with distribution $F$ paramaterized by $\mathbf{w}$ and let $\mathbf{w}$ be a RV distributed by $G$
parameterized by $\mathbf{t}$, then the compound distribution $H$ parameterized by $\mathbf{t}$ for the random variable $t$ is defined by:

$p_H(t|\mathbf{t}) = \int_{\mathbf{w}} P_F(t|\mathbf{w}) P_G(\mathbf{w}|\mathbf{t})d\mathbf{w}$

$p(Y|X) = \frac{p\left(X|Y\right)p\left(Y\right)}{p\left(X\right)}$

Bayes' theorem will appear repeatadly in the discussion of machine learning. Not surprisingly, Bayes' theorem plays a fundamental role in Bayesian modelling. Assume, that we model some process where the model has free parameters contained in the vector $\mathbf{w}$. Now assume that we have some notion of the probability distribution of these parameters, $p(\mathbf{w})$, called the

$p(\mathbf{w}|D) = \frac{p(D|\mathbf{w})p(\mathbf{w})}{p(D)}$

In order to apply a fully Bayesian approach, we must formulate models for both the *prior*, $p(\mathbf{w})$, and the *likelihood function*, $p(D|\mathbf{w})$. Given
these models and a set of data we can compute appropriate values for our free parameter vector $\mathbf{w}$ by maximizing
$p(\mathbf{w}|D) \propto p(D|\mathbf{w})p(\mathbf{w})$. How does this differ from frequentist modeling?
The frequentist approach, or *maximum likelihood* approach, ignores the formulation of a *prior*, and goes directly to maximizing the likelihood function
to find the model parameters. Thus, the frequentist approach can be described as *maximizing the probability of the data given the parameters*. Under certain
conditions the results of Bayesian and frequentist modeling will conincide, but this is not true in general.

One could obtain a point estimate for $\mathbf{w}$ by maximizing the *posterior probability* model, but this not typical. Instead a *predictive distribution* of the value of the target
variable, $t$, is formed based on the compound distribution definition provided above. Taking the mean of this distribution provides a point estimate of $t$ while distribution itself provides
a measure of the uncertainty in the estimate, say by considering the standard deviation.

**TODO:** Add a simple example illustrating the difference. For now, a good illustration is available
here

We assume we have specified a probability density model, $p_{\mathbf{w}}(d)$ for the observed data elements, ${d \in D}$ that is parameterized by $\mathbf{w}$, i.e. $p$ is a parametric model for the distribution of $D$. As
an example, if $D$ ahs a normal distribution with mean $\mu$ and variance $\sigma^2$, then

$\mathbf{w} = (\mu, \sigma^2)$

and

$p_{\mathbf{w}}(d) = \frac{1}{\sqrt{2 \pi} \sigma} e^{-(d-\mu)^2/2\sigma^2}$

The likelihood function, regardless of our choice of model $p$, is defined by

$L(\mathbf{w}; D) = \prod_{i=1}^N p_{\mathbf{w}}(d_i)$

where $N$ is the number of elements in $D$. Thus the likelihood function is simply the product of the probability of all the individual data points, $d_i \in D$, under the probability model, $p_{\mathbf{w}}$. Note that this
definition implicitly assumes these data points are independent events.

Out of mathematical convenience, we will most often work with the *log-likelihood* function (which turns the product into a sum by properties of the log function), i.e. the logarithm of $L(\mathbf{w}; D)$, defined as

$l(\mathbf{w};D) = \sum_{i=1}^N l(\mathbf{w};d_i) = \sum_{i=1}^N \log p_{\mathbf{w}}(d_i)$

where we recall that $\log(ab) = \log(a) + \log(b)$.

The method of maximum likelihood chooses the value $\mathbf{w} = \widehat{\mathbf{w}}$ that maximizes the *log-likelihood* function. We will also often work with an **error function**, $E(\mathbf{w})$, defined as the
negative of the log-likelihood function

$E(\mathbf{w}) = -l(\mathbf{w};D)$

where we note $-\log(a) = \log(1/a)$

In [ ]:

```
```