Notebook

Probability¶

Preface -¶

These are notes I wrote for myself while studying for PhD qualifying exams at Stanford.

I wanted to published them so that others could benefit.

They draw from sources:

CS229, an excellent machine learning class at Stanford taught by Andrew Ng.
An Introduction to Statistical Learning (ISL), an excellent text by G. James, D. Witten, T. Hastie and R. Tibshirani (Springer, 2013).
A pretty good, free online textbook: http://www.statsoft.com/Textbook/Experimental-Design
Another textbook (actually on Deep Learning, but with some probability): http://www.iro.umontreal.ca/~bengioy/dlbook/prob.html
http://nbviewer.ipython.org/gist/fonnesbeck/5850483

Some of the figures are taken from ISL with permission from the authors.

The idea behind probability.¶

Probability theory is a mathematical framework for representing uncertain statements.

Sources of uncertainty:

Incomplete observability: When we cannot observe something, we are uncertain about its true nature.
Simplifications: Uncertainty can also arise from the simpliﬁcations we make in order to model real-world processes.

Types of probability:

Frequentist: Related directly to the rates at which events occur (e.g., drawing a card).
Bayes: Related to qualitative levels of certainty (e.g., change of getting cancer).

A random variable:

A variable that can take on diﬀerent values randomly.

A probability distribution:

A description of how likely a random variable is to take on each of its possible states.
Probability mass function (PMF): probability distribution over discrete variables.
Probability density function (PDF): distribution over continuous random variables.

Note that we can integrate the PDF to ﬁnd the actual probability mass over a range of points.

Conditional probability:

The probability of some event, given that some other event has happened, $ P (y \mid x) $.

Joint probability distribution:

Can be decomposed into conditional distributions over only one variable

$P(A,B,C) = P(A \mid B,C) P(B \mid C) P(C) $

Independence:

Two random variables $X$ and $Y$ are independent if their probability distribution can be expressed as a product of two factors.

Key terms.¶

Basics -¶

Expectation (expected value):

The average value that a function $f(x)$ takes when x is drawn from $P(x)$.
For a discrete random variable: $ E(x) = \sum\limits_x F(x) P(x) $

Variance:

The variance gives a measure of how much the diﬀerent values of a function are spread apart.
This is simply the sum of the squared deviations from the mean $ TSS = \sum\limits (y^i-y_{mean})^2 $.
For a set of $n$ equally likely values $ v = \frac{TSS}{n-1} $
Of course, the standard deviation is simply $ v = \sigma^2 $.

Covariance:

The variance is a special case of covariance, which is a measure of how two variables are linearly related.
$ Cov(a,b) = \frac{1}{n-1} \sum\limits_{i=1}^{n} (a_i - \mu_a) (b_i - \mu_b) $
If the sign of the covariance is positive, then the values tend to change in the same direction.

Correlation (Pearson correlation):

When the covariance is normalized by standard deviation, one obtains the Pearson correlation coefficient:
$ r = Cor(a,b) = \frac{Cov(a,b)}{\sigma_a \sigma_b} $

Coefficient of determination:

An alternative measure of linear dependence is $R^2$.
This is the proportion of variability in Y that can be explained using X.
$ R^2 = 1 - \frac{RSS}{TSS} $
Where RSS measures the amount of variability that is left unexplained after performing the regression.
This is $ \sum\limits (y^i-\theta^T x^i)^2 $.
$r2=0.9$, then 90% of the variance in Y can be explained by variation our model, a funtion of $x$.

Distributions for discrete random variables -¶

Bernoulli¶

Single binary random variable that takes value 1 with success probability $\phi$:

$P(x=1) = \phi $

$P(x=0) = 1 - \phi $

In [1]:

import numpy as np
from scipy import stats
from scipy.stats import bernoulli, poisson, binom
import matplotlib
import matplotlib.pyplot as plt 
%matplotlib inline

plt.subplot(2,2,1)
a = np.arange(2)
p=0.2
plt.bar(a,bernoulli.pmf(a, p),label=p,alpha=0.5)
plt.xticks(a)
plt.title('P=%s'%p)

Out[1]:

<matplotlib.text.Text at 0x10712f610>

Binomial¶

Simply, this tells us the number of heads in $n$ independent flips of a coin with equal heads probability $p$.

Compute the probability of observing $n$ successes for $N$ trials given a probability $p$ for each without regard for order.

If $N=5$, $p=0.5$, $n=2$, then the first success can occour in any of the 5 trials.

The second success can occour in any of the remaining 4 trials, giving 5 $\times$ 4 permutations.

In short, if the first success was on trial 1 then the second can be any of the 4 remaining; for each of the 5 trials for success 1.

So, let's say success 1 happened on trial 1 and success 2 happened on trial 2.

We don't reall care if the inverse happened (2 on trial 1).

So, we correct for order: $\frac{5 \times 4}{n!} = 10$ combinations of 2 successes.

The number of combinations that satisfy the condition, $n$, is generally given by the binomial coefficient.

This states:

(1) there are $N$ choose $n$ combinations with
(2) the probability of each combination $p^n (1-p)^{N-n}$ (n successes, $N-n$ failures)

$$ P(n) = {N \choose n} p^n (1-p)^{N-n} $$$$ {N \choose n} = \frac{N!}{n!(N-n)!} $$

In short, the use is evaluating a series of equally likely events, such as dice rolls.

Also, the expected value is simply $Np$.

Poisson¶

Let's say the expected value of a Binomial random variable is $\lambda = Np$.

Let's assume $\lambda$ is the expected number of cars observed per hour.

If we let N=60 (one trial per minute over the course of the hour), then we can compute the probability of success per trail:

$$p=\frac{\lambda}{60}$$

OK, so let's say we want to compute the probability of $n$ cars:

$$ P(n=k) = {60 \choose k} \frac{\lambda}{60}^k (1-\frac{\lambda}{60})^{60-k} $$

The problem is that each trial (minute) is binary: there can only be one or zeo cars!

In practice, we could have > 1 car per minute.

So, we can get more granular.

So, let's assume the number of trials is large, far less than a minute for each one!

And the probability of success at each is correspondingly smaller.

As $N$ approaches infinity, Binomal becomes the Poisson distribution.

So, the Binomal was brittle: the interval (e.g., one minute) was specified and assigned an associated probability.

For Poisson, probability of an event in a time interval is proportional to the size of the interval!

$P = \lambda dt$ for interval $dt$, where $\lambda$ indicates the number of events per unit time.
We want to compute $P(t)$, the probability that an event happens at time $t$ starting at $t=0$.
For this to occour, the event should not happen until time $t$.
We define this as $P_o(t)$, starting from $t=0$.
We break $t$ into a set of small intervals and the probability of no event in one interval is given by $1-\lambda dt$.
$P_o = (1-\lambda dt)^N$ is the probability of no events in the $N$ intervals such that $dtN=t$.
This can we re-written as $P_o = (1-\lambda dt)^{t dt}$ and is $P_o = e^{-\lambda t}$ in the limit of $dt \to 0$.
So, to compute $P(t)$ we determine the probabiity of not happening until time $t$, $e^{-\lambda t}$.
Also include the probability that it happens in time $dt$, $\lambda dt$.
The product is $P(t)=\lambda e^{-\lambda t}$.

Poisson process for many events:

Either $n$ events in the time leading up to the final interval or it is the final event occouring in the last interval, $dt$.
$P_n(t+dt) = P_n(t)(1-\lambda dt) + P_{n-1}(t) \lambda dt$.
This can be solved to yield:

$$ P_n(t) = \frac{(\lambda t)^n}{n!} e^{- \lambda t} $$

An intuitive application is sequence:

There are a large number of trial (reads).

The probability of a read hitting a given base is low for each trial.

The expected number of reads per base is given by the read count, read length, and genome size.

So, we can work out the likelihood of observing a given number of reads per base.

In [33]:

n=np.arange(0,50) # Range of coverage values.
l=100 # Read length.
N=3*10**9 # Size of human genome.
reads=[10**8,10**8.5,10**9] # Sequencing reads.
coverage=[float(read_number)*l/N  for read_number in reads] # Lambda: expected reads per base.
colors = matplotlib.rcParams['axes.color_cycle']

plt.subplot(2,1,1)
for u,c in zip(coverage,colors):
    y=stats.poisson.pmf(n,u)
    plt.plot(n,y,'-',linewidth=2,color=c,alpha=1,label='$\lambda$ =%.3f'%u)
    plt.fill_between(n,y,color=c,alpha = .25)
    plt.xticks(fontsize=8)
    plt.yticks(fontsize=8)
    plt.title('Probability mass funtion',fontsize=8)
    plt.xlabel('n (base)',fontsize=8)
    plt.ylabel('P(n reads/base)',fontsize=8)
    plt.legend(loc='upper right', scatterpoints = 1,prop={'size':8})
    plt.grid()

plt.subplot(2,1,2)
for u,c in zip(coverage,colors):
    y=stats.poisson.cdf(n,u)
    plt.plot(n,y,'-',linewidth=2,color=c,alpha=1,label='$\lambda$ =%.3f'%u)
    plt.fill_between(n,y,color=c,alpha = .25)
    plt.xticks(fontsize=8)
    plt.yticks(fontsize=8)
    plt.title('Cumulative distribution funtion',fontsize=8)
    plt.xlabel('n (base)',fontsize=8)
    plt.ylabel('P(< or = n reads/base)',fontsize=8)
    plt.legend(loc='upper right', scatterpoints = 1,prop={'size':8})
    plt.grid()

plt.tight_layout()	

As coverage increaces, the PMF shows that the probability if getter a higher number of reads per base increaces.

Of course, the probability of a specific position having exactly 30 reads is low < 0.1.

But the CDF shows that the probability of getting 30 or more reads spanning the base when $\lambda=30$ is reasonable at $0.524$ ($1-y[29]$).

Distributions for continuous random variables -¶

Exponential¶

Exponential distribution models the waiting time between events in a Poisson process:

The probability of seeing no hits at time $t$ for a Poisson process is $P_o(t) = e^{- \lambda t}$.
The probability of observing an event in time $T < t$ is $1-P_o(t) = 1-e^{- \lambda t}$.
Thus, the CDF for the exponential distribution is $1-e^{- \lambda t}$.

In [36]:

x=np.arange(0,4,0.1)
colors = matplotlib.rcParams['axes.color_cycle']
lambda_ = [0.5,1,2]

plt.subplot(2,1,1)
u_vals=[2,1]
for u,c in zip(coverage,colors):
    y=u*np.exp(-u*x)
    plt.plot(x,y,'-',linewidth=2,alpha=1,color=c,label='$\lambda$ =%.3f'%u)
    plt.fill_between(x,y,color=c,alpha = .25)
plt.legend()
plt.xticks(fontsize=5)
plt.yticks(fontsize=5)
plt.title('Probability density funtion',fontsize=8)
plt.xlabel('x (distance)',fontsize=8)
plt.ylabel('P(x)',fontsize=8)
plt.legend(loc='upper right', scatterpoints = 1,prop={'size':8})
plt.xlim([0,1])
plt.grid()

plt.subplot(2,1,2)
for u,c in zip(coverage,colors):
    y=1-np.exp(-u*x)
    plt.plot(x,y,'-',linewidth=2,alpha=1,color=c,label='$\lambda$ =%.3f'%u)
    plt.fill_between(x,y,color=c,alpha = .25)
plt.legend()
plt.xticks(fontsize=5)
plt.yticks(fontsize=5)
plt.title('Cumulative distribution funtion',fontsize=8)
plt.xlabel('x (distance)',fontsize=8)
plt.ylabel('P(event <= x)',fontsize=8)
plt.legend(loc='upper right', scatterpoints = 1,prop={'size':8})
plt.grid()

plt.tight_layout()	

Normal¶

$$ P(x) = \frac{1}{\sigma \sqrt{2\pi}} exp(-\frac{(x-\mu)^{2}}{2\sigma^2}) $$

Note that the standard z-score is the number of standard deviations away from the mean:

$$ Z=\frac{(x-\mu)}{\sigma} $$

In [38]:

# Specify parameters for normal distribution
colors = matplotlib.rcParams['axes.color_cycle']
stdVal=3

plt.subplot(2,1,1)
mean = [5,10,20]
for meanVal,c in zip(mean,colors):
    #  Generates a normal continuous random variable with specified mean (loc), std (scale) 
    rv = stats.norm(loc=meanVal,scale=stdVal)   
    x = np.arange(.1*meanVal,meanVal*10,1)
    plt.plot(x,rv.pdf(x),lw=2,linewidth=2,alpha=1,color=c,label='$\mu$ =%.3f'%meanVal)
    plt.fill_between(x,rv.pdf(x),color=c,alpha = .25)
plt.legend()
plt.xticks(fontsize=5)
plt.yticks(fontsize=5)
plt.xlim([0,np.max(mean)*1.5])
plt.title('Probability density funtion',fontsize=7.5)
plt.xlabel('Random variable (x)',fontsize=5)
plt.ylabel('P(x)',fontsize=5)
plt.legend(loc='upper right', scatterpoints = 1,prop={'size':5})
plt.grid()

plt.subplot(2,1,2)
for meanVal,c in zip(mean,colors):
    rv = stats.norm(loc=meanVal,scale=stdVal)   
    x = np.arange(.1*meanVal,meanVal*10,1)
    plt.plot(x,rv.cdf(x),lw=2,linewidth=2,alpha=1,color=c,label='$\mu$ =%.3f'%meanVal)
    plt.fill_between(x,rv.cdf(x),color=c,alpha = .25)
plt.legend()
plt.xticks(fontsize=5)
plt.yticks(fontsize=5)
plt.xlim([0,np.max(mean)*1.5])
plt.title('Cumulative distribution funtion',fontsize=7.5)
plt.xlabel('Random variable (x)',fontsize=5)
plt.ylabel('P(event <= x)',fontsize=5)
plt.legend(loc='upper right', scatterpoints = 1,prop={'size':5})
plt.grid()

plt.tight_layout()

Conditional distributions.¶

Bayes rule¶

Conditional distributions seek to answer the probability distribution over Y when we know that X must take on a certain value x.

Frequentist probability relies on the long-run frequency of events (e.g., probability for each roll of a dice).

Bayesian probability assigns a belief, or prior probability, to hypothesis $P(H)$.

Note that Frequentist probability is typically assigned based upon a known frequency, whereas our prior can be assinged.

We'll use this to assign a the likelihood of our hypothesis given some data, $D$.

So, we also must state the probabilty of:

Observing the data, $P(D)$.
Observing data if the hypothesis is true, $P(D|H)$.

From this we can estimate the probability that the hypothesis is true given the $P(H|D)$, the posterial probability.

$$ P(H|D) = \frac{P(D|H)P(H)}{P(D)} $$

Information theory.¶

Shannon entropy -¶

The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred.

Likely events should have low information content. Unlikely events should have high information content

$I(x) = −log P (x)$

Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution.

$E(I(x)) = −\sum p \times log P (x)$

Shannon entropy:

High entropy means $X$ is from a uniform distribution, where every outcome is as likely to occur
Low entropy means $X$ is from varied distribution (with various peaks and valleys)

Modelling.¶

Previously (in the statistics notebook), I went through statistical hypothesis tests.

These tests seek to esimate whether groups or effects are "statistically significant."

Another approach is to be build flexible models with the overarching aim of estimating quantities of interest.

Estimation -¶

In short, what we want to do is:

Fit a probability distribution to our data.
Then, extract parameters that correspond to the distribution that best represents our data.

We can do this in two ways:

Method of moments chooses parameters so sample moments (typically the sample mean and variance) match the theoretical moments of our chosen distribution.
Maximum likelihood chooses parameters to maximize the likelihood, which measures how likely it is to observe our given sample with the model in mind.

Method of moments and Maximum likelihood -¶

So, break it down like this.

(1) A distribution is parameterized by some values:

For gamma distribution, it's $\alpha$ and $\beta$.

(2) For a distribution, there is a link between the parameters and the mean ($\mu$), variance ($\sigma$):

For gamma distribution, $\mu = \alpha \beta$ and $\sigma^2 = \alpha \beta^2$

(3) For a given sample of data, we can compute sample statistics ($\mu = x$, $\sigma = s$).

(4) We can then use the sample statistics to compute the distribution paramters and fit it!

$ \alpha = \frac{\sigma^2}{\beta^2}$

$ \beta = \frac{\mu}{\alpha} $

$ \beta^2 = \frac{\mu^2}{\alpha^2} $

$ \alpha = \frac{\sigma^2 \alpha^2}{\mu^2}$

$ \alpha = \frac{\mu^2}{\sigma^2}$

This notebook has a nice review of both:

http://nbviewer.ipython.org/gist/fonnesbeck/5850483