These are notes I wrote for myself while studying for PhD qualifying exams at Stanford.
I wanted to published them so that others could benefit.
They draw from sources:
Some of the figures are taken from ISL with permission from the authors.
Probability theory is a mathematical framework for representing uncertain statements.
Sources of uncertainty:
Types of probability:
A random variable:
A probability distribution:
Note that we can integrate the PDF to find the actual probability mass over a range of points.
Conditional probability:
The probability of some event, given that some other event has happened, $ P (y \mid x) $.
Joint probability distribution:
Can be decomposed into conditional distributions over only one variable
$P(A,B,C) = P(A \mid B,C) P(B \mid C) P(C) $
Independence:
Two random variables $X$ and $Y$ are independent if their probability distribution can be expressed as a product of two factors.
Expectation (expected value):
Variance:
Covariance:
Correlation (Pearson correlation):
Coefficient of determination:
Single binary random variable that takes value 1 with success probability $\phi$:
$P(x=1) = \phi $
$P(x=0) = 1 - \phi $
import numpy as np
from scipy import stats
from scipy.stats import bernoulli, poisson, binom
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
plt.subplot(2,2,1)
a = np.arange(2)
p=0.2
plt.bar(a,bernoulli.pmf(a, p),label=p,alpha=0.5)
plt.xticks(a)
plt.title('P=%s'%p)
<matplotlib.text.Text at 0x10712f610>
Simply, this tells us the number of heads in $n$ independent flips of a coin with equal heads probability $p$.
Compute the probability of observing $n$ successes for $N$ trials given a probability $p$ for each without regard for order.
If $N=5$, $p=0.5$, $n=2$, then the first success can occour in any of the 5 trials.
The second success can occour in any of the remaining 4 trials, giving 5 $\times$ 4 permutations.
In short, if the first success was on trial 1 then the second can be any of the 4 remaining; for each of the 5 trials for success 1.
So, let's say success 1 happened on trial 1 and success 2 happened on trial 2.
We don't reall care if the inverse happened (2 on trial 1).
So, we correct for order: $\frac{5 \times 4}{n!} = 10$ combinations of 2 successes.
The number of combinations that satisfy the condition, $n$, is generally given by the binomial coefficient.
This states:
In short, the use is evaluating a series of equally likely events, such as dice rolls.
Also, the expected value is simply $Np$.
Let's say the expected value of a Binomial random variable is $\lambda = Np$.
Let's assume $\lambda$ is the expected number of cars observed per hour.
If we let N=60 (one trial per minute over the course of the hour), then we can compute the probability of success per trail:
$$p=\frac{\lambda}{60}$$OK, so let's say we want to compute the probability of $n$ cars:
$$ P(n=k) = {60 \choose k} \frac{\lambda}{60}^k (1-\frac{\lambda}{60})^{60-k} $$The problem is that each trial (minute) is binary: there can only be one or zeo cars!
In practice, we could have > 1 car per minute.
So, we can get more granular.
So, let's assume the number of trials is large, far less than a minute for each one!
And the probability of success at each is correspondingly smaller.
As $N$ approaches infinity, Binomal becomes the Poisson distribution.
So, the Binomal was brittle: the interval (e.g., one minute) was specified and assigned an associated probability.
For Poisson, probability of an event in a time interval is proportional to the size of the interval!
Poisson process for many events:
An intuitive application is sequence:
There are a large number of trial (reads).
The probability of a read hitting a given base is low for each trial.
The expected number of reads per base is given by the read count, read length, and genome size.
So, we can work out the likelihood of observing a given number of reads per base.
n=np.arange(0,50) # Range of coverage values.
l=100 # Read length.
N=3*10**9 # Size of human genome.
reads=[10**8,10**8.5,10**9] # Sequencing reads.
coverage=[float(read_number)*l/N for read_number in reads] # Lambda: expected reads per base.
colors = matplotlib.rcParams['axes.color_cycle']
plt.subplot(2,1,1)
for u,c in zip(coverage,colors):
y=stats.poisson.pmf(n,u)
plt.plot(n,y,'-',linewidth=2,color=c,alpha=1,label='$\lambda$ =%.3f'%u)
plt.fill_between(n,y,color=c,alpha = .25)
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
plt.title('Probability mass funtion',fontsize=8)
plt.xlabel('n (base)',fontsize=8)
plt.ylabel('P(n reads/base)',fontsize=8)
plt.legend(loc='upper right', scatterpoints = 1,prop={'size':8})
plt.grid()
plt.subplot(2,1,2)
for u,c in zip(coverage,colors):
y=stats.poisson.cdf(n,u)
plt.plot(n,y,'-',linewidth=2,color=c,alpha=1,label='$\lambda$ =%.3f'%u)
plt.fill_between(n,y,color=c,alpha = .25)
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
plt.title('Cumulative distribution funtion',fontsize=8)
plt.xlabel('n (base)',fontsize=8)
plt.ylabel('P(< or = n reads/base)',fontsize=8)
plt.legend(loc='upper right', scatterpoints = 1,prop={'size':8})
plt.grid()
plt.tight_layout()
As coverage increaces, the PMF shows that the probability if getter a higher number of reads per base increaces.
Of course, the probability of a specific position having exactly 30 reads is low < 0.1.
But the CDF shows that the probability of getting 30 or more reads spanning the base when $\lambda=30$ is reasonable at $0.524$ ($1-y[29]$).
Exponential distribution models the waiting time between events in a Poisson process:
x=np.arange(0,4,0.1)
colors = matplotlib.rcParams['axes.color_cycle']
lambda_ = [0.5,1,2]
plt.subplot(2,1,1)
u_vals=[2,1]
for u,c in zip(coverage,colors):
y=u*np.exp(-u*x)
plt.plot(x,y,'-',linewidth=2,alpha=1,color=c,label='$\lambda$ =%.3f'%u)
plt.fill_between(x,y,color=c,alpha = .25)
plt.legend()
plt.xticks(fontsize=5)
plt.yticks(fontsize=5)
plt.title('Probability density funtion',fontsize=8)
plt.xlabel('x (distance)',fontsize=8)
plt.ylabel('P(x)',fontsize=8)
plt.legend(loc='upper right', scatterpoints = 1,prop={'size':8})
plt.xlim([0,1])
plt.grid()
plt.subplot(2,1,2)
for u,c in zip(coverage,colors):
y=1-np.exp(-u*x)
plt.plot(x,y,'-',linewidth=2,alpha=1,color=c,label='$\lambda$ =%.3f'%u)
plt.fill_between(x,y,color=c,alpha = .25)
plt.legend()
plt.xticks(fontsize=5)
plt.yticks(fontsize=5)
plt.title('Cumulative distribution funtion',fontsize=8)
plt.xlabel('x (distance)',fontsize=8)
plt.ylabel('P(event <= x)',fontsize=8)
plt.legend(loc='upper right', scatterpoints = 1,prop={'size':8})
plt.grid()
plt.tight_layout()
Note that the standard z-score is the number of standard deviations away from the mean:
# Specify parameters for normal distribution
colors = matplotlib.rcParams['axes.color_cycle']
stdVal=3
plt.subplot(2,1,1)
mean = [5,10,20]
for meanVal,c in zip(mean,colors):
# Generates a normal continuous random variable with specified mean (loc), std (scale)
rv = stats.norm(loc=meanVal,scale=stdVal)
x = np.arange(.1*meanVal,meanVal*10,1)
plt.plot(x,rv.pdf(x),lw=2,linewidth=2,alpha=1,color=c,label='$\mu$ =%.3f'%meanVal)
plt.fill_between(x,rv.pdf(x),color=c,alpha = .25)
plt.legend()
plt.xticks(fontsize=5)
plt.yticks(fontsize=5)
plt.xlim([0,np.max(mean)*1.5])
plt.title('Probability density funtion',fontsize=7.5)
plt.xlabel('Random variable (x)',fontsize=5)
plt.ylabel('P(x)',fontsize=5)
plt.legend(loc='upper right', scatterpoints = 1,prop={'size':5})
plt.grid()
plt.subplot(2,1,2)
for meanVal,c in zip(mean,colors):
rv = stats.norm(loc=meanVal,scale=stdVal)
x = np.arange(.1*meanVal,meanVal*10,1)
plt.plot(x,rv.cdf(x),lw=2,linewidth=2,alpha=1,color=c,label='$\mu$ =%.3f'%meanVal)
plt.fill_between(x,rv.cdf(x),color=c,alpha = .25)
plt.legend()
plt.xticks(fontsize=5)
plt.yticks(fontsize=5)
plt.xlim([0,np.max(mean)*1.5])
plt.title('Cumulative distribution funtion',fontsize=7.5)
plt.xlabel('Random variable (x)',fontsize=5)
plt.ylabel('P(event <= x)',fontsize=5)
plt.legend(loc='upper right', scatterpoints = 1,prop={'size':5})
plt.grid()
plt.tight_layout()
Conditional distributions seek to answer the probability distribution over Y when we know that X must take on a certain value x.
Frequentist probability relies on the long-run frequency of events (e.g., probability for each roll of a dice).
Bayesian probability assigns a belief, or prior probability, to hypothesis $P(H)$.
Note that Frequentist probability is typically assigned based upon a known frequency, whereas our prior can be assinged.
We'll use this to assign a the likelihood of our hypothesis given some data, $D$.
So, we also must state the probabilty of:
From this we can estimate the probability that the hypothesis is true given the $P(H|D)$, the posterial probability.
$$ P(H|D) = \frac{P(D|H)P(H)}{P(D)} $$The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred.
Likely events should have low information content. Unlikely events should have high information content
$I(x) = −log P (x)$
Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution.
$E(I(x)) = −\sum p \times log P (x)$
Shannon entropy:
Previously (in the statistics notebook), I went through statistical hypothesis tests.
These tests seek to esimate whether groups or effects are "statistically significant."
Another approach is to be build flexible models with the overarching aim of estimating quantities of interest.
In short, what we want to do is:
We can do this in two ways:
So, break it down like this.
(1) A distribution is parameterized by some values:
(2) For a distribution, there is a link between the parameters and the mean ($\mu$), variance ($\sigma$):
(3) For a given sample of data, we can compute sample statistics ($\mu = x$, $\sigma = s$).
(4) We can then use the sample statistics to compute the distribution paramters and fit it!
$ \alpha = \frac{\sigma^2}{\beta^2}$
$ \beta = \frac{\mu}{\alpha} $
$ \beta^2 = \frac{\mu^2}{\alpha^2} $
$ \alpha = \frac{\sigma^2 \alpha^2}{\mu^2}$
$ \alpha = \frac{\mu^2}{\sigma^2}$
This notebook has a nice review of both: