This notebook is based in part on chapter 3 of "The elements of statistical learning" by Hastie, Tibshirani and Friedman (get full pdf here).
Models are mathematical constructs used to explain measured data. In cognitive neuroscience, models are usually used to quantitatively describe specific parts of the nervous system (e.g. neurons, voxels, etc.). Typically, we seek to understand changes in the level of a signal measured from the unit (e.g. spike rate, or BOLD response), as a function of the inputs into the system (e.g. stimulus presented, cognitive state of the participant, etc.). Accurate and precise mathematical models derived from the data are useful in the following ways:
They allow you to explain the data you have: summary statistics of the data, such as the mean of the responses of a voxel, or the variance of this variable, are often not that informative (especially if normalized...). A model can give you a good explanation of the data you have observed, in terms of background knowledge, or theory.
They allow you to predict data you haven't collected: In many cases, assuming that the parameters don't change with changes in the independent variables, models allow us to interpolate and predict what the dependent variables would have been like in other values of the independent variables. Even extrapolate to conditions that have not measured. For example, because measuring in these conditions is difficult for practical reasons.
They allow you to specify your confidence in a specific theory: By performing some form of statistical testing, you can specify your confidence level in the veracity of statements about responses to a particular condition, or differential responses in different conditions.
Assume that we make a measurement $\bf{y}$. Here bold-face indicates that we are talking about a measurment vector. Say, of length $N$. This measurement is a response in our unit-of-interest to some inputs: $(\bf{x}_1, \bf{x}_2, ... \bf{x}_p)$. Each one of the $\bf{x}_i$ vectors also has $N$ elements. We will make this more concrete later on, but for now, you can think of the $\bf{X}$ as representing different input streams. For example, different locations on the screen.
A linear model assumes that the expected value of $\bf{y}$ given a specific input $\bf{X}$ can be expressed as a linear sum of the $\bf{x}_i$. That is: $E(\bf{Y}|\bf{X})$ is linear in the inputs.
Let's break that down a bit. All this means is that in the noise-less case $\bf{y}$ is a function of the inputs, $f(x)$, such that there exist $\beta_i$ which satisfy:
$f(x) = \beta_0 + \beta_1 \bf{x}_1 + \beta_2 x_2 + ... + \beta_p x_p$
You will notice that bivariate linear regression, which you are probably familiar with is simply a subset of this, with N=1
def linear_function(x,a,b):
"""
A linear function
Parameters
----------
x : ndarray
Input variable
a : float
offset parameter
b : float
slope parameter
"""
return a + b * x
Let's look at a simple case:
N = 100
x = np.linspace(-pi, pi, N)
a = np.random.randn() * 10
b = np.random.randn() * 10
y = linear_function(x, a, b)
fig, ax = plt.subplots(1)
ax.plot(x, y, '.')
ax.set_xlabel('x')
ax.set_ylabel('y')
<matplotlib.text.Text at 0x105d7f650>
The linear relationship holds, even when the input is not a linear function:
t = np.linspace(-pi, pi, N)
x = np.sin(t)
fig, ax = plt.subplots(1)
ax.plot(x)
y = linear_function(x, a, b)
fig, ax = plt.subplots(1)
ax.plot(y, '.')
fig, ax = plt.subplots(1)
ax.plot(x, y, '.')
ax.set_xlabel('x')
ax.set_ylabel('y')
<matplotlib.text.Text at 0x1063112d0>
Now, let's generalize this to $p$ inputs, where we choose $p=10$
For example, let's consider an output which is a linear combination of $p$ sine waves.
To account for $\beta_0$, we make sure that the first column is all 1's:
p = 10 # Number of inputs
# Preallocate a matrix (make it a matrix of ones to account for beta_0):
X = np.ones((N, p))
for ii in range(1, p):
X[:, ii] = np.sin((ii+1)*t)
fig, ax = plt.subplots(1)
ax.plot(t,X)
ax.set_ylim([-1.1, 1.1])
ax.set_xlim([-pi, pi])
ax.set_xlabel('t')
ax.set_ylabel('x')
<matplotlib.text.Text at 0x10626a750>
Let's define our $f(x)$ as a function of beta coefficients:
def linear_model(X, beta):
"""
The linear model
Parameters
----------
X : 2d array
The design matrix : a matrix of regressors
beta : 1d array
Model coefficients
"""
return np.dot(X, beta)
Where np.dot
is a matrix multiplication:
$\bf{X}\beta = \beta_0 + \beta_1 \bf{x}_1 + \beta_2 \bf{x}_2 + ... + \beta_p \bf{x}_p $
The multiplication of two matrices $\bf{A}$ and $\bf{B}$ is denoted $\bf{AB}$. This is only defined if the inner dimension of the multiplication matches. That is, if $\bf{A}$ is $m$ by $n$ and $\bf{B}$ is $n$ by $k$. In that case, element $i,j$ of the multiplication is defined as following:
$\bf{AB}_{ij} = \sum_l\sum_k{A_{ik} B_{lj}}$
A special case is the matrix that has 1's on the diagonal and 0's everywhere else. This is known as the identity matrix and is denoted by $\bf{I}$
A = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
print A.shape
B = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
print B.shape
C = np.dot(A,B)
print C.shape
print C[2,1]
D = np.dot(B,A)
print D.shape
print D[2,1]
I = np.eye(4)
IB = np.dot(I, B)
(3, 4) (4, 3) (3, 3) 288 (4, 4) 152
We choose a set of random $\beta$ coefficients and generate $f(x)$
beta = np.random.randn(p)
f = linear_model(X, beta)
fig, ax = plt.subplots(1)
ax.plot(t, f)
ax.set_xlim([-pi, pi])
(-3.141592653589793, 3.141592653589793)
In this case, $\bf{y}$ is not going to be linear in any particular $\bf{x}_i$, only in a combination of them:
fig, ax = plt.subplots(1)
ax.plot(X[:, 1], y, '.')
[<matplotlib.lines.Line2D at 0x10624abd0>]
But that's hard to visualize, because it's happening in 10D!
Unfortunately, the noise-less case is not a very realistic one. Let's assume that the output is additionaly corrupted by noise, $\epsilon$:
$\bf{y} = f(\bf{x}) + \epsilon$
which is distributed according to a Gaussian/normal distribution with zero mean:
$\epsilon \sim \mathcal{N}(0, \sigma)$
y = f + np.random.randn(*f.shape)
fig, ax = plt.subplots(1)
ax.plot(f, label='f')
ax.plot(y, label='y')
plt.legend()
<matplotlib.legend.Legend at 0x105d765d0>
Solving the linear model, means finding an estimate of the $\bf{\beta}$ coefficients that generated the data we observe
In other words, we are looking for procedure to estimate the coefficients (we call the estimated coefficients $\bf{\hat{\beta}}$)
A 'good' estimate of these coefficients would produce an estimate of the output, $\bf{\hat{y}}$, which would be close to the measured data points.
Where:
$\bf{\hat{y}} = \bf{ X\hat{\beta}}$
One way to define this is by saying that we want the residual sum of squares to be small, where the residuals are:
$\bf{y}-\bf{\hat{y}}$
and the residual sum of squares (or RSS) is:
$RSS = \sum_{i=1}^{N}(y_i - \hat{y_i})^2$
Since $\bf{\hat{y}}$ is a function of $\bf{\beta}$, we can think of $RSS$ as a function of $\beta$ and we want to optimize it with respect to a choice of the elements of $\beta$. That is, we want to find a minimum of $RSS$ with respect to $\beta$.
Mathematically speaking, we want to find a value of $\bf{\hat{\beta}}$ for which the first derivative of $RSS$ with respect to $\beta$ is 0
To find the derivative, we re-write $RSS$ in the following way:
$RSS(\beta) = (\bf{y} - X\beta)^T (\bf{y} - X\beta) $
The derivative of this function is (we leave it as an exercise to the reader to prove this:)
$\frac{\partial RSS}{\partial\beta} = -2 \bf{X}^T (\bf{y} - \bf{X}\beta)$
Setting the first derivative to 0 and dividing both sides by 2, we obtain
$\bf{X}^T (y-\bf{X}\beta) = 0$
$\bf{X}^T y - \bf{X}^T \bf{X} \beta = 0$
$\Rightarrow \bf{X}^T y = \bf{X}^T \bf{X} \beta$
If $\bf{X}$ is 'full rank', then $\bf{X}^T\bf{X}$ is positive definite and it can be inverted.
The singular value decomposition of the matrix $\bf{X}$ is defined as:
$\bf{X} = \bf{UDV}^T$
Where $\bf{U}$ and $\bf{V}$ are unitary matrices (meaning that, for example $\bf{U}^T\bf{U} = \bf{I}$) and $\bf{D}$ is a diagonal matrix (meaning that off-diagonal values are equal to 0).
import scipy.linalg as la
U,D,V = la.svd(X, full_matrices = False) # We call full_matrices = False, to conform with the conventions in the text
# We convert D to a diagonal matrix:
D = np.diag(D)
fig, ax = plt.subplots(1)
im = ax.matshow(D)
fig.colorbar(im)
fig, ax = plt.subplots(1)
im = ax.matshow(np.dot(U.T, U))
fig.colorbar(im)
fig, ax = plt.subplots(1)
im = ax.matshow(np.dot(V.T, V))
fig.colorbar(im)
np.allclose(np.dot(np.dot(U, D), V), X)
True
If all the singular values of the matrix $\bf{X}$ (the values in the diagonal of $\bf{D}$) are non-zero, the matrix $\bf{X}$ is 'full rank' and can be inverted.
Importantly, for the matrix to be full rank, the eigenvalues should all be non-zero.
If this is the case, we can obtain the unique solution:
$ \Rightarrow \hat{\beta} = (\bf{X}^T \bf{X})^{-1}\bf{X}^T\bf{y}$,
This is also known as the Ordinary Least-Squares solution to the regression problem (or OLS)
Let's make that a function, which we can reuse:
def ols(X):
"""
The matrix which solves the OLS regression problem for full-rank X
"""
return np.dot(la.pinv(np.dot(X.T, X)), X.T)
Let's compute $\hat{\beta}$, using this function:
beta_hat = np.dot(ols(X), y)
We display the estimate, $\hat{\beta}$ against the actual parameters, $\beta$:
plot(beta, beta_hat, 'o')
plt.xlabel('Real parameter value')
plt.ylabel('Estimate parameter value')
<matplotlib.text.Text at 0x106c65d90>
Using $\bf{\hat{\beta}}$, we can calculate back $\bf{\hat{y}}$, the estimate of the output and compare it to the actual output:
$\bf{\hat{y}} = \bf{X} \bf{\hat{\beta}}$
y_hat = np.dot(X, beta_hat)
fig, ax = plt.subplots(1)
ax.plot(y, label='$y$')
ax.plot(y_hat, label='$\hat{y}$')
plt.legend()
<matplotlib.legend.Legend at 0x106258b50>
The so called "hat matrix", or $ H $ is the matrix which defines the transformation from $\bf{X}$ to the estimate $\hat{y}$, so it's the matrix that "puts the hat" on $y$:
$H = \bf{X}(\bf{X^T}\bf{X})^{-1} \bf{X^T}$
H = np.dot(X, ols(X))
H.shape
y_hat = np.dot(H.T, y)
plot(y, y_hat, 'o')
plt.xlabel('The original output')
plt.ylabel('Linear estimate of the output')
<matplotlib.text.Text at 0x106c78bd0>
y_hat = []
for ii in range(1000):
y = f + np.random.randn(*f.shape)
y_hat.append(np.dot(H, y))
plot(f)
plot(np.mean(y_hat,0))
[<matplotlib.lines.Line2D at 0x106e56ed0>]
Bias can occur (for example), when the assumptions of the OLS are violated. Here, we choose the noise from a different distribution:
$\epsilon \sim U(0,1)$
y_hat = []
for ii in range(1000):
y = f + np.random.rand(*f.shape)
y_hat.append(np.dot(H, y))
plot(f)
plot(np.mean(y_hat,0))
[<matplotlib.lines.Line2D at 0x106e146d0>]
plot(np.std(y_hat,0))
[<matplotlib.lines.Line2D at 0x106f1b350>]
In general, there is a trade-off between bias and variance. Models with a lot of flexibility tend to be unbiased in the long run, but they will also tend to fit the idiosyncratic noise in the sample too well, so they will be very different depending on the particular noise in that sample.
The OLS solutions are unbiased (when fit for Gaussian noise), but they have relatively high variance.
Shrinkage methods are methods of fitting a linear model, which reduce the variance of the estimates. We will look at one of these methods:
This form of regression regularizes the regression by optimizing a variant of RSS:
$\hat{\beta}^{ridge} = argmin_{\beta} [(sum_{i=1}^{N} (y_i - \beta_0 - \sum_{j=1}^{p} x_{ij}\beta_j)^2 + \lambda \sum_{j=1}^{p}{\beta^2}]$
Where $\lambda \geq 0$ is a complexity parameter that controls the amount of "shrinkage" of the model, by controlling the sum of squares of the parameter values and will control the bias-variance trade-off
The solution to this problem is:
$\hat{\beta}^{ridge} = (\bf{X}^T\bf{X} + \lambda \bf{I})^{-1} \bf{X}^T y$
Let's write a function to do that:
def ridge_regression_matrix(X, L):
""""
Calculate the matrix for ridge regression
Parameters
----------
X : 2d array
The design matrix
L : float
ridge parameter for regularization
"""
return np.dot(la.pinv(np.dot(X.T,X) + L * np.eye(X.shape[-1])), X.T)
beta_ridge = np.dot(ridge_regression_matrix(X,10), y)
fig, ax = plt.subplots(1)
ax.plot(beta_hat, beta_ridge, 'o')
ax.plot([-2, 2],[-2, 2],'k--')
ax.plot([-2, 2],[0, 0],'k--')
ax.plot([0, 0],[-2, 2],'k--')
ax.set_xlabel(r'$\beta(OLS)$')
ax.set_ylabel(r'$\beta(Ridge)$')
<matplotlib.text.Text at 0x106ea9850>
H_ridge = np.dot(X, ridge_regression_matrix(X,10))
y_hat = []
y_hat_ridge = []
for ii in range(1000):
y = f + np.random.randn(*f.shape)
y_hat.append(np.dot(H, y))
y_hat_ridge.append(np.dot(H_ridge, y))
plot(f)
plot(np.mean(y_hat,0))
plot(np.mean(y_hat_ridge, 0))
[<matplotlib.lines.Line2D at 0x107fc7c10>]
plot(np.std(y_hat,0))
plot(np.std(y_hat_ridge,0))
[<matplotlib.lines.Line2D at 0x107fa2f50>]
In this segment, we looked at the basics of linear regression. Linear models are very popular in neuroscience (and science in general), because:
We have shown here an approach to solving a linear model, using the 'ordinary least squares' solution. We have examined a tool for examining and analyzing the inputs, or design matrix: the singular value decomposition. Finally, we have examined a shrinkage method to reduce the variance of the parameter estimates: 'ridge regression'.
Using our SVD from before, we can derive the OLS solution:
$X\hat{\beta} = \bf{X}(\bf{X}^T\bf{X})^{-1} \bf{X}^T y = \bf{UU}^T y $
x_beta_hat = np.dot(np.dot(U, U.T), y)
plot(y)
plot(x_beta_hat)
[<matplotlib.lines.Line2D at 0x10831ba10>]
Where $\bf{U}^T y$ are the coordinates of $y$ with respect to the orthonormal basis $\bf{U}$
y_coords_in_U = np.dot(U.T, y)
plot(y_coords_in_U, beta_hat, '.')
[<matplotlib.lines.Line2D at 0x10836b610>]
plot(U)
[<matplotlib.lines.Line2D at 0x108a13f90>, <matplotlib.lines.Line2D at 0x108a17250>, <matplotlib.lines.Line2D at 0x108a17450>, <matplotlib.lines.Line2D at 0x108a175d0>, <matplotlib.lines.Line2D at 0x108a17750>, <matplotlib.lines.Line2D at 0x108a178d0>, <matplotlib.lines.Line2D at 0x108a17a50>, <matplotlib.lines.Line2D at 0x108a04250>, <matplotlib.lines.Line2D at 0x108a17d50>, <matplotlib.lines.Line2D at 0x108a17ed0>]
matshow(np.dot(U.T, U)) # Shows that U is an orthonormal basis set
<matplotlib.image.AxesImage at 0x108aaa3d0>
Another way to rewrite the ridge regression, using the results of the SVD:
$\bf{X} \hat{\beta} ^ {ridge} = U D (D^2 + \lambda I)^{-1} D U^T y $
D_sq = np.dot(D, D)
lamb = 10
UD = np.dot(U,D)
Dsq_LI_inv = la.pinv(D_sq + lamb*np.eye(D_sq.shape[0]))
DUt = np.dot(D, U.T)
svdRR = np.dot(np.dot(UD, Dsq_LI_inv), DUt)
x_b_ridge_svd = np.dot(svdRR, y)
plot(f)
plot(x_b_ridge_svd)
[<matplotlib.lines.Line2D at 0x108320f90>]