Notebook

An introduction to the General Linear Model, t and F tests¶

This notebook comprises two sections. The first section is an introduction to linear regression by Ariel, the second takes an example from fmri data and explain t and F statistics.

It builds upon the following chapter : http://www.fil.ion.ucl.ac.uk/spm/doc/books/hbf2/pdfs/Ch8.pdf in the book Human Brain Function (2nd edition) and upon the review in Neuroimage (Poline and Brett, 2012).

In this notebook, in a first part we will explain the General Linear Model in a simple case, assuming first that we are analyzing only one region of interest or one voxel.

In a second part, we will take an example (cf the "principle of stat" notebook) and run an analysis at each and every voxel.

Linear Regression - introduction (by Ariel Rokem)¶

This section is based in part on chapter 3 of "The elements of statistical learning" by Hastie, Tibshirani and Friedman (get full pdf here).

Why model?¶

Models are mathematical constructs used to explain measured data. In cognitive neuroscience, models are usually used to quantitatively describe specific parts of the nervous system (e.g. neurons, voxels, etc.). Typically, we seek to understand changes in the level of a signal measured from the unit (e.g. spike rate, or BOLD response), as a function of the inputs into the system (e.g. stimulus presented, cognitive state of the participant, etc.). Accurate and precise mathematical models derived from the data are useful in the following ways:

They allow you to explain the data you have: summary statistics of the data, such as the mean of the responses of a voxel, or the variance of this variable, are often not that informative (especially if normalized...). A model can give you a good explanation of the data you have observed, in terms of background knowledge, or theory.
They allow you to predict data you haven't collected: In many cases, assuming that the parameters don't change with changes in the independent variables, models allow us to interpolate and predict what the dependent variables would have been like in other values of the independent variables. Even extrapolate to conditions that have not measured. For example, because measuring in these conditions is difficult for practical reasons.
They allow you to specify your confidence in a specific theory: By performing some form of statistical testing, you can specify your confidence level in the veracity of statements about responses to a particular condition, or differential responses in different conditions.

Definition of a linear model:¶

Assume that we make a measurement $\bf{y}$. Here bold-face indicates that we are talking about a measurment vector. Say, of length $N$. This measurement is a response in our unit-of-interest to some inputs: $(\bf{x}_1, \bf{x}_2, ... \bf{x}_p)$. Each one of the $\bf{x}_i$ vectors also has $N$ elements. We will make this more concrete later on, but for now, you can think of the $\bf{X}$ as representing different input streams. For example, different locations on the screen.

A linear model assumes that the expected value of $\bf{y}$ given a specific input $\bf{X}$ can be expressed as a linear sum of the $\bf{x}_i$. That is: $E(\bf{Y}|\bf{X})$ is linear in the inputs.

Let's break that down a bit. All this means is that in the noise-less case $\bf{y}$ is a function of the inputs, $f(x)$, such that there exist $\beta_i$ which satisfy:

$f(x) = \beta_0 + \beta_1 \bf{x}_1 + \beta_2 x_2 + ... + \beta_p x_p$

You will notice that bivariate linear regression, which you are probably familiar with is simply a subset of this, with N=1

In [1]:

def linear_function(x,a,b):
    """ 
    A linear function
    
    Parameters
    ----------
    x : ndarray
        Input variable
    
    a : float
        offset parameter
    
    b : float 
        slope parameter
    """
    return a + b * x 

In [2]:

N = 100
x = np.linspace(-pi, pi, N)

a = np.random.randn() * 10
b = np.random.randn() * 10
y = linear_function(x, a, b)

fig, ax = plt.subplots(1)
ax.plot(x, y, '.')
ax.set_xlabel('x')
ax.set_ylabel('y')

Out[2]:

<matplotlib.text.Text at 0x328c490>

In [3]:

t = np.linspace(-pi, pi, N)
x = np.sin(t)

fig, ax = plt.subplots(1)
ax.plot(x)

y = linear_function(x, a, b)
fig, ax = plt.subplots(1)
ax.plot(y, '.')

fig, ax = plt.subplots(1)
ax.plot(x, y, '.')
ax.set_xlabel('x')
ax.set_ylabel('y')

Out[3]:

<matplotlib.text.Text at 0x36b9e50>

Multi-linear regression:¶

Now, let's generalize this to $p$ inputs, where we choose $p=10$

For example, let's consider an output which is a linear combination of $p$ sine waves.

To account for $\beta_0$, we make sure that the first column is all 1's:

In [4]:

p = 10  # Number of inputs

# Preallocate a matrix (make it a matrix of ones to account for beta_0):
X = np.ones((N, p)) 
for ii in range(1, p):
    X[:, ii] = np.sin((ii+1)*t)
    

In [5]:

fig, ax = plt.subplots(1) 
ax.plot(t,X)
ax.set_ylim([-1.1, 1.1])
ax.set_xlim([-pi, pi])
ax.set_xlabel('t')
ax.set_ylabel('x')

Out[5]:

<matplotlib.text.Text at 0x34f2cd0>

In [6]:

def linear_model(X, beta):
    """ 
    The linear model

    Parameters
    ----------
    X : 2d array
        The design matrix : a matrix of regressors
    
    beta : 1d array 
        Model coefficients
    """ 
    return np.dot(X, beta)
    

Where np.dot is a matrix multiplication:

$\bf{X}\beta = \beta_0 + \beta_1 \bf{x}_1 + \beta_2 \bf{x}_2 + ... + \beta_p \bf{x}_p $

Aside on matrix multiplication :¶

The multiplication of two matrices $\bf{A}$ and $\bf{B}$ is denoted $\bf{AB}$. This is only defined if the inner dimension of the multiplication matches. That is, if $\bf{A}$ is $m$ by $n$ and $\bf{B}$ is $n$ by $k$. In that case, element $i,j$ of the multiplication is defined as following:

$\bf{AB}_{ij} = \sum_l\sum_k{A_{ik} B_{lj}}$

Back to our regression problem¶

We choose a set of random $\beta$ coefficients and generate $f(x)$

In [7]:

beta = np.random.randn(p)
f = linear_model(X, beta)
fig, ax = plt.subplots(1)
ax.plot(t, f)
ax.set_xlim([-pi, pi])

Out[7]:

(-3.141592653589793, 3.141592653589793)

In this case, $\bf{y}$ is not going to be linear in any particular $\bf{x}_i$, only in a combination of them:

In [8]:

fig, ax = plt.subplots(1)
ax.plot(X[:, 1], y, '.')

Out[8]:

[<matplotlib.lines.Line2D at 0x3279ed0>]

Noise¶

Unfortunately, the noise-less case is not a very realistic one. Let's assume that the output is additionaly corrupted by noise, $\epsilon$:

$\bf{y} = f(\bf{x}) + \epsilon$

which is distributed according to a Gaussian/normal distribution with zero mean:

$\epsilon \sim \mathcal{N}(0, \sigma)$

In [9]:

y = f + 3*np.random.randn(*f.shape)

fig, ax = plt.subplots(1)
ax.plot(f, label='f')
ax.plot(y, label='y')
plt.legend()

Out[9]:

<matplotlib.legend.Legend at 0x36b9810>

Solving the linear model¶

Solving the linear model, means finding an estimate of the $\bf{\beta}$ coefficients that generated the data we observe

In other words, we are looking for procedure to estimate the coefficients (we call the estimated coefficients $\bf{\hat{\beta}}$)

Criterion for a solution¶

A 'good' estimate of these coefficients would produce an estimate of the output, $\bf{\hat{y}}$, which would be close to the measured data points.

Where:

$\bf{\hat{y}} = \bf{ X\hat{\beta}}$

One way to define this is by saying that we want the residual sum of squares to be small, where the residuals are:

$\bf{y}-\bf{\hat{y}}$

and the residual sum of squares (or RSS) is:

$RSS = \sum_{i=1}^{N}(y_i - \hat{y_i})^2$

Finding $\beta$ - a first solution¶

Another way of deriving the result is to minimize the sum of square of our residuals: $\sum_i \hat\epsilon_i^2 = \sum_i{(Y_i - \widehat{Y}_i)^2} $ :

Since $\bf{\hat{y}}$ is a function of $\bf{\beta}$, we can think of $RSS$ as a function of $\beta$ and we want to optimize it with respect to a choice of the elements of $\beta$. That is, we want to find a minimum of $RSS$ with respect to $\beta$.

Mathematically speaking, we want to find a value of $\bf{\hat{\beta}}$ for which the first derivative of $RSS$ with respect to $\beta$ is 0

To find the derivative, we re-write $RSS$ in the following way:

$RSS(\beta) = (\bf{y} - X\beta)^T (\bf{y} - X\beta) $ The derivative of this function is (we leave it as an exercise to the reader to prove this:)

$\frac{\partial RSS}{\partial\beta} = -2 \bf{X}^T (\bf{y} - \bf{X}\beta)$

Setting the first derivative to 0 and dividing both sides by 2, we obtain

$\bf{X}^T (y-\bf{X}\beta) = 0$

$\bf{X}^T y - \bf{X}^T \bf{X} \beta = 0$

$\Rightarrow \bf{X}^T y = \bf{X}^T \bf{X} \beta$

If $\bf{X}$ is 'full rank', then $\bf{X}^T\bf{X}$ is positive definite and it can be inverted.

In [10]:

import scipy.linalg as lin

def ols(X):
    """
    The matrix which solves the OLS regression problem for full-rank X
    """
    return np.dot(lin.pinv(np.dot(X.T, X)), X.T)

In [11]:

beta_hat = np.dot(ols(X), y)

In [12]:

plot(beta, beta_hat, 'o')
plt.xlabel('Real parameter value')
plt.ylabel('Estimate parameter value')

Out[12]:

<matplotlib.text.Text at 0x3cf4d10>

Using $\bf{\hat{\beta}}$, we can calculate back $\bf{\hat{y}}$, the estimate of the output and compare it to the actual output:

$\bf{\hat{y}} = \bf{X} \bf{\hat{\beta}}$

In [13]:

y_hat = np.dot(X, beta_hat)
fig, ax = plt.subplots(1)
ax.plot(y, label='$y$')
ax.plot(y_hat, label='$\hat{y}$')
plt.legend()

Out[13]:

<matplotlib.legend.Legend at 0x416a8d0>

The so called "hat matrix", or $ H $ is the matrix which defines the transformation from $\bf{X}$ to the estimate $\hat{y}$, so it's the matrix that "puts the hat" on $y$:

$H = \bf{X}(\bf{X^T}\bf{X})^{-1} \bf{X^T}$

In [14]:

H = np.dot(X, ols(X))
H.shape
y_hat = np.dot(H.T, y)

plot(y, y_hat, 'o')
plt.xlabel('The original output') 
plt.ylabel('Linear estimate of the output')

Out[14]:

<matplotlib.text.Text at 0x4196910>

GLM, t and F-tests¶

Grasping some data¶

First, finding some nifti or img on the disk. Let's assume we have some data in our current directory:

In [15]:

import nibabel as nib
import os
from os.path import join as pjoin

this_course = 'pna'

def load_images(course):
    if course == 'pna':
        PNA_DATA_PATH = pjoin(os.path.expanduser('~'), 'data', 'qcpna', 'bolddata')
        img_fname = pjoin(PNA_DATA_PATH, 'smallbold.nii.gz')
        if os.path.isfile(img_fname):
            img = nib.load(img_fname)
            print img.shape
            return img
        else:
            print "where is this image: " + fname + "\n"
            return None
                     
    else:
        files_1 = [pjoin('ds107','sub' + "%03d" %i, 'BOLD','task001_run001','meanabold.nii') for i in range(1,15)]
        print [(f,os.path.isfile(f)) for f in files_1]
        #files = [os.path.join(scans_dir, 'snn03055dy%i.img' % i) for i in range(1,13)]
        files = [f for f in files_1 if os.path.isfile(f)]
        return [nib.load(f) for f in files]
    
img = load_images('pna')

(64, 64, 35, 42)

Once we have these images, we want in a first step to extract one voxel value. To do so, we know that images spatially normalized have an affine transform that allows us to compute the the position of a (i,j,k) index of the image in the template space. If we have the (x,y,z) position in the template space, we need to inverse this transform to get the index of the (closest voxel).

Now we load the affine transformation from the first image (it's the same for all), pick a voxel in the MNI space, and compute the conversion parameters from MNI space (in mm) to voxels space (indices of the 3D array).

In [16]:

# transformation from voxel no to mm

try:
    M = img.get_affine()
except: 
    M = img[0].get_affine()

In [19]:

# mm to voxel no
iM = np.linalg.inv(M)
# coordinates of voxel of interest in mm (MNI space)
posmm = [-20.0, -42, 34]
(x, y, z) = tuple(posmm)
# coordinates in voxel space (in homogenous coordinates)
posvox = np.dot(iM, posmm + [1])
# We grab the spatial part of the output.  Since we want to use it as an 
# index, we need to make it a tuple
i,j,k = tuple(np.round(posvox[:3]).astype(int))

print i,j,k
vdata = img.get_data()[i,j,k+2, :]

24 20 26

Let's display the values of this voxel - and display a slice of the 4D volume.

In [20]:

import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(14,4))
ax1.plot(vdata); ax1.set_title('the time series')
ax2.set_title('the slice at z = %d' %z)
# time t = 0
t = 0
imshow(img.get_data()[:,:,z,t], interpolation='nearest')

Out[20]:

<matplotlib.image.AxesImage at 0x4ae3b90>

The GLM machinery: any difference with linear regression?¶

We will model the data using the GLM, then ask questions on the parameters of this model, a concise and fairly general framework for asking question to our data.

$$ Y = X \beta + \epsilon $$

In a first example, we take X to contain two columns. Let's assume for a moment that our data come from an experiment with 12 subjects in which odd scans are from condition 1 (or group one if each scan is from a different subject) and even scans are from group 2. The first column of X will model this with, with $x_{i1} = 1$ when i is odd, and $x_{i1} = -1$ if $i$ even. The second column will be a column of ones, modelling a constant offset. We note $\epsilon_i$ the noise at each measurement $y_i$ under this model.

$$ Y = \left[ \begin{array}{c} y_1 \\\\ y_2 \\\\ y_3 \\\\ y_4 \\\\ y_5 \\\\ y_6 \\\\ y_7 \\\\ y_8 \\\\ y_9 \\\\ y_{10} \\\\ y_{11} \\\\ y_{12} \\\\ \end{array} \right] = \beta_0 \left[ \begin{array}{c} x_1 \\\\ x_2 \\\\ x_3 \\\\ x_4 \\\\ x_5 \\\\ x_6 \\\\ x_7 \\\\ x_8 \\\\ x_9 \\\\ x_{10} \\\\ x_{11} \\\\ x_{12} \\\\ \end{array} \right] + \beta_1 \left[ \begin{array}{c} % \small{\textrm{1}} \\\\ \small{\textrm{1}} \\\\ \small{\textrm{1}} \\\\ \small{\textrm{1}} \\\\ % \small{\textrm{1}} \\\\ \small{\textrm{1}} \\\\ \small{\textrm{1}} \\\\ \small{\textrm{1}} \\\\ % \small{\textrm{1}} \\\\ \small{\textrm{1}} \\\\ \small{\textrm{1}} \\\\ \small{\textrm{1}} \\\\ 1 \\\\ 1 \\\\ 1 \\\\ 1 \\\\ 1 \\\\ 1 \\\\ 1 \\\\ 1 \\\\ 1 \\\\ 1 \\\\ 1 \\\\ 1 \\\\ \end{array} \right] + \left[ \begin{array}{c} \epsilon_1 \\\\ \epsilon_2 \\\\ \epsilon_3 \\\\ \epsilon_4 \\\\ \epsilon_5 \\\\ \epsilon_6 \\\\ \epsilon_7 \\\\ \epsilon_8 \\\\ \epsilon_9 \\\\ \epsilon_{10} \\\\ \epsilon_{11} \\\\ \epsilon_{12} \\\\ \end{array} \right] = \left[ \begin{array}{rc} 1 & 1 \\\\ -1 & 1 \\\\ 1 & 1 \\\\ -1 & 1 \\\\ 1 & 1 \\\\ -1 & 1 \\\\ 1 & 1 \\\\ -1 & 1 \\\\ 1 & 1 \\\\ -1 & 1 \\\\ 1 & 1 \\\\ -1 & 1 \\\\ \end{array} \right] \left[ \begin{array}{c} \beta_0 \\\\ \beta_1 \\\\ \end{array} \right] + \left[ \begin{array}{c} \epsilon_1 \\\\ \epsilon_2 \\\\ \epsilon_3 \\\\ \epsilon_4 \\\\ \epsilon_5 \\\\ \epsilon_6 \\\\ \epsilon_7 \\\\ \epsilon_8 \\\\ \epsilon_9 \\\\ \epsilon_{10} \\\\ \epsilon_{11} \\\\ \epsilon_{12} \\\\ \end{array} \right] = X \bf \beta + \bf \epsilon $$

Finding $\beta$ - a second solution¶

In the equation above, we know the model $X$ (or at least we assume this model), and we need to estimate the $\beta$ and the $\epsilon$. To estimate these, the simplest is to first multiply the left and right hand side of the equation by $X^T$, the transpose of $X$, to get:

$$ X^T Y = X^T X \beta + X^T \epsilon $$

In the above equation, we search for an estimate of the noise ($\hat\epsilon$) that is uncorrelated with our explanatory variables (the constant vector or ones, and the $x_{i1}$). In other words, in matrix terms, such that $X^T \hat\epsilon = 0$. In terms of the estimates $\hat\beta$ and $\hat\epsilon$, we therefore have:

$$ X^T Y = X^T X \hat\beta + X^T \hat\epsilon = X^T X \hat\beta $$

This equation is called the "normal" equation. Now, if $X^T X$ is invertible, we have

$$\hat\beta = (X^T X )^{-1} X^T Y $$

If $X^TX$ is not invertible, (which will happen if some of the explanatory variable can be computed as linear combinaison of the others), we can always take a pseudo inverse of $X^TX$, denoted for instance by $(X^TX)^+$, and compute the $\hat\beta$ with $\hat\beta = (X^T X )^{+} X^T Y $.

A full description of the pseudo inverse (here, we assume the Moore-Penrose pseudo-inverse) is out of the scope here, but we can think of it as acting as the inverse of a matrix, such that :

$X X^+ X = X$, and $X^+ X X^+ = X^+$, and both $X X^+$ and $X^+ X$ are symmetric.

If $X^+$ is a the pseudo inverse of $X$, we have $\hat{\beta} = X^+ Y$ (and $(X^TX)^+X$ is also a pseudo inverse of X).

In any case, we can compute our best estimates for $Y$ given the model $X$ with : $\widehat{Y} = X \hat\beta$, and our estimated noise or residuals by $\hat\epsilon = Y - \widehat{Y} = Y - X\hat\beta $.

Applying to our voxel data:¶

In [58]:

x0 = np.asarray([1, -1, 1, -1, 1, -1, 1, -1, 1, -1, 1, -1])
x1 = np.ones(12)
X = np.vstack((x0,x1)).T
print X
X_firstmodel = copy(X)

[[ 1.  1.]
 [-1.  1.]
 [ 1.  1.]
 [-1.  1.]
 [ 1.  1.]
 [-1.  1.]
 [ 1.  1.]
 [-1.  1.]
 [ 1.  1.]
 [-1.  1.]
 [ 1.  1.]
 [-1.  1.]]

In [59]:

from scipy import linalg as lin

# take only the first 12 values of the data - for 
# scale voxel data for better visual display; make its mean == 3.0
Y = vdata[:12]  
Y = (Y - Y.mean())/np.std(Y)  + 3.0
vdata = Y

pinvX   = lin.pinv(X)
betah   = pinvX.dot(Y)
Yfitted = X.dot(betah)
resid   = Y - Yfitted
print "Y:\n", Y, "\nresid:\n", resid, "\nbetah:\n", betah
print "mean of Y: ", np.mean(Y), "\t mean of resid: ", np.mean(resid)

# make this a little function as we will be reusing it :
def glm(X,Y):
    """ a simple GLM function returning the estimated parameters and residuals """
    betah   =  lin.pinv(X).dot(Y)
    Yfitted =  X.dot(betah)
    resid   =  Y - Yfitted
    return betah, Yfitted, resid

Y:
[ 3.71749728  2.35802875  3.80812852  3.2643411   5.07696582  3.80812852
  2.35802875  2.26739751  2.17676628  3.62686605  2.17676628  1.36108516] 
resid:
[ 0.4984718  -0.42294577  0.58910303  0.48336659  1.85794033  1.027154
 -0.86099674 -0.513577   -1.04225921  0.84589153 -1.04225921 -1.41988936] 
betah:
[ 0.21902549  3.        ]
mean of Y:  3.0 	 mean of resid:  1.62832710278e-15

In [60]:

# Plot the results
x = range(12)
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,4))
ax1.plot(x, Y, 'r-', x, Yfitted, 'b-.')
ax2.plot(x, resid, 'g--')

# Again, we will want to reuse this code, let's make a tiny function:
def plot_glm(Y, Yf, r):
    x = range(Y.shape[0])
    f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,4))
    ax1.plot(x, Y, 'r-', x, Yf, 'b-.')
    ax1.set_title('Y (red) Y fitted (blue)')
    ax2.plot(x, r, 'g--')
    ax2.set_title('residuals')

Anoter way to compute the residuals is to compute the so called residual forming matrix $R$ :

$$ \hat\epsilon = Y - \widehat{Y} = Y - X \hat\beta = Y - X (X^T X)^+ X^T Y = (I_n - X (X^T X)^+ X^T)Y = R Y $$

where $I_n$ is the n by n identity matrix.

Interpreting our estimates of $\beta$, changing the data and the model.¶

At the look of our results, it seems that there's not much in the first regressor. We have $Y = 0.219 x_1 + 3.0$ We can check that 1) the residuals have mean zero (they are "uncorrelated with the constant x_1") the mean of the Ys is equal to \beta_1, which is the case because $x_0$ has mean zero. We could also check that the residuals are uncorrelated with $\bf x_0$.

While the first coefficient is small, especially with respect to the residuals, we still have to see whether this coefficient is "significantly" different from zero. This is the topic of our next section.

Making a slightly more general example¶

But before going into testing, let's make our example a little more general. Let's assume that our scans are divided in another 2 groups split, where the first 6 scans are of one condition (or group) and the last six ones of another. We can model this with a vector $\bf x = [\textrm{1 1 1 1 1 1 -1 -1 -1 -1 -1 -1}]^T$, which will model the difference between the first and the last 6 scans. Another solution is to include two regressors, one for the mean of the six first scans and one for the mean of the last six scans. Let's adopt this parametrisation of the model.

In [61]:

vdata = vdata[:12]
Y = vdata * 1.
print mean(Y)
x0 = np.asarray([1, -1, 1, -1, 1, -1, 1, -1, 1, -1, 1, -1])
x1 = np.hstack((np.ones(6),np.zeros(6)))
x2 = np.hstack((np.zeros(6),np.ones(6)))
x3 = np.ones(12)
X = np.vstack((x0,x1,x2,x3)).T
print X

second_X = X

3.0
[[ 1.  1.  0.  1.]
 [-1.  1.  0.  1.]
 [ 1.  1.  0.  1.]
 [-1.  1.  0.  1.]
 [ 1.  1.  0.  1.]
 [-1.  1.  0.  1.]
 [ 1.  0.  1.  1.]
 [-1.  0.  1.  1.]
 [ 1.  0.  1.  1.]
 [-1.  0.  1.  1.]
 [ 1.  0.  1.  1.]
 [-1.  0.  1.  1.]]

Let's add signal (== something that is in the model) in the data¶

In [62]:

# Add some signal to our voxel, for the sake of the example:

Y += 0.5*mean(Y)*x1 - 0.5*mean(Y)*x2

print Y, mean(Y)

[ 5.21749728  3.85802875  5.30812852  4.7643411   6.57696582  5.30812852
  0.85802875  0.76739751  0.67676628  2.12686605  0.67676628 -0.13891484] 3.0

Let's fit our model for the new model and data.

In [63]:

beta, Yfitted, resid = glm(X,Y)
print beta, mean(Y)
# (beta[1]/2 + beta[2]/2 + beta[3])

[ 0.21902549  3.17218166 -1.17218166  2.        ] 3.0

In [64]:

# Plot the results
plot_glm(Y, Yfitted, resid)

Conclusion : we see that the fitted data (blue) are fairly close to the actual data - residuals are small compare to the size of the signal.

What if there is signal not in the model ?¶

Let's see what results we would have had with the first model (that doesnt have the regressor modelling the "on off" shape, but the new data that have this signal:

In [65]:

print X_firstmodel

[[ 1.  1.]
 [-1.  1.]
 [ 1.  1.]
 [-1.  1.]
 [ 1.  1.]
 [-1.  1.]
 [ 1.  1.]
 [-1.  1.]
 [ 1.  1.]
 [-1.  1.]
 [ 1.  1.]
 [-1.  1.]]

In [66]:

glm_result = glm(X_firstmodel, Y)
plot_glm(Y, glm_result[1], glm_result[2])
print glm_result[0] # the betas

[ 0.21902549  3.        ]

First, clearly the fitted data (blue) are far from our actual data. Noise is large, and ressemble the data (accounting for the shift in the mean)
You may notice: the beta estimated with these new data (we have added the "step" function) are exactly the same than the ones estimated on the original data. Why ? Because we added in the Y something that is completely uncorrelated to the model X_firstmodel. No regressor in this first model can pick it up, therefore the betas are unchanged.

Changing the model $X$ by adding a regressor that is not in the model¶

Using the previous model, let's add a regressor $x_2$ to the model that is uncorrelated to each and every column of $X$, (ie adding $\bf{x_2}$ with $\bf{x_0}^t\bf{x_2} = 0$, and $\bf{x_1}^t\bf{x_2} = 0)$, $\bf{x_2}$ is in the third position, the $\hat\beta_0$ and $\hat\beta_1$ would be unchanged. But the Let's check:

In [67]:

X = X_firstmodel
x2 = np.hstack((np.ones(6),-np.ones(6)))
X_ = np.vstack((X.T, x2)).T
print X_.T.dot(X_)
glm_result = glm(X_, Y) # original data
plot_glm(Y, glm_result[1], glm_result[2])
print glm_result[0] # the betas

[[ 12.   0.   0.]
 [  0.  12.   0.]
 [  0.   0.  12.]]
[ 0.21902549  3.          2.17218166]

If there's some correlation between the regressors already in the model and the added regressor ? then the parameters $\beta$ will change. But the critical piece of information is what is or is not in the model, it is its "flexibility".

What happens if we have too many things in the model ?

Two cases

* Those things can capture some of the noise (the residual variance decrease)
* Those things cannot capture some of the noise (the residual variance doesn't decrease)

In the first case, we will be overfitting. In the second case, we simply loose some degrees of freedom.

Let's now see how we test the parameters that we have just estimated.

Testing the $\hat\beta$¶

Let's go back to our second model. The model and the data can be displayed as little images:

In [68]:

X = second_X
f, (a1, a2, a3) = subplots(1, 3, figsize=(10,3))
a1.imshow(Y[:,np.newaxis], interpolation='nearest', cmap='gray')
a2.imshow(X, interpolation='nearest',cmap='gray')
a3.imshow(resid[:,np.newaxis], interpolation='nearest',cmap='gray')
titles = ['Y','X','e']
for ix,a in enumerate([a1, a2, a3]):
    a.set_xticklabels([])
    a.set_title(titles[ix])
    #a.set_yticklabels([])

There are two main statistics to understand with the GLM: the t- and the F- tests. First, we have to assume that our data $Y$, or equivalently the residuals $\epsilon$, are normally distributed. We write $Y \sim N(X\beta, \sigma^2)$, or, $\epsilon \sim N(0, \sigma^2)$.

GLM and T-tests¶

Most often the t-test is used to answer the question of whether a linear combinaison of the estimated parameters is zero. For instance, let's take $\beta_1 - \beta_2$ in the model above. This can be written with $c^T \beta$ where $c^T = [0, 1, -1, 0]$. To see if this quantity is too high (or too small) to be expected if we only had noise in the data (and therefore that we can reject the hypothesis that it is zero), we compute the t statistics:

$$ t = \frac{c^T \hat\beta}{\sqrt{\mathrm{var}(c^T \hat\beta)}} $$

and it can be shown that t follows a student-distribution with degrees of freedom $df$ that we will compute below.

Let's first compute $ \mathrm{var}(c^T \hat\beta) $. We have

$ \mathrm{var}(c^T \hat\beta) = \mathrm{var}(c^T X^+ Y) = c^T X^+ \mathrm{var}(Y) X^{+T} c = \sigma^2 c^T X^+ X^{+T} c = \sigma^2 c^T (X^T X)^- c $.

To compute the t-value, we need an estimate of $\sigma^2$. We dont have access to the true $\epsilon$, but we can get an estimate using the residuals $\hat\epsilon$. We therefore compute the Residual Sum of Square (RSS) with

$$ \textrm{RSS} = \sum_{i=1}^{12} (Y_i - \widehat{Y_i})^2 = \sum_{i=1}^{12} (Y_i - X_i \hat\beta)^2 = (Y - X\hat\beta)^T(Y - X\hat\beta) = (Y - X X^+ Y)^T(Y - X X^+Y) = (Y^T R^T) R Y = Y^T R Y $$

where $X_i$ is the $i$th row of the matrix $X$. To obtain the estimate of $\sigma$, we compute the expectation of RSS. Here, a couple of mathematical tricks are useful, involving the trace of a matrix (which is the sum of its diagonal elements). In short, playing with the trace and the expectation, we get :

$$ E(\textrm{RSS}) = E(Y^T R Y) = E(tr(Y^T R Y)) = E(tr(R Y Y^T)) = tr(R E(YY^T)) = tr(\sigma^2 R) = \sigma^2 (n - \textrm{rank}(X)) $$

It is not necessary to understand all the steps of this derivation, but it is interesting to understand the essence of what is shown above. In words, what this says is that if we were to compute the average value of the residuals sum of square, over many experiments, this sum of square is less that the sum of square of the true noise $\epsilon$, and it depends on the rank of $X$. The rank of a $X$ is the minimal number of independant columns needed to construct $X$. Here, while we have 4 columns in $X$ but $\textrm{rank}(X) = 3$ because we need only column 1, 2, and 3 (or 1, 2, 4) to reconstruct fully $X$. The more independant parameters we are estimating, the less degrees of freedom remains to estimate the variance of $\epsilon$.

Following the equation above, and letting $p = \textrm{rank}(X)$ we can estimate $\sigma$ with $\hat\sigma$ where:

$$\hat\sigma^2 = \textrm{RSS} / (n - p)$$

This estimation is also called the Mean Residual Sum of Square (MRSS), and $\hat\sigma = \textrm{MRSS} = \textrm{RSS} / \it{df} $ where $\it{df}$ is the degrees of freedom $(n-p) = 12 - 3$, the number of observations (here, 12) minus the rank of the design matrix (here 3). In other word, this is the estimation of our noise.

Coming back to our t-statistics, we can now calculate the Standard Error (SE) of $c^T \hat\beta$ with $\textrm{SE} = \sqrt{\hat{\sigma}^2 c^T (X^t X)^+ c} $

The t test is then : $t = {c^T \hat\beta}/{\textrm{SE}} $

Clearly, for t to be big, we need the SE to be small, and / or $c^T \hat\beta$ to be large. Notice that the SE is composed of two elements: $\hat{\sigma}^2$ on the one hand, and $c^T (X^t X)^+ c$ on the other hand. If the noise variance is high, then we have less chance to detect an effect. The second factor introduce how much the effect we are testing is correlated with other elements in the design matrix. If for instance, we are testing for $\beta_0$ but we have a column in $X$ which is highly correlated with $\bf{x_0}$, $c^T (X^t X)^+ c$ will be large and we also have less chance to detect the effect. This led to some work for optimizing the design, once the questions of interest are formulated as contrast of the estimated parameters.

Let's compute this statistics in our example.

In [69]:

Out[69]:

array([[ 1.,  1.,  0.,  1.],
       [-1.,  1.,  0.,  1.],
       [ 1.,  1.,  0.,  1.],
       [-1.,  1.,  0.,  1.],
       [ 1.,  1.,  0.,  1.],
       [-1.,  1.,  0.,  1.],
       [ 1.,  0.,  1.,  1.],
       [-1.,  0.,  1.,  1.],
       [ 1.,  0.,  1.,  1.],
       [-1.,  0.,  1.,  1.],
       [ 1.,  0.,  1.,  1.],
       [-1.,  0.,  1.,  1.]])

In [70]:

betah, Yfitted, resid = glm(X,Y) # fit the model
RSS = sum((Y - Yfitted)**2)
MRSS = RSS/(len(Y) - linalg.matrix_rank(X))

print "RSS :", RSS  #, "\nRSS metho 2:", sum(resid**2),  
print "MRSS :", MRSS #, "\nMRSS method 2:", RSS/9 #checking
print betah

cT = np.asarray([0, 1, -1, 0])
c  = cT[:,newaxis]
pXTX = np.linalg.pinv(X.T.dot(X))

t_num = cT.dot(betah)
SE = np.sqrt(MRSS* cT.dot(pXTX).dot(c))
t = t_num / SE

print t

RSS : 6.00239575609
MRSS : 0.666932861788
[ 0.21902549  3.17218166 -1.17218166  2.        ]
[ 9.21394696]

To see if this is significant, we have to see what is the probability of observing this t-value or larger under the null hypothesis. We first import some utilities stat functions form scipy.

In [71]:

from scipy.stats import t as tdist, norm as ndist

In [72]:

pvalue = 1.0 - tdist.cdf(t,10)
print "t-value = ", t, "p-value = ", pvalue

# if t was 1.8 ?

t = 1.8
pvalue = 1.0 - tdist.cdf(t,10)
print "t-value = ", t, "p-value = ", pvalue

t-value =  [ 9.21394696] p-value =  [  1.67422278e-06]
t-value =  1.8 p-value =  0.0510261215673

GLM and F-tests¶

The simplest and generally most useful way of thinking of F test is to think as the test between two models: one which contains the regressor or factor that we want to test for (refered as the full model with design matrix $X$), and one which doesnt (the reduced model $X_0$). To test whether the model containing more columns is better, we compute the difference between the estimation of the noise variance between the models (variance estimated with $X$ versus variance estimated with $X_0$), normalized by the estimation of the noise variance under the full model. This is :

$$ \begin{eqnarray} F_{\nu_1, \nu_2} & = & \frac{(\hat\epsilon_0^t \hat\epsilon_0 - \hat\epsilon^T\hat\epsilon)/ \nu_{1} }{\hat\epsilon^T\hat\epsilon/\nu_{2}} \\ & = & \frac{(Y^TR_{X_0}Y - Y^TR_{X}Y)/ \nu_{1} }{Y^TR_{X}Y/\nu_{2}} \\ & = & \frac{(Y^T(I-P_{X_0})Y - Y^T(I-P_{X})Y)/ \nu_{1} }{Y^T(I- P_{X})Y/\nu_{2}} \\ & = & \frac{(\textrm{SSR}(X_0) - \textrm{SSR}(X))/\nu_1}{\textrm{SSR}(X)/\nu_2} \end{eqnarray} $$

with $\textrm{SSR}(X)$, $\textrm{SSR}(X_0)$, the sum of squares for error of models $X, X_0$ respectively, and:

$$ \begin{eqnarray} \nu_{1} & = & \textrm{tr}(P_X - P_{X_0}) = \textrm{tr}(R_0 - R_X) \\ \nu_{2} & = & \textrm{tr}(I - P_X) = \textrm{tr}(R_X) \end{eqnarray} $$

Notice that $\hat\epsilon^T \hat\epsilon$ = $Y^t R^T R Y $, but R is symetric ($R^T = R $) and is idempotent ($R R = R $). Geometrically, it is a projector.

It can be shown that under the hypothesis that our true model is $X_0$, this follows an $F$ distribution with $\nu_1,nu_2$ degrees of freedom. This is the case because $P_X - P_{X_0}$ is orthogonal to $R_X$ (since $P_{X_0}$ is in $P_X$) - it can be shown that in this case these two sum of squares are independant.

Let's take an example with the previous model and implement this with code. Say for instance that we are interested to know wheter there is an effect of the step function in the model.

In [73]:

print X
X0 = X[:,[0,3]]
betah, Yfitted, resid = glm(X,Y)
betah0, Yfitted0, resid0 = glm(X0,Y)
PX = X.dot(lin.pinv(X))
RX = np.eye(X.shape[0]) - PX
PX0 = X0.dot(lin.pinv(X0))

F_num = (sum(resid0**2) - sum(resid**2))
nu1 = np.trace(PX - PX0)
F_den = sum(resid**2)
nu2 = np.trace(RX)
F = (F_num/nu1)/(F_den/nu2)
print F, nu1, nu2, F_num, F_den

[[ 1.  1.  0.  1.]
 [-1.  1.  0.  1.]
 [ 1.  1.  0.  1.]
 [-1.  1.  0.  1.]
 [ 1.  1.  0.  1.]
 [-1.  1.  0.  1.]
 [ 1.  0.  1.  1.]
 [-1.  0.  1.  1.]
 [ 1.  0.  1.  1.]
 [-1.  0.  1.  1.]
 [ 1.  0.  1.  1.]
 [-1.  0.  1.  1.]]
84.8968186247 1.0 9.0 56.620478202 6.00239575609

In [74]:

from scipy.stats import f as Fdist
Fdist.sf(F,nu1,nu2)

Out[74]:

7.0423196532412472e-06

The relation between the significance of t and F test¶

For this, let's imaging that we have two regressors that we are interested in. For the sake of simplicity, let's assume that these are not correlated.

Regression versus correlation : how can things so similar be so different?¶

First, let's make a little function for testing the beta_hat (we assume they are estimable)

In [75]:

def t_test(betah, resid, X):
    """ 
    test the parameters betah one by one - this assumes they are estimable (X full rank)
    
    betah : (p, 1) estimated parameters
    resid : (n, 1) estimated residuals
    X : design matrix
    """

    RSS = sum((resid)**2)
    n = resid.shape[0]
    q = np.linalg.matrix_rank(X)
    df = n-q
    MRSS = RSS/df
    
    XTX = np.linalg.pinv(X.T.dot(X))

    tval = np.zeros(betah.shape)
    pval = np.zeros(betah.shape)


    for (idx, beta) in enumerate(betah):
        c = zeros(betah.shape)
        c[idx] = 1
        t_num = c.T.dot(betah)
        SE = np.sqrt(MRSS* c.T.dot(XTX).dot(c))
        tval[idx] = t_num / SE
   
        pval[idx] = 1.0 - tdist.cdf(tval[idx], df)
    
    return tval, pval

#print t_test(betah, Yfitted, resid)

In [76]:

n = 20
x = np.random.randn(n,1)
X  = np.hstack((x, np.ones(x.shape)))
print X.shape
m_y = 0 
m_y = 3.14156

e = np.random.randn(n,1)
y = 1.4 * x + m_y + e
plt.plot(x,y,'o')

(20, 2)

Out[76]:

[<matplotlib.lines.Line2D at 0x6ef42d0>]

What is the estimated beta ?

In [77]:

betah, Yfitted, resid = glm(X,y)
#print Yfitted.shape
t, p =  t_test(betah, resid, X)
print " betah = \n", betah
print " t = \n", t
print " p = \n", p

 betah = 
[[ 1.49135871]
 [ 3.2565731 ]]
 t = 
[[  5.22092063]
 [ 10.44492969]]
 p = 
[[  2.88492490e-05]
 [  2.27580943e-09]]

In [78]:

# do this the other way : let's have y be the explanatory variable:

X2 =  np.hstack((y - m_y, np.ones(x.shape)))
y2 = x
betah2, Yfitted2, resid2 = glm(X2, y2)
print "compare the slopes : \t\t", 1./betah2[0,0], betah[0,0]
print "compare the residual variance:\t", resid.T.dot(resid)[0,0], resid2.T.dot(resid2)[0,0]
t2, p2 =  t_test(betah2, resid2, X2)

print "betah2 : \t\t\t", betah2[:,0]
print "compare t values : \n", np.hstack((t, t2))
print "compare p values : \n", np.hstack((p, p2))

compare the slopes : 		2.47618681684 1.49135870604
compare the residual variance:	33.4062605805 9.04611997931
betah2 : 			[ 0.40384675 -0.13895912]
compare t values : 
[[  5.22092063   5.22092063]
 [ 10.44492969  -0.87105303]]
compare p values : 
[[  2.88492490e-05   2.88492490e-05]
 [  2.27580943e-09   8.02402760e-01]]

But why is that ?##¶

$y = x\beta_0 + \mu + \epsilon$

compared to

$x = y/\beta_0 - \mu/\beta_0 - \epsilon/\beta_0$

In [79]:

# Fisher z transform of correlation coefficient:

def fisher_transf(rho):
    """ take a coefficient of correlation and z transform it 
        see en.wikipedia.org/wiki/Fisher_transformation
    """
    return (0.5 * np.log((1. + rho) / (1. - rho)))
    

In [80]:

corr = np.corrcoef(x[:,0],y[:,0])
z = fisher_transf(corr[0,1])

p_fisher = 1. - ndist.cdf(z, 0, 1./np.sqrt(n-3))
print p_fisher

9.81377091236e-06

GLM over the brain volume : see "simplest_glm" notebook¶

In [81]:

# An example of a t-test with 20 "voxels"
# from nipy.modalities.fmri.glm import FMRILinearModel

t-test and F-test significance¶

In [82]:

from IPython import display as d

In [83]:

def F_test_reducedX(X, X0, Y):
    betah, Yfitted, resid = glm(X,Y)
    betah0, Yfitted0, resid0 = glm(X0,Y)
    PX = X.dot(lin.pinv(X))
    RX = np.eye(X.shape[0]) - PX
    PX0 = X0.dot(lin.pinv(X0))
    
    F_num = (sum(resid0**2) - sum(resid**2))
    nu1 = np.trace(PX - PX0)
    F_den = sum(resid**2)
    nu2 = np.trace(RX)
    F = (F_num/nu1)/(F_den/nu2)
    return F, Fdist.sf(F,nu1,nu2), nu1, nu2

In [86]:

# Set up a model with two regressors - first uncorrelated

n = 20
age = np.random.randn(n,1)
likes_pies = np.random.randn(n,1)

x0 = age
# assumes likes_pies has no correlation with age : remove the correlation
x1 = likes_pies - (likes_pies.T.dot(age)/(age.T.dot(age))) * age
# check with print x0.T.dot(x1)
x2 = np.ones((n,1))

# 
X = np.hstack((x0,x1,x2))

print X.shape
m_y = pi

sig_e = 1.5
e = sig_e * np.random.normal(size=(n,1))
y = X.dot(np.array([1.0, 1.0, m_y])[:,newaxis])  + e

betah, yfitted, resid = glm(X,y)
plot_glm(y, yfitted, resid)
t, p =  t_test(betah, resid, X)

F, pF, nu1, nu2 = F_test_reducedX(X, X[:,[2]], y)
print betah, "\n", t, "\n", p, "\n", F, nu1, nu2, pF

(20, 3)
[[ 1.39775805]
 [ 1.00581096]
 [ 2.52847904]] 
[[ 3.46320756]
 [ 3.82996098]
 [ 8.39050781]] 
[[  1.48591762e-03]
 [  6.70372529e-04]
 [  9.48074406e-08]] 
13.0946461759 2.0 17.0 0.000361508973007

In [88]:

# Set up a model with two regressors -  correlated

n = 20
age = np.random.randn(n,1)
likes_pies = np.random.randn(n,1)

x0 = age + .1*likes_pies
x1 = age - .1*likes_pies
# can you tell what is the expected correlation ?
print corrcoef(x0.T, x1.T)[0,1]
x2 = np.ones((n,1))

# 
X = np.hstack((x0,x1,x2))
m_y = pi

sig_e = 1.5
e = sig_e * np.random.normal(size=(n,1))
y = X.dot(np.array([1.0, 1.0, m_y])[:,newaxis])  + e

betah, yfitted, resid = glm(X,y)
plot_glm(y, yfitted, resid)
t, p =  t_test(betah, resid, X)

F, pF, nu1, nu2 = F_test_reducedX(X, x2, y)
print betah, "\n", t, "\n", p, "\n", F, nu1, nu2, pF

0.991989590541
[[-0.10010842]
 [ 2.6387875 ]
 [ 2.89779484]] 
[[-0.04694278]
 [ 1.24034419]
 [ 7.0355948 ]] 
[[  5.18447054e-01]
 [  1.15847549e-01]
 [  1.00163387e-06]] 
44.6566022043 2.0 17.0 1.70932466497e-07

Here's an illustration of how the various situations one can encounter from J. Mumford:

In [145]:

HOME = os.path.expanduser('~')
d.Image(filename=pjoin(HOME, 'code', 'pna-notebooks', 't_F_test.jpg'))

Out[145]:

In [ ]: