Notebook

The hidden truth on F-tests and contrasts¶

We know that the traditional way of looking at an F-contrast is to think of two models : a full model $X$, and a reduced model $X_0$, and compute the sum of square of the residuals under these two models. We also know that we can express the numerator of an F test with a contrast that impose a constraint on the parameters of the model. This short notebook is to make clear the relation between these two formulations, and therefore being able to test unusual hypotheses more easily.

In [32]:

import scipy.linalg as lin
import scipy.stats as sst
import numpy as np

In [33]:

n = 30

# make a design of 1 factor 3 levels
oneway_anov = np.kron(np.eye(3), ones((10,1)))

#add a trend?
#trend = np.arange(0,n).reshape(n,1)

# add an intercept
intercept = np.ones((n,1))

X = np.hstack((oneway_anov, intercept))

In [34]:

print X

[[ 1.  0.  0.  1.]
 [ 1.  0.  0.  1.]
 [ 1.  0.  0.  1.]
 [ 1.  0.  0.  1.]
 [ 1.  0.  0.  1.]
 [ 1.  0.  0.  1.]
 [ 1.  0.  0.  1.]
 [ 1.  0.  0.  1.]
 [ 1.  0.  0.  1.]
 [ 1.  0.  0.  1.]
 [ 0.  1.  0.  1.]
 [ 0.  1.  0.  1.]
 [ 0.  1.  0.  1.]
 [ 0.  1.  0.  1.]
 [ 0.  1.  0.  1.]
 [ 0.  1.  0.  1.]
 [ 0.  1.  0.  1.]
 [ 0.  1.  0.  1.]
 [ 0.  1.  0.  1.]
 [ 0.  1.  0.  1.]
 [ 0.  0.  1.  1.]
 [ 0.  0.  1.  1.]
 [ 0.  0.  1.  1.]
 [ 0.  0.  1.  1.]
 [ 0.  0.  1.  1.]
 [ 0.  0.  1.  1.]
 [ 0.  0.  1.  1.]
 [ 0.  0.  1.  1.]
 [ 0.  0.  1.  1.]
 [ 0.  0.  1.  1.]]

In [35]:

n, p = X.shape

Y = np.random.randn(n, 1)

csignal = np.array([1., -1., 0, 3.]).reshape(4,1)
signal = X.dot(csignal)
Y +=  signal

print np.hstack((Y, X))

[[ 2.77986271  1.          0.          0.          1.        ]
 [ 3.99773182  1.          0.          0.          1.        ]
 [ 3.7271806   1.          0.          0.          1.        ]
 [ 3.8632183   1.          0.          0.          1.        ]
 [ 4.20122298  1.          0.          0.          1.        ]
 [ 4.1900779   1.          0.          0.          1.        ]
 [ 3.23495586  1.          0.          0.          1.        ]
 [ 3.49377632  1.          0.          0.          1.        ]
 [ 3.80760708  1.          0.          0.          1.        ]
 [ 4.15341048  1.          0.          0.          1.        ]
 [ 1.10881459  0.          1.          0.          1.        ]
 [ 1.45459043  0.          1.          0.          1.        ]
 [ 1.49804018  0.          1.          0.          1.        ]
 [ 2.04922078  0.          1.          0.          1.        ]
 [ 2.81045098  0.          1.          0.          1.        ]
 [ 0.11316675  0.          1.          0.          1.        ]
 [ 2.58020441  0.          1.          0.          1.        ]
 [ 0.17984951  0.          1.          0.          1.        ]
 [ 2.83416135  0.          1.          0.          1.        ]
 [ 0.83353014  0.          1.          0.          1.        ]
 [ 2.98371249  0.          0.          1.          1.        ]
 [ 3.74146005  0.          0.          1.          1.        ]
 [ 3.39001635  0.          0.          1.          1.        ]
 [ 2.33916425  0.          0.          1.          1.        ]
 [ 3.51469752  0.          0.          1.          1.        ]
 [ 4.16145559  0.          0.          1.          1.        ]
 [ 3.68332627  0.          0.          1.          1.        ]
 [ 3.08862975  0.          0.          1.          1.        ]
 [ 2.86829704  0.          0.          1.          1.        ]
 [ 3.266467    0.          0.          1.          1.        ]]

F test with a known reduced model¶

Let's imaging that we know the reduced model: we'd like to test what is in $X$ but not in $X_0$.

We define the orthogonal projector onto $X$ as $P_X = X X^-$, where $X^-$ is the moore-penrose pseudo inverse.

\begin{eqnarray} %\label{equ:F} F_{\nu_1, \nu_2} & = & \frac{(Y^T(I-P_{X_0})Y - Y^T(I-P_{X})Y)/ \nu_{1} }{Y^T(I-P_{X})Y/\nu_{2}} \\ & = & \frac{(SSR(X_0) - SSR(X))/\nu_1}{SSR(X)/\nu_2} \end{eqnarray}

with $SSR(X)$, (resp. $SSR(X_0)$) the sum of square for error of model $X$ (resp. $X_0$), and

\begin{eqnarray} \nu_{1} & = & tr(P_X - P_{X_0}) = tr(R_0 - R_X) \\ \nonumber \nu_{2} & = & tr(I - P_X) = tr(R_X) \end{eqnarray}

In [36]:

from numpy.linalg import matrix_rank

In [37]:

# lets do an F-test : 

def F_test(Y, X, L0):
    '''
    Y: numpy array
        data
    X: numpy array
        design matrix
    L0: numpy array
        contrast forming X0, the "null" model
    
    returns: 
    
    F, nu1, nu2 : float
    '''
    X0 = X.dot(L0)
    n  = X.shape[0]
    p0 = matrix_rank(X0)
    p = matrix_rank(X)
    
    nu1 = p-p0
    nu2 = n-p

    SSR_num = (Y.T.dot(proj(X, Y)) - Y.T.dot(proj(X0, Y)))
    numF = SSR_num/nu1
    print '    ', SSR_num, p, p0
    SSR_den = (Y.T.dot(R_proj(X, Y)))
    denF = SSR_den/nu2
    print '    ', SSR_den, n, p
    
    return numF/denF, nu1, nu2

    
def proj(X, Y=None):
    # project onto the space of X
    #print matrix_rank(X)
    [u, s, vt] = lin.svd(X, full_matrices=False)
    tol = s.max() * max(X.shape) * finfo(s.dtype).eps
    nz = np.where(s > tol)[0]
    #print nz
    if Y is None:
        return u[:,nz].dot(u[:,nz].T)
    else:
        return u[:,nz].dot(u[:,nz].T.dot(Y))
    
def R_proj(X, Y=None):
    # project onto the space orthogonal to X (RY) 
    if Y is None:
        return np.eye(X.shape[0]) - proj(X, Y) 
    else:
        return Y - proj(X, Y)

Understanding the numerator : Matthew's proposal - works under some conditions:

At line $i$ :

$$ \begin{eqnarray} \epsilon_{0i}^2 - \epsilon_i^2 &=& (Y - \mu)^2 - (Y - \beta_{i} - \mu)^2 \\ &=& \epsilon_{0i}^2 - (\epsilon_{0i} - \beta_{i})^2 \\ &=& \beta_i^2 + 2\epsilon_{0i}\beta_{i} \end{eqnarray} $$

Summing these, we see that $\sum_i\epsilon_{0i}\beta_{i} = 0$ since each column of X must be decorrelated from the noise $\epsilon_0$

A more general version of this can be written in the following way, with a small bit of matrix algebra:

$$ \begin{eqnarray} SSR(X_0) - SSR(X) & = & (Y^T(I-P_{X_0})Y - Y^T(I-P_{X})Y) \\ & = & Y^TY - YP_{X_0}Y - Y^TY + YP_{X}Y \\ & = & Y^TP_{X}Y - Y^TP_{X_0}Y \end{eqnarray} $$

But $Y^TP_{X}Y$ is also $Y^TP_{X}P_{X}Y = Y^TX(X^TX)^{-1}X^T X(X^TX)^{-1}X^T Y = \beta^T X^T X \beta$

The first equality is because $P_{X} = P_{X}P_{X}$, the second by replacing $P_X$ by $X(X^TX)^{-1}X^T $.

Now, we can partition $X$ in $[X_t X_0]$ where $X_0$ is the null model and $X_t$ is the tested part (with parameters $\beta_t$), and if they are orthogonal, then

$$ SSR(X_0) - SSR(X) = \beta^T X^T X \beta - \beta_0^T X_0^T X_0 \beta_0 = \beta_t^T X_t^T X_t \beta_t $$

and therefore

$$ SSR(X_0) - SSR(X) = \beta_t^T X_t^T X_t \beta_t $$

which corresponds to the sum of square of the means, weighted by the number of observation (see this by doing the dot product of $X_t \beta_t$ by itself. Note what is $ X_t \beta_t$ : the "means".

In [38]:

L0 = np.eye(p)[:,3:]
(F, n1, n2) = F_test(Y, X, L0)
print (F, n1, n2)

     [[ 27.05935053]] 3 1
     [[ 13.51305476]] 30 3
(array([[ 27.0332089]]), 2, 27)

In [39]:

# split X in Xt and X0 :
X0 = X[:,[3]]
# orthogonolize the three first columns of X : this is what we want to test
Xt = R_proj(X0, X[:,:3])
# recreate the all matrix
Xall = np.hstack((Xt, X0))

In [40]:

betah = lin.pinv(Xall).dot(Y)
betah_t = betah[:3,:]

print "betah_t : ", betah_t

tmp = Xt.dot(betah_t)
SSRXt = tmp.T.dot(tmp)
print "Using the sum of square of the beta_t : SSRXt = %f \n" % SSRXt

SSRX_SSRX0 =  Y.T.dot(proj(Xall, Y)) - Y.T.dot(proj(X0, Y))

print "Using SSR X - SSR X0 : SSRX_SSRX0 = %f \n" % SSRX_SSRX0

betah_t :  [[ 0.87996109]
 [-1.3187404 ]
 [ 0.43877931]]
Using the sum of square of the beta_t : SSRXt = 27.059351 

Using SSR X - SSR X0 : SSRX_SSRX0 = 27.059351

In [41]:

from scipy.stats import f as Fdist
Fdist.sf(F,n1,n2)

Out[41]:

array([[  3.58143893e-07]])

In [61]:

# check if we have no rank deficiency
# recreate the original
X = np.hstack((oneway_anov, intercept))

Xt = X.dot(np.asarray([[1, -1, 0, 0],[0, 1, -1, 0]]).T)
X0 = X[:,[3]]
Xall = np.hstack((Xt, X0))

betah = np.linalg.pinv(Xall).dot(Y)

#print Xall
#print Xall.T.dot(Xall)

betah_t = betah[:2,:]
print " betah_t ", betah_t

tmp = Xt.dot(betah_t)
SSRXt = tmp.T.dot(tmp)
print "Using the sum of square of the beta_t : SSRXt = %f \n" % SSRXt

SSRX_SSRX0 =  Y.T.dot(proj(Xall, Y)) - Y.T.dot(proj(X0, Y))

print "Using SSR X - SSR X0 : SSRX_SSRX0 = %f \n" % SSRX_SSRX0

 betah_t  [[ 0.50142161]
 [ 0.02168588]]
Using the sum of square of the beta_t : SSRXt = 4.820403 

Using SSR X - SSR X0 : SSRX_SSRX0 = 4.820403

Now - let's make the F test with a contrast. We just assume that we derived it - the formula is

$$ \begin{equation} F_{\nu_1, \nu_2} = \frac{\hat\beta^T \Lambda (\Lambda^T (X^TX)^- \Lambda)^- \Lambda^T \hat\beta / \nu_{1} }{MSE} \\ \end{equation} $$

where the $MSE$ is the mean square error $SSR(X)/\nu_2$.

In [1]:

def F_con(X, Y, L, verbose=False):
    
    # check that the contrast makes sense
    B = lin.pinv(X).dot(Y)
    
    # Numerator
    XtX = X.T.dot(X)
    piXtX = lin.pinv(X.T.dot(X))
    piLtpiXtXL = lin.pinv(L.T.dot(piXtX).dot(L))
    LtB = L.T.dot(B)
    Fnum = LtB.T.dot(piLtpiXtXL).dot(LtB)
    
    # SSRX ?
    SSRX = Y.T.dot(Y) - B.T.dot(XtX).dot(B)
    v1 = matrix_rank(L)
    v2 = X.shape[0] - matrix_rank(X)
    
    if verbose:
        print 'Fnum: %f, df1=%d, Fdenom: %f, df2=%d,  \n' % (Fnum, v1, SSRX, v2)
    
    return (Fnum/v1)/(SSRX/v2), v1, v2

In [71]:

B = lin.pinv(X).dot(Y)
print B
#print X
L = np.array([[1., -1., 0, 0],[0., 1., -1., 0]]).T
print L
F,v1,v2 = F_con(X, Y, L, verbose = True)
print F,v1,v2 

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-71-e62367a14bfc> in <module>()
      4 L = np.array([[1., -1., 0, 0],[0., 1., -1., 0]]).T
      5 print L
----> 6 F,v1,v2 = F_con(X, Y, L, verbose = True)
      7 print F,v1,v2

<ipython-input-70-ecf4a53ba7b2> in F_con(X, Y, L, verbose)
      6     # Numerator
      7     piXtX = lin.pinv(X.T.dot(X))
----> 8     piLtpiXtXL = lin.pinv(L.T.dot(piXtX).dot(L))
      9     LtB = L.T.dot(B)
     10     Fnum = LtB.T.dot(piLtpiXtXL).dot(LtB)

ValueError: objects are not aligned

[[ 1.0565886 ]
 [ 0.82571892]
 [ 0.36905904]
 [ 1.90725234]
 [ 0.34411422]
 [ 2.25136656]]
[[ 1.  0.]
 [-1.  1.]
 [ 0. -1.]
 [ 0.  0.]]

In [72]:

print X

[[ 1.  0.  0.  1.  0.  1.]
 [ 1.  0.  0.  1.  0.  1.]
 [ 1.  0.  0.  1.  0.  1.]
 [ 1.  0.  0.  1.  0.  1.]
 [ 1.  0.  0.  1.  0.  1.]
 [ 1.  0.  0.  0.  1.  1.]
 [ 1.  0.  0.  0.  1.  1.]
 [ 1.  0.  0.  0.  1.  1.]
 [ 1.  0.  0.  0.  1.  1.]
 [ 1.  0.  0.  0.  1.  1.]
 [ 0.  1.  0.  1.  0.  1.]
 [ 0.  1.  0.  1.  0.  1.]
 [ 0.  1.  0.  1.  0.  1.]
 [ 0.  1.  0.  1.  0.  1.]
 [ 0.  1.  0.  1.  0.  1.]
 [ 0.  1.  0.  0.  1.  1.]
 [ 0.  1.  0.  0.  1.  1.]
 [ 0.  1.  0.  0.  1.  1.]
 [ 0.  1.  0.  0.  1.  1.]
 [ 0.  1.  0.  0.  1.  1.]
 [ 0.  0.  1.  1.  0.  1.]
 [ 0.  0.  1.  1.  0.  1.]
 [ 0.  0.  1.  1.  0.  1.]
 [ 0.  0.  1.  1.  0.  1.]
 [ 0.  0.  1.  1.  0.  1.]
 [ 0.  0.  1.  0.  1.  1.]
 [ 0.  0.  1.  0.  1.  1.]
 [ 0.  0.  1.  0.  1.  1.]
 [ 0.  0.  1.  0.  1.  1.]
 [ 0.  0.  1.  0.  1.  1.]]

Example 2: 2 ways anova¶

Here, we quickly take the example of a 2 ways anova - with no interaction. Say we have two groups of subjects (eg patients and normal controls) and they are divided in 3 groups each, depending for instance on rehabilitation therapy. Let's do the desing matrix:

In [73]:

# simplifiy the notations
from numpy import kron, ones, eye, zeros

nbs1= 10; g1 = 3; g2 = 2;
n = nbs1*g1
factor1 = kron(eye(g1), ones((nbs1,1)))
factor2 = kron(eye(g2), ones((n/6,1)))
factor2 = kron(ones((g1,1)), factor2)
intercept = ones((n,1))

X = np.hstack((factor1, factor2, intercept))
print X

[[ 1.  0.  0.  1.  0.  1.]
 [ 1.  0.  0.  1.  0.  1.]
 [ 1.  0.  0.  1.  0.  1.]
 [ 1.  0.  0.  1.  0.  1.]
 [ 1.  0.  0.  1.  0.  1.]
 [ 1.  0.  0.  0.  1.  1.]
 [ 1.  0.  0.  0.  1.  1.]
 [ 1.  0.  0.  0.  1.  1.]
 [ 1.  0.  0.  0.  1.  1.]
 [ 1.  0.  0.  0.  1.  1.]
 [ 0.  1.  0.  1.  0.  1.]
 [ 0.  1.  0.  1.  0.  1.]
 [ 0.  1.  0.  1.  0.  1.]
 [ 0.  1.  0.  1.  0.  1.]
 [ 0.  1.  0.  1.  0.  1.]
 [ 0.  1.  0.  0.  1.  1.]
 [ 0.  1.  0.  0.  1.  1.]
 [ 0.  1.  0.  0.  1.  1.]
 [ 0.  1.  0.  0.  1.  1.]
 [ 0.  1.  0.  0.  1.  1.]
 [ 0.  0.  1.  1.  0.  1.]
 [ 0.  0.  1.  1.  0.  1.]
 [ 0.  0.  1.  1.  0.  1.]
 [ 0.  0.  1.  1.  0.  1.]
 [ 0.  0.  1.  1.  0.  1.]
 [ 0.  0.  1.  0.  1.  1.]
 [ 0.  0.  1.  0.  1.  1.]
 [ 0.  0.  1.  0.  1.  1.]
 [ 0.  0.  1.  0.  1.  1.]
 [ 0.  0.  1.  0.  1.  1.]]

Simulate the data :

In [74]:

print csignal

[[ 0.5]
 [-0.5]
 [ 0. ]
 [ 2. ]
 [ 0. ]
 [ 3. ]]

In [75]:

n, p = X.shape
f1_fx = np.asarray([1., -1., 0]) * .5
f2_fx = np.asarray([2., 0.])
offset = np.asarray([3.])

csignal = np.hstack((f1_fx, f2_fx, offset)).reshape(p,1)

Y = np.random.randn(n, 1)
signal = X.dot(csignal)

Y = Y +  signal

In [76]:

print signal

[[ 5.5]
 [ 5.5]
 [ 5.5]
 [ 5.5]
 [ 5.5]
 [ 3.5]
 [ 3.5]
 [ 3.5]
 [ 3.5]
 [ 3.5]
 [ 4.5]
 [ 4.5]
 [ 4.5]
 [ 4.5]
 [ 4.5]
 [ 2.5]
 [ 2.5]
 [ 2.5]
 [ 2.5]
 [ 2.5]
 [ 5. ]
 [ 5. ]
 [ 5. ]
 [ 5. ]
 [ 5. ]
 [ 3. ]
 [ 3. ]
 [ 3. ]
 [ 3. ]
 [ 3. ]]

Test for factor 1:

In [77]:

Fc1 = np.asarray([[1, -1, 0], [0, 1, -1]])

F_fill = zeros((2, p-3))

Fc = np.hstack((Fc1, F_fill))
print Fc
print p 

(f, n1, n2) = F_con(X, Y, Fc.T, verbose=True)

# compute the P value:
Fdist.sf(f,n1,n2)

[[ 1. -1.  0.  0.  0.  0.]
 [ 0.  1. -1.  0.  0.  0.]]
6
Fnum: 1.796334, df1=2, Fdenom: 25.258972, df2=26,

Out[77]:

array([[ 0.40937548]])

Exercise 1 : F contrast on factor 2¶

Exercise 2: create the interaction term - and the contrast associated with it.¶

In [ ]:

F test from contrast : the derivation - Please don't look at this ....¶

We assume from now on that the contrast of parameters we would like to test may concern multiple constraints on the $\beta$, and will denote those constraints by $\Lambda \beta = 0$.

Setting $\Lambda^T \beta = 0$ is also setting $H^T X \beta = 0$ for some matrix $H$ (the number of columns in $H$ is the number of columns in $\Lambda$).

If we have a (legitimate) contrast of the parameters, what does $H$ look like? How does this relate to a reduced model? Why is all this relevant for fMRI? It is interesting to understand what $H$ is. While $\Lambda$ puts some constraints on the parameters of the model, $H$ is the equivalent constraint on the space of $X$ \citep{Christensen2002}. The reduced model that is to be tested is $ Y = X \beta + \epsilon $ $\textbf{and}$ $H^T X \beta =0$, or, equivalently, that $E(Y) \in C(X)$ $\textbf{and}$ $E(Y) \in C(H)^{\perp}$ where $C(H)^{\perp}$ is the space orthogonal to $H$ (This is the set of vectors with zero correlation with any vector of $H$). The reduced model should therefore be a matrix $X_0$ such that $X_0 \in C(X)$ $\textbf{and}$ $X_0 \in C(H)^{\perp}$.

By choosing $H = X^{T-} \Lambda$, we impose the constraint on $X$ that corresponds to $\Lambda^T \beta = 0$. Since $H^T E(Y) = H^T X \beta = \Lambda^T X^- X \beta = \Lambda^T \beta = 0 $, we have $E(Y) \in C(H)^\perp $ with this particular choice of $H$. Also, $H \in C(X)$, so $H$ is a matrix that imposes a constraint on $X$ within the space of $X$, and is the part of $X$ that is being tested. The part of $X$ that is not being tested, the reduced model, can be found easily by projecting the orthogonal space of $H$ onto $X$ , so we have $X_0 = P_X R_H = P_X(I - P_H) = P_X - P_H$. Testing for $\Lambda \beta = 0$ is equivalent to testing for the reduced model $Y = X_0 \gamma + \epsilon$ with $X_0$ as defined above. Having defined $X_0$ as above, we have $P_X = P_{X_0} + P_H$.

Now, the numerator of the F test can be rewritten in a much simpler way, as a function of $\Lambda$ only:

\begin{eqnarray} Y^T (P_H) Y & = & Y^T X (X^T X)^- X^T H(H^T H)^- H^T X (X^T X)^- X^T Y / r(\Lambda) \\ %\nonumber & = & \hat\beta^T \Lambda (\Lambda^T (X^TX)^- \Lambda)^- \Lambda^T \hat\beta / r(\Lambda) \end{eqnarray}

Where $r(\Lambda)$ is the rank of $\Lambda$, and the F test can be rewritten as:

\begin{equation} F_{\nu_1, \nu_2} = \frac{\hat\beta^T \Lambda (\Lambda^T (X^TX)^- \Lambda)^- \Lambda^T \hat\beta / \nu_{1} }{MSE} \\ \end{equation}

where the $MSE$ is the mean square error $SSR(X)/\nu_2$.

In [ ]: