Y = c(435,147,375,134)
S = factor(c('M','M','F','F'))
B = factor(c('Y','N','Y','N'))
N = sum(Y)
piS = c((435+147)/N,(375+134)/N)
piB = c((435+375)/N,(147+134)/N)
E = N*c(piS[1]*piB[1], piS[1]*piB[2], piS[2]*piB[1], piS[2]*piB[2])
# Pearson's X^2
X2 = sum((Y - E)^2/E)
c(X2, 1-pchisq(X2,1))
The independence test is called chisq.test
in R. Depending on whether one corrects
or not, we get the $X^2$ or a corrected version.
chisq.test(matrix(Y,2,2), correct=FALSE)
chisq.test(matrix(Y,2,2))
The goodness of fit test can also be found using a glm
.
summary(glm(Y ~ S + B, family=poisson()))
This model has the same fitted values as we had computed by hand above.
fitted(glm(Y ~ S+B, family=poisson()))
E
Here is the deviance test statistic.
It is numerically close, but
not identical to Pearson's $X^2$ for this data.
DEV = sum(2*(Y*log(Y/E)+Y-E))
c(X2, DEV)
Unlike in other cases, in this case the full model has as many parameters as observations (i.e. it’s saturated).
This test is known as a goodness of fit test.
It tests: "how well does the independence model fit this data"?
Unlike other tests we've seen, the deviance is the test statistic, not a difference of deviance.
url = 'http://stats191.stanford.edu/data/lumber.table'
lumber.table = read.table(url, header=T)
lumber.glm = glm(Customers ~ Housing + Income + Age + Competitor + Store,
family=poisson(), data=lumber.table)
summary(lumber.glm)
par(mfrow=c(2,2))
plot(lumber.glm)
lumber.R.glm = glm(Customers ~ Housing + Income + Age,
family=poisson, data=lumber.table)
anova(lumber.R.glm, lumber.glm)
pchisq(263.45, 2, lower=FALSE, log=TRUE)
1 - pchisq(263.45, 2)
step
can also be used for model selection.step(lumber.glm)
step(glm(Customers ~ 1, data=lumber.table, family=poisson()), scope=list(upper=lumber.glm), direction='forward')
library(glmnet)
X = model.matrix(lumber.glm)[,-1]
Y = lumber.table$Customers
G = glmnet(X, Y, family='poisson')
plot(G)