Binary outcomes
options(repr.plot.width=5, repr.plot.height=5)
Generalized Linear Model
.logit.inv = function(x) {
return(exp(x) / (1 + exp(x)))
}
x = seq(-4, 4, length=200)
plot(x, logit.inv(x), lwd=2, type='l', col='red', cex.lab=1.2)
logit
¶logit = function(p) {
return(log(p / (1 - p)))
}
p = seq(0.01,0.99,length=200)
plot(p, logit(p), lwd=2, type='l', col='red', cex.lab=1.2)
$F$ (usually a distribution function).
The logistic model uses the function (we called logit.inv
above) $$F(x)=\frac{e^x}{1+e^x}.$$
Can be fit using Maximum Likelihood / Iteratively Reweighted Least Squares.
For logistic regression, coefficients have nice interpretation in terms of odds ratios
(to be defined shortly).
What about inference?
Instead of sum of squares, logistic regression uses deviance:
$DEV(\pi| Y) = -2 \sum_{i=1}^n \left( Y_i \log(\pi_i) + (1-Y_i) \log(1-\pi_i) \right)$
Minimizing deviance $\iff$ Maximum Likelihood
For any binary regression model, $\pi=\pi(\beta)$.
The deviance is:
a different transformation, the first part would not be linear in $X\beta$.
X[,1]=1
corresponding to $\beta_0$flu.table = read.table('http://stats191.stanford.edu/data/flu.table',
header=TRUE)
flu.glm = glm(Shot ~ Age + Health.Aware, data=flu.table,
family=binomial())
summary(flu.glm)
logodds = predict(flu.glm, list(Age=c(35,45),Health.Aware=c(50,50)),
type='link')
logodds
exp(logodds[2] - logodds[1])
The estimated probabilities are below, yielding a ratio of $0.1932/0.0254 \approx 7.61$. Not too far from 9.18.
prob = exp(logodds)/(1+exp(logodds))
prob
prob[2] / prob[1]
The Newton-Raphson updates for logistic regression are $$ \hat{\beta} \mapsto \hat{\beta} - \nabla^2 DEV(\hat{\beta})^{-1} \nabla DEV(\hat{\beta}) $$
These turn out to be the same as the updates above.
In earlier statistical software one might only have access to a weighted least squares estimator.
One thing the IRLS procedure hints at is what the approximate limiting distribution is.
center = coef(flu.glm)['Age']
SE = sqrt(vcov(flu.glm)['Age', 'Age'])
U = center + SE * qnorm(0.975)
L = center - SE * qnorm(0.975)
data.frame(L, center, U)
pi.hat = fitted(flu.glm)
W.hat = pi.hat * (1 - pi.hat)
X = model.matrix(flu.glm)
C = solve(t(X) %*% (W.hat * X))
c(SE, sqrt(C['Age', 'Age']))
R
will give you if you ask it forconfidence intervals.
R
uses so-called profile intervals.
For large samples the two methods should agree quite closely.
CI = confint(flu.glm)
CI
mean(CI[2,]) # profile intervals are not symmetric around the estimate...
data.frame(L, center, U)
What about comparing full and reduced model?
This is closely related to $F$ with large $df_F$: approximately $F_{df_R-df_F, df_R} \cdot (df_R-df_F)$.
For logistic regression this difference in $SSE$ is replaced with
anova(glm(Shot ~ 1,
data=flu.table,
family=binomial()),
flu.glm)
anova(glm(Shot ~ Health.Aware,
data=flu.table,
family=binomial()),
flu.glm)
We should compare this difference in deviance with a $\chi^2_1$ random variable.
1 - pchisq(35.61, 2) # testing ~1 vs ~1 + Health.Aware + Age
1 - pchisq(16.863, 1) # testing ~ 1 + Health.Aware vs ~1 + Health.Aware + Age
Let's compare this with the Wald test:
summary(flu.glm)
Similar to least square regression, only residuals used are usually deviance residuals $r_i = \text{sign}(Y_i-\widehat{\pi}_i) \sqrt{DEV(\widehat{\pi}_i|Y_i)}.$
These agree with usual residual for least square regression.
par(mfrow=c(2,2))
plot(flu.glm)
influence.measures(flu.glm)
As the model is a likelihood based model, each fitted model has an AIC. Stepwise selection can be used easily …
step(flu.glm, scope=list(upper= ~.^2), direction='both')
library(glmnet)
X = model.matrix(flu.glm)[,-1]
Y = as.numeric(flu.table$Shot)
G = glmnet(X, Y, family="binomial")
plot(G)
library(ElemStatLearn)
data(spam)
dim(spam)
X = model.matrix(spam ~ ., data=spam)[,-1]
Y = as.numeric(spam$spam == 'spam')
G = glmnet(X, Y, family='binomial')
plot(G)
CV = cv.glmnet(X, Y, family='binomial')
plot(CV)
c(CV$lambda.min, CV$lambda.1se)
glmnet
¶beta.hat = coef(G, s=CV$lambda.1se)
beta.hat
Probit regression model: $$\Phi^{-1}(E(Y|X))= \sum_{j=1}^{p} \beta_j X_{j}$$ where $\Phi$ is CDF of $N(0,1)$, i.e. $\Phi(t) = {\tt pnorm(t)}$, $\Phi^{-1}(q) = {\tt qnorm}(q)$.
Regression function
In logit, probit and cloglog ${\text{Var}}(Y_i)=\pi_i(1-\pi_i)$ but the model for the mean is different.
Coefficients no longer have an odds ratio interpretation.
summary(glm(Shot ~ Age + Health.Aware,
data=flu.table,
family=binomial(link='probit')))
Given a dataset $(Y_i, X_{i1}, \dots, X_{ip}), 1 \leq i \leq n$ we consider a model for the distribution of $Y|X_1, \dots, X_p$.