Count data

Afterlife example

Men and women were asked whether they believed in the after life (1991 General Social Survey).

Y N or U Total
M 435 147 582
F 375 134 509
Total 810 281 1091

Question of interest: is belief in the afterlife independent of gender?

The 4 outcomes in this model are the cell entries.

In [7]:
%%R
G = c(rep('M', 435), rep('M', 147), rep('F', 375), rep('F', 134))
B = c(rep('Y', 435), rep('N/U', 147), rep('Y', 375), rep('N/U', 134))
Y = table(G, B)
Y
   B
G   N/U   Y
  F 134 375
  M 147 435

The information in the table can be partially summarized by the marginal proportions of M vs. F and Y vs. N/U.

In [8]:
%%R
N = sum(Y)
piG = c((Y[1,1] + Y[1,2]) / N, (Y[2,1] + Y[2,2])/N)
piB = c((Y[1,1] + Y[2,1]) / N, (Y[1,2] + Y[2,2])/N)
print(piG)
print(piB)
[1] 0.4665445 0.5334555
[1] 0.2575619 0.7424381

Poisson random variable

  • A random variable $Y$ is a Poisson random variable with parameter $\lambda$ if $P(Y=j) = e^{-\lambda} \frac{\lambda^j}{j!}, \qquad \forall j \geq 0.$

  • Some simple calculations show that $E(Y)=\text{Var}(Y)=\lambda.$

  • Poisson models for counts are analogous to Gaussian for continuous outcomes. By this we mean, that while sums of many (independent) continuous outcomes often look more and more Gaussian, sums of many (independent) counts often look more and more Poisson.

2x2 contingency table

  • Model: $Y_{ij} \sim Poisson(\lambda_{ij} )$.
  • Null: $H_0 : \text{independence}, \lambda_{ij} = \lambda \alpha_i \cdot \beta_j , \sum_i \alpha_i = 1, \sum_j \beta_j = 1.$
  • Alternative: $H_a : \text{$\lambda_{ij}$ 's are unrestricted}$
  • Test statistic: Pearson’s $X^2$ : $X^2 = \sum_{ij} \frac{(Y_{ij}-E_{ij})^2}{E_{ij}} \approx \chi^2_1 \ \text{under $H_0$}$
  • Why 1 df ? Independence model has 5 parameters, two constraints = 3 df. The unrestricted model has 4 unrestricted parameters.
  • As this is a model for the mean of the counts, this is actually a regression model for the count data.

What if we had more categories?

  • Suppose we had $k$ categories on one axis, $m$ on the other (i.e. previous example $k = m = 2$). We call this as $k \times m$ contingency table.
  • Independence model: $\log(E (Y_{ij} )) = \log \lambda_{ij} = \log \lambda + \log \alpha_i + \log \beta_j$

Contingency table as regression model

  • Under independence $$\begin{aligned} \log(E (Y_{ij} )) &= \log \lambda_{ij} = \log \lambda + \log \alpha_i + \log \beta_j \end{aligned}$$

  • In other words, the model has a log as its link function.

  • What about the variance? Because of Poisson assumption $Var(Y_{ij} ) = E (Y_{ij})$

  • In other words, the variance function of the model is $V (\mu) = \mu.$

  • Recall that logistic regression had a link function $\text{logit}(\pi)=\log(\pi/(1-\pi))$ and variance function $V(\pi)=\pi(1-\pi)$.

In the independence model, the expected cell counts are estimated by taking the marginal proportions and multiplying by the marginal total.

In [9]:
%%R
E = N*matrix(c(piG[1]*piB[1], piG[2]*piB[1], piG[1]*piB[2], piG[2]*piB[2]), 2, 2)
E
        [,1]    [,2]
[1,] 131.099 377.901
[2,] 149.901 432.099

We will also compute the Pearson's $X^2$.

In [10]:
%%R
X2 = sum((Y-E)^2/E)
X2
[1] 0.162084

Testing for independence in contingency tables

  • Test for independence uses Pearson’s $X^2 = \sum_{ij} \frac{(Y_{ij}-E_{ij})^2}{E_{ij}} \approx \chi^2_{(k-1)(m-1)} \ \text{under $H_0$}$
  • Alternative test statistic based on the deviance $G = 2\sum_{ij} Y_{ij} \log \left(\frac{Y_{ij}}{E_{ij}}\right)$
In [11]:
%%R
G1 = sum(2*Y*log(Y/E))
G2 = sum(2*(Y*log(Y/E)+Y-E))
c(X2,G1,G2)
[1] 0.1620840 0.1619951 0.1619951

This test is standard and R has a builtin function for it.

In [12]:
%%R
chisq.test(Y, correct=FALSE)

	Pearson's Chi-squared test

data:  Y
X-squared = 0.1621, df = 1, p-value = 0.6872


The correct argument uses a continuity correction when the counts are small.

In [13]:
%%R
chisq.test(Y)

	Pearson's Chi-squared test with Yates' continuity correction

data:  Y
X-squared = 0.111, df = 1, p-value = 0.739


Pearson's $X^2$ via glm

As we have a GLM, we can form the above test using glm. The $X^2$ (or, more accurately, the deviance form $G$ above) is the Residual deviance.

In [14]:
%%R
Gf = c('F','M', 'F', 'M')
Bf = c('N/U', 'N/U', 'Y', 'Y')
Yf = as.numeric(Y)
summary(glm(Yf ~ Gf + Bf, family=poisson()))

Call:
glm(formula = Yf ~ Gf + Bf, family = poisson())

Deviance Residuals: 
      1        2        3        4  
 0.2524  -0.2377  -0.1494   0.1394  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.87595    0.06787  71.839   <2e-16 ***
GfM          0.13402    0.06069   2.208   0.0272 *  
BfY          1.05868    0.06923  15.291   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 272.685  on 3  degrees of freedom
Residual deviance:   0.162  on 1  degrees of freedom
AIC: 35.407

Number of Fisher Scoring iterations: 3


If we really want Pearson's $X^2$ then deviance must be replaced with this $X^2$ statistic. These are the quasi* models in glm.

In [15]:
%%R
summary(glm(Yf ~ Gf + Bf, family=quasipoisson()))

Call:
glm(formula = Yf ~ Gf + Bf, family = quasipoisson())

Deviance Residuals: 
      1        2        3        4  
 0.2524  -0.2377  -0.1494   0.1394  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  4.87595    0.02733 178.440  0.00357 **
GfM          0.13402    0.02443   5.485  0.11479   
BfY          1.05868    0.02787  37.982  0.01676 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 0.162084)

    Null deviance: 272.685  on 3  degrees of freedom
Residual deviance:   0.162  on 1  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 3


Independence test via logistic regression

You can also compute the $2 \times k$ goodness of fit test statistic with logistic regression.

In [16]:
%%R
anova(glm(B == 'Y' ~ G, family=binomial()))
Analysis of Deviance Table

Model: binomial, link: logit

Response: B == "Y"

Terms added sequentially (first to last)


     Df Deviance Resid. Df Resid. Dev
NULL                  1090     1244.8
G     1    0.162      1089     1244.7

Goodness of fit test

  • Unlike in other cases, in this case the full model has as many parameters as observations (i.e. it’s saturated).
  • This test is known as a goodness of fit test.
  • The test tries to answer the question: "How well does the independence model fit this data"?

Regression models for count data

Lumber company example

A lumber company wants to model the number of customers visiting its stores across different regions. In each region, they collect features about typical households in the region and where they are located relative to their stores.

Variable Description
Customers Number of customers visting store from region.
Housing Number of housing units in region.
Income Average household income.
Age Average housing unit age in region.
Competitor Distance to nearest competitor.
Store Distance to store in miles.
In [17]:
%%R
url = 'http://stats191.stanford.edu/data/lumber.table'
lumber.table = read.table(url, header=T)
head(lumber.table)
  Customers Housing Income Age Competitor Store
1         9     606  41393   3       3.04  6.32
2         6     641  23635  18       1.95  8.89
3        28     505  55475  27       6.54  2.05
4        11     866  64646  31       1.67  5.81
5         4     599  31972   7       0.72  8.11
6         4     520  41755  23       2.24  6.81

Poisson (log-linear) regression model

  • Given observations and covariates $Y_i , X_{ij} , 1 \leq i \leq n, 1 \leq j \leq p$.
  • Model: $$Y_{i} \sim Poisson \left(\exp\left(\sum_{j=1}^p \beta_j X_{ij} \right)\right)$$
  • Poisson assumption implies the variance function is $V (\mu) = \mu.$

  • Without specifying the link, it is assumed to be 'log'.

In [18]:
%%R
lumber.glm = glm(Customers ~ Housing + Income + Age + Competitor + Store, family=poisson(link='log'), data=lumber.table)
summary(lumber.glm)

Call:
glm(formula = Customers ~ Housing + Income + Age + Competitor + 
    Store, family = poisson(link = "log"), data = lumber.table)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.93195  -0.58868  -0.00009   0.59269   2.23441  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.942e+00  2.072e-01  14.198  < 2e-16 ***
Housing      6.058e-04  1.421e-04   4.262 2.02e-05 ***
Income      -1.169e-05  2.112e-06  -5.534 3.13e-08 ***
Age         -3.726e-03  1.782e-03  -2.091   0.0365 *  
Competitor   1.684e-01  2.577e-02   6.534 6.39e-11 ***
Store       -1.288e-01  1.620e-02  -7.948 1.89e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 422.22  on 109  degrees of freedom
Residual deviance: 114.99  on 104  degrees of freedom
AIC: 571.02

Number of Fisher Scoring iterations: 4


In [19]:
%%R -h 800 -w 800
par(mfrow=c(2,2))
plot(lumber.glm, pch=23, bg='orange', cex=2)

Interpretation of coefficients

  • The log-linear model means covariates have multiplicative effect.
  • Log-linear model model: $$\frac{E(Y|\dots, X_j=x_j+1, \dots)}{E(Y|\dots, X_j=x_j, \dots)} = e^{\beta_j}$$
  • So, one unit increase in variable $j$ results in $e^{\beta_j}$ (multiplicative) increase the expected count, all other parameters being equal.

Generalized linear models

Recall that a GLM has two ingredients: a link function and a variance function. The Poisson log-linear regression model is a GLM.

  • Logistic model: ${\text{logit}}(\pi) = \beta_0 + \sum_j \beta_j X_j \qquad V(\pi)=\pi(1-\pi)$
  • Poisson log-linear model: $\log(\mu) = \beta_0 + \sum_j \beta_j X_j, \qquad V(\mu) = \mu$

Specifying a model

  • Given $(Y, X_1, \dots, X_p)$, a GLM is specified by the (link, variance function) pair $(V, g)$.
  • Fit using IRLS (also known as Fisher scoring) like logistic.
  • Inference in terms of deviance or Pearson’s $X^2$: $$X^2({\cal M}) = \sum_{i=1}^n \frac{(Y_i - \widehat{\mu}_{{\cal M},i})^2}{V(\widehat{\mu}_{{\cal M},i})}.$$
  • For a Poisson model, the $X^2$ statistic has the form $$ \sum_{i=1}^n \frac{(Y_i - E_i)^2}{E_i} $$ just as in the contingency table case.

Deviance

  • Recall that in the GLM setting, we replace $SSE$ with the notion of deviance, defined as $$ DEV({\cal M}) = -2 \left(\log L(\widehat{\mu}({\cal M})|Y,X) - \log(Y|Y,X) \right)$$

  • The Poisson deviance is $$DEV({\cal M}|Y) = 2 \sum_{i=1}^n \left( Y_i \log \left(\frac{Y_i}{ \widehat{\mu}_{{\cal M},i}} \right) + (Y_i - \widehat{\mu}_{{\cal M},i} ) \right)$$

  • To test $H_0:{\cal M}={\cal M}_R$ vs. $H_a: {\cal M}={\cal M}_F$, we use $$DEV({\cal M}_R) - DEV({\cal M}_F) \approx \chi^2_{df_R-df_F}$$

  • In contingency example ${\cal M}_R$ is the independence model $$\log(E(Y_{ij})) = \lambda + \alpha_i + \beta_j$$ with ${\cal M}_F$ being the "saturated model."

In [20]:
%%R
lumber.R.glm = glm(Customers ~ Housing + Income + Age, family=poisson(link='log'), data=lumber.table)
anova(lumber.R.glm, lumber.glm)
Analysis of Deviance Table

Model 1: Customers ~ Housing + Income + Age
Model 2: Customers ~ Housing + Income + Age + Competitor + Store
  Resid. Df Resid. Dev Df Deviance
1       106     378.43            
2       104     114.99  2   263.45

The $p$-value would be based on a $\chi^2$ with 2 degrees of freedom.

In [21]:
%%R
1 - pchisq(263.45, 2)
[1] 0

Stepwise model selection works here as well.

In [22]:
%%R
step(lumber.glm, list(upper=~.^2), direction='both')
Start:  AIC=571.02
Customers ~ Housing + Income + Age + Competitor + Store

                     Df Deviance    AIC
+ Competitor:Store    1   106.92 564.96
<none>                    114.98 571.02
+ Housing:Income      1   113.18 571.21
+ Housing:Competitor  1   113.69 571.73
+ Income:Age          1   113.73 571.77
+ Age:Competitor      1   114.05 572.09
+ Income:Store        1   114.74 572.78
+ Housing:Age         1   114.77 572.80
+ Age:Store           1   114.92 572.96
+ Income:Competitor   1   114.94 572.98
+ Housing:Store       1   114.98 573.02
- Age                 1   119.36 573.40
- Housing             1   133.19 587.23
- Income              1   146.78 600.82
- Competitor          1   156.65 610.68
- Store               1   182.49 636.52

Step:  AIC=564.96
Customers ~ Housing + Income + Age + Competitor + Store + Competitor:Store

                     Df Deviance    AIC
+ Income:Store        1   104.42 564.46
<none>                    106.92 564.96
+ Housing:Income      1   105.12 565.16
+ Income:Age          1   105.13 565.17
+ Age:Competitor      1   105.21 565.25
+ Income:Competitor   1   105.29 565.33
+ Housing:Age         1   106.53 566.57
+ Housing:Competitor  1   106.65 566.69
+ Age:Store           1   106.71 566.75
+ Housing:Store       1   106.73 566.77
- Age                 1   111.37 567.41
- Competitor:Store    1   114.98 571.02
- Housing             1   124.62 580.66
- Income              1   132.00 588.04

Step:  AIC=564.46
Customers ~ Housing + Income + Age + Competitor + Store + Competitor:Store + 
    Income:Store

                     Df Deviance    AIC
<none>                    104.42 564.46
+ Age:Competitor      1   102.68 564.72
+ Housing:Competitor  1   102.72 564.76
- Income:Store        1   106.92 564.96
+ Income:Age          1   103.42 565.46
+ Age:Store           1   104.03 566.07
+ Housing:Store       1   104.05 566.09
+ Housing:Age         1   104.09 566.12
+ Housing:Income      1   104.13 566.17
+ Income:Competitor   1   104.39 566.43
- Age                 1   108.73 566.77
- Competitor:Store    1   114.74 572.78
- Housing             1   123.10 581.14

Call:  glm(formula = Customers ~ Housing + Income + Age + Competitor + 
    Store + Competitor:Store + Income:Store, family = poisson(link = "log"), 
    data = lumber.table)

Coefficients:
     (Intercept)           Housing            Income               Age  
       3.492e+00         6.169e-04        -5.079e-06        -3.687e-03  
      Competitor             Store  Competitor:Store      Income:Store  
      -2.222e-02        -2.024e-01         2.835e-02        -1.089e-06  

Degrees of Freedom: 109 Total (i.e. Null);  102 Residual
Null Deviance:	    422.2 
Residual Deviance: 104.4 	AIC: 564.5