Notebook

Chapter 18. Metric Predicted Variable with Multiple Metric Predictors¶

베이지안 통계 R
Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan
김무성

Contents¶

18.1. Multiple Linear Regression
18.2. Multiplicative Interaction of Metric Predictors
18.3. Shrinkage of Regression Coefficients
18.4. Variable Selection

In this chapter, we are concerned with situations in which the value to be predicted is on a metric scale, and there are several predictors, each of which is also on a metric scale.

For example,
- we might predict
  - a person’s college grade point average (GPA) from
    - his or her high-school GPA and
    - scholastic aptitude test (SAT) score.
- Another such situation is predicting
  - a person’s blood pressure from
    - his or her height and
    - weight.

multiple linear regression
- predicted variable is an additive combination of predictors
interactions
- nonadditive combinations of predictors

18.1. Multiple Linear Regression¶

18.1.1 The perils of correlated predictors
18.1.2 The model and implementation
18.1.3 The posterior distribution
18.1.4 Redundant predictors
18.1.5 Informative priors, sparse data, and correlated predictors

y ∼ normal(μ, σ ) and μ = β0+β1x1+β2x2.

18.1.1 The perils of correlated predictors¶

18.1.2 The model and implementation¶

The data block of the JAGS code then standardizes the predictors by looping through the columns of the x matrix, as follows:

In [ ]:

data {
    ym <- mean(y)
    ysd <- sd(y)
    for ( i in 1:Ntotal ) { # Ntotal is the number of data rows
        zy[i] <- ( y[i] - ym ) / ysd
    }
    for ( j in 1:Nx ) {# Nx is the number of x predictors
        xm[j] <- mean(x[,j]) # x is a matrix, each column a predictor
        xsd[j] <- sd(x[,j])
        for ( i in 1:Ntotal ) {
        zx[i,j] <- ( x[i,j] - xm[j] ) / xsd[j]
        }
    }
}

The standardized parameters are then transformed to the original scale by generalizing Equation 17.2 (p. 485) to multiple predictors:

The JAGS model specification looks like this:

In [ ]:

model {
    for ( i in 1:Ntotal ) {
        zy[i]  ̃ dt( zbeta0 + sum( zbeta[1:Nx] * zx[i,1:Nx] ) , 1/zsigmaˆ2 , nu )
    }
    # Priors vague on standardized scale:
    zbeta0  ̃ dnorm( 0 , 1/2ˆ2 )
    for ( j in 1:Nx ) {
        zbeta[j]  ̃ dnorm( 0 , 1/2ˆ2 )
    }
    zsigma  ̃ dunif( 1.0E-5 , 1.0E+1 )
    nu <- nuMinusOne+1
    nuMinusOne  ̃ dexp(1/29.0)
    # Transform to original scale:
    beta[1:Nx] <- ( zbeta[1:Nx] / xsd[1:Nx] )*ysd
    beta0 <- zbeta0*ysd + ym - sum( zbeta[1:Nx] * xm[1:Nx] / xsd[1:Nx] )*ysd
    sigma <- zsigma*ysd
}

Jags-Ymet-XmetMulti-Mrobust-Example.R¶

In [1]:

# 책에는 없는 코드. 실습을 위해 data 폴더에 넣은 코드를 실행시키기 위한 작업
cur_dir = getwd()
setwd(sprintf("%s/%s", cur_dir, 'data'))

In [2]:

# Example for Jags-Ymet-XmetMulti-Mrobust.R 
#------------------------------------------------------------------------------- 
# Optional generic preliminaries:
#graphics.off() # This closes all of R's graphics windows.
#rm(list=ls())  # Careful! This clears all of R's memory!

In [3]:

#.............................................................................
# # Two predictors:
myData = read.csv( file="Guber1999data.csv" )

In [4]:

myData

Out[4]:

	State	Spend	StuTeaRat	Salary	PrcntTake	SATV	SATM	SATT
1	Alabama	4.405	17.2	31.144	8	491	538	1029
2	Alaska	8.963	17.6	47.951	47	445	489	934
3	Arizona	4.778	19.3	32.175	27	448	496	944
4	Arkansas	4.459	17.1	28.934	6	482	523	1005
5	California	4.992	24	41.078	45	417	485	902
6	Colorado	5.443	18.4	34.571	29	462	518	980
7	Connecticut	8.817	14.4	50.045	81	431	477	908
8	Delaware	7.03	16.6	39.076	68	429	468	897
9	Florida	5.718	19.1	32.588	48	420	469	889
10	Georgia	5.193	16.3	32.291	65	406	448	854
11	Hawaii	6.078	17.9	38.518	57	407	482	889
12	Idaho	4.21	19.1	29.783	15	468	511	979
13	Illinois	6.136	17.3	39.431	13	488	560	1048
14	Indiana	5.826	17.5	36.785	58	415	467	882
15	Iowa	5.483	15.8	31.511	5	516	583	1099
16	Kansas	5.817	15.1	34.652	9	503	557	1060
17	Kentucky	5.217	17	32.257	11	477	522	999
18	Louisiana	4.761	16.8	26.461	9	486	535	1021
19	Maine	6.428	13.8	31.972	68	427	469	896
20	Maryland	7.245	17	40.661	64	430	479	909
21	Massachusetts	7.287	14.8	40.795	80	430	477	907
22	Michigan	6.994	20.1	41.895	11	484	549	1033
23	Minnesota	6	17.5	35.948	9	506	579	1085
24	Mississippi	4.08	17.5	26.818	4	496	540	1036
25	Missouri	5.383	15.5	31.189	9	495	550	1045
26	Montana	5.692	16.3	28.785	21	473	536	1009
27	Nebraska	5.935	14.5	30.922	9	494	556	1050
28	Nevada	5.16	18.7	34.836	30	434	483	917
29	New Hampshire	5.859	15.6	34.72	70	444	491	935
30	New Jersey	9.774	13.8	46.087	70	420	478	898
31	New Mexico	4.586	17.2	28.493	11	485	530	1015
32	New York	9.623	15.2	47.612	74	419	473	892
33	North Carolina	5.077	16.2	30.793	60	411	454	865
34	North Dakota	4.775	15.3	26.327	5	515	592	1107
35	Ohio	6.162	16.6	36.802	23	460	515	975
36	Oklahoma	4.845	15.5	28.172	9	491	536	1027
37	Oregon	6.436	19.9	38.555	51	448	499	947
38	Pennsylvania	7.109	17.1	44.51	70	419	461	880
39	Rhode Island	7.469	14.7	40.729	70	425	463	888
40	South Carolina	4.797	16.4	30.279	58	401	443	844
41	South Dakota	4.775	14.4	25.994	5	505	563	1068
42	Tennessee	4.388	18.6	32.477	12	497	543	1040
43	Texas	5.222	15.7	31.223	47	419	474	893
44	Utah	3.656	24.3	29.082	4	513	563	1076
45	Vermont	6.75	13.8	35.406	68	429	472	901
46	Virginia	5.327	14.6	33.987	65	428	468	896
47	Washington	5.906	20.2	36.151	48	443	494	937
48	West Virginia	6.107	14.8	31.944	17	448	484	932
49	Wisconsin	6.93	15.9	37.746	9	501	572	1073
50	Wyoming	6.16	14.9	31.285	10	476	525	1001

In [5]:

summary(myData)

Out[5]:

        State        Spend         StuTeaRat         Salary     
 Alabama   : 1   Min.   :3.656   Min.   :13.80   Min.   :25.99  
 Alaska    : 1   1st Qu.:4.882   1st Qu.:15.22   1st Qu.:30.98  
 Arizona   : 1   Median :5.768   Median :16.60   Median :33.29  
 Arkansas  : 1   Mean   :5.905   Mean   :16.86   Mean   :34.83  
 California: 1   3rd Qu.:6.434   3rd Qu.:17.57   3rd Qu.:38.55  
 Colorado  : 1   Max.   :9.774   Max.   :24.30   Max.   :50.05  
 (Other)   :44                                                  
   PrcntTake          SATV            SATM            SATT       
 Min.   : 4.00   Min.   :401.0   Min.   :443.0   Min.   : 844.0  
 1st Qu.: 9.00   1st Qu.:427.2   1st Qu.:474.8   1st Qu.: 897.2  
 Median :28.00   Median :448.0   Median :497.5   Median : 945.5  
 Mean   :35.24   Mean   :457.1   Mean   :508.8   Mean   : 965.9  
 3rd Qu.:63.00   3rd Qu.:490.2   3rd Qu.:539.5   3rd Qu.:1032.0  
 Max.   :81.00   Max.   :516.0   Max.   :592.0   Max.   :1107.0

In [6]:

yName = "SATT" ; xName = c("Spend","PrcntTake")
fileNameRoot = "Guber1999data-Jags-" 
numSavedSteps=15000 ; thinSteps=5

In [7]:

graphFileType = "png"

In [8]:

#------------------------------------------------------------------------------- 
# Load the relevant model into R's working memory:
source("Jags-Ymet-XmetMulti-Mrobust.R")

*********************************************************************
Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:
A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.
*********************************************************************

Loading required package: coda

In [9]:

#------------------------------------------------------------------------------- 
# Generate the MCMC chain:
#startTime = proc.time()
mcmcCoda = genMCMC( data=myData , xName=xName , yName=yName , 
                    numSavedSteps=numSavedSteps , thinSteps=thinSteps , 
                    saveName=fileNameRoot )

CORRELATION MATRIX OF PREDICTORS:
           Spend PrcntTake
Spend     1.000     0.593
PrcntTake 0.593     1.000

Calling 3 simulations using the parallel method...
Following the progress of chain 1 (the program will wait for all chains
to finish before continuing):
Welcome to JAGS 3.4.0 on Wed Sep 16 20:42:43 2015
JAGS is free software and comes with ABSOLUTELY NO WARRANTY
Loading module: basemod: ok
Loading module: bugs: ok
. . Reading data file data.txt
. Compiling data graph
   Resolving undeclared variables
   Allocating nodes
   Initializing
   Reading data back into data table
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 535
. Reading parameter file inits1.txt
. Initializing model
. Adapting 500
-------------------------------------------------| 500
++++++++++++++++++++++++++++++++++++++++++++++++++ 100%
Adaptation successful
. Updating 1000
-------------------------------------------------| 1000
************************************************** 100%
. . . . . . . . Updating 25000
-------------------------------------------------| 25000
************************************************** 100%
. . . . Updating 0
. Deleting model
. 
All chains have finished
Simulation complete.  Reading coda files...
Coda files loaded successfully
Finished running the simulation

In [10]:

#------------------------------------------------------------------------------- 
# Display diagnostics of chain, for specified parameters:
parameterNames = varnames(mcmcCoda) # get all parameter names
for ( parName in parameterNames ) {
  diagMCMC( codaObject=mcmcCoda , parName=parName , 
            saveName=fileNameRoot , saveType=graphFileType )
}

In [11]:

# 창으로 뜨는 것들을 없앤다.
graphics.off()

In [12]:

#------------------------------------------------------------------------------- 
# Get summary statistics of chain:
summaryInfo = smryMCMC( mcmcCoda , 
                        saveName=fileNameRoot )
show(summaryInfo)

                   Mean        Median          Mode     ESS HDImass      HDIlow
CHAIN       2.000000000   2.000000000  1.997413e+00     1.5    0.95   1.0000000
beta0     991.492221267 991.336500000  9.899081e+02 13808.0    0.95 948.3840000
beta[1]    12.832703606  12.858900000  1.273112e+01 13153.9    0.95   4.5361400
beta[2]    -2.878617251  -2.879090000 -2.881414e+00 13227.0    0.95  -3.3193300
sigma      31.530577473  31.329000000  3.106441e+01 13692.3    0.95  23.9116000
zbeta0     -0.001200131  -0.001465985 -9.973416e-05 15000.0    0.95  -0.1229360
zbeta[1]    0.233739136   0.234216000  2.318889e-01 13153.9    0.95   0.0826227
zbeta[2]   -1.029646875  -1.029815000 -1.030645e+00 13227.0    0.95  -1.1872800
zsigma      0.421415965   0.418722000  4.151860e-01 13692.3    0.95   0.3195860
nu         33.397469008  24.824800000  1.103249e+01 10801.1    0.95   1.8336700
log10(nu)   1.375887230   1.394885757  1.485261e+00 12329.1    0.95   0.6664712
              HDIhigh CompVal PcntGtCompVal ROPElow ROPEhigh PcntLtROPE
CHAIN        3.000000      NA            NA      NA       NA         NA
beta0     1035.490000      NA            NA      NA       NA         NA
beta[1]     21.497200      NA            NA      NA       NA         NA
beta[2]     -2.442310      NA            NA      NA       NA         NA
sigma       39.230700      NA            NA      NA       NA         NA
zbeta0       0.121780      NA            NA      NA       NA         NA
zbeta[1]     0.391557      NA            NA      NA       NA         NA
zbeta[2]    -0.873587      NA            NA      NA       NA         NA
zsigma       0.524331      NA            NA      NA       NA         NA
nu          90.746200      NA            NA      NA       NA         NA
log10(nu)    2.072673      NA            NA      NA       NA         NA
          PcntInROPE PcntGtROPE
CHAIN             NA         NA
beta0             NA         NA
beta[1]           NA         NA
beta[2]           NA         NA
sigma             NA         NA
zbeta0            NA         NA
zbeta[1]          NA         NA
zbeta[2]          NA         NA
zsigma            NA         NA
nu                NA         NA
log10(nu)         NA         NA

In [13]:

# Display posterior information:
plotMCMC( mcmcCoda , data=myData , xName=xName , yName=yName , 
          pairsPlot=TRUE , showCurve=FALSE ,
          saveName=fileNameRoot , saveType=graphFileType )

In [14]:

# 창으로 뜨는 것들을 없앤다.
graphics.off()

18.1.3 The posterior distribution¶

proportion of variance accounted for¶

frequentist¶

In least-squares regression, the overall variance in y is algebraically decomposed into the variance of the linearly predicted values and the residual variance:

The proportion of variance accounted for is

bayesian¶

Bayesian analysis no such decomposition of variance occurs
But for people familiar with least-squares notions who crave a statistic analogous to R^2 , we

can compute a surrogate. At each step in the MCMC chain, a credible value of R^2 is computed as

where

is the standardized regression coefficient for the jth predictor at that step in the MCMC chain,

and

is the correlation of the predicted values,

with the jth predictor values,

18.1.4 Redundant predictors¶

18.1.5 Informative priors, sparse data, and correlated predictors¶

Informed priors¶

The examples in this book tend to use mildly informed priors (e.g., using information about the rough magnitude and range of the data). But a benefit of Bayesian analysis is the potential for cumulative scientific progress by using priors that have been informed from previous research.

sparse data¶

Informed priors can be especially useful when the amount of data is small compared to the parameter space.
A strongly informed prior essentially reduces the scope of the credible parameter space, so that a small amount of new data implies a narrow zone of credible parameter values.

correlated predictors¶

In the context of multiple linear regression, sparse data can lead to usefully precise posteriors on regression coefficients if some of the regression coefficients have informed priors and the predictors are strongly correlated.
To understand this idea, it is important to remember that when predictors are correlated, their regression coefficients are also (anti-)correlated.
For example,
- recall the SAT data from Figure 18.3 (p. 514) in which spending-per-pupil and percent-taking-the-exam are correlated. Consequently, the posterior estimates of the regression coefficients had a negative correlation, as shown in Figure 18.5 (p. 518). The correlation of credible regression coefficients implies that a strong belief about the value of one regression coefficient constrains the value of the other coefficient.
- That influence of one slope estimate on another can be used for inferential advantage when we have prior knowledge about one of the slopes.

18.2. Multiplicative Interaction of Metric Predictors¶

18.2.1 An example

18.2.1 An example¶

we conceptualize the interaction term of Equation 18.2 as an additional additive predictor, like this:

We create the new variable x3 = x1x2 outside the model and then submit the new variable as if it were another additive predictor.
- One benefit of this approach is that we do not have to create a new model, and it is easy, in cases of many predictors, to set up interaction variables for many different combinations of variables.
- Another key benefit is that we can examine the correlations of the single predictors with the interaction variables.

To illustrate some of the issues involved in interpreting the parameters of a model with interaction, consider again the SAT data from Figure 18.3.
- SAT score
- the spending per pupil (Spend)
- the percentage of students who took the test (PrcntTake)
- When no interaction term was included in the model, the posterior distribution looked like Figure 18.5, which indicated a positive influence of Spend and a negative influence of PrcntTake.
We will include a multiplicative interaction of Spend and PrcntTake.

Jags-Ymet- XmetMulti-Mrobust-Example.R¶

In [ ]:

# Read in data:
myData = read.csv( file="Guber1999data.csv" )
# Append the new interaction variable:
myData = cbind( myData , SpendXPrcnt = myData[,"Spend"]  * myData[,"PrcntTake"] )
# Specify names of data columns to use in the analysis:
yName = "SATT" ; xName = c("Spend","PrcntTake","SpendXPrcnt")

In [ ]:

#### 수정된 버전 Jags-Ymet-XmetMulti-Mrobust-Example.R

In [1]:

# 책에는 없는 코드. 실습을 위해 data 폴더에 넣은 코드를 실행시키기 위한 작업
cur_dir = getwd()
setwd(sprintf("%s/%s", cur_dir, 'data'))

In [2]:

#.............................................................................
# # Two predictors:
myData = read.csv( file="Guber1999data.csv" )

In [3]:

# Append the new interaction variable:
myData = cbind( myData , SpendXPrcnt = myData[,"Spend"]  * myData[,"PrcntTake"] )
# Specify names of data columns to use in the analysis:
yName = "SATT" ; xName = c("Spend","PrcntTake","SpendXPrcnt")

In [4]:

myData

Out[4]:

	State	Spend	StuTeaRat	Salary	PrcntTake	SATV	SATM	SATT	SpendXPrcnt
1	Alabama	4.405	17.2	31.144	8	491	538	1029	35.24
2	Alaska	8.963	17.6	47.951	47	445	489	934	421.261
3	Arizona	4.778	19.3	32.175	27	448	496	944	129.006
4	Arkansas	4.459	17.1	28.934	6	482	523	1005	26.754
5	California	4.992	24	41.078	45	417	485	902	224.64
6	Colorado	5.443	18.4	34.571	29	462	518	980	157.847
7	Connecticut	8.817	14.4	50.045	81	431	477	908	714.177
8	Delaware	7.03	16.6	39.076	68	429	468	897	478.04
9	Florida	5.718	19.1	32.588	48	420	469	889	274.464
10	Georgia	5.193	16.3	32.291	65	406	448	854	337.545
11	Hawaii	6.078	17.9	38.518	57	407	482	889	346.446
12	Idaho	4.21	19.1	29.783	15	468	511	979	63.15
13	Illinois	6.136	17.3	39.431	13	488	560	1048	79.768
14	Indiana	5.826	17.5	36.785	58	415	467	882	337.908
15	Iowa	5.483	15.8	31.511	5	516	583	1099	27.415
16	Kansas	5.817	15.1	34.652	9	503	557	1060	52.353
17	Kentucky	5.217	17	32.257	11	477	522	999	57.387
18	Louisiana	4.761	16.8	26.461	9	486	535	1021	42.849
19	Maine	6.428	13.8	31.972	68	427	469	896	437.104
20	Maryland	7.245	17	40.661	64	430	479	909	463.68
21	Massachusetts	7.287	14.8	40.795	80	430	477	907	582.96
22	Michigan	6.994	20.1	41.895	11	484	549	1033	76.934
23	Minnesota	6	17.5	35.948	9	506	579	1085	54
24	Mississippi	4.08	17.5	26.818	4	496	540	1036	16.32
25	Missouri	5.383	15.5	31.189	9	495	550	1045	48.447
26	Montana	5.692	16.3	28.785	21	473	536	1009	119.532
27	Nebraska	5.935	14.5	30.922	9	494	556	1050	53.415
28	Nevada	5.16	18.7	34.836	30	434	483	917	154.8
29	New Hampshire	5.859	15.6	34.72	70	444	491	935	410.13
30	New Jersey	9.774	13.8	46.087	70	420	478	898	684.18
31	New Mexico	4.586	17.2	28.493	11	485	530	1015	50.446
32	New York	9.623	15.2	47.612	74	419	473	892	712.102
33	North Carolina	5.077	16.2	30.793	60	411	454	865	304.62
34	North Dakota	4.775	15.3	26.327	5	515	592	1107	23.875
35	Ohio	6.162	16.6	36.802	23	460	515	975	141.726
36	Oklahoma	4.845	15.5	28.172	9	491	536	1027	43.605
37	Oregon	6.436	19.9	38.555	51	448	499	947	328.236
38	Pennsylvania	7.109	17.1	44.51	70	419	461	880	497.63
39	Rhode Island	7.469	14.7	40.729	70	425	463	888	522.83
40	South Carolina	4.797	16.4	30.279	58	401	443	844	278.226
41	South Dakota	4.775	14.4	25.994	5	505	563	1068	23.875
42	Tennessee	4.388	18.6	32.477	12	497	543	1040	52.656
43	Texas	5.222	15.7	31.223	47	419	474	893	245.434
44	Utah	3.656	24.3	29.082	4	513	563	1076	14.624
45	Vermont	6.75	13.8	35.406	68	429	472	901	459
46	Virginia	5.327	14.6	33.987	65	428	468	896	346.255
47	Washington	5.906	20.2	36.151	48	443	494	937	283.488
48	West Virginia	6.107	14.8	31.944	17	448	484	932	103.819
49	Wisconsin	6.93	15.9	37.746	9	501	572	1073	62.37
50	Wyoming	6.16	14.9	31.285	10	476	525	1001	61.6

In [5]:

fileNameRoot = "Guber1999data-interaction-Jags-" 
numSavedSteps=15000 ; thinSteps=5

In [6]:

graphFileType = "png"

In [7]:

#------------------------------------------------------------------------------- 
# Load the relevant model into R's working memory:
source("Jags-Ymet-XmetMulti-Mrobust.R")

*********************************************************************
Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:
A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.
*********************************************************************

Loading required package: coda

In [8]:

#------------------------------------------------------------------------------- 
# Generate the MCMC chain:
#startTime = proc.time()
mcmcCoda = genMCMC( data=myData , xName=xName , yName=yName , 
                    numSavedSteps=numSavedSteps , thinSteps=thinSteps , 
                    saveName=fileNameRoot )

CORRELATION MATRIX OF PREDICTORS:
             Spend PrcntTake SpendXPrcnt
Spend       1.000     0.593       0.775
PrcntTake   0.593     1.000       0.951
SpendXPrcnt 0.775     0.951       1.000

Calling 3 simulations using the parallel method...
Following the progress of chain 1 (the program will wait for all chains
to finish before continuing):
Welcome to JAGS 3.4.0 on Wed Oct  7 02:03:51 2015
JAGS is free software and comes with ABSOLUTELY NO WARRANTY
Loading module: basemod: ok
Loading module: bugs: ok
. . Reading data file data.txt
. Compiling data graph
   Resolving undeclared variables
   Allocating nodes
   Initializing
   Reading data back into data table
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 638
. Reading parameter file inits1.txt
. Initializing model
. Adapting 500
-------------------------------------------------| 500
++++++++++++++++++++++++++++++++++++++++++++++++++ 100%
Adaptation successful
. Updating 1000
-------------------------------------------------| 1000
************************************************** 100%
. . . . . . . . Updating 25000
-------------------------------------------------| 25000
************************************************** 100%
. . . . Updating 0
. Deleting model
All chains have finished
Simulation complete.  Reading coda files...
Coda files loaded successfully
Finished running the simulation

In [9]:

#------------------------------------------------------------------------------- 
# Display diagnostics of chain, for specified parameters:
parameterNames = varnames(mcmcCoda) # get all parameter names
for ( parName in parameterNames ) {
  diagMCMC( codaObject=mcmcCoda , parName=parName , 
            saveName=fileNameRoot , saveType=graphFileType )
}

In [10]:

# 창으로 뜨는 것들을 없앤다.
graphics.off()

In [11]:

#------------------------------------------------------------------------------- 
# Get summary statistics of chain:
summaryInfo = smryMCMC( mcmcCoda , 
                        saveName=fileNameRoot )
show(summaryInfo)

                   Mean        Median          Mode     ESS HDImass      HDIlow
CHAIN      2.000000e+00    2.00000000  1.997413e+00     1.5    0.95   1.0000000
beta0      1.046328e+03 1046.55500000  1.045477e+03  1192.8    0.95 964.1770000
beta[1]    2.706973e+00    2.64031500  2.081181e+00  1215.5    0.95 -13.2015000
beta[2]   -4.064498e+00   -4.06793500 -4.089410e+00   898.6    0.95  -5.6597000
beta[3]    2.037744e-01    0.20437850  2.117125e-01   813.3    0.95  -0.0561615
sigma      3.104436e+01   30.81855000  3.040303e+01 13947.4    0.95  23.9590000
zbeta0    -1.564266e-03   -0.00124368  3.764121e-03 15000.0    0.95  -0.1236310
zbeta[1]   4.930570e-02    0.04809150  3.790802e-02  1215.5    0.95  -0.2404560
zbeta[2]  -1.453822e+00   -1.45505000 -1.462733e+00   898.6    0.95  -2.0244100
zbeta[3]   5.615321e-01    0.56319650  5.834098e-01   813.3    0.95  -0.1547620
zsigma     4.149175e-01    0.41189950  4.063463e-01 13947.4    0.95   0.3202200
nu         3.490538e+01   26.42730000  1.158489e+01 10637.7    0.95   2.1464200
log10(nu)  1.400149e+00    1.42205278  1.457488e+00 12121.0    0.95   0.7047432
              HDIhigh CompVal PcntGtCompVal ROPElow ROPEhigh PcntLtROPE
CHAIN        3.000000      NA            NA      NA       NA         NA
beta0     1129.180000      NA            NA      NA       NA         NA
beta[1]     17.740100      NA            NA      NA       NA         NA
beta[2]     -2.461580      NA            NA      NA       NA         NA
beta[3]      0.468960      NA            NA      NA       NA         NA
sigma       38.820800      NA            NA      NA       NA         NA
zbeta0       0.120886      NA            NA      NA       NA         NA
zbeta[1]     0.323124      NA            NA      NA       NA         NA
zbeta[2]    -0.880479      NA            NA      NA       NA         NA
zbeta[3]     1.292290      NA            NA      NA       NA         NA
zsigma       0.518852      NA            NA      NA       NA         NA
nu          94.077300      NA            NA      NA       NA         NA
log10(nu)    2.096834      NA            NA      NA       NA         NA
          PcntInROPE PcntGtROPE
CHAIN             NA         NA
beta0             NA         NA
beta[1]           NA         NA
beta[2]           NA         NA
beta[3]           NA         NA
sigma             NA         NA
zbeta0            NA         NA
zbeta[1]          NA         NA
zbeta[2]          NA         NA
zbeta[3]          NA         NA
zsigma            NA         NA
nu                NA         NA
log10(nu)         NA         NA

In [12]:

# Display posterior information:
plotMCMC( mcmcCoda , data=myData , xName=xName , yName=yName , 
          pairsPlot=TRUE , showCurve=FALSE ,
          saveName=fileNameRoot , saveType=graphFileType )

In [13]:

# 창으로 뜨는 것들을 없앤다.
graphics.off()

When the analysis is run, the first thing it does is display the correlations of the predictors:

We can see that the interaction variable is strongly correlated with both predictors.
Therefore, we know that
- there will be strong trade-offs among the regression coefficients,
- and the marginal distributions of single regression coefficients might be much wider than when there was no interaction included.

The marginal distribution for β3, also labeled as SpendXPrcnt, indicates that the modal value of the interaction coefficients is indeed positive, as we anticipated it could be. However, the 95% HDI includes zero, which indicates that we do not have very strong precision in the estimate of the magnitude of the interaction.

Notice that the inclusion of the interaction term has changed the apparent marginal distributions for the regression coefficients on Spend and PrcntTake.
- In particular, the regression coefficient on Spend now clearly includes zero.
  - This might lead a person, inappropriately, to conclude that there is not a credible influence of spending on SAT scores, because zero is among the credible values of β1 .
  - This conclusion is not appropriate because β1 only indicates the slope on spending when the percentage of students taking the test is zero.
  - The slope on Spend depends on the value of PrcntTake because of the interaction.

To properly understand the credible slopes on the two predictors, we must consider the credible slopes on each predictor as a function of the value of the other predictor.

Recall from Equations 18.3 that the slope on x1 is β1 + β1×2 x2 .
Thus, for the present application, the slope on Spend is β1 + β3 · PrcntTake
- because β1×2 is β3 and x2 is PrcntTake.
Thus, for any particular value of PrcntTake, we get the distribution of credible slopes on Spend by stepping through the MCMC chain and computing β1 + β3 · PrcntTake at each step.

We can summarize the distribution of slopes by its median and 95% HDI.
We do that for many candidate values of PrcntTake, and the result is plotted in the middle panel of Figure 18.9.
- You can see that when PrcntTake is large, the credible slopes on Spend clearly exceed zero.
- You can also mentally extrapolate that when PrcntTake is zero, the median and HDI will match the marginal distribution of β1 shown in the top of Figure 18.9.

In summary,

when there is interaction, then the influence of the individual predictors can not be summarized by their individual regression coefficients alone,
because those coefficients only describe the influence when the other variables are at zero.
Notice that this is true even though the interaction coefficient did not exclude zero from its 95% HDI.
In other words, if you include an interaction term, you cannot ignore it even if its marginal posterior distribution includes zero.

18.3. Shrinkage of Regression Coefficients¶

With so many candidate predictors of noisy data, there may be some regression coefficients that are spuriously estimated to be non-zero.

we would like the description to de-emphasize weak or spurious predictors.

t-distribution prior¶

One way to implement such a description is by using a t distribution for the prior on the regression coefficients.
- By setting
  - its mean to zero,
  - its normality parameter to a small value,
  - and its scale parameter to a moderately small value,
- the t-distributed prior dictates that regression coefficients should probably be near zero,
  - where the narrow peak of the t distribution is.
- But if a regression coefficient is clearly nonzero,
  - then it could be large,
  - as is allowed by the heavy tail of the t-distributed prior.

The empty braces in the top of Figure 18.10 refer to optional aspects of the model.
- As was mentioned in the previous paragraph, the normality parameter νβ and the scale parameter σβ could be set to constants,
  - in which case the braces as the top of the diagram would enclose constants and the arrows would be labeled with an equal sign.
- When the prior has constants, it is sometimes called a regularizer for the estimation.

double exponential distribution prior¶

Alternatively, a double exponential distribution could be used.
A double exponential is simply an exponential distribution on both +β and −β, equally spread.
The double exponential has a single-scale parameter (and no shape parameter).
The double exponential is built into JAGS and Stan.
A well-known regularization method called lasso regression uses a double exponential weighting on the regression coefficients.

Should the scale parameter (i.e., σβ in Figure 18.10) be fixed at a constant value or should it be estimated from the data?¶

If it is fixed,
- then every regression coefficient experiences the same fixed regularization,
- independently from all the other regression

coefficients.

If the scale parameter is estimated,
- then the variability of the estimated regression coefficients across predictors influences the estimate of the scale parameter,
- which in turn influences all the regression coefficients.
- If the model estimates σβ (instead of fixing it), the model is assuming that all the regression coefficients are mutually representative of the variability across regression coefficients.

To illustrate these ideas with a concrete example, consider again the SAT data of Figure 18.3, but now supplemented with 12 more randomly generated predictors.

The x values were randomly and independently drawn from a normal distribution centered at zero, and therefore, any correlation between the predictors and any nonzero regression coefficient is an accident of random sampling.

We will first apply the simple model of Figure 18.4, for which the regression coefficients have fixed, independent,

vague normal priors.

The resulting posterior distribution is shown in Figure 18.11.

To save space, the results for random predictors 4-9 (xRand4–xRand9) have not been displayed.

Attend specifically to the distributions for the regression coefficients on spending (Spend) and random predictor 10 (xRand10).
- The coefficient on Spend is still estimated to be positive and its 95% HDI falls above zero, but only barely.
- The coefficient on xRand10 is negative, with its 95% HDI falling below zero, to about

the same extent as Spend fell above zero. - This apparent relation of xRand10 with SAT score is spurious, an accident of random sampling.

Now we repeat the analysis using the hierarchical model of Figure 18.10, with νβ fixed at one (i.e., heavy tails), and with σβ given a gamma prior that has mode 1.0 and standard deviation 1.0 (i.e., broad for the standardized data).

Jags-Ymet-XmetMulti-MrobustShrink-Example.R¶

The resulting posterior distribution is shown in Figure 18.12.

Notice that the marginal distribution for the regression coefficient on xRand10 is now shifted so that its 95% HDI covers zero.
- The estimate is shrunken toward zero because many predictors are telling the higher-level t distribution that their

regression coefficients are near zero. - Indeed, the estimate of σβ (not displayed) has its posterior mode around 0.05, even though its prior mode was at 1.0. - The shrinkage also causes the estimate of the regression coefficient on Spend to shift toward zero. - but it has also suppressed what might be a real but small regression coefficient on Spend.

Notice, however, that the marginal distribution for the coefficient on PrcntTake has not been much affected by shrinkage, presumably because it is big enough that it falls in the tail of the t distribution where the prior is relatively flat.

The shrinkage is desirable not only because it shares information across predictors (as expressed by the hierarchical prior) but also because it rationally helps control for false alarms in declaring that a predictor has a nonzero regression coefficient.

Finally, notice in Figure 18.12 that the marginal posterior distributions on many of the regression coefficients are (upside-down) funnel shaped, each with a pointy peak near

. You can imagine that the posterior distribution zero and long concave tails, like this:

A funnel shape is characteristic of a posterior distribution experiencing strong shrinkage.
- If a marginal posterior distribution is displayed only by a dot at its central tendency and a segment for its HDI, without a profile for its shape, then this signature of shrinkage is missed.

18.4. Variable Selection¶

18.4.1 Inclusion probability is strongly affected by vagueness of prior
18.4.2 Variable selection with hierarchical shrinkage
18.4.3 What to report and what to conclude
18.4.4 Caution: Computational methods
18.4.5 Caution: Interaction variables

Deciding which predictors to include is often called variable selection.

The motivation of the previous section was
- an assumption that many predictors might have weak predictive value relative to the noise in the data,
- and therefore, shrinkage would be appropriate to stabilize the estimates.
In some applications, it may be theoretically meaningful to suppose that a predictor has literally zero predictive value.
- In this case, the issue is not estimating a presumed weak predictiveness relative to noise;
- instead, the issue is deciding whether there is any predictiveness at all.

we are entertaining a situation in which there are many candidate predictors that may genuinely have zero real relation to the predicted value or

have relations small enough to be counted as insignificant.

In this situation, a reasonable question is, which predictors can be credibly included in the descriptive model?

This section introduces some basic ideas and methods of Bayesian variable selection, using some simple illustrative examples.

The topic is extensively studied and undergoing rapid development in the literature.
The examples presented here are intended to reveal some of the foundational concepts and methods, not to serve as a comprehensive

reference for the latest and greatest methods.

The key to models of variable selection (as opposed to shrinkage)¶

is that each predictor has both a regression coefficient and an inclusion indicator, which can be thought of as simply another coefficient that takes on the values 0 or 1.
- When the inclusion indicator is 1,
  - then the regression coefficient has its usual role.
- When the inclusion indicator is 0,
  - the predictor has no role in the model and the regression coefficient is superfluous.

For example,
- if there are four predictors, then
  - <δ1 , δ2 , δ3 , δ4> = [1, 1, 1, 1] is a model that includes all four predictors,
  - and <δ1 , δ2 , δ3 , δ4> = [0, 1, 0, 1] is a model that includes only the second and fourth predictors, and so on.
  - With four predictors, there are 2^4 = 16 possible models.

A simple way to put a prior on the inclusion indicator is to have each indicator come

from an independent Bernoulli prior, such as

JAGS code¶

In [ ]:

model {
    for ( i in 1:Ntotal ) {
        zy[i]  ̃ dt( zbeta0 + sum( zbeta[1:Nx] * zx[i,1:Nx] ) , 1/zsigmaˆ2 , nu )
    }
    # Priors vague on standardized scale:
    zbeta0  ̃ dnorm( 0 , 1/2ˆ2 )
    for ( j in 1:Nx ) {
        zbeta[j]  ̃ dnorm( 0 , 1/2ˆ2 )
    }
    zsigma  ̃ dunif( 1.0E-5 , 1.0E+1 )
    nu <- nuMinusOne+1
    nuMinusOne  ̃ dexp(1/29.0)
    # Transform to original scale:
    beta[1:Nx] <- ( zbeta[1:Nx] / xsd[1:Nx] )*ysd
    beta0 <- zbeta0*ysd + ym - sum( zbeta[1:Nx] * xm[1:Nx] / xsd[1:Nx] )*ysd
    sigma <- zsigma*ysd
}

In [ ]:

# Nx : number of predictors.
model {
    for ( i in 1:Ntotal ) {
        zy[i]  ̃ dt( zbeta0 + sum( delta[1:Nx] * zbeta[1:Nx] * zx[i,1:Nx] ) ,
                    1/zsigmaˆ2 , nu )
    }
    # Priors vague on standardized scale:
    zbeta0  ̃ dnorm( 0 , 1/2ˆ2 )
    for ( j in 1:Nx ) {
        zbeta[j]  ̃ dnorm( 0 , 1/2ˆ2 )
        delta[j]  ̃ dbern( 0.5 )
    }
    zsigma  ̃ dunif( 1.0E-5 , 1.0E+1 )
    nu <- nuMinusOne+1
    nuMinusOne  ̃ dexp(1/29.0)
    # Transform to original scale:
    beta[1:Nx] <- ( delta[1:Nx] * zbeta[1:Nx] / xsd[1:Nx] )*ysd
    beta0 <- zbeta0*ysd + ym - sum( delta[1:Nx] * zbeta[1:Nx] * xm[1:Nx]
                                                    / xsd[1:Nx] )*ysd
    sigma <- zsigma*ysd
}

As a first example of applying the variable-selection method, recall the SAT data of Figure 18.3 and the posterior distribution shown in Figure 18.5.

For each of 50 states, the average SAT score was regressed on two predictors:
- average spending per pupil (Spend)
- and percentage of students who took the test (PrcntTake).

With two predictors, there are four possible models involving different subsets of predictors.

Because the prior inclusion probability was set at 0.5, each model had a prior probability of 0.52 = 0.25.

Figure 18.13 shows the results.

Of the four possible models,
- only two had a non-negligible posterior probability, namely the model that
  - included both predictors and
  - the model that included only PrcntTake.
As shown in the upper panel of Figure 18.13,
- the model with both predictors has a posterior probability of about 70%.
- This value is simply the number of times that the MCMC chain had <δ1 , δ2> = [1,1] divided by the total number of steps in the chain.
As shown in the lower panel of Figure 18.13,
- the model with only PrcntTake has a posterior probability of about 30%.
- This value is simply the number of times that the MCMC chain had <δ1 , δ2> = [0, 1] divided by the total number of steps in the chain.
Thus, the model involving both predictors is more than twice as credible as the model involving only one predictor.
Notice that the parameter estimates are different for different models.
- For example, the estimate of the intercept is quite different for different included predictors.

18.4.1 Inclusion probability is strongly affected by vagueness of prior¶

We will now see that the degree of vagueness of the prior on the regression coefficient can have an enormous influence on the inclusion probability, even though the degree of vagueness has little influence on the estimate of the regression coefficient itself.

Recall that the prior in the model was specified as a generic broad distribution on the standardized regression coefficients, like this:

In [ ]:

model {
    ...
    # Priors vague on standardized scale:
    zbeta0  ̃ dnorm( 0 , 1/2ˆ2 ) # SD=2
    for ( j in 1:Nx ) {
        zbeta[j]  ̃ dnorm( 0 , 1/2ˆ2 ) # SD=2
        delta[j]  ̃ dbern( 0.5 )
    }
    ...
}

The choice of SD=2 was arbitrary but reasonable because standardized regression coefficients cannot exceed ±1 in least-squares regression.

We now re-run the analysis using different arbitrary degrees of vagueness on the priors for the standardized regression coefficients.

We will illustrate with SD=1, like this:

In [ ]:

model {
    ...
    # Priors vague on standardized scale:
    zbeta0  ̃ dnorm( 0 , 1/1ˆ2 ) # SD=1
    for ( j in 1:Nx ) {
        zbeta[j]  ̃ dnorm( 0 , 1/1ˆ2 ) # SD=1
        delta[j]  ̃ dbern( 0.5 )
    }
    ...
}

and with SD=10, like this:

In [ ]:

model {
    ...
    # Priors vague on standardized scale:
    zbeta0  ̃ dnorm( 0 , 1/10ˆ2 ) # SD=10
    for ( j in 1:Nx ) {
        zbeta[j]  ̃ dnorm( 0 , 1/10ˆ2 ) # SD=10
        delta[j]  ̃ dbern( 0.5 )
    }
    ...
}

Notice that the prior probability on the inclusion parameters has not changed.
The prior inclusion probability is always 0.5 in these examples.

Figure 18.14 shows the results.

The upper pair of panels shows the posterior probabilities when the prior on the standardized regression coefficients has SD=1.
- You can see that there is a strong advantage for the two-predictor model,
  - with the posterior inclusion probability of Spend being about 0.82.
The lower pair of panels in Figure 18.14 shows the posterior probabilities when the prior on the standardized regression coefficients has SD=10.
- You can see that there is now an advantage for the one-predictor model,
  - with the posterior inclusion probability of Spend being only about 0.36.
How could that be?
- After all, a model with all predictors included must be able to fit the data at least as well as a model with only a subset of predictors excluded.
- The reason for the lower probability of the more complex model is that each extra parameter dilutes the prior density on the pre-existing parameters.

Consider the PrcntTake-only model.
- In this model,
- one of the most credible values for the regression coefficient on PrcntTake is β2 = −2.5.
  - The likelihood at that point involves p(D|β2 = −2.5)
  - and the prior density at the point involves p(β2 = −2.5),
  - with the posterior probability proportional to their product.
The very same likelihood can be achieved by the model that also includes Spend,
- merely by setting the regression coefficient on Spend to zero:
  - p(D|β2 = −2.5) = p(D|β2 = −2.5, β1 = 0).
  - But the prior density at that point, p(β2 = −2.5, β1 = 0) = p(β2 = −2.5) p(β1 = 0),
  - will typically be less than p(β2 = −2.5),
  - because the prior is p(β2 = −2.5) multiplied by a probability density that is almost certainly less than one.
Thus, models that include more predictors will pay the cost of lower prior probability.

On the other hand, the change in vagueness of the prior distribution has hardly affected the estimates of the regression coefficients at all.
- Figure 18.14 shows that the estimate of the regression coefficient on Spend has a 95% HDI from about 4 to 21 for both prior distributions, regardless of whether its inclusion probability is low or high.
From these results, do we conclude that Spend should be included or not?
- For me, the robustness of the explicit estimate of the regression coefficient, showing that it is non-zero, trumps the model comparison.

Bayesian model comparison can be strongly affected by the degree of vagueness in the priors, even though explicit estimates of the parameter

values may be minimally affected.

18.4.2 Variable selection with hierarchical shrinkage¶

If you have strong previous research that can inform the prior, then it should be used. But if previous knowledge is weak, then the uncertainty should be expressed in the prior.

This is an underlying mantra of the Bayesian approach: Any uncertainty should be expressed in the prior.

Thus, if you are not sure what the value of σβ should be, you can estimate it and

include a higher-level distribution to express your prior uncertainty.

In other words, in Figure 18.10 (p. 531), we estimate σβ and give it a prior distribution in place of the open

braces.

In the present application, we have uncertainty, and therefore, we place a broad prior on σβ .

The code below shows a few different options in commented lines.

In the code, σβ is denoted sigmaBeta.
- One option sets sigmaBeta to a constant,
  - which produces the results reported in the previous section.
- Another option puts a broad uniform prior on sigmaBeta.
  - A uniform prior is intuitively straight forward, but a uniform prior must

always have some arbitrary bounds. - Therefore, the next option, - not commented out of the code below, - is a gamma distribution that has a mode at 1.0 - but is very broad with a standard deviation of 10.0

In [ ]:

model {
    ...
    # Priors vague on standardized scale:
    zbeta0  ̃ dnorm( 0 , 1/2ˆ2 )
    for ( j in 1:Nx ) {
        zbeta[j]  ̃ dt( 0 , 1/sigmaBetaˆ2 , 1 ) # Notice sigmaBeta
        delta[j]  ̃ dbern( 0.5 )
    }
    zsigma  ̃ dunif( 1.0E-5 , 1.0E+1 )
    ## Uncomment one of the following specifications for sigmaBeta:
    # sigmaBeta <- 2.0
    # sigmaBeta  ̃ dunif( 1.0E-5 , 1.0E+2 )
    sigmaBeta  ̃ dgamma(1.1051,0.1051) # mode 1.0, sd 10.0
    # sigmaBeta <- 1/sqrt(tauBeta) ; tauBeta  ̃ dgamma(0.001,0.001)
    ...
}

Jags-Ymet-XmetMulti-MrobustVarSelect-Example.R¶

When it is run, it produces results very similar to those in Figure 18.13

18.13 관련 식¶

In [ ]:

# 18.13 관련 코드
# Nx : number of predictors.
model {
    for ( i in 1:Ntotal ) {
        zy[i]  ̃ dt( zbeta0 + sum( delta[1:Nx] * zbeta[1:Nx] * zx[i,1:Nx] ) ,
                    1/zsigmaˆ2 , nu )
    }
    # Priors vague on standardized scale:
    zbeta0  ̃ dnorm( 0 , 1/2ˆ2 )
    for ( j in 1:Nx ) {
        zbeta[j]  ̃ dnorm( 0 , 1/2ˆ2 )
        delta[j]  ̃ dbern( 0.5 )
    }
    zsigma  ̃ dunif( 1.0E-5 , 1.0E+1 )
    nu <- nuMinusOne+1
    nuMinusOne  ̃ dexp(1/29.0)
    # Transform to original scale:
    beta[1:Nx] <- ( delta[1:Nx] * zbeta[1:Nx] / xsd[1:Nx] )*ysd
    beta0 <- zbeta0*ysd + ym - sum( delta[1:Nx] * zbeta[1:Nx] * xm[1:Nx]
                                                    / xsd[1:Nx] )*ysd
    sigma <- zsigma*ysd
}

In [4]:

# Example for Jags-Ymet-XmetMulti-MrobustVarSelect.R

In [5]:

# 책에는 없는 코드. 실습을 위해 data 폴더에 넣은 코드를 실행시키기 위한 작업
cur_dir = getwd()
setwd(sprintf("%s/%s", cur_dir, 'data'))

In [6]:

#------------------------------------------------------------------------------- 
# Load data file and specity column names of x (predictors) and y (predicted):
myData = read.csv( file="Guber1999data.csv" )

In [7]:

# UNCOMMENT ONE OF THE FOLLOWING SECTIONS (In RStudio, select and ctrl-shift-C)
#.............................................................................
# Two predictors:
yName = "SATT" ; xName = c("Spend","PrcntTake")
fileNameRoot = "Guber1999data-Jags-VarSelect-" 
numSavedSteps=15000 ; thinSteps=25

In [8]:

#.............................................................................
# # Two predictors with redundant predictor:
# PropNotTake = (100-myData[,"PrcntTake"])/100
# myData = cbind( myData , PropNotTake )
# yName = "SATT" ; xName = c("Spend","PrcntTake","PropNotTake")
# fileNameRoot = "Guber1999data-Jags-Redund-VarSelect-" 
# numSavedSteps=15000 ; thinSteps=30

In [9]:

#.............................................................................
# # Two predictors with two redundant predictors:
# PropNotTake = (100-myData[,"PrcntTake"])/100
# Partic = myData[,"PrcntTake"]/10 
# myData = cbind( myData , PropNotTake , Partic )
# yName = "SATT" ; xName = c("Spend","PrcntTake","PropNotTake","Partic")
# fileNameRoot = "Guber1999data-Jags-Redund2-VarSelect-" 
# numSavedSteps=15000 ; thinSteps=15

In [10]:

#.............................................................................
# # Four predictors:
# yName = "SATT" ; xName = c("Spend","PrcntTake","StuTeaRat","Salary")
# fileNameRoot = "Guber1999data-Jags-4X-VarSelect-" 
# numSavedSteps=15000 ; thinSteps=20

In [11]:

#.............................................................................
# # Append columns of random predictors:
# standardize = function(v){(v-mean(v))/sd(v)}
# Ny=nrow(myData)
# NxRand = 12
# set.seed(47405)
# for ( xIdx in 1:NxRand ) {
#   xRand = standardize(rnorm(Ny))
#   myData = cbind( myData , xRand )
#   colnames(myData)[ncol(myData)] = paste0("xRand",xIdx)
# }
# yName = "SATT" ; xName = c("Spend","PrcntTake", paste0("xRand",1:NxRand) )
# fileNameRoot = "Guber1999data-Jags-RandX-VarSelect-" 
# numSavedSteps=15000 ; thinSteps=5

In [12]:

#.............................................................................
# # Two predictors with interaction.
# # Append named column with interaction product:
# myData = cbind( myData , SpendXPrcnt=myData[,"Spend"]*myData[,"PrcntTake"] )
# yName = "SATT" ; xName = c("Spend","PrcntTake","SpendXPrcnt")
# fileNameRoot = "Guber1999data-Jags-Inter-VarSelect-" 
# numSavedSteps=15000 ; thinSteps=25

In [13]:

#.............................................................................
graphFileType = "png"

In [14]:

#------------------------------------------------------------------------------- 
# Load the relevant model into R's working memory:
source("Jags-Ymet-XmetMulti-MrobustVarSelect.R")

*********************************************************************
Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:
A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.
*********************************************************************

Installing packages into ‘/home/moosung/R/x86_64-pc-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)

Error in contrib.url(repos, type): trying to use CRAN without setting a mirror

In [ ]:

#------------------------------------------------------------------------------- 
# Generate the MCMC chain:
#startTime = proc.time()
mcmcCoda = genMCMC( data=myData , xName=xName , yName=yName , 
                    numSavedSteps=numSavedSteps , thinSteps=thinSteps , 
                    saveName=fileNameRoot )
#stopTime = proc.time()
#duration = stopTime - startTime
#show(duration)

In [ ]:

#------------------------------------------------------------------------------- 
# Display diagnostics of chain, for specified parameters:
parameterNames = varnames(mcmcCoda) # get all parameter names
for ( parName in parameterNames ) {
  diagMCMC( codaObject=mcmcCoda , parName=parName , 
            saveName=fileNameRoot , saveType=graphFileType )
}

In [ ]:

# 창으로 뜨는 것들을 없앤다.
graphics.off()

In [ ]:

#------------------------------------------------------------------------------- 
# Get summary statistics of chain:
summaryInfo = smryMCMC( mcmcCoda , 
                        saveName=fileNameRoot )
show(summaryInfo)

In [ ]:

# Display posterior information:
plotMCMC( mcmcCoda , data=myData , xName=xName , yName=yName , 
          pairsPlot=TRUE , showCurve=FALSE ,
          saveName=fileNameRoot , saveType=graphFileType )

In [ ]:

# 창으로 뜨는 것들을 없앤다.
graphics.off()

As a small extension of the example,¶

it turns out that the SAT data from Guber (1999) had two additional predictors,
- namely the average student-teacher ratio (StuTeaRat) in each state
- and the average salary of the teachers (Salary).
These variables are also plausible predictors of SAT score. Should they be included?

First we consider the correlations of the candidate predictors:

Notice above that Salary is strongly correlated with Spend, and therefore,
- a model that includes both Salary and Spend
  - will show
    - a strong trade-off between those predictors
  - and consequently will show inflated uncertainty in the regression coefficients for either

one.

Should only one or the other be included, or both, or neither?

The prior inclusion bias was 0.5 for each predictor, and therefore,
- each of 2^4 = 16 models
- had a prior probability of 0.5^4 = 0.0625.
The prior on the regression coefficients was hierarchical
- with σβ having a gamma prior
  - with mode 1.0
  - and standard deviation of 10.0, as explained in the previous paragraphs.

Figure 18.15 shows the results.

The most probable model,
- shown at the top of Figure 18.15,
- includes only two predictors, namely Spend and PrcntTake,
- and has a posterior probability of roughly 50%.
The runner up (in the second row of Figure 18.15)
- has a posterior probability of about half that much and includes only PrcntTake.
The Salary predictor is included only in the third most probable model,
- with a posterior probability of only about 8%.

18.4.3 What to report and what to conclude¶

From the results of variable-selection analysis, such as in Figure 18.15, what should

be reported and what should be concluded?

Which candidate predictors should be included in an explanatory model?
How should predictions of future data be made?

Unfortunately, there is no singular “correct” answer.

The analysis tells us the relative posterior credibilities of the models for our particular choice of prior.

책의 설명을 더 읽어보자.

18.4.4 Caution: Computational methods¶

The computer code that created the examples of this section is in the files Jags-Ymet-XmetMulti-MrobustVarSelect-Example.R and Jags-Ymet-XmetMulti-

MrobustVarSelect.R. The code is meant primarily for pedagogical purposes, and it does not scale well for larger applications, for reasons explained presently.

There are a variety of approaches to Bayesian variable selection, and MCMC is just one.
- MCMC is useful only when there are a modest number of predictors.
- A useful MCMC chain will need ample opportunity to sample from all the models, which would require an impractically long chain even for moderately large numbers of predictors.
- Even for a modest number of predictors, the MCMC chain can be badly autocorrelated in the model indices or inclusion indicators.

To conclude this section regarding variable selection, - it is appropriate to recapitulate the considerations at the beginning of the section. - Variable selection is a reasonable approach only if it is genuinely plausible and meaningful that candidate predictors have zero relation to the predicted variable. - The results can be surprisingly sensitive to the seemingly innocuous choice of prior for the regression coefficients, and, of course, the prior for the inclusion probability. - Because of these limitations, hierarchical shrinkage priors may be a more meaningful approach.

18.4.5 Caution: Interaction variables¶

참고자료¶

[1] Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan - http://www.amazon.com/Doing-Bayesian-Analysis-Second-Edition/dp/0124058884