-
Notifications
You must be signed in to change notification settings - Fork 36
/
Copy pathcollinear.Rmd
185 lines (142 loc) · 9.26 KB
/
collinear.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
title: "Bayesian version of Does model averaging make sense?"
author: "[Aki Vehtari](https://users.aalto.fi/~ave/)"
date: "First version 2017-08-09. Last modified `r format(Sys.Date())`."
output:
html_document:
fig_caption: yes
toc: TRUE
toc_depth: 2
number_sections: TRUE
toc_float:
smooth_scroll: FALSE
bibliography: modelsel.bib
csl: harvard-cite-them-right.csl
link-citations: yes
---
# Setup {.unnumbered}
```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=FALSE, message=FALSE, error=FALSE, warning=FALSE, comment=NA, out.width='95%')
```
**Load packages**
```{r}
library(rstanarm)
options(mc.cores = parallel::detectCores())
library(loo)
library(tidyverse)
library(GGally)
library(bayesplot)
theme_set(bayesplot::theme_default())
library(projpred)
SEED=87
```
# Introduction
This notebook was inspired by Andrew Tyre's blog post [Does model averaging make sense?](https://atyre2.github.io/2017/06/16/rebutting_cade.html). Tyre discusses problems in current statistical practices in ecology, focusing in multi-collinearity, model averaging and measuring the relative importance of variables. Tyre's post is commenting a paper [Model averaging and muddled multimodel inferences](http://onlinelibrary.wiley.com/doi/10.1890/14-1639.1/full.) In his blog post he uses maximum likelihood and AIC_c. Here we provide a Bayesian approach for handling multicollinearity, model averaging and measuring relative importance of variables using packages [`rstanarm`](https://cran.r-project.org/package=rstanarm), [`bayesplot`](https://cran.r-project.org/package=bayesplot), [`loo`](https://cran.r-project.org/package=loo) and [`projpred`](https://cran.r-project.org/package=projpred). We demonstrate the benefits of Bayesian posterior analysis [@BDA3] and projection predictive approach [@Piironen+etal:projpred:2020; @Piironen+Vehtari:2017a].
# Data
We generate the data used previously to illustrate multi-collinearity problems.
```{r warning=FALSE,message=FALSE}
# all this data generation is from Cade 2015
# doesn't matter what this is -- if you use a different number your results will be different from mine.
set.seed(SEED)
df <- tibble(
pos.tot = runif(200,min=0.8,max=1.0),
urban.tot = pmin(runif(200,min=0.0,max=0.02),1.0 - pos.tot),
neg.tot = (1.0 - pmin(pos.tot + urban.tot,1)),
x1= pmax(pos.tot - runif(200,min=0.05,max=0.30),0),
x3= pmax(neg.tot - runif(200,min=0.0,max=0.10),0),
x2= pmax(pos.tot - x1 - x3/2,0),
x4= pmax(1 - x1 - x2 - x3 - urban.tot,0))
# true model and 200 Poisson observations
mean.y <- exp(-5.8 + 6.3*df$x1 + 15.2*df$x2)
df$y <- rpois(200,mean.y)
ggpairs(df,diag=list(continuous="barDiag"))
```
Tyre: "So there is a near perfect negative correlation between the things sage grouse like and the things they don’t like, although it gets less bad when considering the individual covariates."
# Bayesian inference
From this point onwards we switch to Bayesian approach. The [rstanarm package](https://cran.r-project.org/package=rstanarm) provides `stan_glm` function which accepts same arguments as `glm`, but makes full Bayesian inference using Stan ([mc-stan.org](https://mc-stan.org)). By default a weakly informative Gaussian prior is used for weights.
```{r}
fitg <- stan_glm(y ~ x1 + x2 + x3 + x4, data = df, na.action = na.fail, family=poisson(), seed=SEED)
```
Let's look at the summary:
```{r}
summary(fitg)
```
We didn't get divergences, Rhat's are less than 1.1 and n_eff's are useful (see, e.g., [RStan workflow](http://mc-stan.org/users/documentation/case-studies/rstan_workflow.html)). However, when we know that covariats are correlating we can get even better performance by using QR decomposition (see, [The QR Decomposition For Regression Models](http://mc-stan.org/users/documentation/case-studies/qr_regression.html)).
```{r}
fitg <- stan_glm(y ~ x1 + x2 + x3 + x4, data = df, na.action = na.fail, family=poisson(), QR=TRUE, seed=SEED)
```
Let's look at the summary and plot:
```{r}
summary(fitg)
```
Use of QR decomposition greatly improved sampling efficiency and we continue with this model.
```{r}
mcmc_areas(as.matrix(fitg),prob_outer = .99)
```
All 95% posterior intervals are overlapping 0 and it seems we have the same collinearity problem as with maximum likelihood estimates.
Looking at the pairwise posteriors we can see high correlations
```{r}
mcmc_pairs(as.matrix(fitg),pars = c("x1","x2","x3","x4"))
```
If look more carefully on of the subplots, we see that although marginal posterior intervals overlap 0, the joint posterior is not overlapping 0.
```{r}
mcmc_scatter(as.matrix(fitg), pars = c("x1", "x2"))+geom_vline(xintercept=0)+geom_hline(yintercept=0)
```
Based on the joint distributions all the variables would be relevant. To make predictions we don't need to make variable selection, we just integrate over the uncertainty (kind of continuous model averaging).
In case of even more variables with some being relevant and some irrelevant, it will be difficult to analyse joint posterior to see which variables are jointly relevant. We can easily test whether any of the covariates are useful by using cross-validation to compare to a null model,
```{r}
fitg0 <- stan_glm(y ~ 1, data = df, na.action = na.fail, family=poisson(), seed=SEED)
```
```{r}
(loog <- loo(fitg))
(loog0 <- loo(fitg0))
loo_compare(loog0, loog)
```
Based on cross-validation covariates together contain significant information to improve predictions.
# Variable selection
We might want to choose some variables 1) because we don't want to observe all the variables in the future (e.g. due to the measurement cost), or 2) we want to most relevant variables which we define here as a minimal set of variables which can provide similar predictions to the full model.
Tyre used AIC_c to estimate the model performance. In Bayesian setting we could use Bayesian cross-validation or WAIC, but we don't recommend thhem for variable selection as discussed by @Piironen+Vehtari:2017a. The reason for not using Bayesian CV or WAIC is that the selection process uses the data twice, and in case of large number variable combinations the selection process overfits and can produce really bad models. Using the usual posterior inference given the selected variables ignores that the selected variables are conditonal on the selection process and simply setting some variables to 0 ignores the uncertainty related to their relevance.
@Piironen+Vehtari:2017a also show that a projection predictive approach can be used to make a model reduction, that is, choosing a smaller model with some coefficients set to 0. The projection predictive approach solves the problem how to do inference after the selection. The solution is to project the full model posterior to the restricted subspace. See more by @Piironen+etal:projpred:2020.
We make the projective predictive variable selection using the previous full model. A fast leave-one-out cross-validation approach [@Vehtari+etal:PSIS-LOO:2017] is used to choose the model size.
```{r, results='hide'}
fitg_cv <- cv_varsel(fitg, method='forward', cv_method='LOO')
```
```{r}
fitg_cv$vind
```
We can now look at the estimated predictive performance of smaller models compared to the full model.
```{r}
plot(fitg_cv, stats = c('elpd', 'rmse'))
```
And we get a LOO based recommendation for the model size to choose
```{r}
(nsel <- suggest_size(fitg_cv, alpha=0.1))
(vsel <- solution_terms(fitg_cv)[1:nsel])
```
We see that `r nv` variables is enough to get the same predictive accuracy as with all 4 variables.
Next we form the projected posterior for the chosen model.
```{r}
projg <- project(fitg_cv, nv = nsel, ns = 4000)
projdraws <- as.matrix(projg)
round(colMeans(projdraws),1)
round(posterior_interval(projdraws),1)
```
This looks good as the true values are intercept=-5.8, x2=15.2, x1=6.3.
```{r}
mcmc_areas(projdraws, pars=c("Intercept",vsel))
```
Even if we started with a model which had due to a co-linearity difficult to interpret posterior, the projected posterior is able to match closely the true values. The necessary information was in the full model and with the projection we were able to form the projected posterior which we should use if x3 and x4 are set to 0.
Back to the Tyre's question "Does model averaging make sense?". If we are interested just in good predictions we can do continuous model averaging by using suitable priors and by integrating over the posterior. If we are intersted in predcitions, then we don't first average weights (ie posterior mean), but use all weight values to compute predictions and do the averaging of the predictions. All this is automatic in Bayesian framework.
Tyre also commented on the problems of measuring variable importance. The projection predictive approach above is derived using decision theory and is very helpful for measuring relevancy and choosing relevant variables. Tyre did not comment about the inference after selection although it is also known problem in variable selection. The projection predictive approach above solves that problem, too.
<br />
# References {.unnumbered}
<div id="refs"></div>
# Licenses {.unnumbered}
* Code © 2017-2018, Aki Vehtari, licensed under BSD-3.
* Text © 2017-2018, Aki Vehtari, licensed under CC-BY-NC 4.0.
* Code for generating the data by [Andrew Tyre](https://atyre2.github.io/2017/06/16/rebutting_cade.html)
# Original Computing Environment {.unnumbered}
```{r}
sessionInfo()
```
<br />