-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathQCBS_stan.R
739 lines (570 loc) · 27.6 KB
/
QCBS_stan.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
##########################
## QCBS Stan Workshop ##
##########################
# An introduction to Stan, rstan & rstanarm
# Maxwell J. Farrell
# April 24th 2017
###############
## Libraries ##
###############
require(rstan)
require(gdata)
require(ggplot2)
require(bayesplot)
require(rstanarm)
require(shinystan)
require(loo)
###############
# Remember! Bayesian model building is not always easy!
# It can require a lot of work to build up to a good working model
# that works for your data and also reflects what you
# know about the world.
# First, with any model building get to know your data.
# Explore it, plot it, calculate those summary statistics.
# Once you have a sense of your data and what you want to learn,
# you enter the iterative process of building a Bayesian model:
# 1. Choose your model
# 2. Simulate data & make sure your model is doing what you think it's doing
# 2. Pick priors (informative? not? do you have external data you could turn into a prior?)
# 3. Inspect model convergence (traceplots, rhats, and no divergent transitions...)
# 4. Generate posterior predictions and check how they compare to your data!
# 5. Repeat...
#####################
## Simulating data ##
#####################
# To start we will simulate data under a linear model:
# y = alpha + beta*x + error
set.seed(420)
N <- 100
alpha <- 2.5
beta <- 0.2
sigma <- 6
x <- rnorm(N, 100, 10)
y <- rnorm(N, alpha + beta*x, sigma)
# This ^ is equivalent to the following:
# y <- alpha + beta*x + rnorm(N, 0, sigma)
# Have a look
plot(y~x, pch=20)
# Plot the "true" relationship
abline(alpha, beta, col=4, lty=2, lw=2)
# Run an "lm" model for comparison
lm1 <- lm(y ~ x)
summary(lm1)
abline(lm1, col=2, lty=2, lw=3)
#####################
## STAN REFERENCES ##
#####################
# WEBSITE: http://mc-stan.org/
# MANUAL(v2.14): https://github.com/stan-dev/stan/releases/download/v2.14.0/stan-reference-2.14.0.pdf
# RSTAN: https://cran.r-project.org/web/packages/rstan/vignettes/rstan.html
# STANCON 2017 Intro Course Materials: https://t.co/6d3omvBkrd
# MAILING LIST: https://groups.google.com/forum/#!forum/stan-users
############################
## Our first Stan program ##
############################
# We're going to start by writing a linear model in the
# language Stan. This can be written in your R script,
# or saved seprately as a .stan file and called into R.
# A stan program has three required "blocks":
# 1. "data" block: where you declare the data types,
# their dimensions, any restrictions
# (i.e. upper= or lower- , which act as checks for Stan),
# and their names.
# Any names will be what you give to your Stan
# program and also be the names used in other blocks.
# 2. "parameters" block: This is where you indicate the
# parameters you want to model, their dimensions, restricitons, and name.
# For a linear regression, we will want to model the intercept,
# any slopes, and the standard deviation of the errors
# around the regression line.
# 3. "model" block: This is where you include any sampling statements,
# including the "likelihood" (model) you are using.
# The model block is where you indicate any prior distributions
# you want to include for your parameters. If no prior is defined,
# Stan uses default priors of uniform(-infinity, +infinity).
# You can restrict priors using restrictions when
# declaring the parameters (i.e. <lower=0> to make sure a parameter is + )
# Sampling is indicated by the "~" symbol, and Stan already includes
# many common distributions as vectorized functions.
# See the manual for a comprehensive list.
# There are also optional blocks:
# functions {}
# transformed data {}
# transformed parameters {}
# generated quantitied {}
# Comments are indicated by "//" and are ignored by Stan.
stan_lm1_model <- "
// Stan model for simple linear regression
data {
int<lower=1> N; // Sample size
vector[N] x; // Predictor
vector[N] y; // Outcome
}
parameters {
real alpha; // Intercept
real beta; // Slope (regression coefficients)
real<lower=0> sigma; // Error SD
}
model {
y ~ normal(alpha + x * beta , sigma);
// y ~ alpha + x * beta + normal(0 , sigma); // Equivalent to above
}
generated quantities {
}
"
# Here we are implicitly using uniform(-infinity, +infinity) priors
# for our parameters.
# Sampling
# Stan programs are complied to C++ before being used. For this you must have
# a C++ complier installed (see: https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started)
# If we wanted to call a .stan file, we use the argument file= instead of model_code.
# You can use your model many times per session once you compile it,
# but you must re-complile when you start a new R session.
model1 <- stan_model(model_code=stan_lm1_model)
# There are many C++ compilers and are often different across systems.
# If your model spits out a bunch of errors / junk, don't worry.
# As long as your model can be used with the sampling() function, it compiled correctly.
# We need to pass the data we want to use to stan as a list of named objects.
# The names given here need to match the variable names used in the data block.
stan_data <- list(N=N, x=x, y=y)
# We fit our model by using the sampling() function, and providing it with the
# model, the data, and indicating the number of iterations for warmup,
# the total number of iterations, how many chains we want to run,
# the number of cores we want to use (stan is set up for parallelization), which
# indicates how many chains are run simultaneously, and the thinning,
# which is how frequently we want to store our post-warmup iterations.
# thin=1 will keep every iteration, thin-2 will keep every second, etc...
# Stan automatically uses half of the iterations as warm-up, if
# the warmup= argument is not specified.
fit <- sampling(model1, data=stan_data, warmup=500, iter=1000, chains=4, cores=2, thin=1)
################################################
## Accessing the contents of a stanfit object ##
################################################
# https://cran.r-project.org/web/packages/rstan/vignettes/stanfit-objects.html
# Results from sampling() are saved as a "stanfit" object (S4 class)
# We can get summary statistics for parameter estimates,
# and sampler diagnostics by executing the name of the object:
fit
# From this output we can quickly assess model convergence by
# looking at the Rhat values for each parameter.
# When these are at or near 1, the chains have converged.
# There are many other diagnostics, but this is an important one for Stan.
# We can also look at the full posterior of our parameters by extracting
# them from the model object. There are many ways to access the posterior.
posterior <- extract(fit)
str(posterior)
# extract() pulls puts the posteriors for each parameter into a list
# Let's compare to our previous estimate with "lm"
abline( mean(posterior$alpha), mean(posterior$beta), col=6, lw=2)
# The result is identical to the lm output.
# This is because we are using a simple model, and have put
# uniform priors on our parameters.
# One way to visualize the variability in our estimation of the regression line
# is to plot multiple estimates from the posterior:
# Plotting the posterior distribution of regression lines
for (i in 1:500) {
abline(posterior$alpha[i], posterior$beta[i], col="gray", lty=1)
}
# Let's plot the "true" regression line on top:
abline(alpha, beta, col=4, lty=2, lw=3)
# Let's try again, but now with "strong" priors on the relationship
# Try changing the priors yourself and see what happens!
stan_lm1_model_strongPriors <- "
// Stan model for simple linear regression
data {
int<lower=1> N; // Sample size
vector[N] x; // Predictor
vector[N] y; // Outcome
}
parameters {
real alpha; // Intercept
real beta; // Slope (regression coefficients)
real<lower=0> sigma; // Error SD
}
model {
alpha ~ normal(10, 0.1);
beta ~ normal(1, 0.1);
y ~ normal(alpha + x * beta , sigma);
// y ~ alpha + x * beta + normal(0 , sigma);
}
generated quantities {
}
"
model2 <- stan_model(model_code=stan_lm1_model_strongPriors)
fit2 <- sampling(model2, data=stan_data, warmup=500, iter=1000, chains=4, cores=2, thin=1)
posterior2 <- extract(fit2)
abline( mean(posterior2$alpha), mean(posterior2$beta), col=7, lw=2)
# and compared to the uniform priors:
abline( mean(posterior$alpha), mean(posterior$beta), col=6, lw=3)
#############################
## Convergence Diagnostics ##
#############################
# Before we go on we should check again the Rhat values, the
# effective sample size (n_eff), and the traceplots of our model
# parameters to make sure the model has converged and is trustable.
# n_eff is a crude measure of the effective sample size.
# You usually only need to worry is this number is less than 1/100th
# or 1/1000th of your number of iterations.
# 'Anything over an n_eff of 100 is usually "fine"' - Bob Carpenter
# For traceplots, we can view them directly from the posterior:
par(mfrow=c(3,1))
plot(posterior$alpha, type="l")
plot(posterior$beta, type="l")
plot(posterior$sigma, type="l")
# For simpler models, convergence is usually not a problem unless you
# run your sampler for too few iterations:
# Try running a model for only 50 iterations and check the traceplots.
fit_bad <- sampling(model1, data=stan_data, warmup=25, iter=50, chains=4, cores=2, thin=1)
# This also has some "divergent transitions" after warmup, indicating a mis-specified model,
# or a sampler that has failed to fully sample the posterior.
posterior_bad <- extract(fit_bad)
plot(posterior_bad$alpha, type="l")
plot(posterior_bad$beta, type="l")
plot(posterior_bad$sigma, type="l")
# We can also get summaries of the parameters through the posterior directly
# We can plot the simluted values to make sure our model is doing what we think it is...
par(mfrow=c(1,3))
plot(density(posterior$alpha), main="Alpha")
abline(v=alpha, col=4, lty=2)
plot(density(posterior$beta), main="Beta")
abline(v=beta, col=4, lty=2)
plot(density(posterior$sigma), main="Sigma")
abline(v=sigma, col=4, lty=2)
# And from the posterior we can direcly calculate the probability
# of any parameter being over or under a certain value of interest
# Probablility that beta is >0:
sum(posterior$beta>0)/length(posterior$beta)
# Probablility that beta is >0.2:
sum(posterior$beta>0.2)/length(posterior$beta)
# While we can work with the posterior directly,
# rstan has a lot of these functions built-in as ggplot objects
traceplot(fit)
# This is a wrapper for "stan_trace()""
# stan_trace(fit)
# This is much better than our previous plot because
# it displays each chain overlayed.
# We can also look at the posterior densitites ^ histograms
stan_dens(fit)
stan_hist(fit)
# And we can generate plots which indicate the mean parameter estimates
# and any credible intervals we may be interested in
plot(fit, show_density=FALSE, ci_level=0.5, outer_level=0.95, fill_color="salmon")
#################################
## Posterior Predictive Checks ##
#################################
# Vignettes: https://cran.r-project.org/web/packages/bayesplot/index.html
# For prediction and as another form of model diagnostic,
# Stan can use random number generators to generate predicted values
# for each data point, at each iteration.
# This way we can generate predictions that also represent
# the uncertainties in our model and our data generation process.
# We generate these using the Generated Quantities block.
# This block can be used to get any other information
# we want about the posterior, or make predictions for new data.
stan_lm1_model_GQ <- "
// Stan model for simple linear regression
data {
int<lower=1> N; // Sample size
vector[N] x; // Predictor
vector[N] y; // Outcome
}
parameters {
real alpha; // Intercept
real beta; // Slope (regression coefficients)
real<lower=0> sigma; // Error SD
}
model {
y ~ normal(x * beta + alpha, sigma);
}
generated quantities {
real y_rep[N];
for (n in 1:N) {
y_rep[n] = normal_rng(x[n] * beta + alpha, sigma);
}
}
"
# Note that vectorization is not supported in the GQ block,
# so we'll have to put it in a loop, but because this is a C++ program,
# loops are actually quite fast and the GQ block is only evaluated
# once per iteration, so it won't add too much time to your sampling.
# Typically, the data generating functions will be the distributions you
# used in the model block but with an "_rng" suffix.
# Double-check in the Stan manual to see which rng functions are included.
model3 <- stan_model(model_code=stan_lm1_model_GQ)
fit3 <- sampling(model3, data=stan_data, iter=1000, chains=4, cores=2, thin=1)
# Extracting the y_rep values from posterior
y_rep <- as.matrix(fit3, pars = "y_rep")
# dim(y_rep) # each row is an iteration (single posterior estimate) from the model.
# We can use the "bayesplot" package to make some nicer looking plots.
# This package is a wrapper for many common ggplot2 plots,
# and has a lot of built-in functions to work with posterior predictions.
# Comparing density of y with densities of y over 200 posterior draws
ppc_dens_overlay(y, y_rep[1:200, ])
# Here we see data (dark blue) fit well with our posterior predictions.
# We can also use this to get compare estimates of summary statistics
ppc_stat(y = y, yrep = y_rep, stat="mean")
# We can change the function passed to "stat" function, and even write our own!
# We can investigate mean posterior prediction per datapoint
# vs the observed value for each datapoint (default line is 1:1)
ppc_scatter_avg(y = y, yrep = y_rep)
# Here is a list of currently available plots (bayesplot 1.2)
available_ppc()
# You can change the colour scheme in bayesplot too:
color_scheme_view(c("blue", "gray", "green", "pink", "purple",
"red","teal","yellow"))
# And you can even mix them:
color_scheme_view("mix-blue-red")
# You can set color schemes with:
color_scheme_set("blue")
##################################
## Trying with some "real" data ##
##################################
# Data from Akinyi et al 2013 (Animal Behaviour):
# Role of grooming in reducing tick load in wild baboons (Papio cynocephalus)
# http://datadryad.org/resource/doi:10.5061/dryad.3r8n2
# Data download, load-in and clean-up
# Direct download from Dryad and opening of the .xlsx with the gdata package!
data <- read.xls("http://datadryad.org/bitstream/handle/10255/dryad.44883/Akinyi_etal_2013_AnBeh_dataset1.xlsx?sequence=1", skip=0, method="csv", header=TRUE, stringsAsFactors=FALSE)
names(data) <- tolower(names(data))
names(data)[4] <- "age.years"
names(data)[8] <- "grooming.count"
names(data) <- gsub("[.]at[.]darting","", names(data))
# Only use complete cases
data <- data[complete.cases(data),]
# Recoding sex
data$sex[data$sex=="M"] <- 0
data$sex[data$sex=="F"] <- 1
data$sex <- as.integer(data$sex)
# Getting total tick counts
data$tick_count <- with(data, adult.tick.count + tick.larvae.count)
####################
## Using rstanarm ##
####################
# Vignette: https://cran.r-project.org/web/packages/rstanarm/vignettes/rstanarm.html
# This is a special extension of rstan that has common regression models pre-compiled.
# Based on "Data Analysis using Regression and Multilevel/Hierarchical Models"
# by Gelman & Hill (2007): http://www.stat.columbia.edu/~gelman/arm/
# Check here for more vignettes:
# https://cran.r-project.org/web/packages/rstanarm/index.html
# It currently supports building regression models using similar
# syntax as aov, lm, glm, lmer, glmer, glmer.nb, gamm4, polr, and betareg
# With this you can easily run most regression models in a Bayesian framework.
# No need to code in Stan, or to compile any models.
# It also comes with wrappers for many rstan and bayesplot functions,
# functions that can easily update models, or make predictions for new input data.
require(rstanarm)
par(mfrow=c(1,1))
# Trying to re-create figure 3 from the paper
with(data, plot(tick_count ~ grooming.count, ylab="Tick load", xlab="Counts of received grooming"))
abline(lm(tick_count ~ grooming.count, data=data))
# The researchers say that if we are to try to explain variation in tick
# abundances among individuals, we should control for the amount of grooming
# individuals receive (here, amount of grooming over 6 months prior to sampling ticks)
# Model: tick_count ~ grooming.count
stan_lm1 <- stan_glm(tick_count ~ grooming.count,
data = data, family=gaussian,
chains = 4, cores = 2, seed = 15042017)
## Convergence diagnostics for rstanarm objects
color_scheme_set("blue")
# Check convergence diagnostics
plot(stan_lm1, plotfun="trace")
summary(stan_lm1)
## Checking what priors were used
prior_summary(stan_lm1)
# This should be chosen before running the model, but here we will
# just check what default priors rstanarm is using:
# Defaut priors are normal(0,10) for the intercept,
# normal(0,2.5) for the regression coefficient,
# and half-cauchy(0, 5) for sigma (normal error term)
# This may seem fairly informative, but rstanarm
# automatically centers and scales your variables
# before fitting your model, so these are not as
# informative as they would be if the data were left
# on their input scales!
# You can change the priors by passing additional
# arguments to the stan_glm function
# ex: prior = normal(0,2.5), prior_intercept = normal(0,10)
## Posterior predictive checks
# The pre-compiled models in rstanarm already include a y_rep
# variable in the generated quantities block.
# The "ppc_check()" function is just an rstanarm wrapper for
# the ppc_* functions from the bayesplot package used above
# that automatically extracts "y_rep"
pp_check(stan_lm1, plotfun = "stat", stat = "mean")
pp_check(stan_lm1, plotfun = "dens_overlay")
# Looks pretty bad! Maybe this is because we are modelling count data
# using normal (gaussian) errors...
# Let's try a poisson regression (better for count data)
# Instead of writing the full model out again, we can use the "update()"
# function to run the same modelbut using the poisson family
stan_glm1 <- update(stan_lm1, family = poisson)
plot(stan_glm1, plotfun="trace")
summary(stan_glm1)
pp_check(stan_glm1, plotfun = "stat", stat = "mean")
pp_check(stan_glm1, plotfun = "dens_overlay")
# Looks like the model is underestimating instances of zero counts.
# This is a common issue when using the poisson distribution
# as the variance is equal to the mean for poisson distributions.
# ~~~~ Picking a new model ~~~~ #
# Let's try running the model with a negative binomial distrubution
# which is for modelling count data, but estimates mean and
# variance ("aka reciprocal dispersion") terms separately
stan_glm2 <- update(stan_glm1, family = neg_binomial_2)
## Check convergence & priors
plot(stan_glm2, plotfun="trace")
summary(stan_glm2)
prior_summary(stan_glm2)
## Posterior Predictive Checks
pp_check(stan_glm2, plotfun = "stat", stat = "mean")
pp_check(stan_glm2, plotfun = "dens_overlay")
# It looks like we are fitting the data better, but is a bit hard to see
# because one of our posterior estimates had a large number of 0s...
# Lets write our own function to check the proportion of zeros
# in the data and the posterior "y_rep" (this is automatically calculated)
prop_zero <- function(x) mean(x == 0)
pp_check(stan_glm2, plotfun = "stat", stat = "prop_zero")
# Looks much better!
# ~~~~ Posterior Predictive Checks to inform model building ~~~~ #
# What about additional predictors? What should we include?
# The researchers indicate that in many primates, participation in
# grooming bouts differs between the sexes and with life history stage.
# Let's look at the posterior mean of tick counts grouped by sex
pp_check(stan_glm2, plotfun="stat_grouped", group=data$sex, stat=mean)
# Looks like our current model is underpredicting the tick count for males (0),
# and overpredicting the tick count for females (1)
# Model: tick_count ~ grooming.count + sex + age.years
stan_glm3 <- stan_glm(tick_count ~ grooming.count + sex,
data = data, family=neg_binomial_2,
chains = 4, cores = 2, seed = 15042017)
## Check convergence & priors
plot(stan_glm3, plotfun="trace")
summary(stan_glm3)
## Posterior Predictive Checks
pp_check(stan_glm3, plotfun="stat_grouped", group=data$sex, stat=mean)
# looks better!
# What about age?
pp_check(stan_glm3, plotfun="intervals", x=data$age.years, prob=0.95)
# Looks like observed tick count slightly increases with age.
# The observations per age fall within the mean 95% credible intervals
# according to age, but we maybe we could get better estimates by including it...
# Model: tick_count ~ grooming.count + sex + age.years
stan_glm4 <- stan_glm(tick_count ~ grooming.count + sex + age.years,
data = data, family=neg_binomial_2,
chains = 4, cores = 2, seed = 15042017)
## Check convergence & priors
plot(stan_glm4, plotfun="trace")
summary(stan_glm4)
# looks like age is not adding too much to the model
pp_check(stan_glm4, plotfun="intervals", x=data$age.years, prob=0.9)
# Looks like age has a weak effect, but we have reduced variation
# in the our posterior predictions for younger baboons.
# We can visualize the our parameter estimates with the "areas" plot function
plot(stan_glm4, plotfun="areas", prob = 0.95, prob_outer = 1.00)
#############################################
## Interactive Diagnostics with shinystan! ##
#############################################
require(shinystan)
# With the help of shiny, you can also look at your model diagnostics
# and posterior predictive checks interactively through the
# shinystan package, which offers many more mcmc diagnostics to explore.
launch_shinystan(stan_glm4)
# Looks like grooming.count still doesn't have much of an effect at all
# Let's run one more model without it, then compare...
# Model: tick_count ~ sex + age.years
stan_glm5 <- stan_glm(tick_count ~ sex + age.years,
data = data, family=neg_binomial_2,
chains = 4, cores = 2, seed = 15042017)
######################
## Model Comparison ##
######################
# There are many ways to compare models, and in the end it depends on
# what your initial goal was in building the model, what criteria you
# want to use to define the "best" model, and what kinds of models you are fitting.
# One philosophy is that you should build the model you beleive represents
# your system of interest well, and there is no need to use formal tests to
# select a "best" from a variety of candidate models.
# Another philosophy, well phrased by George Box is
# "All models are wrong but some are useful". In this frame of mind, we
# could run a series of models, and average across the outcomes of each,
# thereby incorporating uncertainty in model selection.
# Popular in ecology is the information theoretic approach, often
# employing some flavour of the Aikaike Information Criterion (AIC).
# You could calculate AIC for Bayesian models, and some statisticians
# have developed alternative versions such as BIC that includes
# penalization based on sample size, or DIC, which
# is often used for hierarchical models.
# For the Stan development team, they do not promote AIC, BIC, or DIC
# because they are based on point estimates of the likelihood, and therefore
# are not fully Bayesian. Instead they promote the use of
# WAIC (Widely applicable information criterion), and
# LOO (leave-one-out cross-validation).
# These use calculations of the log-likelihood across the entire posterior.
# For predictive models, a common method for picking models is to test
# the predictive accuracy of your model on new data, or some held-out portion of
# the data you already have (but not the portion you used to build the model).
# If your model does well predicting a random datapoint excluded from your model
# when you build it, it is likely not overfit. However, re-running your model
# many times, dropping a different data point every time would take a lot of
# computation time, and be unpractical for complex models.
# The "loo" package in R employs a algorithm to approximate what the performance
# of your model would be if you performed full leave-one-out cross-validation.
# If you are interested in how the approximation is calculated,
# and how to determine if you can reasonably make this approximation
# for you model, please read the original paper by Aki Vehtari, which
# explains WAIC and LOO: https://arxiv.org/abs/1507.04544
# and check out the example on CRAN: https://cran.r-project.org/web/packages/loo/index.html
##############################################################################
## Model selection based on approximation to leave-one-out cross-validation ##
##############################################################################
# Frequentists would test the null hypothesis that the coefficient on
# a particular term is zero (like grooming.count seems to be).
# Bayesians might ask whether such a model is expected to produce better
# out-of-sample predictions than a model without it.
# For rstanarm, the log-likelihood necessary for calculated is already
# included in the fit object, but if using your own stan program you
# need to include it's calculation in the generated quantities block.
require(loo)
loo1 <- loo(stan_glm1)
# The pareto k diangnostic is used to determine if it is appropriate to
# use the loo approximation, and it is calculated per observation.
# k values over 0.7 indicate the observation is and outlier and may be
# problematic, meaning it may have unexpectedly high influence
# when building your model.
# We can print the loo() output and visualize the k estimates:
loo1
plot(loo1)
# From Jonah Gabry and Ben Goodrich (rstanarm developers):
# "LOO has the same purpose as the AIC that is used by frequentists.
# Both are intended to estimate the expected log predicted density (ELPD)
# for a new dataset. However, the AIC ignores priors and assumes that the
# posterior distribution is multivariate normal, whereas the functions
# from the loo package used here do not assume that the posterior distribution
# is multivariate normal and integrate over uncertainty in the parameters.
# This only assumes that any one observation can be omitted without having
# a major effect on the posterior distribution, which can be judged using the plots above."
# One or two moderate outliers (k>0.5) shouldn't have too much of aneffect on the model,
# Rather than ignoring this, one option is to re-fit the model without these problematic
# observations, and calculate the loo statistic directly for them.
loo1 <- loo(stan_glm1, k_threshold=0.7)
loo2 <- loo(stan_glm2, k_threshold=0.7)
loo3 <- loo(stan_glm3, k_threshold=0.7)
loo4 <- loo(stan_glm4, k_threshold=0.7)
loo5 <- loo(stan_glm5, k_threshold=0.7)
# We can then compare models:
compare(loo1, loo2, loo3, loo4, loo5)
# Like AIC, lower "looic" values indicate preference over other models,
# Remember! Bayesian model building is not always easy!
# It can require a lot of work to build up to a good working model
# that also reflects what you know about the world.
# The general process is iterative:
# 1. Pick model
# 2. Simulate data to make sure your model is doing what you think it's doing
# 2. Pick priors (informative? not? do you have external data you could turn into a prior?)
# 3. Inspect model convergence (traceplots, rhats, an no divergent transitions...)
# 4. Generate posterior predictions and check that they look like your data!
# 5. Repeat...
# Stan is a run by a small, but dedicated group of developers.
# If you are start using Stan, please get involved in the mailing list.
# It's a great resource for understanding and diagnosing problems with Stan,
# and by posting problems you encounter you are helping yourself,
# and giving back to the community.