-
-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mixed vs fixed model bootstrap does not use the same type #505
Comments
As fa as I know, there is no real "non-parametric" bootstrapping for merMod, but just "semi-parametric". Not sure if this will bring us closer to consistency, or if it is preferred to stick to the defaults? |
In the docs the options at parametric of semiparametric - there is no non-parametric. https://www.rdocumentation.org/packages/lme4/versions/1.1-27/topics/bootMer |
Yeah, parametric is really the only stable option for bootMer. Semiparametric is pretty unstable; it gives weird results fairly often in my experience and is applicable to only a limited set of models. Resampling cases for mixed effects models is tricky (basically has to happen at the cluster level) and is pretty intractable for crossed models. There is this package, but I've not investigated it https://cran.r-project.org/web/packages/lmeresampler/lmeresampler.pdf Do we accept and pass along the use.u option to bootMer? If not, we should. I'd also like to support other bootstrap intervals besides the percentile (eg, BCa). Opened an issue for this earlier today. |
I might have triggered this issue with a query that I sent yesterday to a few of you (apologies for the redundancy). I feel that there should be consistency in the call of model_parameters and perhaps a much clearer wording in the help and the vignettes. For the "typical" sizes (a few dozens, < 100?) of datasets analyzed by scientists, I think that it does matter whether non-parametric or parametric bootstrap is used to estimate the 95%CIs and p-values. Non-parametric bootstrapping of a small sample until doomsday will never generate values on the tails of the real distribution of the response variable that are key to 95%CI and p-values. Regardless of the prob distribution of the response variable of a model, values in the upper/lower 10% are by definition unlikely to be represented in a sample of moderate size. Hence, the 95%CIs and p-values obtained by non-parametric bootstrap are bound to be somewhat biased and imprecise. Provided that one has a statistical model with a reasonable goodness of fit to data, parametric bootstrap should almost always provide better 95%CI and p-values than non-parametric one. The reason is that generating pseudosample values in the upper/lower 10% from a prob distribution reasonably fitted by a model depends on the number of bootstrap replications (i.e. computer brute force) and not on the size of the original data. Almost certainly, the differences in CIs and p-value between nonparametric vs parametric bootstrap disappear for large (>250?) sample sizes, but in these cases one should probably rely on the asymptotics of Wald statistics to get the CIs and p-values. Other R libraries (e.g. glmmboot) doing bootstrap on these models employ such "weasel wording" in the description of their methods that one must query the authors to know what they actually do. This is why I feel that the excellent set of libraries of your "ecosystem" deserve a better and clerer description for the benefit of their users (like me). |
In which way, which part could probably be improved? |
If I might be bold for a minute, I would suggest three things; |
hm, option "parallel" for model <- lm(mpg ~ wt + factor(cyl), data = mtcars)
boot_function <- function(model, data, indices) {
d <- mtcars[indices, ] # allows boot to select sample
fit <- stats::update(model, data = d)
params <- insight::get_parameters(fit)
n_params <- insight::n_parameters(model)
if (nrow(params) != n_params) {
params <- stats::setNames(rep.int(NA, n_params), params$Parameter)
} else {
params <- stats::setNames(params$Estimate, params$Parameter) # Transform to named vector
}
return(params)
}
set.seed(123)
results <- boot::boot(
data = mtcars,
statistic = boot_function,
R = 100,
sim = "ordinary",
model = model
)
as.data.frame(results$t) |> head()
#> V1 V2 V3 V4
#> 1 33.40302 -3.247039 -3.697935 -5.852254
#> 2 35.10031 -3.802603 -3.918454 -4.430873
#> 3 33.47632 -2.927751 -4.632131 -5.792161
#> 4 33.55433 -2.763675 -4.787897 -7.392401
#> 5 29.48630 -2.436482 -2.511552 -4.973958
#> 6 34.77243 -3.755184 -3.219585 -4.911665
set.seed(123)
results <- boot::boot(
data = mtcars,
statistic = boot_function,
R = 100,
sim = "parametric",
model = model
)
as.data.frame(results$t) |> head()
#> V1 V2 V3 V4
#> 1 33.99079 -3.205613 -4.255582 -6.07086
#> 2 33.99079 -3.205613 -4.255582 -6.07086
#> 3 33.99079 -3.205613 -4.255582 -6.07086
#> 4 33.99079 -3.205613 -4.255582 -6.07086
#> 5 33.99079 -3.205613 -4.255582 -6.07086
#> 6 33.99079 -3.205613 -4.255582 -6.07086 Created on 2021-05-27 by the reprex package (v2.0.0) |
library(parameters)
model <- lm(mpg ~ wt + factor(cyl), data = mtcars)
set.seed(123)
model_parameters(model, bootstrap = TRUE, iterations = 100)
#> Parameter | Coefficient | 95% CI | p
#> --------------------------------------------------
#> (Intercept) | 34.21 | [29.50, 39.25] | 0.010
#> wt | -3.09 | [-5.02, -1.98] | 0.010
#> cyl [6] | -4.33 | [-6.49, -1.61] | 0.010
#> cyl [8] | -5.99 | [-9.17, -2.71] | 0.010
set.seed(123)
model_parameters(model, bootstrap = TRUE, iterations = 100, type = "parametric")
#> Parameter | Coefficient | 95% CI | p
#> --------------------------------------------------
#> (Intercept) | 33.99 | [33.99, 33.99] | 0.010
#> wt | -3.21 | [-3.21, -3.21] | 0.010
#> cyl [6] | -4.26 | [-4.26, -4.26] | 0.010
#> cyl [8] | -6.07 | [-6.07, -6.07] | 0.010 Created on 2021-05-27 by the reprex package (v2.0.0) |
@strengejacke I think it's a little weird that the reported point estimates are changing when bootstrap approaches are used. I think most users requesting bootstrap intervals would be expecting the maximum likelihood estimates along with the bootstrap intervals. |
If model_parameters(model, bootstrap = TRUE, iterations = 100, type = "parametric") does parametric bootstrap for all common models (lm, glm, glmer, lmer, glmmTMB, etc), I might have provoked a storm on a tea cup. In my view, it is still strange that the user has to reach to "See 'Examples' in model_parameters.default" in the help of model_parameters to know that he/she can carry out parametric bootstrap, and that nothing in this regard is said in the help of bootstrap_model and bootstrap_parameters. If so, I should think that some editorial work in the help of the latter functions and in the vignette would solve my original query. I still stand by the relative merits of parametric vs nonparametric bootstrap from before. |
The more curious thing here is that the bootstrapped estimates are all the same when |
But for skewed data, isn't a bootstrapped estimate more useful? Else, one could just request robust standard errors. |
Oh, yeah, that’s not right. Can you link to where you specify the parametric call to boot? You need to also give the ran.gen function (eg, simulate) and mle estimate. |
My view is that if you’re going to make a parametric assumption with a single level model, you may as well construct intervals using an analytic method such as pivot or profile likelihood. These methods are tractable for such models and computationally more efficient. I don’t see all that much value in parametric bootstrap in those cases—in those cases, the value of the bootstrap is to avoid parametric assumptions when the sampling distribution isn’t likely to follow those assumptions. |
The code is in parameters/R/bootstrap_model.R Line 68 in 75e5a0b
|
So you would suggest we stick to the default "ordinary"? |
I believe that if one sets type="ordinary", the command boot performs a nonparametric bootstrap by default. The parametric bootstrap option in the command boot requires providing a ran.gen function to generate new R pseudovalues of the response variable using the parameters of the originally fitted model, and then refit the model on the R pseudosamples to obtain the bootstrapped sampling distribution for each model parameter, to calculate the p-values and CI. |
This results <- boot::boot(
data = data,
statistic = boot_function,
R = iterations,
sim = type,
parallel = parallel,
ncpus = n_cpus,
model = model
) needs to be something like: f <- function(x, mle) {
out <- insight::get_data(x)
resp <- simulate(x, nsim = 1)
out[[insight::find_response(x)]] <- resp
return(out)
}
results <- boot::boot(
data = data,
statistic = boot_function,
R = iterations,
sim = type,
parallel = parallel,
ncpus = n_cpus,
model = model,
ran.gen = f
) |
I would argue that the primary motivation for parametric bootstrap is the same as MCMC or other computational methods--to conduct inference when analytic methods are intractable (e.g., due to model complexity or convergence issues due to shallow likelihood surfaces or parameters near boundaries). Based on my reading of literatures I'm familiar with, parametric bootstrap intervals generally have similar coverage as profile intervals. When a pivot is available (e.g., for linear regression coefficients), these usually perform better even in small samples than either profile or parametric bootstrap (this is the whole basis on fiducial inference). For a variety of problems, parametric bootstrap can have somewhat or severely erratic performance (e.g., https://www.genetics.org/content/174/1/481; https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.2514; https://www.tandfonline.com/doi/pdf/10.1080/10705511.2011.607072). Of course, you may disagree, but in my view, the assumption that differences in deviance values are χ2 distributed becomes reasonably correct even at fairly small sample sizes (e.g., N = 20–30; some bootstrap texts jokingly suggest N = 42). In my experiences in social sciences, most analysts adopt a bootstrap approach because they wish to avoid parametric assumptions, rather than due to coverage concerns. So, I think the current default basic bootstrap approach is likely best. |
To summarize:
So can we close this issue? |
I would say that it is reasonable to close the issue. Many thanks for all your effort. I think I found a minor sign inconsistency on the overdispersion parameter of the negative binomial models fitted with glmmTMB. The fitted parameter must be positive (0.566) and it comes as negative when bootstrapped. It might be useful to change the label in the bootstrap table from "(Intercept).1" to "overdispersion param." Also, when the fitted model is Zero Truncated Negative Binomial, the overdispersion parameter does not appear when bootstrapped. The latter may not be not a great loss for most users, and I just wonder if for overall consistency it might be better to just delete it from the glm.NB fitted with glmmTMB. The choice and the work is yours, of course. library(glmmTMB)
glm.NB <- glmmTMB(count ~ mined, family=nbinom2, data=Salamanders)
summary(glm.NB)
# Family: nbinom2 ( log )
# Formula: count ~ mined
# Data: Salamanders
# AIC BIC logLik deviance df.resid
# 1762.3 1775.7 -878.2 1756.3 641
# Overdispersion parameter for nbinom2 family (): **0.566**
#
# Conditional model:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.2192 0.1293 -9.427 <2e-16 ***
# minedno 2.0368 0.1527 13.343 <2e-16 ***
#
bootstrap_parameters(model=glm.NB, iterations=100)
# # Fixed Effects
#
# Parameter | Coefficient | 95% CI | p
# ----------------------------------------------------
# (Intercept) | -1.26 | [-1.54, -0.87] | 0.010
# minedno | 2.07 | [ 1.67, 2.37] | 0.010
# (Intercept).1 | **-0.58** | [-0.79, -0.28] | 0.010 |
@PabloInchausti Internally, the overdispersion parameter is formulated as negative by glmmTMB: glm.NB <- glmmTMB::glmmTMB(count ~ mined, family=glmmTMB::nbinom2, data=glmmTMB::Salamanders)
#> Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
#> integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
#> Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
#> integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
#> Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
#> integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
summary(glm.NB)
#> Family: nbinom2 ( log )
#> Formula: count ~ mined
#> Data: glmmTMB::Salamanders
#>
#> AIC BIC logLik deviance df.resid
#> 1762.3 1775.7 -878.2 1756.3 641
#>
#>
#> Overdispersion parameter for nbinom2 family (): 0.566
#>
#> Conditional model:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.2192 0.1293 -9.427 <2e-16 ***
#> minedno 2.0368 0.1527 13.343 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glm.NB$fit
#> $par
#> beta beta betad
#> -1.2192401 2.0367623 -0.5696049
#>
#> $objective
#> [1] 878.1677
#>
#> $convergence
#> [1] 0
#>
#> $iterations
#> [1] 11
#>
#> $evaluations
#> function gradient
#> 15 12
#>
#> $message
#> [1] "relative convergence (4)"
#>
#> $parfull
#> beta beta betad
#> -1.2192401 2.0367623 -0.5696049 Created on 2021-06-07 by the reprex package (v2.0.0) (If you post more code examples, please format them using |
The overdispersion parameter is indeed reported as negative in $par and $parfull, but not in the summary seen and interpreted by all users. glmmTMB probable changes the sign given that this parameter in the NB2 parameterization of the NegBinom distribution must be a real positive number, and perhaps your package parameters ought to do likewise. |
Specifically, it's applying the inverse link function for the sigma model. For gaussian, that's These only apply if there is a single betad parameter, not multiple terms in the dispersion model. |
@strengejacke After we bootstrap glmmTMB models, we should split out the parameters into cond, zi, and disp lists separately. |
In
bootstrap_model
:lme4::bootMer()
defaults totype = "parametric"
.boot::boot()
defaults tosim = "ordinary"
which is non-parametric.Should we be consistent? Or does the diff between the models somehow justify this?
@bwiernik thoughts?
The text was updated successfully, but these errors were encountered: