Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow parallel computation during bootstrapping #436

Open
IndrajeetPatil opened this issue Mar 8, 2021 · 11 comments
Open

allow parallel computation during bootstrapping #436

IndrajeetPatil opened this issue Mar 8, 2021 · 11 comments
Labels
3 investigators ❔❓ Need to look further into this issue Enhancement 💥 Implemented features can be improved or revised Help us 👀 Help is needed to implement something Low priority 😴 This issue does not impact package functionality much

Comments

@IndrajeetPatil
Copy link
Member

This requires adding a new parallel argument to model_parameters and then passing the value to boot calls:

For example, here we can add parallel = parallel inside the call:

results <- boot::boot(data = data, statistic = boot_function, R = iterations, model = model)

We can also default to parallel = "multicore", so multiple cores - if available - are used by default.

@IndrajeetPatil IndrajeetPatil added the Enhancement 💥 Implemented features can be improved or revised label Mar 8, 2021
@DominiqueMakowski
Copy link
Member

wouldn't it be better to let that be passed throug ellipsis to avoid cluttering the API? Or to retrieve it from the options (as stan does) ?

@IndrajeetPatil
Copy link
Member Author

Can't create a reprex because parallel doesn't seem to work with it. But passing the dots works (PR: #439).

> set.seed(123)
> library(parameters)
> 
> mod <- lm(formula = wt ~ mpg, data = mtcars)
> 
> set.seed(123)
> system.time(model_parameters(mod, bootstrap = TRUE, iterations = 1000, parallel = "no")) 
   user  system elapsed 
  1.043   0.007   1.057 
> 
> set.seed(123)
> system.time(
+   model_parameters(
+     mod,
+     bootstrap = TRUE,
+     iterations = 1000,
+     parallel = "multicore",
+     ncpus = 4L
+   )
+ ) 
   user  system elapsed 
  0.078   0.056   0.613 

@strengejacke
Copy link
Member

"multicore" doesn't work on windows.

@strengejacke
Copy link
Member

Using normal R, or Microsoft R Open doesn't seem to make a difference, increasing used CPUs even slows down:

library(parameters)
#> Warning: Paket 'parameters' wurde unter R Version 4.0.4 erstellt
model <- lm(mpg ~ wt + cyl, data = mtcars)

microbenchmark::microbenchmark(
  model_parameters(model, bootstrap = TRUE, iterations = 1000, parallel = "snow", ncpus = 4),
  times = 5
)
#> Unit: seconds
#>                                                                                             expr
#>  model_parameters(model, bootstrap = TRUE, iterations = 1000,      parallel = "snow", ncpus = 4)
#>       min       lq    mean   median       uq      max neval
#>  2.146296 2.178574 2.18241 2.179772 2.200774 2.206634     5

microbenchmark::microbenchmark(
  model_parameters(model, bootstrap = TRUE, iterations = 1000, parallel = "no", ncpus = 4),
  times = 5
)
#> Unit: seconds
#>                                                                                           expr
#>  model_parameters(model, bootstrap = TRUE, iterations = 1000,      parallel = "no", ncpus = 4)
#>       min      lq     mean   median       uq      max neval
#>  1.120941 1.12849 1.132289 1.128846 1.137772 1.145394     5

microbenchmark::microbenchmark(
  model_parameters(model, bootstrap = TRUE, iterations = 1000, parallel = "multicore", ncpus = 4),
  times = 5
)
#> Unit: seconds
#>                                                                                                  expr
#>  model_parameters(model, bootstrap = TRUE, iterations = 1000,      parallel = "multicore", ncpus = 4)
#>       min      lq     mean   median      uq      max neval
#>  1.102907 1.10788 1.117547 1.114816 1.12571 1.136424     5

Created on 2021-03-09 by the reprex package (v1.0.0)

@IndrajeetPatil
Copy link
Member Author

Yeah, I am also seeing the same on my Mac that the computation time actually increases if I use parallel computing with ncpus set to some value > 1.

It's all a bit confusing. And this has nothing to do with parameters functions.

Here is an example from the boot package docs:

library(boot)
library(microbenchmark)

# usual bootstrap of the ratio of means using the city data
ratio <- function(d, w) sum(d$x * w) / sum(d$u * w)

set.seed(123)
microbenchmark::microbenchmark(
  boot(city, ratio, R = 4999, stype = "w"),
  times = 5
)
#> Unit: milliseconds
#>                                      expr      min       lq     mean   median
#>  boot(city, ratio, R = 4999, stype = "w") 30.76705 36.27656 39.59618 40.73334
#>        uq      max neval
#>  42.90163 47.30233     5

options(boot.parallel = "multicore")
set.seed(123)
microbenchmark::microbenchmark(
  boot(city, ratio, R = 4999, stype = "w", ncpus = 5),
  times = 5
)
#> Unit: milliseconds
#>                                                 expr      min       lq    mean
#>  boot(city, ratio, R = 4999, stype = "w", ncpus = 5) 44.64621 47.21875 51.9313
#>    median       uq     max neval
#>  48.56907 50.58117 68.6413     5

Created on 2021-03-10 by the reprex package (v1.0.0)

I think we should stay away from making any changes to parameters until we figure out how to successfully use boot's parallel computation functionality.

@strengejacke
Copy link
Member

Yes, sounds good.

@strengejacke strengejacke added 3 investigators ❔❓ Need to look further into this issue Help us 👀 Help is needed to implement something Low priority 😴 This issue does not impact package functionality much labels Mar 10, 2021
@IndrajeetPatil
Copy link
Member Author

@bwiernik Do you have any ideas about how to get this to work?

@bwiernik
Copy link
Contributor

bwiernik commented Jul 4, 2021

Yeah, I can take a look

@vincentarelbundock
Copy link
Contributor

future is probably a better platform for cross-platform parallel computation: https://cran.r-project.org/web/packages/future/index.html

The examples in this thread are probably all too small (OLS with N=32), so the parallel overhead is heavier than the gains.

Perhaps one strategy would be for us to support extracting results from boot and other bootstrap objects. That way, users who want fancy features like parallel computation can use the existing support in the appropriate package, and we can extract and display the estimates.

@bwiernik
Copy link
Contributor

One of the major benefits of parameters is that we provide a simple interface for bootstrapping that otherwise are really difficult for new users (learning to use the boot package is a nightmare). I agree that we should use future for parallelization, but I do think we should support it.

@vincentarelbundock
Copy link
Contributor

You're right. boot is kind of a nightmare to learn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 investigators ❔❓ Need to look further into this issue Enhancement 💥 Implemented features can be improved or revised Help us 👀 Help is needed to implement something Low priority 😴 This issue does not impact package functionality much
Projects
None yet
Development

No branches or pull requests

5 participants