Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simulate_scenarios() to easily calculate outbreak size distribution #275

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

joshwlambert
Copy link
Member

This PR adds the simulate_scenarios() function which enables the exploration of parameter space of $R$ and $k$ values for different offspring distributions and bins the resulting transmission chain statistics into user-specified intervals to enable them to be easily plotted.

Note: This is a work-in-progress feature (hence the draft pull request). Before merging we need to:

  • generalise the ability to pass different distribution parameters to simulate_chain_stats()
  • consider if more input checking is required (I think if we can align the function signatures of simulate_scenarios() with simulate_chain_stats() then most, if not all, of the input checking can be deferred to that function).

@joshwlambert joshwlambert added the enhancement New feature or request label Aug 23, 2024
@codecov-commenter
Copy link

codecov-commenter commented Aug 23, 2024

Codecov Report

Attention: Patch coverage is 98.30508% with 1 line in your changes missing coverage. Please review.

Project coverage is 99.87%. Comparing base (dc53c2c) to head (9348ed7).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
R/simulate.R 98.30% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##              main     #275      +/-   ##
===========================================
- Coverage   100.00%   99.87%   -0.13%     
===========================================
  Files            8        8              
  Lines          755      814      +59     
===========================================
+ Hits           755      813      +58     
- Misses           0        1       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

breaks = c(0, 5, 10, 20, Inf)
)
expect_s3_class(df, class = "data.frame")
expect_identical(dim(df), c(240L, 6L))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the hardcoding here and rather find the dimensions of the result of an expand.grid().

stat_threshold = 15
)
expect_s3_class(df, class = "data.frame")
expect_identical(dim(df), c(240L, 6L))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before on removing the hardcoding to make the scenarios amenable to changes in the inputs.

for (i in seq_len(nrow(scenarios))) {
args <- list(
n_chains = 1000,
offspring_dist = match.fun(scenarios[i, "offspring_dist"]),
Copy link
Member

@jamesmbaazam jamesmbaazam Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to consider allowing string function names so that we don't have to do match.fun as it's not ideal. I'll go back through previous considerations in #167 to remind myself why we moved away from character function names as there was a lot of discussion in the early days.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would ask the other way round, why not use functions here instead of character strings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It relates the the expand.grid() issue stated above. We didn't spend very long trying to find work arounds, so it may be the case that we can use the same function interface in simulate_scenarios().

args <- c(args, args_)
}
args <- utils::modifyList(x = args, val = list(...))
x <- do.call(epichains::simulate_chain_stats, args = args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one point of generalisation where we can pass either simulate_chains() or simulate_chain_stats() and the class of result will depend on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

R/simulate.R Outdated
Comment on lines 572 to 574
interval <- cut(x, breaks = breaks)
prop <- table(interval) / sum(table(interval))
df_ <- as.data.frame(prop)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only fulfills a single use case and we may want to not do this here and leave it to the user. This function could just be about simulating the outbreak over a grid of scenarios and returning the results for downstream analyses/post-processing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

@@ -477,3 +477,107 @@ simulate_chain_stats <- function(n_chains,

return(out)
}

#' Calculate transmission chain statistics across a range of scenarios and
#' divide into intervals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a comment below on the fact that the part about dividing into intervals is just one use case and so this function should probably return the raw results.

Copy link
Member

@jamesmbaazam jamesmbaazam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this, Josh. I've left some quick comments for now as this is a living PR for the work in https://github.com/jamesmbaazam/h5n1_uk_scenario_modelling/. As discussed in person, we will have to generalise this beyond just poison and negative binomial. Looking forward to collaborating on this.

R/simulate.R Outdated
R_seq,
k_seq,
breaks,
...) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this function simulates only one run for each scenario, so we will need to allow for multiple runs here. This is related to #41. I'm torn between having it here versus in the main simulation functions because users may not want to explore scenarios using this function but may want to run the main functions multiple times for one parameter set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll follow your lead on where you think this is best to implement.

@sbfnk
Copy link
Contributor

sbfnk commented Sep 4, 2024

Thanks @joshwlambert, this looks like useful functionality!

I think there's potentially a broader question (to which I don't know the answer) of whether this kind of functionality should be offered by a package as it's essentially a loop over package function calls. In {socialmixr} we made a decision the other way round, i.e. to remove the ability to create replicates: epiforecasts/socialmixr#63

Comment on lines +485 to +489
#' The `offspring_dist` argument is not equivalent to the `offspring_dist`
#' argument in [simulate_chains()] and [simulate_chain_stats()], in those
#' functions `offspring_dist` should be a `<function>`, whereas in
#' `simulate_scenarios()` `offspring_dist` should be a vector of `character`
#' strings which match function names.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for deviating from the package architecture here? If this is to become part of the package I think we should follow a consistent design.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was due to an issue with expand.grid() and functions. I can't remember the exact issue when we were working on it, but it was something like this:

R_seq = seq(0.1, 1, 0.1)
k_seq = seq(0.1, 0.5, 0.1)
offspring_dist = rpois
statistic = c("size", "length")
scenarios <- expand.grid(
  offspring_dist = offspring_dist,
  statistic = statistic,
  R = R_seq,
  k = k_seq,
  stringsAsFactors = FALSE
)
#> Error in paste0(nmc[i], "=", if (is.numeric(x)) format(x) else x): cannot coerce type 'closure' to vector of type 'character'

Created on 2024-09-06 with reprex v2.1.0

Comment on lines +499 to +503
#' @param R_seq A `numeric` vector of reproduction number values to simulate
#' the branching process for.
#' @param k_seq A `numeric` vector of dispersion values to simulate the
#' branching process for. Only applicable for `offspring_dist = "rnbinom"`.
#' @inheritParams base::cut
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not call them R and k? They don't generally have to be a sequence here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they can be renamed.

for (i in seq_len(nrow(scenarios))) {
args <- list(
n_chains = 1000,
offspring_dist = match.fun(scenarios[i, "offspring_dist"]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would ask the other way round, why not use functions here instead of character strings?

R/simulate.R Outdated
Comment on lines 572 to 574
interval <- cut(x, breaks = breaks)
prop <- table(interval) / sum(table(interval))
df_ <- as.data.frame(prop)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

@joshwlambert
Copy link
Member Author

I think there's potentially a broader question (to which I don't know the answer) of whether this kind of functionality should be offered by a package as it's essentially a loop over package function calls. In {socialmixr} we made a decision the other way round, i.e. to remove the ability to create replicates: epiforecasts/socialmixr#63

I think before continuing development of this function it would be good to decide if we want this functionality as a function, or via documentation. I think the code in the function body could be used to write a vignette demonstrating how to simulate scenarios and then group them. This also relates to @jamesmbaazam's point:

This only fulfills a single use case and we may want to not do this here and leave it to the user. This function could just be about simulating the outbreak over a grid of scenarios and returning the results for downstream analyses/post-processing.

To me half the utility of this function comes from it wrapping cut() instead of requiring the user to know how to use it. But at the same time, a vignette is likely easier to maintain and might be more flexible to a variety of use cases we haven't foreseen.

As I'm just contributing, I'm happy to go with what you both (@jamesmbaazam & @sbfnk) prefer and then I'm happy to work on it.

@joshwlambert
Copy link
Member Author

I think this function should be converted into a vignette. @jamesmbaazam & @sbfnk please let me know if you agree and how you'd like to proceed. I'm happy to draft the vignette and then can assign you both to review, edit and then merge if deemed suitable.

@sbfnk
Copy link
Contributor

sbfnk commented Oct 10, 2024

Thanks @joshwlambert, that sounds like a good idea to me.

@jamesmbaazam
Copy link
Member

@joshwlambert I believe it's a good idea, but I would recommend pivoting the focus to highlight the application of epichains for estimating outbreak sizes based on offspring distribution scenarios. This way, the emphasis will be on the application rather than the function itself. The multi-simulation pattern is already outlined in the COVID-19 vignette, so we can build on that pattern as an application. We can in fact combine the idea here with that outlined in #77. The same simulation can be used to achieve both outcomes.

Let me know what you think.

@joshwlambert
Copy link
Member Author

I would recommend pivoting the focus to highlight the application of epichains for estimating outbreak sizes based on offspring distribution scenarios. This way, the emphasis will be on the application rather than the function itself.

Yes, I agree, I plan to remove the simulate_scenarios() functions and instead use the code from the function body as a script to explain to the reader how to explore distributions of outbreak size and length.

I will make the changes in a new branch and open a new PR and then this PR can be closed without merging.

@jamesmbaazam
Copy link
Member

I would recommend pivoting the focus to highlight the application of epichains for estimating outbreak sizes based on offspring distribution scenarios. This way, the emphasis will be on the application rather than the function itself.

Yes, I agree, I plan to remove the simulate_scenarios() functions and instead use the code from the function body as a script to explain to the reader how to explore distributions of outbreak size and length.

I will make the changes in a new branch and open a new PR and then this PR can be closed without merging.

Perfect!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants