From e3943ecc0bf16aad080f3ba8e27abf11d052c869 Mon Sep 17 00:00:00 2001 From: Hugo Gruson <10783929+Bisaloo@users.noreply.github.com> Date: Thu, 25 Apr 2024 13:54:32 +0200 Subject: [PATCH] Harmonize blog post tags --- .../posts/extend-dataframes/index/execute-results/html.json | 4 ++-- _freeze/posts/for-vs-apply/index/execute-results/html.json | 4 ++-- .../statistical-correctness/index/execute-results/html.json | 4 ++-- .../posts/system-dependencies/index/execute-results/html.json | 4 ++-- posts/extend-dataframes/index.qmd | 2 +- posts/for-vs-apply/index.qmd | 2 +- posts/lint-rcpp/index.qmd | 2 +- posts/share-cpp/index.qmd | 2 +- posts/statistical-correctness/index.qmd | 2 +- posts/system-dependencies/index.qmd | 2 +- 10 files changed, 14 insertions(+), 14 deletions(-) diff --git a/_freeze/posts/extend-dataframes/index/execute-results/html.json b/_freeze/posts/extend-dataframes/index/execute-results/html.json index 671c61b1..ec45192b 100644 --- a/_freeze/posts/extend-dataframes/index/execute-results/html.json +++ b/_freeze/posts/extend-dataframes/index/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "664c966c2583572041a2047ac2092fd8", + "hash": "4df6c10d80016a065384227af241115a", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Extending Data Frames\"\nsubtitle: \"Creating custom classes and {dplyr} compatibility\"\nauthor:\n - name: \"Joshua W. Lambert\"\n orcid: \"0000-0001-5218-3046\"\ndate: \"2023-04-12\"\ncategories: [data frame, R, R package, interoperability, S3 class, dplyr]\nformat:\n html:\n toc: true\n---\n\n\n## Extending Data Frames in R\n\nR is a commonly used language for data science and statistical computing. Foundational to this is having data structures that allow manipulation of data with minimal effort and cognitive load. One of the most commonly required data structures is tabular data. This can be represented in R in a few ways, for example a matrix or a data frame. The data frame (class `data.frame`) is a flexible tabular data structure, as it can hold different data types (e.g. numbers, character strings, etc.) across different columns. This is in contrast to matrices -- which are arrays with dimensions -- and thus can only hold a single data type.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# data frame can hold heterogeneous data types across different columns\ndata.frame(a = c(1, 2, 3), b = c(4, 5, 6), c = c(\"a\", \"b\", \"c\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n a b c\n1 1 4 a\n2 2 5 b\n3 3 6 c\n```\n\n\n:::\n\n```{.r .cell-code}\n# each column must be of the same type\ndf <- data.frame(a = c(1, 2, 3), b = c(\"4\", 5, 6))\n# be careful of the silent type conversion\ndf$a\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 2 3\n```\n\n\n:::\n\n```{.r .cell-code}\ndf$b\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"4\" \"5\" \"6\"\n```\n\n\n:::\n\n```{.r .cell-code}\nmat <- matrix(1:9, nrow = 3, ncol = 3)\nmat\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2] [,3]\n[1,] 1 4 7\n[2,] 2 5 8\n[3,] 3 6 9\n```\n\n\n:::\n\n```{.r .cell-code}\nmat[1, 1] <- \"1\"\n# be careful of the silent type conversion\nmat\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2] [,3]\n[1,] \"1\" \"4\" \"7\" \n[2,] \"2\" \"5\" \"8\" \n[3,] \"3\" \"6\" \"9\" \n```\n\n\n:::\n:::\n\n\nData frames can even be nested, cells can be data frames or lists.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- data.frame(a = \"w\", b = \"x\")\ndf[1, 1][[1]] <- list(c = c(\"y\", \"z\"))\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n a b\n1 y, z x\n```\n\n\n:::\n\n```{.r .cell-code}\ndf <- data.frame(a = \"w\", b = \"x\")\ndf[1, 1][[1]] <- list(data.frame(c = \"y\", d = \"z\"))\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n a b\n1 y, z x\n```\n\n\n:::\n:::\n\n\nIt is therefore clear why data frames are so prevalent. However, they are not without limitations. They have a relatively basic printing method which can fload the R console when the number of columns or rows is large. They have useful methods (e.g., `summary()` and `str()`), but these might not be appropriate for certain types of tabular data. In these cases it is useful to utilise R's inheritance mechanisms (specifically S3 inheritance) to write extensions for R's `data.frame` class. In this case the data frame is the superclass and the new subclass extends it and inherits its methods (see [the *Advanced R* book](https://adv-r.hadley.nz/s3.html#s3-inheritance) for more details on S3 inheritance).\n\nOne of the most common extension of the data frame is the `tibble` from the {tibble} R package. Outlined in [{tibble}'s vignette](https://tibble.tidyverse.org/articles/tibble.html), `tibble`s offer improvements in printing, subsetting and recycling rules. Another commonly used data frame extension is the `data.table` class from the [{data.table} R package](https://github.com/Rdatatable/data.table). In addition to the improved printing, this class is designed to improve the performance (i.e. speed and efficiency of operations and storage) of working with tabular data in R and provide a terse syntax for manipulation.\n\nIn the process of developing R software (most likely an R package), a new tabular data class that builds atop data frames can become beneficial. This blog post has two main sections:\n\n1. a brief overview of the steps required to setup a class that extends data frames\n2. guide to the technical aspects of class invariants (required data members of a class) and design and implementation decisions, and tidyverse compatibility\n\n### Writing a custom data class\n\nIt is useful to write a class constructor function that can be called to create an object of your new class. The functions defined below are a redacted version (for readability) of functions available in the [{ExtendDataFrames} R package](https://github.com/joshwlambert/ExtendDataFrames), which contains example functions and files discussed in this post. When assigning the class name ensure that it is a vector containing `\"data.frame\"` as the last element\nto correctly inherit properties and methods from the `data.frame` class.\n\n```r\nbirthdays <- function(x) {\n # the vector of classes is required for it to inherit from `data.frame`\n structure(x, class = c(\"birthdays\", \"data.frame\"))\n}\n```\n\nThat's all that's needed to create a subclass of a data frame. However, although we've created the class we haven't given it any functionality and thus it will be identical to a data frame due to inheritance.\n\nWe can now write as many methods as we want. Here we will show two methods, one of which does not require writing a generic (`print.birthdays`) and the second that does (`birthdays_per_month`). The `print()` generic function is provided by R, which is why we do not need to add one ourselves. See [Adv R](https://adv-r.hadley.nz/s3.html#s3-methods) and this [Epiverse blog post](https://epiverse-trace.github.io/posts/s3-generic/) to find out more about S3 generics.\n\n```r\nprint.birthdays <- function(x, ...) {\n cat(\n sprintf(\n \"A `birthdays` object with %s rows and %s cols\",\n dim(x)[1], dim(x)[2]\n )\n )\n invisible(x)\n}\n\nbirthdays_per_month <- function(x, ...) {\n UseMethod(\"birthdays_per_month\")\n}\n\nbirthdays_per_month.birthdays <- function(x, ...) {\n out <- table(lubridate::month(x$birthday))\n months <- c(\n \"Jan\", \"Feb\", \"Mar\", \"Apr\", \"May\", \"Jun\",\n \"Jul\", \"Aug\", \"Sep\", \"Oct\", \"Nov\", \"Dec\"\n )\n names(out) <- months[as.numeric(names(out))]\n return(out)\n}\n```\n\n::: {.callout-tip}\nUseful resources for the \"Writing custom data class\" section:\n[extending `tibbles` and their functionality](https://tibble.tidyverse.org/articles/extending.html)\n:::\n\n### Design decision around class invariants\n\nWe will now move on to the second section of the post, in which we discuss the design choices when creating and using S3 classes in R. ***Class invariants*** are members of your class that define it. In other words, without these elements your class does not fulfil its basic definition. It is therefore sensible to make sure that your class contains these elements at all times (or at least after operations have been applied to your class). In cases when the class object contains all the invariants normal service can be continued. However, in the case that an invariant is missing or modified to a non-conformist type (e.g. a date converted to a numeric) a decision has to be made. Either the code can error, hopefully giving the user an informative message as to why their modification broke the object; alternatively, the subclass can be revoked and the superclass can be returned. In almost all cases the superclass (i.e. the base class being inherited from) is more general and won't have the same class invariant restrictions.\n\nFor our example class, ``, the invariants are a column called `name` which must contain characters, and a column called `birthday` which must contain dates. The order of the rows and columns is not considered an invariant property, and having extra columns with other names and data types is also allowed. The number of rows is also not an invariant as we can have as many birthdays as we like in the data object.\n\nHere we present both cases as well as considerations and technical details of both options. We'll demonstrate both of these cases with the subset function in R (subsetting uses a single square bracket for tabular data, `[`). First the fail-on-subsetting. Before we write the subsetting function it is useful to have a function that checks that an object of our class is valid, a so-called validator function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvalidate_birthdays <- function(x) {\n stopifnot(\n \"input must contain 'name' and 'birthday' columns\" =\n all(c(\"name\", \"birthday\") %in% colnames(x)),\n \"names must be a character\" =\n is.character(x$name),\n \"birthday must be a date\" =\n lubridate::is.Date(x$birthday)\n )\n invisible(x)\n}\n```\n:::\n\n\nThis will return an error if the class is not valid (defined in terms of the class' invariants).\n\nNow we can show how to error if one of the invariants are removed during subsetting. See `?NextMethod()` for information on method dispatch.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n`[.birthdays` <- function(x) {\n validate_birthdays(NextMethod())\n}\n\nbirthdays[, -1]\n# Error in validate_birthdays(NextMethod()) :\n# input must contain 'name' and 'birthday' columns\n```\n:::\n\n\nThe second design option is the reconstruct-on-subsetting. This checks whether the class is valid, and if not downgrade the class to the superclass, in our case a data frame. This is done by not only validating the object during subsetting but to check whether it is a valid class object, and then either ensuring all of the attributes of the subclass -- in our case `` -- are maintained, or attributes are stripped and only the attributes of the base superclass -- in our case `data.frame` -- are kept.\n\n::: {.callout-note}\nImportant note: this section of the post relies heavily on .\n:::\n \nThe four functions that are required to be added to ensure our class is correctly handled when invaliding it are:\n \n- `birthdays_reconstruct()`\n- `birthdays_can_reconstruct()`\n- `df_reconstruct()`\n- `dplyr_reconstruct.birthdays()`\n\nWe'll tackle the first three first, and then move onto to the last one as this requires some extra steps.\n\n`birthdays_reconstruct()` is a function that contains an if-else statement to determine whether the returned object is a `` or `data.frame` object.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbirthdays_reconstruct <- function(x, to) {\n if (birthdays_can_reconstruct(x)) {\n df_reconstruct(x, to)\n } else {\n x <- as.data.frame(x)\n message(\"Removing crucial column in `` returning ``\")\n x\n }\n}\n```\n:::\n\n\nThe if-else evaluation is controlled by `birthdays_can_reconstruct()`. This function determines whether after subsetting the object is a valid `` class. It checks whether the validator fails, in which case it returns `FALSE`, otherwise the function will return `TRUE`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbirthdays_can_reconstruct <- function(x) {\n # check whether input is valid\n valid <- tryCatch(\n { validate_birthdays(x) },\n error = function(cnd) FALSE\n )\n\n # return boolean\n !isFALSE(valid)\n}\n```\n:::\n\n\nThe next function required is `df_reconstruct()`. This is called when the object is judged to be a valid `` object and simply copies the attributes over from the `` class to the object being subset.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_reconstruct <- function(x, to) {\n attrs <- attributes(to)\n attrs$names <- names(x)\n attrs$row.names <- .row_names_info(x, type = 0L)\n attributes(x) <- attrs\n x\n}\n```\n:::\n\n\nThe three functions defined for reconstruction can be added to a package with the subsetting function in order to subset `` objects and returning either `` objects if still valid, or data frames when invalidated. This design has the benefit that when conducting data exploration a user is not faced with an error, but can continue with a data frame, while being informed by the message printed to console in `birthdays_reconstruct()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n`[.birthdays` <- function(x, ...) {\n out <- NextMethod()\n birthdays_reconstruct(out, x)\n}\n```\n:::\n\n\n### Compatibility with {dplyr}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n```\n:::\n\n\nIn order to be able to operate on our `` class using functions from the\npackage {dplyr}, as would be common for data frames, we need to make our function compatible. This is where the function `dplyr_reconstruct.birthdays()` comes in. `dplyr_reconstruct()` is a generic function exported by {dplyr}. It is called in {dplyr} verbs to make sure that the objects are restored to the input class when not invalidated.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndplyr_reconstruct.birthdays <- function(data, template) { # nolint\n birthdays_reconstruct(data, template)\n}\n```\n:::\n\n\nInformation about the generic can be found through the {dplyr} help documentation.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?dplyr::dplyr_extending\n?dplyr::dplyr_reconstruct\n```\n:::\n\n\nAs explained in the help documentation, {dplyr} also uses two base R functions to perform data manipulation.\n`names<-` (i.e the names setter function) and `[` the one-dimensional subsetting function. We therefore define these methods for our custom class in order for `dplyr_reconstruct()` to work as intended.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n`[.birthdays` <- function(x, ...) {\n out <- NextMethod()\n birthdays_reconstruct(out, x)\n}\n\n`names<-.birthdays` <- function(x, value) {\n out <- NextMethod()\n birthdays_reconstruct(out, x)\n}\n```\n:::\n\n\nThis wraps up the need for adding function to perform data manipulation using the reconstruction design outlined above.\n\nHowever, there is some final housekeeping to do. In cases when {dplyr} is not a package dependency (either imported or suggested), then the S3 generic `dplyr_reconstruct()` is required to be loaded. In R versions before 3.6.0 -- this also works for R versions later than 3.6.0 -- the generic function needs to be registered. This is done by writing an `.onLoad()` function, typically in a file called `zzz.R`. This is included in the {ExtendDataFrames} package for illustrative purposes.\n\n\n```{.r filename=\"zzz.R\"}\n.onLoad <- function(libname, pkgname) {\n s3_register(\"dplyr::dplyr_reconstruct\", \"birthdays\")\n invisible()\n}\n```\n\nThe `s3_register()` function used in `.onLoad()` also needs to be added to the package and this function is kindly supplied by both {vctrs} and {rlang} unlicensed and thus can be copied into another package. See the [R packages book](https://r-pkgs.org/dependencies-mindset-background.html#sec-dependencies-attach-vs-load) for information about `.onLoad()` and attaching and loading in general.\n\nSince R version 3.6.0 this [S3 generic registration](https://blog.r-project.org/2019/08/19/s3-method-lookup/index.html) happens automatically with `S3Method()` in the package namespace using the {roxygen2} documentation `#' @exportS3Method dplyr::dplyr_reconstruct`.\n\nThere is one last option which prevents the hard dependency on a relatively recent R version. Since {roxygen2} version 6.1.0, there is the `@rawNamespace` tag which allows insertion of text into the NAMESPACE file. Using this tag the following code will check the local R version and register the S3 method if equal to or above 3.6.0.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#' @rawNamespace if (getRversion() >= \"3.6.0\") {\n#' S3method(pkg::fun, class)\n#' }\n```\n:::\n\n\nEach of the three options for registering S3 methods has different benefits and downsides, so the choice depends on the specific use-case. Over time it may be best to use the most up-to-date methods as packages are usually only maintained for a handful of recent R releases[^1].\n\nThe topics discussed in this post have been implemented in the [{epiparameter} R package](https://github.com/epiverse-trace/epiparameter) within [Epiverse-TRACE](https://github.com/epiverse-trace).\n\nCompatibility with {vctrs} is also possible using the same mechanism (functions) described in this post, and if interested see for details.\n\nFor other use-cases and discussions of the designs and implementations discussed in this post see:\n \n- [{dials} R package](https://github.com/tidymodels/dials)\n- [{rsample} R package](https://github.com/tidymodels/rsample)\n- [{googledrive} R package](https://github.com/tidyverse/googledrive)\n- [Pull request on {tibble} R package](https://github.com/tidyverse/tibble/issues/890)\n\nThis blog post is a compendium of information from sources that are linked and cited throughout. Please refer to those sites for more information and as the primary source for citation in further work.\n\n[^1]: This is the working practise of tidyverse packages: [https://www.tidyverse.org/blog/2019/04/r-version-support/](https://www.tidyverse.org/blog/2019/04/r-version-support/)\n", + "markdown": "---\ntitle: \"Extending Data Frames\"\nsubtitle: \"Creating custom classes and {dplyr} compatibility\"\nauthor:\n - name: \"Joshua W. Lambert\"\n orcid: \"0000-0001-5218-3046\"\ndate: \"2023-04-12\"\ncategories: [data frame, R, R package, interoperability, S3, tidyverse, object orientation]\nformat:\n html:\n toc: true\n---\n\n\n## Extending Data Frames in R\n\nR is a commonly used language for data science and statistical computing. Foundational to this is having data structures that allow manipulation of data with minimal effort and cognitive load. One of the most commonly required data structures is tabular data. This can be represented in R in a few ways, for example a matrix or a data frame. The data frame (class `data.frame`) is a flexible tabular data structure, as it can hold different data types (e.g. numbers, character strings, etc.) across different columns. This is in contrast to matrices -- which are arrays with dimensions -- and thus can only hold a single data type.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# data frame can hold heterogeneous data types across different columns\ndata.frame(a = c(1, 2, 3), b = c(4, 5, 6), c = c(\"a\", \"b\", \"c\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n a b c\n1 1 4 a\n2 2 5 b\n3 3 6 c\n```\n\n\n:::\n\n```{.r .cell-code}\n# each column must be of the same type\ndf <- data.frame(a = c(1, 2, 3), b = c(\"4\", 5, 6))\n# be careful of the silent type conversion\ndf$a\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 2 3\n```\n\n\n:::\n\n```{.r .cell-code}\ndf$b\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"4\" \"5\" \"6\"\n```\n\n\n:::\n\n```{.r .cell-code}\nmat <- matrix(1:9, nrow = 3, ncol = 3)\nmat\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2] [,3]\n[1,] 1 4 7\n[2,] 2 5 8\n[3,] 3 6 9\n```\n\n\n:::\n\n```{.r .cell-code}\nmat[1, 1] <- \"1\"\n# be careful of the silent type conversion\nmat\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2] [,3]\n[1,] \"1\" \"4\" \"7\" \n[2,] \"2\" \"5\" \"8\" \n[3,] \"3\" \"6\" \"9\" \n```\n\n\n:::\n:::\n\n\nData frames can even be nested, cells can be data frames or lists.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- data.frame(a = \"w\", b = \"x\")\ndf[1, 1][[1]] <- list(c = c(\"y\", \"z\"))\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n a b\n1 y, z x\n```\n\n\n:::\n\n```{.r .cell-code}\ndf <- data.frame(a = \"w\", b = \"x\")\ndf[1, 1][[1]] <- list(data.frame(c = \"y\", d = \"z\"))\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n a b\n1 y, z x\n```\n\n\n:::\n:::\n\n\nIt is therefore clear why data frames are so prevalent. However, they are not without limitations. They have a relatively basic printing method which can fload the R console when the number of columns or rows is large. They have useful methods (e.g., `summary()` and `str()`), but these might not be appropriate for certain types of tabular data. In these cases it is useful to utilise R's inheritance mechanisms (specifically S3 inheritance) to write extensions for R's `data.frame` class. In this case the data frame is the superclass and the new subclass extends it and inherits its methods (see [the *Advanced R* book](https://adv-r.hadley.nz/s3.html#s3-inheritance) for more details on S3 inheritance).\n\nOne of the most common extension of the data frame is the `tibble` from the {tibble} R package. Outlined in [{tibble}'s vignette](https://tibble.tidyverse.org/articles/tibble.html), `tibble`s offer improvements in printing, subsetting and recycling rules. Another commonly used data frame extension is the `data.table` class from the [{data.table} R package](https://github.com/Rdatatable/data.table). In addition to the improved printing, this class is designed to improve the performance (i.e. speed and efficiency of operations and storage) of working with tabular data in R and provide a terse syntax for manipulation.\n\nIn the process of developing R software (most likely an R package), a new tabular data class that builds atop data frames can become beneficial. This blog post has two main sections:\n\n1. a brief overview of the steps required to setup a class that extends data frames\n2. guide to the technical aspects of class invariants (required data members of a class) and design and implementation decisions, and tidyverse compatibility\n\n### Writing a custom data class\n\nIt is useful to write a class constructor function that can be called to create an object of your new class. The functions defined below are a redacted version (for readability) of functions available in the [{ExtendDataFrames} R package](https://github.com/joshwlambert/ExtendDataFrames), which contains example functions and files discussed in this post. When assigning the class name ensure that it is a vector containing `\"data.frame\"` as the last element\nto correctly inherit properties and methods from the `data.frame` class.\n\n```r\nbirthdays <- function(x) {\n # the vector of classes is required for it to inherit from `data.frame`\n structure(x, class = c(\"birthdays\", \"data.frame\"))\n}\n```\n\nThat's all that's needed to create a subclass of a data frame. However, although we've created the class we haven't given it any functionality and thus it will be identical to a data frame due to inheritance.\n\nWe can now write as many methods as we want. Here we will show two methods, one of which does not require writing a generic (`print.birthdays`) and the second that does (`birthdays_per_month`). The `print()` generic function is provided by R, which is why we do not need to add one ourselves. See [Adv R](https://adv-r.hadley.nz/s3.html#s3-methods) and this [Epiverse blog post](https://epiverse-trace.github.io/posts/s3-generic/) to find out more about S3 generics.\n\n```r\nprint.birthdays <- function(x, ...) {\n cat(\n sprintf(\n \"A `birthdays` object with %s rows and %s cols\",\n dim(x)[1], dim(x)[2]\n )\n )\n invisible(x)\n}\n\nbirthdays_per_month <- function(x, ...) {\n UseMethod(\"birthdays_per_month\")\n}\n\nbirthdays_per_month.birthdays <- function(x, ...) {\n out <- table(lubridate::month(x$birthday))\n months <- c(\n \"Jan\", \"Feb\", \"Mar\", \"Apr\", \"May\", \"Jun\",\n \"Jul\", \"Aug\", \"Sep\", \"Oct\", \"Nov\", \"Dec\"\n )\n names(out) <- months[as.numeric(names(out))]\n return(out)\n}\n```\n\n::: {.callout-tip}\nUseful resources for the \"Writing custom data class\" section:\n[extending `tibbles` and their functionality](https://tibble.tidyverse.org/articles/extending.html)\n:::\n\n### Design decision around class invariants\n\nWe will now move on to the second section of the post, in which we discuss the design choices when creating and using S3 classes in R. ***Class invariants*** are members of your class that define it. In other words, without these elements your class does not fulfil its basic definition. It is therefore sensible to make sure that your class contains these elements at all times (or at least after operations have been applied to your class). In cases when the class object contains all the invariants normal service can be continued. However, in the case that an invariant is missing or modified to a non-conformist type (e.g. a date converted to a numeric) a decision has to be made. Either the code can error, hopefully giving the user an informative message as to why their modification broke the object; alternatively, the subclass can be revoked and the superclass can be returned. In almost all cases the superclass (i.e. the base class being inherited from) is more general and won't have the same class invariant restrictions.\n\nFor our example class, ``, the invariants are a column called `name` which must contain characters, and a column called `birthday` which must contain dates. The order of the rows and columns is not considered an invariant property, and having extra columns with other names and data types is also allowed. The number of rows is also not an invariant as we can have as many birthdays as we like in the data object.\n\nHere we present both cases as well as considerations and technical details of both options. We'll demonstrate both of these cases with the subset function in R (subsetting uses a single square bracket for tabular data, `[`). First the fail-on-subsetting. Before we write the subsetting function it is useful to have a function that checks that an object of our class is valid, a so-called validator function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvalidate_birthdays <- function(x) {\n stopifnot(\n \"input must contain 'name' and 'birthday' columns\" =\n all(c(\"name\", \"birthday\") %in% colnames(x)),\n \"names must be a character\" =\n is.character(x$name),\n \"birthday must be a date\" =\n lubridate::is.Date(x$birthday)\n )\n invisible(x)\n}\n```\n:::\n\n\nThis will return an error if the class is not valid (defined in terms of the class' invariants).\n\nNow we can show how to error if one of the invariants are removed during subsetting. See `?NextMethod()` for information on method dispatch.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n`[.birthdays` <- function(x) {\n validate_birthdays(NextMethod())\n}\n\nbirthdays[, -1]\n# Error in validate_birthdays(NextMethod()) :\n# input must contain 'name' and 'birthday' columns\n```\n:::\n\n\nThe second design option is the reconstruct-on-subsetting. This checks whether the class is valid, and if not downgrade the class to the superclass, in our case a data frame. This is done by not only validating the object during subsetting but to check whether it is a valid class object, and then either ensuring all of the attributes of the subclass -- in our case `` -- are maintained, or attributes are stripped and only the attributes of the base superclass -- in our case `data.frame` -- are kept.\n\n::: {.callout-note}\nImportant note: this section of the post relies heavily on .\n:::\n \nThe four functions that are required to be added to ensure our class is correctly handled when invaliding it are:\n \n- `birthdays_reconstruct()`\n- `birthdays_can_reconstruct()`\n- `df_reconstruct()`\n- `dplyr_reconstruct.birthdays()`\n\nWe'll tackle the first three first, and then move onto to the last one as this requires some extra steps.\n\n`birthdays_reconstruct()` is a function that contains an if-else statement to determine whether the returned object is a `` or `data.frame` object.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbirthdays_reconstruct <- function(x, to) {\n if (birthdays_can_reconstruct(x)) {\n df_reconstruct(x, to)\n } else {\n x <- as.data.frame(x)\n message(\"Removing crucial column in `` returning ``\")\n x\n }\n}\n```\n:::\n\n\nThe if-else evaluation is controlled by `birthdays_can_reconstruct()`. This function determines whether after subsetting the object is a valid `` class. It checks whether the validator fails, in which case it returns `FALSE`, otherwise the function will return `TRUE`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbirthdays_can_reconstruct <- function(x) {\n # check whether input is valid\n valid <- tryCatch(\n { validate_birthdays(x) },\n error = function(cnd) FALSE\n )\n\n # return boolean\n !isFALSE(valid)\n}\n```\n:::\n\n\nThe next function required is `df_reconstruct()`. This is called when the object is judged to be a valid `` object and simply copies the attributes over from the `` class to the object being subset.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_reconstruct <- function(x, to) {\n attrs <- attributes(to)\n attrs$names <- names(x)\n attrs$row.names <- .row_names_info(x, type = 0L)\n attributes(x) <- attrs\n x\n}\n```\n:::\n\n\nThe three functions defined for reconstruction can be added to a package with the subsetting function in order to subset `` objects and returning either `` objects if still valid, or data frames when invalidated. This design has the benefit that when conducting data exploration a user is not faced with an error, but can continue with a data frame, while being informed by the message printed to console in `birthdays_reconstruct()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n`[.birthdays` <- function(x, ...) {\n out <- NextMethod()\n birthdays_reconstruct(out, x)\n}\n```\n:::\n\n\n### Compatibility with {dplyr}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n```\n:::\n\n\nIn order to be able to operate on our `` class using functions from the\npackage {dplyr}, as would be common for data frames, we need to make our function compatible. This is where the function `dplyr_reconstruct.birthdays()` comes in. `dplyr_reconstruct()` is a generic function exported by {dplyr}. It is called in {dplyr} verbs to make sure that the objects are restored to the input class when not invalidated.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndplyr_reconstruct.birthdays <- function(data, template) { # nolint\n birthdays_reconstruct(data, template)\n}\n```\n:::\n\n\nInformation about the generic can be found through the {dplyr} help documentation.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?dplyr::dplyr_extending\n?dplyr::dplyr_reconstruct\n```\n:::\n\n\nAs explained in the help documentation, {dplyr} also uses two base R functions to perform data manipulation.\n`names<-` (i.e the names setter function) and `[` the one-dimensional subsetting function. We therefore define these methods for our custom class in order for `dplyr_reconstruct()` to work as intended.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n`[.birthdays` <- function(x, ...) {\n out <- NextMethod()\n birthdays_reconstruct(out, x)\n}\n\n`names<-.birthdays` <- function(x, value) {\n out <- NextMethod()\n birthdays_reconstruct(out, x)\n}\n```\n:::\n\n\nThis wraps up the need for adding function to perform data manipulation using the reconstruction design outlined above.\n\nHowever, there is some final housekeeping to do. In cases when {dplyr} is not a package dependency (either imported or suggested), then the S3 generic `dplyr_reconstruct()` is required to be loaded. In R versions before 3.6.0 -- this also works for R versions later than 3.6.0 -- the generic function needs to be registered. This is done by writing an `.onLoad()` function, typically in a file called `zzz.R`. This is included in the {ExtendDataFrames} package for illustrative purposes.\n\n\n```{.r filename=\"zzz.R\"}\n.onLoad <- function(libname, pkgname) {\n s3_register(\"dplyr::dplyr_reconstruct\", \"birthdays\")\n invisible()\n}\n```\n\nThe `s3_register()` function used in `.onLoad()` also needs to be added to the package and this function is kindly supplied by both {vctrs} and {rlang} unlicensed and thus can be copied into another package. See the [R packages book](https://r-pkgs.org/dependencies-mindset-background.html#sec-dependencies-attach-vs-load) for information about `.onLoad()` and attaching and loading in general.\n\nSince R version 3.6.0 this [S3 generic registration](https://blog.r-project.org/2019/08/19/s3-method-lookup/index.html) happens automatically with `S3Method()` in the package namespace using the {roxygen2} documentation `#' @exportS3Method dplyr::dplyr_reconstruct`.\n\nThere is one last option which prevents the hard dependency on a relatively recent R version. Since {roxygen2} version 6.1.0, there is the `@rawNamespace` tag which allows insertion of text into the NAMESPACE file. Using this tag the following code will check the local R version and register the S3 method if equal to or above 3.6.0.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#' @rawNamespace if (getRversion() >= \"3.6.0\") {\n#' S3method(pkg::fun, class)\n#' }\n```\n:::\n\n\nEach of the three options for registering S3 methods has different benefits and downsides, so the choice depends on the specific use-case. Over time it may be best to use the most up-to-date methods as packages are usually only maintained for a handful of recent R releases[^1].\n\nThe topics discussed in this post have been implemented in the [{epiparameter} R package](https://github.com/epiverse-trace/epiparameter) within [Epiverse-TRACE](https://github.com/epiverse-trace).\n\nCompatibility with {vctrs} is also possible using the same mechanism (functions) described in this post, and if interested see for details.\n\nFor other use-cases and discussions of the designs and implementations discussed in this post see:\n \n- [{dials} R package](https://github.com/tidymodels/dials)\n- [{rsample} R package](https://github.com/tidymodels/rsample)\n- [{googledrive} R package](https://github.com/tidyverse/googledrive)\n- [Pull request on {tibble} R package](https://github.com/tidyverse/tibble/issues/890)\n\nThis blog post is a compendium of information from sources that are linked and cited throughout. Please refer to those sites for more information and as the primary source for citation in further work.\n\n[^1]: This is the working practise of tidyverse packages: [https://www.tidyverse.org/blog/2019/04/r-version-support/](https://www.tidyverse.org/blog/2019/04/r-version-support/)\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/posts/for-vs-apply/index/execute-results/html.json b/_freeze/posts/for-vs-apply/index/execute-results/html.json index 0300c36d..98b5d0ec 100644 --- a/_freeze/posts/for-vs-apply/index/execute-results/html.json +++ b/_freeze/posts/for-vs-apply/index/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "a4cc084fadcc225e9f6c4441a04cc94b", + "hash": "206058d5f18d118d875c11c647f25c11", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Lesser-known reasons to prefer `apply()` over for loops\"\nauthor:\n - name: \"Hugo Gruson\"\n orcid: \"0000-0002-4094-1476\"\ndate: \"2023-11-02\"\ncategories: [R, functional programming, iteration, readability, good practices]\nformat:\n html: \n toc: true\neditor: \n markdown: \n wrap: 80\n---\n\n\nThe debate regarding the use of `for` loops versus the `apply()` function family\n(`apply()`, `lapply()`, `vapply()`, etc., along with their purrr counterparts:\n`map()`, `map2()`, `map_lgl()`, `map_chr()`, etc.), has been a longstanding one\nin the R community.\n\nWhile you may occasionally hear that `for` loops are slower, this notion has\nalready been debunked [in other\nposts](https://privefl.github.io/blog/why-loops-are-slow-in-r/). When utilized\ncorrectly, a `for` loop can achieve performance on par with `apply()` functions.\n\nHowever, there are still lesser-known reasons to prefer `apply()` functions over\n`for` loops, which we will explore in this post.\n\n## Preamble: `for` loops can be used in more cases than `apply()`\n\nIt is important to understand that `for` loops and `apply()` functions are not\nalways interchangeable. Indeed, `for` loops can be used in cases where `apply`\nfunctions can't: when the next step depends on the previous one. This concept is\nknown as [*recursion*](https://en.wikipedia.org/wiki/Recursion).\n\nConversely, when each step is independent of the previous one, but you want to\nperform the same operation on each element of a vector, it is referred to as\n[*iteration*](https://en.wikipedia.org/wiki/Iteration).\n\n`for` loops are capable of both *recursion* and *iteration*, whereas `apply()`\ncan only do *iteration*.\n\n| Operator | Iteration | Recursion |\n|-----------|-----------|-----------|\n| `for` | ✔️ | ✔️ |\n| `apply()` | ✔️ | ❌ |\n\nWith this distinction in mind, we can now focus on why you should favour\n`apply()` for iteration over `for` loops.\n\n## Reason 1: clarity of intent\n\nAs mentioned earlier, `for` loops can be used for both iteration and recursion.\nBy consistently employing `apply()` for iteration[^caveat] and reserving `for` loops for\nrecursion, we enable readers to immediately discern the underlying concept in\nthe code. This leads to code that is easier to read and understand.\n\n[^caveat]: There are a handful of rare corner cases where `apply()` is not the best method for iteration. These are cases that make use of `match.call()` or `sys.call()`. More details are available in `lapply()` documentation and in [this GitHub comment by Tim Taylor during the review of this post](https://github.com/epiverse-trace/epiverse-trace.github.io/pull/125#issuecomment-1775929451).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nl <- list(c(1, 2, 6), c(3, 5), 6, c(0, 9, 3, 4, 8))\n\n# `for` solution -------------------\nres <- numeric(length(l))\nfor (i in seq_along(l)) {\n res[[i]] <- mean(l[[i]])\n}\nres\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.0 4.0 6.0 4.8\n```\n\n\n:::\n\n```{.r .cell-code}\n# `vapply()` solution ---------------\nres <- vapply(l, mean, numeric(1))\nres\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.0 4.0 6.0 4.8\n```\n\n\n:::\n:::\n\n\nThe simplicity of `apply()` is even more apparent in the case of multiple iterations. For example, if we want to find the median of each matrix row for a list of matrices:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nl <- replicate(5, { matrix(runif(9), nrow = 3) }, simplify = FALSE)\n\n# `for` solution -------------------\nres <- list()\nfor (i in seq_along(l)) {\n meds <- numeric(nrow(l[[i]]))\n for (j in seq_len(nrow(l[[i]]))) {\n meds[[j]] <- median(l[[i]][j, ])\n }\n res[[i]] <- meds\n}\nres\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[[1]]\n[1] 0.4278542 0.3265237 0.5433079\n\n[[2]]\n[1] 0.1559343 0.4797861 0.7725579\n\n[[3]]\n[1] 0.5113269 0.5074208 0.6371496\n\n[[4]]\n[1] 0.1825518 0.6402405 0.2021850\n\n[[5]]\n[1] 0.9030861 0.7609473 0.3594749\n```\n\n\n:::\n\n```{.r .cell-code}\n# `vapply()` solution ---------------\nlapply(l, function(e) {\n apply(e, 1, median)\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[[1]]\n[1] 0.4278542 0.3265237 0.5433079\n\n[[2]]\n[1] 0.1559343 0.4797861 0.7725579\n\n[[3]]\n[1] 0.5113269 0.5074208 0.6371496\n\n[[4]]\n[1] 0.1825518 0.6402405 0.2021850\n\n[[5]]\n[1] 0.9030861 0.7609473 0.3594749\n```\n\n\n:::\n:::\n\n\nMoreover, this clarity of intent is not limited to human readers alone;\nautomated static analysis tools can also more effectively identify suboptimal\npatterns. This can be demonstrated using R most popular static analysis tool:\nthe lintr package, suggesting vectorized alternatives:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlintr::lint(text = \"vapply(l, length, numeric(1))\")\n\nlintr::lint(text = \"apply(m, 1, sum)\")\n```\n:::\n\n\n## Reason 2: code compactness and conciseness\n\nAs illustrated in the preceding example, `apply()` often leads to more compact\ncode, as much of the boilerplate code is handled behind the scenes: you don't\nhave to initialize your variables, manage indexing, etc.\n\nThis, in turn, impacts code readability since:\n\n- The boilerplate code does not offer meaningful insights into the algorithm or\n implementation and can be seen as visual noise.\n- While compactness should never take precedence over readability, a more\n compact solution allows for more code to be displayed on the screen without\n scrolling. This ultimately makes it easier to understand what the code is\n doing. With all things otherwise equal, the more compact solution should thus\n be preferred.\n\n## Reason 3: variable leak\n\nAs discussed in the previous sections, you have to manually manage the iteration\nindex in a `for` loop, whereas they are abstracted in `apply()`. This can\nsometimes lead to perplexing errors:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nk <- 10\n\nfor (k in c(\"Paul\", \"Pierre\", \"Jacques\")) {\n message(\"Hello \", k)\n}\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nHello Paul\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nHello Pierre\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nHello Jacques\n```\n\n\n:::\n\n```{.r .cell-code}\nrep(letters, times = k)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: NAs introduced by coercion\n```\n\n\n:::\n\n::: {.cell-output .cell-output-error}\n\n```\nError in rep(letters, times = k): invalid 'times' argument\n```\n\n\n:::\n:::\n\n\nThis is because the loop index variable leaks into the global environment and\ncan overwrite existing variables:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (k in 1:3) {\n # do something\n}\n\nmessage(\"The value of k is now \", k)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nThe value of k is now 3\n```\n\n\n:::\n:::\n\n\n## Reason 4: pipelines\n\nThe final reason is that `apply()` (or more commonly in this situation\n`purrr::map()`) can be used in pipelines due to their functional nature:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nl <- list(c(1, 2, 6), c(3, 5), 6, c(0, 9, 3, 4, 8))\n\n# Without pipe\nvapply(l, mean, numeric(1))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.0 4.0 6.0 4.8\n```\n\n\n:::\n\n```{.r .cell-code}\n# With pipe\nl |> vapply(mean, numeric(1))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.0 4.0 6.0 4.8\n```\n\n\n:::\n\n```{.r .cell-code}\nl |> purrr::map_dbl(mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.0 4.0 6.0 4.8\n```\n\n\n:::\n:::\n\n\n## Conclusion\n\nThis post hopefully convinced you why it's better to use `apply()` functions\nrather than `for` loops where possible (i.e., for iteration). Contrary to common\nmisconception, the real reason is not performance, but code robustness and\nreadability.\n\n*Thanks to Jaime Pavlich-Mariscal, James Azam, Tim Taylor, and Pratik Gupte for\ntheir thoughtful comments and suggestions on earlier drafts of this post.*\n\n::: {.callout-tip title=\"Beyond R\"}\n\nThis post focused on R, but the same principles generally apply to other\nfunctional languages. In Python for example, you would use [list\ncomprehensions](https://www.w3schools.com/python/python_lists_comprehension.asp)\nor the [`map()` function](https://www.w3schools.com/python/ref_func_map.asp).\n\n:::\n\n::: {.callout-tip title=\"Further reading\"}\n\nIf you liked the code patterns recommended in this post and want to use functional programming in more situations, including recursion, I recommend you check out the [\"Functionals\" chapter of the *Advanced R* book by Hadley Wickham](https://adv-r.hadley.nz/functionals.html#functionals)\n\n:::\n", + "markdown": "---\ntitle: \"Lesser-known reasons to prefer `apply()` over for loops\"\nauthor:\n - name: \"Hugo Gruson\"\n orcid: \"0000-0002-4094-1476\"\ndate: \"2023-11-02\"\ncategories: [R, functional programming, iteration, readability, good practices, tidyverse]\nformat:\n html: \n toc: true\neditor: \n markdown: \n wrap: 80\n---\n\n\nThe debate regarding the use of `for` loops versus the `apply()` function family\n(`apply()`, `lapply()`, `vapply()`, etc., along with their purrr counterparts:\n`map()`, `map2()`, `map_lgl()`, `map_chr()`, etc.), has been a longstanding one\nin the R community.\n\nWhile you may occasionally hear that `for` loops are slower, this notion has\nalready been debunked [in other\nposts](https://privefl.github.io/blog/why-loops-are-slow-in-r/). When utilized\ncorrectly, a `for` loop can achieve performance on par with `apply()` functions.\n\nHowever, there are still lesser-known reasons to prefer `apply()` functions over\n`for` loops, which we will explore in this post.\n\n## Preamble: `for` loops can be used in more cases than `apply()`\n\nIt is important to understand that `for` loops and `apply()` functions are not\nalways interchangeable. Indeed, `for` loops can be used in cases where `apply`\nfunctions can't: when the next step depends on the previous one. This concept is\nknown as [*recursion*](https://en.wikipedia.org/wiki/Recursion).\n\nConversely, when each step is independent of the previous one, but you want to\nperform the same operation on each element of a vector, it is referred to as\n[*iteration*](https://en.wikipedia.org/wiki/Iteration).\n\n`for` loops are capable of both *recursion* and *iteration*, whereas `apply()`\ncan only do *iteration*.\n\n| Operator | Iteration | Recursion |\n|-----------|-----------|-----------|\n| `for` | ✔️ | ✔️ |\n| `apply()` | ✔️ | ❌ |\n\nWith this distinction in mind, we can now focus on why you should favour\n`apply()` for iteration over `for` loops.\n\n## Reason 1: clarity of intent\n\nAs mentioned earlier, `for` loops can be used for both iteration and recursion.\nBy consistently employing `apply()` for iteration[^caveat] and reserving `for` loops for\nrecursion, we enable readers to immediately discern the underlying concept in\nthe code. This leads to code that is easier to read and understand.\n\n[^caveat]: There are a handful of rare corner cases where `apply()` is not the best method for iteration. These are cases that make use of `match.call()` or `sys.call()`. More details are available in `lapply()` documentation and in [this GitHub comment by Tim Taylor during the review of this post](https://github.com/epiverse-trace/epiverse-trace.github.io/pull/125#issuecomment-1775929451).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nl <- list(c(1, 2, 6), c(3, 5), 6, c(0, 9, 3, 4, 8))\n\n# `for` solution -------------------\nres <- numeric(length(l))\nfor (i in seq_along(l)) {\n res[[i]] <- mean(l[[i]])\n}\nres\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.0 4.0 6.0 4.8\n```\n\n\n:::\n\n```{.r .cell-code}\n# `vapply()` solution ---------------\nres <- vapply(l, mean, numeric(1))\nres\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.0 4.0 6.0 4.8\n```\n\n\n:::\n:::\n\n\nThe simplicity of `apply()` is even more apparent in the case of multiple iterations. For example, if we want to find the median of each matrix row for a list of matrices:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nl <- replicate(5, { matrix(runif(9), nrow = 3) }, simplify = FALSE)\n\n# `for` solution -------------------\nres <- list()\nfor (i in seq_along(l)) {\n meds <- numeric(nrow(l[[i]]))\n for (j in seq_len(nrow(l[[i]]))) {\n meds[[j]] <- median(l[[i]][j, ])\n }\n res[[i]] <- meds\n}\nres\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[[1]]\n[1] 0.7807203 0.2448108 0.7407391\n\n[[2]]\n[1] 0.3249898 0.1742138 0.3917644\n\n[[3]]\n[1] 0.2876697 0.8835354 0.5606563\n\n[[4]]\n[1] 0.08360038 0.37515714 0.43738127\n\n[[5]]\n[1] 0.4834298 0.6106350 0.6328951\n```\n\n\n:::\n\n```{.r .cell-code}\n# `vapply()` solution ---------------\nlapply(l, function(e) {\n apply(e, 1, median)\n})\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[[1]]\n[1] 0.7807203 0.2448108 0.7407391\n\n[[2]]\n[1] 0.3249898 0.1742138 0.3917644\n\n[[3]]\n[1] 0.2876697 0.8835354 0.5606563\n\n[[4]]\n[1] 0.08360038 0.37515714 0.43738127\n\n[[5]]\n[1] 0.4834298 0.6106350 0.6328951\n```\n\n\n:::\n:::\n\n\nMoreover, this clarity of intent is not limited to human readers alone;\nautomated static analysis tools can also more effectively identify suboptimal\npatterns. This can be demonstrated using R most popular static analysis tool:\nthe lintr package, suggesting vectorized alternatives:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlintr::lint(text = \"vapply(l, length, numeric(1))\")\n\nlintr::lint(text = \"apply(m, 1, sum)\")\n```\n:::\n\n\n## Reason 2: code compactness and conciseness\n\nAs illustrated in the preceding example, `apply()` often leads to more compact\ncode, as much of the boilerplate code is handled behind the scenes: you don't\nhave to initialize your variables, manage indexing, etc.\n\nThis, in turn, impacts code readability since:\n\n- The boilerplate code does not offer meaningful insights into the algorithm or\n implementation and can be seen as visual noise.\n- While compactness should never take precedence over readability, a more\n compact solution allows for more code to be displayed on the screen without\n scrolling. This ultimately makes it easier to understand what the code is\n doing. With all things otherwise equal, the more compact solution should thus\n be preferred.\n\n## Reason 3: variable leak\n\nAs discussed in the previous sections, you have to manually manage the iteration\nindex in a `for` loop, whereas they are abstracted in `apply()`. This can\nsometimes lead to perplexing errors:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nk <- 10\n\nfor (k in c(\"Paul\", \"Pierre\", \"Jacques\")) {\n message(\"Hello \", k)\n}\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nHello Paul\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nHello Pierre\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nHello Jacques\n```\n\n\n:::\n\n```{.r .cell-code}\nrep(letters, times = k)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: NAs introduced by coercion\n```\n\n\n:::\n\n::: {.cell-output .cell-output-error}\n\n```\nError in rep(letters, times = k): invalid 'times' argument\n```\n\n\n:::\n:::\n\n\nThis is because the loop index variable leaks into the global environment and\ncan overwrite existing variables:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (k in 1:3) {\n # do something\n}\n\nmessage(\"The value of k is now \", k)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nThe value of k is now 3\n```\n\n\n:::\n:::\n\n\n## Reason 4: pipelines\n\nThe final reason is that `apply()` (or more commonly in this situation\n`purrr::map()`) can be used in pipelines due to their functional nature:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nl <- list(c(1, 2, 6), c(3, 5), 6, c(0, 9, 3, 4, 8))\n\n# Without pipe\nvapply(l, mean, numeric(1))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.0 4.0 6.0 4.8\n```\n\n\n:::\n\n```{.r .cell-code}\n# With pipe\nl |> vapply(mean, numeric(1))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.0 4.0 6.0 4.8\n```\n\n\n:::\n\n```{.r .cell-code}\nl |> purrr::map_dbl(mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.0 4.0 6.0 4.8\n```\n\n\n:::\n:::\n\n\n## Conclusion\n\nThis post hopefully convinced you why it's better to use `apply()` functions\nrather than `for` loops where possible (i.e., for iteration). Contrary to common\nmisconception, the real reason is not performance, but code robustness and\nreadability.\n\n*Thanks to Jaime Pavlich-Mariscal, James Azam, Tim Taylor, and Pratik Gupte for\ntheir thoughtful comments and suggestions on earlier drafts of this post.*\n\n::: {.callout-tip title=\"Beyond R\"}\n\nThis post focused on R, but the same principles generally apply to other\nfunctional languages. In Python for example, you would use [list\ncomprehensions](https://www.w3schools.com/python/python_lists_comprehension.asp)\nor the [`map()` function](https://www.w3schools.com/python/ref_func_map.asp).\n\n:::\n\n::: {.callout-tip title=\"Further reading\"}\n\nIf you liked the code patterns recommended in this post and want to use functional programming in more situations, including recursion, I recommend you check out the [\"Functionals\" chapter of the *Advanced R* book by Hadley Wickham](https://adv-r.hadley.nz/functionals.html#functionals)\n\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/posts/statistical-correctness/index/execute-results/html.json b/_freeze/posts/statistical-correctness/index/execute-results/html.json index 983bd40e..a2a781bd 100644 --- a/_freeze/posts/statistical-correctness/index/execute-results/html.json +++ b/_freeze/posts/statistical-correctness/index/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "b324be8c4b3db6c77c8be027b0f36e19", + "hash": "d3a3e4bb0e4c7c3d6dba05971ead2cb7", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Ensuring & Showcasing the Statistical Correctness of your R Package\"\nauthor:\n - name: \"Hugo Gruson\"\n orcid: \"0000-0002-4094-1476\"\ndate: \"2023-02-13\"\ncategories: [code quality, R, R package, testing]\nimage: \"testing_error.png\"\nformat:\n html: \n toc: true\n---\n\n\nWe're evolving in an increasingly data-driven world. And since critical decisions are taken based on results produced by data scientists and data analysts, they need to be be able to trust the tools they use.\nIt is now increasingly common to add continuous integration to software packages and libraries, to ensure the code is not crashing, and that future updates don't change your code output (snapshot tests). But one type of test still remains uncommon: tests for statistical correctness. That is, tests that ensure the algorithm implemented in your package actually produce the correct results.\n\n> Does [@rstudio](https://twitter.com/rstudio) have a position in the trustworthiness / validity of any ~statistical methods~ packages in [#Rstats](https://twitter.com/hashtag/Rstats)?\n>\n> Or, is there a list of packages that [@rstudio](https://twitter.com/rstudio) considers 'approved' and thus will recommend to clients?\n>\n> --- *(deleted tweet)* February 3, 2019\n\nIt is likely that most statistical package authors run some tests on their own during development but there doesn't seem to be guidelines on how to test statistical correctness in a solid and standard way [^1].\n\n[^1]: But see the [\"testing statistical software\" post from Alex Hayes](https://www.alexpghayes.com/post/2019-06-07_testing-statistical-software/) where he presents his process to determine if he deems a statistical package trustworthy or not, and [rOpenSci Statistical Software Peer Review book](https://stats-devguide.ropensci.org/).\n\nIn this blog post, we explore various methods to ensure the statistical correctness of your software. We argue that these tests should be part of your continuous integration system, to ensure your tools remains valid throughout its life, and to let users verify how you validate your package. Finally, we show how these principles are implemented in the Epiverse TRACE tools.\n\nThe approaches presented here are non-exclusive and should ideally all be added to your tests. However, they are presented in order of stringency and priority to implement. We also take a example of a function computing the centroid of a list of points to demonstrate how you would integrate the recommendations from this post with the [`{testthat}` R package](https://testthat.r-lib.org/), often used from unit testing:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#' Compute the centroid of a set of points\n#'\n#' @param coords Coordinates of the points as a list of vectors. Each element of the \n#' list is a point.\n#'\n#' @returns A vector of coordinates of the same length of each element of \n#' `coords`\n#' \n#' @examples\n#' centroid(list(c(0, 1, 5, 3), c(8, 6, 4, 3), c(10, 2, 3, 7)))\n#' \ncentroid <- function(coords) {\n\n # ...\n # Skip all the necessary input checking for the purpose of this demo\n # ...\n\n coords_mat <- do.call(rbind, coords)\n \n return(colMeans(coords_mat))\n \n}\n```\n:::\n\n\n## Compare your results to the reference implementation\n\nThe most straightforward and most solid way to ensure your implementation is valid is to compare your results to the results of the reference implementation. The reference implementation can be a package in another language, an example with toy data in the scientific article introducing the method, etc.\n\nFor example, the [`{gemma2}` R package](https://github.com/fboehm/gemma2), which re-implements the methods from [the GEMMA tool written in C++](https://github.com/genetics-statistics/GEMMA), [verifies that values produced by both tools match](https://github.com/fboehm/gemma2/blob/ea3052f8609622f17224fb8ec5fd83bd1bceb33e/tests/testthat/test_calc_sigma.R#L34-L37):\n\n``` r\ntest_that(\"Results of gemma2 equal those of GEMMA v 0.97\", {\n expect_equal(Sigma_ee, diag(c(18.559, 12.3672)), tolerance = 0.0001)\n expect_equal(Sigma_uu, diag(c(82.2973, 41.9238)), tolerance = 0.0001)\n})\n```\n\n::: {.callout-tip title=\"Example with `centroid()`\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(testthat)\n\ntest_that(\"centroid() in 1D produces the same results as mean()\", {\n\n x <- list(1, 5, 3, 10, 5)\n\n expect_identical(centroid(x), mean(unlist(x)))\n \n})\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTest passed 🎊\n```\n\n\n:::\n:::\n\n\n:::\n\nNote that even if a **reference** implementation doesn't exist, it is still good practice to compare your implementation to competing ones. Discrepancies might reveal a bug in your implementation or theirs but in any case, finding it out is beneficial to the community.\n\nHowever, this approach cannot be used in all cases. Indeed, there may not be a reference implementation in your case. Or it might be difficult to replicate identical computations in the case of algorithm with stochasticity [^2].\n\n[^2]: Setting the random seed is not enough to compare implementations across programming languages because different languages use different kind of Random Number Generators.\n\n## Compare to a theoretical upper or lower bound\n\nAn alternative strategy is to compare your result to theoretical upper or lower bound. This offers a weaker guarantee that your implementation and your results are correct but it can still allow you to detect important mistakes.\n\n::: {.callout-tip title=\"Example with `centroid()`\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntest_that(\"centroid() is inside the hypercube containing the data points\", {\n \n x <- list(c(0, 1, 5, 3), c(8, 6, 4, 3), c(10, 2, 3, 7))\n\n expect_true(all(centroid(x) <= Reduce(pmax, x)))\n expect_true(all(centroid(x) >= Reduce(pmin, x)))\n \n})\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTest passed 🎉\n```\n\n\n:::\n:::\n\n\n:::\n\nYou can see a [real-life example of this kind of test in the `{finalsize}` R package](https://github.com/epiverse-trace/finalsize/blob/a710767b38a9242f15ab4dcf18b02fb5b0bcf24f/tests/testthat/test-newton_solver_vary_r0.R#L1-L13). `{finalsize}` computes the final proportion of infected in a heterogeneous population according to an SIR model. Theory predicts that the number of infections is maximal in a well-mixed population:\n\n``` r\n# Calculates the upper limit of final size given the r0\n# The upper limit is given by a well mixed population\nupper_limit <- function(r0) {\n f <- function(par) {\n abs(1 - exp(-r0 * par[1]) - par[1])\n }\n opt <- optim(\n par = 0.5, fn = f,\n lower = 0, upper = 1,\n method = \"Brent\"\n )\n opt\n}\n```\n\n## Verify that output is changing as expected when a single parameter varies\n\nAn even looser way to test statistical correctness would be to control that output varies as expected when you update some parameters. This could be for example, checking that the values you return increase when you increase or decrease one of your input parameters.\n\n::: {.callout-tip title=\"Example with `centroid()`\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntest_that(\"centroid() increases when coordinates from one point increase\", {\n \n x <- list(c(0, 1, 5, 3), c(8, 6, 4, 3), c(10, 2, 3, 7))\n \n y <- x\n y[[1]] <- y[[1]] + 1 \n\n expect_true(all(centroid(x) < centroid(y)))\n \n})\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTest passed 🎊\n```\n\n\n:::\n:::\n\n\n:::\n\nAn example of this test in an actual R package can again be found [in the finalsize package](https://github.com/epiverse-trace/finalsize/blob/787de9a8fa430d63d06d2bc052c7134c43d1ca69/tests/testthat/test-newton_solver.R#L76-L102):\n\n``` r\nr0_low <- 1.3\nr0_high <- 3.3\n\nepi_outcome_low <- final_size(\n r0 = r0_low,\n <...>\n)\nepi_outcome_high <- final_size(\n r0 = r0_high,\n <...>\n)\n\ntest_that(\"Higher values of R0 result in a higher number of infectious in all groups\", {\n expect_true(\n all(epi_outcome_high$p_infected > epi_outcome_low$p_infected)\n )\n})\n```\n\n## Conclusion: automated validation vs peer-review\n\nIn this post, we've presented different methods to automatically verify the statistical correctness of your statistical software. We would like to highlight one more time that it's important to run these tests are part of your regular integration system, instead of running them just once at the start of the development. This will prevent the addition of possible errors in the code and show users what specific checks you are doing. By doing so, you are transparently committing to the highest quality.\n\n[Multiple voices](https://notstatschat.rbind.io/2019/02/04/how-do-you-tell-what-packages-to-trust/) [in the community](https://twitter.com/hadleywickham/status/1092129977540231168) are pushing more towards peer-review as a proxy for quality and validity:\n\n\n{{< tweet hadleywickham 1092129977540231168 >}}\n\n\n\nWe would like to highlight that automated validation and peer review are not mutually exclusive and answer slightly different purposes.\n\nOn the one hand, automated validation fails to catch more obscure bugs and edge cases. For example, a bug that would be difficult to detect via automated approach is the use of [bad Random Number Generators when running in parallel](https://www.jottr.org/2020/09/22/push-for-statistical-sound-rng/).\n\nBut on the other hand, peer-review is less scalable, and journals usually have some editorial policy that might not make your package a good fit. Additionally, peer-review usually happens at one point in time while automated validation can, and should, be part of the continuous integration system.\n\nIdeally, peer-review and automated validation should work hand-in-hand, with review informing the addition of new automated validation tests.\n", + "markdown": "---\ntitle: \"Ensuring & Showcasing the Statistical Correctness of your R Package\"\nauthor:\n - name: \"Hugo Gruson\"\n orcid: \"0000-0002-4094-1476\"\ndate: \"2023-02-13\"\ncategories: [code quality, R, R package, testing, continuous integration, good practices]\nimage: \"testing_error.png\"\nformat:\n html: \n toc: true\n---\n\n\nWe're evolving in an increasingly data-driven world. And since critical decisions are taken based on results produced by data scientists and data analysts, they need to be be able to trust the tools they use.\nIt is now increasingly common to add continuous integration to software packages and libraries, to ensure the code is not crashing, and that future updates don't change your code output (snapshot tests). But one type of test still remains uncommon: tests for statistical correctness. That is, tests that ensure the algorithm implemented in your package actually produce the correct results.\n\n> Does [@rstudio](https://twitter.com/rstudio) have a position in the trustworthiness / validity of any ~statistical methods~ packages in [#Rstats](https://twitter.com/hashtag/Rstats)?\n>\n> Or, is there a list of packages that [@rstudio](https://twitter.com/rstudio) considers 'approved' and thus will recommend to clients?\n>\n> --- *(deleted tweet)* February 3, 2019\n\nIt is likely that most statistical package authors run some tests on their own during development but there doesn't seem to be guidelines on how to test statistical correctness in a solid and standard way [^1].\n\n[^1]: But see the [\"testing statistical software\" post from Alex Hayes](https://www.alexpghayes.com/post/2019-06-07_testing-statistical-software/) where he presents his process to determine if he deems a statistical package trustworthy or not, and [rOpenSci Statistical Software Peer Review book](https://stats-devguide.ropensci.org/).\n\nIn this blog post, we explore various methods to ensure the statistical correctness of your software. We argue that these tests should be part of your continuous integration system, to ensure your tools remains valid throughout its life, and to let users verify how you validate your package. Finally, we show how these principles are implemented in the Epiverse TRACE tools.\n\nThe approaches presented here are non-exclusive and should ideally all be added to your tests. However, they are presented in order of stringency and priority to implement. We also take a example of a function computing the centroid of a list of points to demonstrate how you would integrate the recommendations from this post with the [`{testthat}` R package](https://testthat.r-lib.org/), often used from unit testing:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#' Compute the centroid of a set of points\n#'\n#' @param coords Coordinates of the points as a list of vectors. Each element of the \n#' list is a point.\n#'\n#' @returns A vector of coordinates of the same length of each element of \n#' `coords`\n#' \n#' @examples\n#' centroid(list(c(0, 1, 5, 3), c(8, 6, 4, 3), c(10, 2, 3, 7)))\n#' \ncentroid <- function(coords) {\n\n # ...\n # Skip all the necessary input checking for the purpose of this demo\n # ...\n\n coords_mat <- do.call(rbind, coords)\n \n return(colMeans(coords_mat))\n \n}\n```\n:::\n\n\n## Compare your results to the reference implementation\n\nThe most straightforward and most solid way to ensure your implementation is valid is to compare your results to the results of the reference implementation. The reference implementation can be a package in another language, an example with toy data in the scientific article introducing the method, etc.\n\nFor example, the [`{gemma2}` R package](https://github.com/fboehm/gemma2), which re-implements the methods from [the GEMMA tool written in C++](https://github.com/genetics-statistics/GEMMA), [verifies that values produced by both tools match](https://github.com/fboehm/gemma2/blob/ea3052f8609622f17224fb8ec5fd83bd1bceb33e/tests/testthat/test_calc_sigma.R#L34-L37):\n\n``` r\ntest_that(\"Results of gemma2 equal those of GEMMA v 0.97\", {\n expect_equal(Sigma_ee, diag(c(18.559, 12.3672)), tolerance = 0.0001)\n expect_equal(Sigma_uu, diag(c(82.2973, 41.9238)), tolerance = 0.0001)\n})\n```\n\n::: {.callout-tip title=\"Example with `centroid()`\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(testthat)\n\ntest_that(\"centroid() in 1D produces the same results as mean()\", {\n\n x <- list(1, 5, 3, 10, 5)\n\n expect_identical(centroid(x), mean(unlist(x)))\n \n})\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTest passed 🎊\n```\n\n\n:::\n:::\n\n\n:::\n\nNote that even if a **reference** implementation doesn't exist, it is still good practice to compare your implementation to competing ones. Discrepancies might reveal a bug in your implementation or theirs but in any case, finding it out is beneficial to the community.\n\nHowever, this approach cannot be used in all cases. Indeed, there may not be a reference implementation in your case. Or it might be difficult to replicate identical computations in the case of algorithm with stochasticity [^2].\n\n[^2]: Setting the random seed is not enough to compare implementations across programming languages because different languages use different kind of Random Number Generators.\n\n## Compare to a theoretical upper or lower bound\n\nAn alternative strategy is to compare your result to theoretical upper or lower bound. This offers a weaker guarantee that your implementation and your results are correct but it can still allow you to detect important mistakes.\n\n::: {.callout-tip title=\"Example with `centroid()`\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntest_that(\"centroid() is inside the hypercube containing the data points\", {\n \n x <- list(c(0, 1, 5, 3), c(8, 6, 4, 3), c(10, 2, 3, 7))\n\n expect_true(all(centroid(x) <= Reduce(pmax, x)))\n expect_true(all(centroid(x) >= Reduce(pmin, x)))\n \n})\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTest passed 😀\n```\n\n\n:::\n:::\n\n\n:::\n\nYou can see a [real-life example of this kind of test in the `{finalsize}` R package](https://github.com/epiverse-trace/finalsize/blob/a710767b38a9242f15ab4dcf18b02fb5b0bcf24f/tests/testthat/test-newton_solver_vary_r0.R#L1-L13). `{finalsize}` computes the final proportion of infected in a heterogeneous population according to an SIR model. Theory predicts that the number of infections is maximal in a well-mixed population:\n\n``` r\n# Calculates the upper limit of final size given the r0\n# The upper limit is given by a well mixed population\nupper_limit <- function(r0) {\n f <- function(par) {\n abs(1 - exp(-r0 * par[1]) - par[1])\n }\n opt <- optim(\n par = 0.5, fn = f,\n lower = 0, upper = 1,\n method = \"Brent\"\n )\n opt\n}\n```\n\n## Verify that output is changing as expected when a single parameter varies\n\nAn even looser way to test statistical correctness would be to control that output varies as expected when you update some parameters. This could be for example, checking that the values you return increase when you increase or decrease one of your input parameters.\n\n::: {.callout-tip title=\"Example with `centroid()`\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntest_that(\"centroid() increases when coordinates from one point increase\", {\n \n x <- list(c(0, 1, 5, 3), c(8, 6, 4, 3), c(10, 2, 3, 7))\n \n y <- x\n y[[1]] <- y[[1]] + 1 \n\n expect_true(all(centroid(x) < centroid(y)))\n \n})\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTest passed 😀\n```\n\n\n:::\n:::\n\n\n:::\n\nAn example of this test in an actual R package can again be found [in the finalsize package](https://github.com/epiverse-trace/finalsize/blob/787de9a8fa430d63d06d2bc052c7134c43d1ca69/tests/testthat/test-newton_solver.R#L76-L102):\n\n``` r\nr0_low <- 1.3\nr0_high <- 3.3\n\nepi_outcome_low <- final_size(\n r0 = r0_low,\n <...>\n)\nepi_outcome_high <- final_size(\n r0 = r0_high,\n <...>\n)\n\ntest_that(\"Higher values of R0 result in a higher number of infectious in all groups\", {\n expect_true(\n all(epi_outcome_high$p_infected > epi_outcome_low$p_infected)\n )\n})\n```\n\n## Conclusion: automated validation vs peer-review\n\nIn this post, we've presented different methods to automatically verify the statistical correctness of your statistical software. We would like to highlight one more time that it's important to run these tests are part of your regular integration system, instead of running them just once at the start of the development. This will prevent the addition of possible errors in the code and show users what specific checks you are doing. By doing so, you are transparently committing to the highest quality.\n\n[Multiple voices](https://notstatschat.rbind.io/2019/02/04/how-do-you-tell-what-packages-to-trust/) [in the community](https://twitter.com/hadleywickham/status/1092129977540231168) are pushing more towards peer-review as a proxy for quality and validity:\n\n\n{{< tweet hadleywickham 1092129977540231168 >}}\n\n\n\nWe would like to highlight that automated validation and peer review are not mutually exclusive and answer slightly different purposes.\n\nOn the one hand, automated validation fails to catch more obscure bugs and edge cases. For example, a bug that would be difficult to detect via automated approach is the use of [bad Random Number Generators when running in parallel](https://www.jottr.org/2020/09/22/push-for-statistical-sound-rng/).\n\nBut on the other hand, peer-review is less scalable, and journals usually have some editorial policy that might not make your package a good fit. Additionally, peer-review usually happens at one point in time while automated validation can, and should, be part of the continuous integration system.\n\nIdeally, peer-review and automated validation should work hand-in-hand, with review informing the addition of new automated validation tests.\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/posts/system-dependencies/index/execute-results/html.json b/_freeze/posts/system-dependencies/index/execute-results/html.json index 191b7c35..1618bb72 100644 --- a/_freeze/posts/system-dependencies/index/execute-results/html.json +++ b/_freeze/posts/system-dependencies/index/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "d5f009c6a17e9de3220c53228a5ebfa9", + "hash": "2eef0272c9ab2c01cb29a2ea2d71e26e", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"System Dependencies in R Packages & Automatic Testing\"\nauthor:\n - name: \"Hugo Gruson\"\n orcid: \"0000-0002-4094-1476\"\ndate: \"2023-09-26\"\ncategories: [package development, R, R package, continuous integration, system dependens]\nformat:\n html: \n toc: true\n---\n\n\n*This post has been [cross-posted on the R-hub blog](https://blog.r-hub.io/2023/09/26/system-dependency/), and the R-hub blog maintainers have contributed to the review and improvement of this post.*\n\nIn a [previous R-hub blog post](https://blog.r-hub.io/2022/09/12/r-dependency/), we discussed a package dependency that goes slightly beyond the normal R package ecosystem dependency: R itself.\nToday, we step even further and discuss dependencies outside of R: system dependencies.\nThis happens when packages rely on external software, such as how [R packages integrating CUDA GPU computation in R](https://github.com/search?q=org%3Acran+cuda+path%3ADESCRIPTION&type=code) require the [CUDA library](https://en.wikipedia.org/wiki/CUDA).\nIn particular, we are going to talk about system dependencies in the context of automated testing: is there anything extra to do when setting continuous integration for your package with system dependencies?\nIn particular, we will focus with the integration with [GitHub Actions](https://beamilz.com/posts/series-gha/2022-series-gha-1-what-is/en/).\nHow does it work behind the scenes?\nAnd how to work with edge cases?\n\n## Introduction: specifying system dependencies in R packages\n\nBefore jumping right into the topic of continuous integration, let's take a moment to introduce, or remind you, how system dependencies are specified in R packages.\n\nThe official 'Writing R Extensions' guide states [^1]:\n\n[^1]: For R history fans, this has been the case [since R 1.7.0](https://github.com/r-devel/r-svn/blob/9c46956fd784c6985867aca069b926d774602928/doc/NEWS.1#L2348-L2350), released in April 2003.\n\n> Dependencies external to the R system should be listed in the 'SystemRequirements' field, possibly amplified in a separate README file.\n\nThis was initially purely designed for humans.\nNo system within R itself makes use of it.\nOne important thing to note is that this field contains free text :scream:.\nAs such, to refer to the same piece of software, you could write either one of the following in the package `DESCRIPTION`:\n\n``` yaml\nSystemRequirements: ExternalSoftware\n```\n\n``` yaml\nSystemRequirements: ExternalSoftware 0.1\n```\n\n``` yaml\nSystemRequirements: lib-externalsoftware\n```\n\nHowever, it is probably good practice check what other R packages with similar system dependencies are writing in `SystemRequirements`, to facilitate the automated identification process we describe below.\n\n## The general case: everything works automagically\n\nIf while reading the previous section, you could already sense the problems linked to the fact `SystemRequirements` is a free-text field, fret not!\nIn the very large majority of cases, setting up continuous integration in an R package with system dependencies is exactly the same as with any other R package.\n\nUsing, as often, the supercharged usethis package, you can automatically create the relevant GitHub Actions workflow file in your project [^2]:\n\n[^2]: Alternatively, if you're not using usethis, you can manually copy-paste the relevant GitHub Actions workflow file from the [`examples` of the `r-lib/actions` project](https://github.com/r-lib/actions/tree/HEAD/examples).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nusethis::use_github_action(\"check-standard\")\n```\n:::\n\n\nThe result is:\n\n``` yaml\n# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples\n# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help\non:\n push:\n branches: [main, master]\n pull_request:\n branches: [main, master]\n\nname: R-CMD-check\n\njobs:\n R-CMD-check:\n runs-on: ${{ matrix.config.os }}\n\n name: ${{ matrix.config.os }} (${{ matrix.config.r }})\n\n strategy:\n fail-fast: false\n matrix:\n config:\n - {os: macos-latest, r: 'release'}\n - {os: windows-latest, r: 'release'}\n - {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'}\n - {os: ubuntu-latest, r: 'release'}\n - {os: ubuntu-latest, r: 'oldrel-1'}\n\n env:\n GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}\n R_KEEP_PKG_SOURCE: yes\n\n steps:\n - uses: actions/checkout@v3\n\n - uses: r-lib/actions/setup-pandoc@v2\n\n - uses: r-lib/actions/setup-r@v2\n with:\n r-version: ${{ matrix.config.r }}\n http-user-agent: ${{ matrix.config.http-user-agent }}\n use-public-rspm: true\n\n - uses: r-lib/actions/setup-r-dependencies@v2\n with:\n extra-packages: any::rcmdcheck\n needs: check\n\n - uses: r-lib/actions/check-r-package@v2\n with:\n upload-snapshots: true\n```\n\nYou may notice there is no explicit mention of system dependencies in this file.\nYet, if we use this workflow in an R package with system dependencies, everything will work out-of-the-box in most cases.\nSo, when are system dependencies installed?\nAnd how the workflow does even know which dependencies to install since the `SystemRequirements` is free text that may not correspond to the exact name of a library?\n\nThe magic happens in the `r-lib/actions/setup-r-dependencies` step.\nIf you want to learn about it, you can read the [source code of this step](https://github.com/r-lib/actions/blob/756399d909bf9c180bbdafe8025f794f51f2da02/setup-r-dependencies/action.yaml).\nIt is mostly written in R but it contains a lot of bells and whistles to handle messaging within the GitHub Actions context and as such, it would be too long to go through it line by line in this post.\nHowever, at a glance, you can notice many mentions of the [pak R package](https://pak.r-lib.org/).\n\nIf it's the first time you're hearing about the pak package, we strongly recommend we go through the [list of the most important pak features](https://pak.r-lib.org/reference/features.html).\nIt is ~~paked~~ packed with many very powerful features.\nThe specific feature we're interested in here is the automatic install of system dependencies via [`pak::pkg_sysreqs()`](https://pak.r-lib.org/reference/local_system_requirements.html), which in turn uses `pkgdepends::sysreqs_install_plan()`.\n\nWe now understand more precisely where the magic happens but it still doesn't explain how pak is able to know which precise piece of software to install from the free text `SystemRequirements` field.\nAs often when you want to increase your understanding, it is helpful to [read the source](https://blog.r-hub.io/2019/05/14/read-the-source/).\nWhile browsing pkgdepends source code, we see a call to .\n\nThis repository contains a set of [rules](https://github.com/rstudio/r-system-requirements/tree/main/rules) as json files which match unformatted software name via regular expressions to the exact libraries for each major operating system.\nLet's walk through an example together:\n\n``` json\n{\n \"patterns\": [\"\\\\bnvcc\\\\b\", \"\\\\bcuda\\\\b\"],\n \"dependencies\": [\n {\n \"packages\": [\"nvidia-cuda-dev\"],\n \"constraints\": [\n {\n \"os\": \"linux\",\n \"distribution\": \"ubuntu\"\n }\n ]\n }\n ]\n}\n```\n\nThe regular expression tells that each time a package lists something as `SystemRequirements` with the word \"nvcc\" or \"cuda\", the corresponding Ubuntu library to install is `nvidia-cuda-dev`.\n\nThis interaction between `r-system-requirements` and pak is also documented in pak's dev version, with extra information about how the `SystemRequirements` field is extracted in different situations: \n\n## When it's not working out-of-the-box\n\nWe are now realizing that this automagical setup we didn't pay so much attention to until now actually requires a very heavy machinery under the hood.\nAnd it happens, very rarely, that this complex machinery is not able to handle your specific use case.\nBut it doesn't mean that you cannot use continuous integration in your package.\nIt means that some extra steps might be required to do so.\nLet's review these possible solutions together in order of complexity.\n\n### Fix it for everybody by submitting a pull request\n\nOne first option might be that the regular expression used by `r-system-requirements` to convert the free text in `SystemRequirements` to a library distributed by your operating system does not recognize what is in `SystemRequirements`.\n\nTo identify if this is the case, you need to find the file containing the specific rule for the system dependency of interest in `r-system-requirements`, and test the regular expression on the contents of `SystemRequirements`.\n\nIf we re-use the cuda example from the previous section and we are wondering why it is not automatically installed for a package specifying \"cudaa\":\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstringr::str_match(\"cudaa\", c(\"\\\\bnvcc\\\\b\", \"\\\\bcuda\\\\b\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1]\n[1,] NA \n[2,] NA \n```\n\n\n:::\n:::\n\n\nThis test confirms that the `SystemRequirements` field contents are not recognized by the regular expression.\nDepending on the case, the best course of action might be to:\n\n- either edit the contents of `SystemRequirements` so that it's picked up by the regular expression\n- or submit a pull request to [`rstudio/r-system-requirements`](https://github.com/rstudio/r-system-requirements) [^3] if you believe the regular expression is too restrictive and should be updated ([example](https://github.com/rstudio/r-system-requirements/pull/93))\n\n[^3]: If you are wondering why we are saying to submit PR to `rstudio/r-system-requirements` when we were previously talking about `r-hub/r-system-requirements`, you can check out [this comment thread](https://github.com/r-hub/blog/pull/165#discussion_r1280644182).\n\nNote however that the first option is likely always the simplest as it doesn't impact all the rest of the ecosystem (which is why `r-system-requirements` maintainers might be reluctant to relax a regular expression) and it is often something directly in your control, rather than a third-party who might not immediately be available to review your PR.\n\n### Install system dependencies \"manually\"\n\nHowever, you might be in a case where you cannot rely on the automated approach.\nFor example, maybe the system dependency to install is not provided by package managers at all.\nTypically, if you had to compile or install it manually on your local computer, you're very likely to have to do the same operation in GitHub Actions.\nThere two different, but somewhat equivalent, ways to do so, as detailed below.\n\n#### Directly in the GitHub Actions workflow\n\nYou can insert the installation steps you used locally in the GitHub Actions workflow file.\nSo, instead of having the usual structure, you have an extra step \"Install extra system dependencies manually\" that may look something like this:\n\n``` diff\njobs:\n R-CMD-check:\n runs-on: ubuntu-latest\n env:\n GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}\n R_KEEP_PKG_SOURCE: yes\n steps:\n - uses: actions/checkout@v3\n\n - uses: r-lib/actions/setup-r@v2\n with:\n use-public-rspm: true\n\n+ - name: Install extra system dependencies manually\n+ run:\n+ wget ...\n+ make\n+ sudo make install\n\n - uses: r-lib/actions/setup-r-dependencies@v2\n with:\n extra-packages: any::rcmdcheck\n needs: check\n\n - uses: r-lib/actions/check-r-package@v2\n```\n\nYou can see [a real-life example in the rbi R package](https://github.com/sbfnk/rbi/blob/9b05a24ce42f7b1b53481370f3bde3dcd86bca02/.github/workflows/R-CMD-check.yaml).\n\n#### Using a Docker image in GitHub Actions\n\nAlternatively, you can do the manual installation in a Docker image and use this image in your GitHub Actions workflow.\nThis is a particularly good solution if there is already a public Docker image or you already wrote a `DOCKERFILE` for your own local development purposes.\nIf you use a public image, you can follow [the steps in the official documentation](https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-running-a-job-within-a-container) to integrate it to your GitHub Actions job.\nIf you use a `DOCKERFILE`, you can follow [the answers to this stackoverflow question](https://stackoverflow.com/q/61154750/4439357) (in a nutshell, use `docker compose` in your job or publish the image first and then follow the official documentation).\n\n``` diff\njobs:\n R-CMD-check:\n runs-on: ubuntu-latest\n+ container: ghcr.io/org/repo:main\n env:\n GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}\n R_KEEP_PKG_SOURCE: yes\n steps:\n - uses: actions/checkout@v3\n\n - uses: r-lib/actions/setup-r@v2\n with:\n use-public-rspm: true\n\n - uses: r-lib/actions/setup-r-dependencies@v2\n with:\n extra-packages: any::rcmdcheck\n needs: check\n\n - uses: r-lib/actions/check-r-package@v2\n```\n\nYou can again see [a real-life example in the rbi R package](https://github.com/sbfnk/rbi/pull/46/files).\n\n## Conclusion\n\nIn this post, we have provided an overview of how to specify system requirements for R package, how this seemingly innocent task requires a very complex infrastructure so that it can be understood by automated tools and that your dependencies are smoothly installed in a single command.\nWe also gave some pointers on what to do if you're in one of the rare cases where the automated tools don't or can't work.\n\nOne final note on this topic is that there might be a move from CRAN to start requiring more standardization in the `SystemRequirements` field.\nOne R package developer has reported being asked to change \"Java JRE 8 or higher\" to \"Java (\\>= 8)\".\n\n*Many thanks to [Maëlle Salmon](https://masalmon.eu/) & [Gábor Csárdi](https://github.com/gaborcsardi) for their insights into this topic and their valuable feedback on this post.*\n", + "markdown": "---\ntitle: \"System Dependencies in R Packages & Automatic Testing\"\nauthor:\n - name: \"Hugo Gruson\"\n orcid: \"0000-0002-4094-1476\"\ndate: \"2023-09-26\"\ncategories: [package development, R, R package, continuous integration, system dependencies]\nformat:\n html: \n toc: true\n---\n\n\n*This post has been [cross-posted on the R-hub blog](https://blog.r-hub.io/2023/09/26/system-dependency/), and the R-hub blog maintainers have contributed to the review and improvement of this post.*\n\nIn a [previous R-hub blog post](https://blog.r-hub.io/2022/09/12/r-dependency/), we discussed a package dependency that goes slightly beyond the normal R package ecosystem dependency: R itself.\nToday, we step even further and discuss dependencies outside of R: system dependencies.\nThis happens when packages rely on external software, such as how [R packages integrating CUDA GPU computation in R](https://github.com/search?q=org%3Acran+cuda+path%3ADESCRIPTION&type=code) require the [CUDA library](https://en.wikipedia.org/wiki/CUDA).\nIn particular, we are going to talk about system dependencies in the context of automated testing: is there anything extra to do when setting continuous integration for your package with system dependencies?\nIn particular, we will focus with the integration with [GitHub Actions](https://beamilz.com/posts/series-gha/2022-series-gha-1-what-is/en/).\nHow does it work behind the scenes?\nAnd how to work with edge cases?\n\n## Introduction: specifying system dependencies in R packages\n\nBefore jumping right into the topic of continuous integration, let's take a moment to introduce, or remind you, how system dependencies are specified in R packages.\n\nThe official 'Writing R Extensions' guide states [^1]:\n\n[^1]: For R history fans, this has been the case [since R 1.7.0](https://github.com/r-devel/r-svn/blob/9c46956fd784c6985867aca069b926d774602928/doc/NEWS.1#L2348-L2350), released in April 2003.\n\n> Dependencies external to the R system should be listed in the 'SystemRequirements' field, possibly amplified in a separate README file.\n\nThis was initially purely designed for humans.\nNo system within R itself makes use of it.\nOne important thing to note is that this field contains free text :scream:.\nAs such, to refer to the same piece of software, you could write either one of the following in the package `DESCRIPTION`:\n\n``` yaml\nSystemRequirements: ExternalSoftware\n```\n\n``` yaml\nSystemRequirements: ExternalSoftware 0.1\n```\n\n``` yaml\nSystemRequirements: lib-externalsoftware\n```\n\nHowever, it is probably good practice check what other R packages with similar system dependencies are writing in `SystemRequirements`, to facilitate the automated identification process we describe below.\n\n## The general case: everything works automagically\n\nIf while reading the previous section, you could already sense the problems linked to the fact `SystemRequirements` is a free-text field, fret not!\nIn the very large majority of cases, setting up continuous integration in an R package with system dependencies is exactly the same as with any other R package.\n\nUsing, as often, the supercharged usethis package, you can automatically create the relevant GitHub Actions workflow file in your project [^2]:\n\n[^2]: Alternatively, if you're not using usethis, you can manually copy-paste the relevant GitHub Actions workflow file from the [`examples` of the `r-lib/actions` project](https://github.com/r-lib/actions/tree/HEAD/examples).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nusethis::use_github_action(\"check-standard\")\n```\n:::\n\n\nThe result is:\n\n``` yaml\n# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples\n# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help\non:\n push:\n branches: [main, master]\n pull_request:\n branches: [main, master]\n\nname: R-CMD-check\n\njobs:\n R-CMD-check:\n runs-on: ${{ matrix.config.os }}\n\n name: ${{ matrix.config.os }} (${{ matrix.config.r }})\n\n strategy:\n fail-fast: false\n matrix:\n config:\n - {os: macos-latest, r: 'release'}\n - {os: windows-latest, r: 'release'}\n - {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'}\n - {os: ubuntu-latest, r: 'release'}\n - {os: ubuntu-latest, r: 'oldrel-1'}\n\n env:\n GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}\n R_KEEP_PKG_SOURCE: yes\n\n steps:\n - uses: actions/checkout@v3\n\n - uses: r-lib/actions/setup-pandoc@v2\n\n - uses: r-lib/actions/setup-r@v2\n with:\n r-version: ${{ matrix.config.r }}\n http-user-agent: ${{ matrix.config.http-user-agent }}\n use-public-rspm: true\n\n - uses: r-lib/actions/setup-r-dependencies@v2\n with:\n extra-packages: any::rcmdcheck\n needs: check\n\n - uses: r-lib/actions/check-r-package@v2\n with:\n upload-snapshots: true\n```\n\nYou may notice there is no explicit mention of system dependencies in this file.\nYet, if we use this workflow in an R package with system dependencies, everything will work out-of-the-box in most cases.\nSo, when are system dependencies installed?\nAnd how the workflow does even know which dependencies to install since the `SystemRequirements` is free text that may not correspond to the exact name of a library?\n\nThe magic happens in the `r-lib/actions/setup-r-dependencies` step.\nIf you want to learn about it, you can read the [source code of this step](https://github.com/r-lib/actions/blob/756399d909bf9c180bbdafe8025f794f51f2da02/setup-r-dependencies/action.yaml).\nIt is mostly written in R but it contains a lot of bells and whistles to handle messaging within the GitHub Actions context and as such, it would be too long to go through it line by line in this post.\nHowever, at a glance, you can notice many mentions of the [pak R package](https://pak.r-lib.org/).\n\nIf it's the first time you're hearing about the pak package, we strongly recommend we go through the [list of the most important pak features](https://pak.r-lib.org/reference/features.html).\nIt is ~~paked~~ packed with many very powerful features.\nThe specific feature we're interested in here is the automatic install of system dependencies via [`pak::pkg_sysreqs()`](https://pak.r-lib.org/reference/local_system_requirements.html), which in turn uses `pkgdepends::sysreqs_install_plan()`.\n\nWe now understand more precisely where the magic happens but it still doesn't explain how pak is able to know which precise piece of software to install from the free text `SystemRequirements` field.\nAs often when you want to increase your understanding, it is helpful to [read the source](https://blog.r-hub.io/2019/05/14/read-the-source/).\nWhile browsing pkgdepends source code, we see a call to .\n\nThis repository contains a set of [rules](https://github.com/rstudio/r-system-requirements/tree/main/rules) as json files which match unformatted software name via regular expressions to the exact libraries for each major operating system.\nLet's walk through an example together:\n\n``` json\n{\n \"patterns\": [\"\\\\bnvcc\\\\b\", \"\\\\bcuda\\\\b\"],\n \"dependencies\": [\n {\n \"packages\": [\"nvidia-cuda-dev\"],\n \"constraints\": [\n {\n \"os\": \"linux\",\n \"distribution\": \"ubuntu\"\n }\n ]\n }\n ]\n}\n```\n\nThe regular expression tells that each time a package lists something as `SystemRequirements` with the word \"nvcc\" or \"cuda\", the corresponding Ubuntu library to install is `nvidia-cuda-dev`.\n\nThis interaction between `r-system-requirements` and pak is also documented in pak's dev version, with extra information about how the `SystemRequirements` field is extracted in different situations: \n\n## When it's not working out-of-the-box\n\nWe are now realizing that this automagical setup we didn't pay so much attention to until now actually requires a very heavy machinery under the hood.\nAnd it happens, very rarely, that this complex machinery is not able to handle your specific use case.\nBut it doesn't mean that you cannot use continuous integration in your package.\nIt means that some extra steps might be required to do so.\nLet's review these possible solutions together in order of complexity.\n\n### Fix it for everybody by submitting a pull request\n\nOne first option might be that the regular expression used by `r-system-requirements` to convert the free text in `SystemRequirements` to a library distributed by your operating system does not recognize what is in `SystemRequirements`.\n\nTo identify if this is the case, you need to find the file containing the specific rule for the system dependency of interest in `r-system-requirements`, and test the regular expression on the contents of `SystemRequirements`.\n\nIf we re-use the cuda example from the previous section and we are wondering why it is not automatically installed for a package specifying \"cudaa\":\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstringr::str_match(\"cudaa\", c(\"\\\\bnvcc\\\\b\", \"\\\\bcuda\\\\b\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1]\n[1,] NA \n[2,] NA \n```\n\n\n:::\n:::\n\n\nThis test confirms that the `SystemRequirements` field contents are not recognized by the regular expression.\nDepending on the case, the best course of action might be to:\n\n- either edit the contents of `SystemRequirements` so that it's picked up by the regular expression\n- or submit a pull request to [`rstudio/r-system-requirements`](https://github.com/rstudio/r-system-requirements) [^3] if you believe the regular expression is too restrictive and should be updated ([example](https://github.com/rstudio/r-system-requirements/pull/93))\n\n[^3]: If you are wondering why we are saying to submit PR to `rstudio/r-system-requirements` when we were previously talking about `r-hub/r-system-requirements`, you can check out [this comment thread](https://github.com/r-hub/blog/pull/165#discussion_r1280644182).\n\nNote however that the first option is likely always the simplest as it doesn't impact all the rest of the ecosystem (which is why `r-system-requirements` maintainers might be reluctant to relax a regular expression) and it is often something directly in your control, rather than a third-party who might not immediately be available to review your PR.\n\n### Install system dependencies \"manually\"\n\nHowever, you might be in a case where you cannot rely on the automated approach.\nFor example, maybe the system dependency to install is not provided by package managers at all.\nTypically, if you had to compile or install it manually on your local computer, you're very likely to have to do the same operation in GitHub Actions.\nThere two different, but somewhat equivalent, ways to do so, as detailed below.\n\n#### Directly in the GitHub Actions workflow\n\nYou can insert the installation steps you used locally in the GitHub Actions workflow file.\nSo, instead of having the usual structure, you have an extra step \"Install extra system dependencies manually\" that may look something like this:\n\n``` diff\njobs:\n R-CMD-check:\n runs-on: ubuntu-latest\n env:\n GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}\n R_KEEP_PKG_SOURCE: yes\n steps:\n - uses: actions/checkout@v3\n\n - uses: r-lib/actions/setup-r@v2\n with:\n use-public-rspm: true\n\n+ - name: Install extra system dependencies manually\n+ run:\n+ wget ...\n+ make\n+ sudo make install\n\n - uses: r-lib/actions/setup-r-dependencies@v2\n with:\n extra-packages: any::rcmdcheck\n needs: check\n\n - uses: r-lib/actions/check-r-package@v2\n```\n\nYou can see [a real-life example in the rbi R package](https://github.com/sbfnk/rbi/blob/9b05a24ce42f7b1b53481370f3bde3dcd86bca02/.github/workflows/R-CMD-check.yaml).\n\n#### Using a Docker image in GitHub Actions\n\nAlternatively, you can do the manual installation in a Docker image and use this image in your GitHub Actions workflow.\nThis is a particularly good solution if there is already a public Docker image or you already wrote a `DOCKERFILE` for your own local development purposes.\nIf you use a public image, you can follow [the steps in the official documentation](https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-running-a-job-within-a-container) to integrate it to your GitHub Actions job.\nIf you use a `DOCKERFILE`, you can follow [the answers to this stackoverflow question](https://stackoverflow.com/q/61154750/4439357) (in a nutshell, use `docker compose` in your job or publish the image first and then follow the official documentation).\n\n``` diff\njobs:\n R-CMD-check:\n runs-on: ubuntu-latest\n+ container: ghcr.io/org/repo:main\n env:\n GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}\n R_KEEP_PKG_SOURCE: yes\n steps:\n - uses: actions/checkout@v3\n\n - uses: r-lib/actions/setup-r@v2\n with:\n use-public-rspm: true\n\n - uses: r-lib/actions/setup-r-dependencies@v2\n with:\n extra-packages: any::rcmdcheck\n needs: check\n\n - uses: r-lib/actions/check-r-package@v2\n```\n\nYou can again see [a real-life example in the rbi R package](https://github.com/sbfnk/rbi/pull/46/files).\n\n## Conclusion\n\nIn this post, we have provided an overview of how to specify system requirements for R package, how this seemingly innocent task requires a very complex infrastructure so that it can be understood by automated tools and that your dependencies are smoothly installed in a single command.\nWe also gave some pointers on what to do if you're in one of the rare cases where the automated tools don't or can't work.\n\nOne final note on this topic is that there might be a move from CRAN to start requiring more standardization in the `SystemRequirements` field.\nOne R package developer has reported being asked to change \"Java JRE 8 or higher\" to \"Java (\\>= 8)\".\n\n*Many thanks to [Maëlle Salmon](https://masalmon.eu/) & [Gábor Csárdi](https://github.com/gaborcsardi) for their insights into this topic and their valuable feedback on this post.*\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/posts/extend-dataframes/index.qmd b/posts/extend-dataframes/index.qmd index e864dd72..905d6fba 100644 --- a/posts/extend-dataframes/index.qmd +++ b/posts/extend-dataframes/index.qmd @@ -5,7 +5,7 @@ author: - name: "Joshua W. Lambert" orcid: "0000-0001-5218-3046" date: "2023-04-12" -categories: [data frame, R, R package, interoperability, S3 class, dplyr] +categories: [data frame, R, R package, interoperability, S3, tidyverse, object orientation] format: html: toc: true diff --git a/posts/for-vs-apply/index.qmd b/posts/for-vs-apply/index.qmd index 3b31f233..c859eefc 100644 --- a/posts/for-vs-apply/index.qmd +++ b/posts/for-vs-apply/index.qmd @@ -4,7 +4,7 @@ author: - name: "Hugo Gruson" orcid: "0000-0002-4094-1476" date: "2023-11-02" -categories: [R, functional programming, iteration, readability, good practices] +categories: [R, functional programming, iteration, readability, good practices, tidyverse] format: html: toc: true diff --git a/posts/lint-rcpp/index.qmd b/posts/lint-rcpp/index.qmd index c64c4f39..4779f958 100644 --- a/posts/lint-rcpp/index.qmd +++ b/posts/lint-rcpp/index.qmd @@ -4,7 +4,7 @@ author: - name: "Pratik Gupte" orcid: "0000-0001-5294-7819" date: "2023-02-16" -categories: [code quality, R, R package, Rcpp] +categories: [code quality, R, R package, Rcpp, good practices, continuous integration] format: html: toc: true diff --git a/posts/share-cpp/index.qmd b/posts/share-cpp/index.qmd index f109e809..111c01ed 100644 --- a/posts/share-cpp/index.qmd +++ b/posts/share-cpp/index.qmd @@ -4,7 +4,7 @@ author: - name: "Pratik Gupte" orcid: "0000-0001-5294-7819" date: "2023-06-05" -categories: [code sharing, R, R package, Rcpp, interoperability] +categories: [code sharing, R, R package, Rcpp, interoperability, package development] format: html: toc: true diff --git a/posts/statistical-correctness/index.qmd b/posts/statistical-correctness/index.qmd index e179c223..08c23546 100644 --- a/posts/statistical-correctness/index.qmd +++ b/posts/statistical-correctness/index.qmd @@ -4,7 +4,7 @@ author: - name: "Hugo Gruson" orcid: "0000-0002-4094-1476" date: "2023-02-13" -categories: [code quality, R, R package, testing] +categories: [code quality, R, R package, testing, continuous integration, good practices] image: "testing_error.png" format: html: diff --git a/posts/system-dependencies/index.qmd b/posts/system-dependencies/index.qmd index aa67765a..98875365 100644 --- a/posts/system-dependencies/index.qmd +++ b/posts/system-dependencies/index.qmd @@ -4,7 +4,7 @@ author: - name: "Hugo Gruson" orcid: "0000-0002-4094-1476" date: "2023-09-26" -categories: [package development, R, R package, continuous integration, system dependens] +categories: [package development, R, R package, continuous integration, system dependencies] format: html: toc: true