Vignette for dynamic patterns in validation functions #525

yjunechoe · 2024-03-04T19:09:49Z

I like to think that the big unifying theme of the new 0.12.0 release is making validation functions more dynamic. Broadly, this came in the form of improved {tidyselect} support and enabling {glue} syntax, but in the process we also unlocked some really cool but non-obvious patterns that I think may be worth documenting.

There's still some questions in my mind about the scope and structure of such vignette (perhaps this could be part of a more general and technical vignette for intermediate-to-advanced users on "programming with pointblank" or "extending pointblank"), but I wanted to put the idea out there. I'm happy to draft something if this is of interest!

Just to sketch out a couple examples of new patterns:

1) `glue()` syntax in `label` allows the injecting of human-readable labels

The pattern would be to first define a "dictionary" (named vector) of column names to human-readable names and use it to substitute {.col} via indexing:

col_labels <- c(
  "lat" = "Latitude",
  "long" = "Longitude"
)
agent1a <- create_agent(storms) %>% 
  col_vals_not_null(
    columns = c(lat, long),
    label = "{col_labels[.col]} information should be present"
  ) %>% 
  interrogate(show_step_label = TRUE, progress = TRUE)
#> 
#> ── Interrogation Started - there are 2 steps ──────────────────────────────
#> ✔ Step 1: OK. - Latitude information should be present
#> ✔ Step 2: OK. - Longitude information should be present
#> 
#> ── Interrogation Completed ────────────────────────────────────────────────

Going further, we can also use all_of() from new tidyselect integration to express both columns and labels dynamically:

agent1b <- create_agent(storms) %>% 
  col_vals_not_null(
    columns = all_of(names(col_labels)),
    label = "{col_labels[.col]} information should be present"
  ) %>% 
  interrogate(show_step_label = TRUE, progress = TRUE)
#> 
#> ── Interrogation Started - there are 2 steps ──────────────────────────────
#> ✔ Step 1: OK. - Latitude information should be present
#> ✔ Step 2: OK. - Longitude information should be present
#> 
#> ── Interrogation Completed ────────────────────────────────────────────────

2. `any_of()` in columns allows smart subsetting of columns that are present from a larger pool of thematically grouped columns

For example, we might define a function that wraps col_vals_gt(), applying a rule (must be positive number) for a pool of columns that we know to contain information about a measure of time:

validate_time_cols <- function(agent) {
  time_measures <- c("year", "month", "day", "hour", "minute", "second")
  agent %>% 
    col_vals_gte(
      columns = any_of(time_measures),
      value = 0,
      label = "Measure of time `{.col}` should be positive"
    )
}

dplyr::storms records year, month, and day, hour but not minute and second in its current state. Just those columns that exist are picked up and validated and with this single function.

intersect(colnames(storms), time_measures)
#> [1] "year"  "month" "day"   "hour"
agent2a <- create_agent(storms) %>% 
  validate_time_cols() %>% 
  interrogate(show_step_label = TRUE, progress = TRUE)
#> 
#> ── Interrogation Started - there are 4 steps ──────────────────────────────
#> ✔ Step 1: OK. - Measure of time `year` should be positive
#> ✔ Step 2: OK. - Measure of time `month` should be positive
#> ✔ Step 3: OK. - Measure of time `day` should be positive
#> ✔ Step 4: OK. - Measure of time `hour` should be positive
#> 
#> ── Interrogation Completed ────────────────────────────────────────────────

If the storms API is known to sometimes (but not always) include information about minute, the same function can gracefully handle that variation, thanks to any_of():

storms2 <- storms %>% 
  mutate(minute = sample(0:60, n(), replace = TRUE))
agent1b <- create_agent(storms2) %>% 
  validate_time_cols() %>% 
  interrogate(show_step_label = TRUE, progress = TRUE)
#> 
#> ── Interrogation Started - there are 5 steps ──────────────────────────────
#> ✔ Step 1: OK. - Measure of time `year` should be positive
#> ✔ Step 2: OK. - Measure of time `month` should be positive
#> ✔ Step 3: OK. - Measure of time `day` should be positive
#> ✔ Step 4: OK. - Measure of time `hour` should be positive
#> ✔ Step 5: OK. - Measure of time `minute` should be positive
#> 
#> ── Interrogation Completed ────────────────────────────────────────────────

3. Shared tidyselect expression in `columns = ...` and `has_columns(...)` to conditionally skip a step

pointblank has always had this pattern, but it's become even simpler to reason about now that columns and has_columns() share the exact same tidyselect implementation.

For example, this function says "check for missing values in factor columns, but only if any exists in the data":

validate_factor_completeness <- function(agent) {
  agent %>% 
    col_vals_not_null(
      columns = where(is.factor),
      label = "Factor `{.col}` should not have missing data",
      active = ~ . %>% has_columns(where(is.factor))
    )
}

Now we have a very general and modular function that can be applied to any dataset.

It picks out the one factor column in dplyr::storms and validates it:

agent3a <- create_agent(storms) %>% 
  validate_factor_completeness() %>% 
  interrogate(show_step_label = TRUE, progress = TRUE)
#> 
#> ── Interrogation Started - there is a single validation step ──────────────
#> ✔ Step 1: OK. - Factor `status` should not have missing data
#> 
#> ── Interrogation Completed ────────────────────────────────────────────────

It skips for dplyr::starwars because there are no factor columns

agent3b <- create_agent(starwars) %>% 
  validate_factor_completeness() %>% 
  interrogate(show_step_label = TRUE, progress = TRUE)
#> 
#> ── Interrogation Started - there is a single validation step ──────────────
#> ℹ Step 1 is not set as active. Skipping.
#> 
#> ── Interrogation Completed ────────────────────────────────────────────────

The text was updated successfully, but these errors were encountered:

yjunechoe added Type: ✎ Docs Type: ★ Enhancement labels Mar 4, 2024

yjunechoe assigned rich-iannone Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vignette for dynamic patterns in validation functions #525

Vignette for dynamic patterns in validation functions #525

yjunechoe commented Mar 4, 2024 •

edited

Loading

Vignette for dynamic patterns in validation functions #525

Vignette for dynamic patterns in validation functions #525

Comments

yjunechoe commented Mar 4, 2024 • edited Loading

1) glue() syntax in label allows the injecting of human-readable labels

2. any_of() in columns allows smart subsetting of columns that are present from a larger pool of thematically grouped columns

3. Shared tidyselect expression in columns = ... and has_columns(...) to conditionally skip a step

yjunechoe commented Mar 4, 2024 •

edited

Loading

1) `glue()` syntax in `label` allows the injecting of human-readable labels

2. `any_of()` in columns allows smart subsetting of columns that are present from a larger pool of thematically grouped columns

3. Shared tidyselect expression in `columns = ...` and `has_columns(...)` to conditionally skip a step