You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I like to think that the big unifying theme of the new 0.12.0 release is making validation functions more dynamic. Broadly, this came in the form of improved {tidyselect} support and enabling {glue} syntax, but in the process we also unlocked some really cool but non-obvious patterns that I think may be worth documenting.
There's still some questions in my mind about the scope and structure of such vignette (perhaps this could be part of a more general and technical vignette for intermediate-to-advanced users on "programming with pointblank" or "extending pointblank"), but I wanted to put the idea out there. I'm happy to draft something if this is of interest!
Just to sketch out a couple examples of new patterns:
1) glue() syntax in label allows the injecting of human-readable labels
The pattern would be to first define a "dictionary" (named vector) of column names to human-readable names and use it to substitute {.col} via indexing:
col_labels<- c(
"lat"="Latitude",
"long"="Longitude"
)
agent1a<- create_agent(storms) %>%
col_vals_not_null(
columns= c(lat, long),
label="{col_labels[.col]} information should be present"
) %>%
interrogate(show_step_label=TRUE, progress=TRUE)
#> #> ── Interrogation Started - there are 2 steps ──────────────────────────────#> ✔ Step 1: OK. - Latitude information should be present#> ✔ Step 2: OK. - Longitude information should be present#> #> ── Interrogation Completed ────────────────────────────────────────────────
Going further, we can also use all_of() from new tidyselect integration to express both columns and labels dynamically:
agent1b<- create_agent(storms) %>%
col_vals_not_null(
columns= all_of(names(col_labels)),
label="{col_labels[.col]} information should be present"
) %>%
interrogate(show_step_label=TRUE, progress=TRUE)
#> #> ── Interrogation Started - there are 2 steps ──────────────────────────────#> ✔ Step 1: OK. - Latitude information should be present#> ✔ Step 2: OK. - Longitude information should be present#> #> ── Interrogation Completed ────────────────────────────────────────────────
2. any_of() in columns allows smart subsetting of columns that are present from a larger pool of thematically grouped columns
For example, we might define a function that wraps col_vals_gt(), applying a rule (must be positive number) for a pool of columns that we know to contain information about a measure of time:
validate_time_cols<-function(agent) {
time_measures<- c("year", "month", "day", "hour", "minute", "second")
agent %>%
col_vals_gte(
columns= any_of(time_measures),
value=0,
label="Measure of time `{.col}` should be positive"
)
}
dplyr::storms records year, month, and day, hour but not minute and second in its current state. Just those columns that exist are picked up and validated and with this single function.
intersect(colnames(storms), time_measures)
#> [1] "year" "month" "day" "hour"agent2a<- create_agent(storms) %>%
validate_time_cols() %>%
interrogate(show_step_label=TRUE, progress=TRUE)
#> #> ── Interrogation Started - there are 4 steps ──────────────────────────────#> ✔ Step 1: OK. - Measure of time `year` should be positive#> ✔ Step 2: OK. - Measure of time `month` should be positive#> ✔ Step 3: OK. - Measure of time `day` should be positive#> ✔ Step 4: OK. - Measure of time `hour` should be positive#> #> ── Interrogation Completed ────────────────────────────────────────────────
If the storms API is known to sometimes (but not always) include information about minute, the same function can gracefully handle that variation, thanks to any_of():
storms2<-storms %>%
mutate(minute= sample(0:60, n(), replace=TRUE))
agent1b<- create_agent(storms2) %>%
validate_time_cols() %>%
interrogate(show_step_label=TRUE, progress=TRUE)
#> #> ── Interrogation Started - there are 5 steps ──────────────────────────────#> ✔ Step 1: OK. - Measure of time `year` should be positive#> ✔ Step 2: OK. - Measure of time `month` should be positive#> ✔ Step 3: OK. - Measure of time `day` should be positive#> ✔ Step 4: OK. - Measure of time `hour` should be positive#> ✔ Step 5: OK. - Measure of time `minute` should be positive#> #> ── Interrogation Completed ────────────────────────────────────────────────
3. Shared tidyselect expression in columns = ... and has_columns(...) to conditionally skip a step
pointblank has always had this pattern, but it's become even simpler to reason about now that columns and has_columns() share the exact same tidyselect implementation.
For example, this function says "check for missing values in factor columns, but only if any exists in the data":
validate_factor_completeness<-function(agent) {
agent %>%
col_vals_not_null(
columns= where(is.factor),
label="Factor `{.col}` should not have missing data",
active=~. %>% has_columns(where(is.factor))
)
}
Now we have a very general and modular function that can be applied to any dataset.
It picks out the one factor column in dplyr::storms and validates it:
agent3a<- create_agent(storms) %>%
validate_factor_completeness() %>%
interrogate(show_step_label=TRUE, progress=TRUE)
#> #> ── Interrogation Started - there is a single validation step ──────────────#> ✔ Step 1: OK. - Factor `status` should not have missing data#> #> ── Interrogation Completed ────────────────────────────────────────────────
It skips for dplyr::starwars because there are no factor columns
agent3b<- create_agent(starwars) %>%
validate_factor_completeness() %>%
interrogate(show_step_label=TRUE, progress=TRUE)
#> #> ── Interrogation Started - there is a single validation step ──────────────#> ℹ Step 1 is not set as active. Skipping.#> #> ── Interrogation Completed ────────────────────────────────────────────────
The text was updated successfully, but these errors were encountered:
I like to think that the big unifying theme of the new
0.12.0
release is making validation functions more dynamic. Broadly, this came in the form of improved{tidyselect}
support and enabling{glue}
syntax, but in the process we also unlocked some really cool but non-obvious patterns that I think may be worth documenting.There's still some questions in my mind about the scope and structure of such vignette (perhaps this could be part of a more general and technical vignette for intermediate-to-advanced users on "programming with pointblank" or "extending pointblank"), but I wanted to put the idea out there. I'm happy to draft something if this is of interest!
Just to sketch out a couple examples of new patterns:
1)
glue()
syntax inlabel
allows the injecting of human-readable labelsThe pattern would be to first define a "dictionary" (named vector) of column names to human-readable names and use it to substitute
{.col}
via indexing:Going further, we can also use
all_of()
from new tidyselect integration to express bothcolumns
andlabels
dynamically:2.
any_of()
in columns allows smart subsetting of columns that are present from a larger pool of thematically grouped columnsFor example, we might define a function that wraps
col_vals_gt()
, applying a rule (must be positive number) for a pool of columns that we know to contain information about a measure of time:dplyr::storms
records year, month, and day, hour but not minute and second in its current state. Just those columns that exist are picked up and validated and with this single function.If the storms API is known to sometimes (but not always) include information about
minute
, the same function can gracefully handle that variation, thanks toany_of()
:3. Shared tidyselect expression in
columns = ...
andhas_columns(...)
to conditionally skip a steppointblank has always had this pattern, but it's become even simpler to reason about now that
columns
andhas_columns()
share the exact same tidyselect implementation.For example, this function says "check for missing values in factor columns, but only if any exists in the data":
Now we have a very general and modular function that can be applied to any dataset.
It picks out the one factor column in
dplyr::storms
and validates it:It skips for
dplyr::starwars
because there are no factor columnsThe text was updated successfully, but these errors were encountered: