diff --git a/NEWS.md b/NEWS.md index 8996565..4c3eaea 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,7 +1,41 @@ # shapviz 0.9.3 -## User-visible changes +## `sv_dependence()`: Control over automatic color feature selection + +### How is the color feature selected anyway? + +If no SHAP interaction values are available, by default, the color feature `v'` is selected by the heuristic `potential_interaction()`, which works as follows: + +1. If the feature `v` (the on the x-axis) is numeric, it is binned into `nbins` bins. +2. Per bin, the SHAP values of `v` are regressed onto `v` and the R-squared is calculated. +3. The R-squared are averaged over bins, weighted by the bin size. + +This measures how much variability in the SHAP values is explained by `v'`, after accounting for `v`. + +We have introduced four parameters to control the heuristic. Their defaults are in line with the old behaviour. + +- `nbin = NULL`: Into how many quantile bins should a numeric `v` be binned? The default `NULL` equals the smaller of $n/20$ and $\sqrt n$ (rounded up), where $n$ is the sample size. +- `color_num` Should color features be converted to numeric, even if they are factors/characters? Default is `TRUE`. +- `scale = FALSE`: Should R-squared be multiplied with the sample variance of +within-bin SHAP values? If `TRUE`, bins with stronger vertical scatter will get higher weight. The default is `FALSE`. +- `adjusted = FALSE`: Should *adjusted* R-squared be calculated? + +If SHAP interaction values are available, these parameters have no effect. In `sv_dependence()` they are called `ih_nbin` etc. + +This partly implements the ideas in [#119](https://github.com/ModelOriented/shapviz/issues/119) of Roel Verbelen, thanks a lot for your patient explanations! + +### Further plans? + +We will continue to experiment with the defaults, which might change in the future. A good alternative to the current (naive) defaults could be: + +- `nbins = 7`: Smaller than now to not overfit too strongly with factor/character color features. +- `color_num = FALSE`: To not naively integer encode factors/characters. +- `scale = TRUE`: To account for non-equal spread in bins. +- `adjusted = TRUE`: To not put too much weight on factors with many categories. + +## Other user-visible changes +- `sv_dependence()`: If `color_var = "auto"` (default) and no color feature seems to be relevant (SHAP interaction is `NULL`, or heuristic returns no positive value), there won't be any color scale. - `mshapviz()` objects can now be rowbinded via `rbind()` or `+`. Implemented by [@jmaspons](https://github.com/jmaspons) in [#110](https://github.com/ModelOriented/shapviz/pull/110). - `mshapviz()` is more strict when combining multiple "shapviz" objects. These now need to have identical column names, see [#114](https://github.com/ModelOriented/shapviz/pull/114). @@ -9,7 +43,7 @@ - `print.shapviz()` now shows top two rows of SHAP matrix. - Re-activate all unit tests. -- Setting `nthread = 1` in all calls to `xgb.DMatrix()` as suggested by [@jmaspons](https://github.com/jmaspons) in [issue #109](https://github.com/ModelOriented/shapviz/issues/109). +- Setting `nthread = 1` in all calls to `xgb.DMatrix()` as suggested by [@jmaspons](https://github.com/jmaspons) in [#109](https://github.com/ModelOriented/shapviz/issues/109). - Added "How to contribute" to README. - `permshap()` connector is now part of {kerneshap} [#122](https://github.com/ModelOriented/shapviz/pull/122). diff --git a/R/potential_interactions.R b/R/potential_interactions.R new file mode 100644 index 0000000..fec72e7 --- /dev/null +++ b/R/potential_interactions.R @@ -0,0 +1,153 @@ +#' Interaction Strength +#' +#' Returns vector of interaction strengths between variable `v` and all other variables, +#' see Details. +#' +#' If SHAP interaction values are available, the interaction strength +#' between feature `v` and another feature `v'` is measured by twice their +#' mean absolute SHAP interaction values. +#' +#' Otherwise, we use a heuristic calculated as follows to calculate interaction strength +#' between `v` and each other "color" feature `v': +#' 1. If `v` is numeric, it is binned into `nbins` bins. +#' 2. Per bin, the SHAP values of `v` are regressed onto `v`, and the R-squared +#' is calculated. +#' 3. The R-squared are averaged over bins, weighted by the bin size. +#' +#' Set `scale = TRUE` to multiply the R-squared by the within-bin variance +#' of the SHAP values. This will put higher weight to bins with larger scatter. +#' +#' Set `color_num = FALSE` to *not* turn the values of the "color" feature `v'` +#' to numeric. +#' +#' Finally, set `adjusted = TRUE` to use *adjusted* R-squared. +#' +#' @param obj An object of class "shapviz". +#' @param v Variable name to calculate potential SHAP interactions for. +#' @param nbins Into how many quantile bins should a numeric `v` be binned? +#' The default `NULL` equals the smaller of \eqn{n/20} and \eqn{\sqrt n} (rounded up), +#' where \eqn{n} is the sample size. Ignored if `obj` contains SHAP interactions. +#' @param color_num Should other ("color") features `v'` be converted to numeric, +#' even if they are factors/characters? Default is `TRUE`. +#' Ignored if `obj` contains SHAP interactions. +#' @param scale Should adjusted R-squared be multiplied with the sample variance of +#' within-bin SHAP values? If `TRUE`, bins with stronger vertical scatter will get +#' higher weight. The default is `FALSE`. Ignored if `obj` contains SHAP interactions. +#' @param adjusted Should *adjusted* R-squared be used? Default is `FALSE`. +#' @returns A named vector of decreasing interaction strengths. +#' @export +#' @seealso [sv_dependence()] +potential_interactions <- function(obj, v, nbins = NULL, color_num = TRUE, + scale = FALSE, adjusted = FALSE) { + stopifnot(is.shapviz(obj)) + S <- get_shap_values(obj) + S_inter <- get_shap_interactions(obj) + X <- get_feature_values(obj) + nms <- colnames(obj) + v_other <- setdiff(nms, v) + stopifnot(v %in% nms) + if (ncol(obj) <= 1L) { + return(NULL) + } + + # Simple case: we have SHAP interaction values + if (!is.null(S_inter)) { + return(sort(2 * colMeans(abs(S_inter[, v, ]))[v_other], decreasing = TRUE)) + } + + # Complicated case: calculate heuristic per color variable + if (is.null(nbins)) { + nbins <- ceiling(min(sqrt(nrow(X)), nrow(X) / 20)) + } + out <- vapply( + X[v_other], # data.frame is a list + FUN = heuristic, + FUN.VALUE = 1.0, + s = S[, v], + bins = .fast_bin(X[[v]], nbins = nbins), + color_num = color_num, + scale = scale, + adjusted = adjusted + ) + sort(out, decreasing = TRUE, na.last = TRUE) +} + +#' Interaction Heuristic +#' +#' Internal function used to calculate the interaction heuristics described in +#' [potential_interactions()]. It calls `heuristic_in_bin()` per bin and aggregates +#' the result. +#' +#' @noRd +#' @keywords internal +#' +#' @inheritParams potential_interactions +#' @param color Feature values of the "color" feature. +#' @param s SHAP values of `v`. +#' @returns A single number. +heuristic <- function(color, s, bins, color_num, scale, adjusted) { + if (isTRUE(color_num)) { + color <- .as_numeric(color) + } + color <- split(color, bins) + s <- split(s, bins) + M <- mapply( + heuristic_in_bin, + color = color, + s = s, + MoreArgs = list(scale = scale, adjusted = adjusted) + ) + stats::weighted.mean(M[1L, ], M[2L, ], na.rm = TRUE) +} + +#' Interaction Heuristic in Bin +#' +#' Internal function used to calculate the within-bin heuristic used in `heuristic()`. +#' See `heuristic()` for details. +#' +#' @noRd +#' @keywords internal +#' +#' @inheritParams heuristic +#' @returns +#' A (1x2) matrix with heuristic and number of observations. +heuristic_in_bin <- function(color, s, scale = FALSE, adjusted = FALSE) { + suppressWarnings( + tryCatch( + { + z <- stats::lm(s ~ color) + r <- z$residuals + n <- length(r) + var_y <- stats::var(z$fitted.values + r) + denom <- if (adjusted) z$df.residual else n - 1 + var_r <- sum(r^2) / denom + stat <- 1 - var_r / var_y + if (scale) { + stat <- stat * var_y + } + cbind(stat = stat, n = n) + }, + error = function(e) return(cbind(stat = NA, n = 0)) + ) + ) +} + +# Like as.numeric(), but can deal with factor variables +.as_numeric <- function(z) { + if (is.numeric(z)) { + return(z) + } + if (is.character(z)) { + z <- factor(z) + } + as.numeric(z) +} + +# Bins discrete z into integer valued bins +.fast_bin <- function(z, nbins) { + if (.is_discrete(z, n_unique = nbins)) { + return(z) + } + q <- stats::quantile(z, seq(0, 1, length.out = nbins + 1L), na.rm = TRUE) + findInterval(z, unique(q), rightmost.closed = TRUE) +} diff --git a/R/sv_dependence.R b/R/sv_dependence.R index 8205b50..5b6f180 100644 --- a/R/sv_dependence.R +++ b/R/sv_dependence.R @@ -3,6 +3,8 @@ #' Scatterplot of the SHAP values of a feature against its feature values. #' If SHAP interaction values are available, setting `interactions = TRUE` allows #' to focus on pure interaction effects (multiplied by two) or on pure main effects. +#' By default, the feature on the color scale is selected via SHAP interactions +#' (if available) or an interaction heuristic, see [potential_interactions()]. #' #' @importFrom rlang .data #' @@ -31,6 +33,9 @@ #' Requires SHAP interaction values. If `color_var = NULL` (or it is equal to `v`), #' the pure main effect of `v` is visualized. Otherwise, twice the SHAP interaction #' values between `v` and the `color_var` are plotted. +#' @param ih_nbins,ih_color_num,ih_scale,ih_adjusted Interaction heuristic (ih) +#' parameters used to select the color variable, see [potential_interactions()]. +#' Only used if `color_var = "auto"` and if there are no SHAP interaction values. #' @param ... Arguments passed to [ggplot2::geom_jitter()]. #' @returns An object of class "ggplot" (or "patchwork") representing a dependence plot. #' @examples @@ -71,7 +76,9 @@ sv_dependence.default <- function(object, ...) { #' @export sv_dependence.shapviz <- function(object, v, color_var = "auto", color = "#3b528b", viridis_args = getOption("shapviz.viridis_args"), - jitter_width = NULL, interactions = FALSE, ...) { + jitter_width = NULL, interactions = FALSE, + ih_nbins = NULL, ih_color_num = TRUE, + ih_scale = FALSE, ih_adjusted = FALSE, ...) { p <- length(v) if (p > 1L || length(color_var) > 1L) { if (is.null(color_var)) { @@ -90,6 +97,10 @@ sv_dependence.shapviz <- function(object, v, color_var = "auto", color = "#3b528 object = object, viridis_args = viridis_args, interactions = interactions, + ih_nbins = ih_nbins, + ih_color_num = ih_color_num, + ih_scale = ih_scale, + ih_adjusted = ih_adjusted, ... ), SIMPLIFY = FALSE @@ -116,10 +127,20 @@ sv_dependence.shapviz <- function(object, v, color_var = "auto", color = "#3b528 jitter_width <- 0.2 * .is_discrete(X[[v]], n_unique = 7L) } - # Set color value + # Set color value if "auto" if (!is.null(color_var) && color_var == "auto" && !("auto" %in% nms)) { - scores <- potential_interactions(object, v) - color_var <- names(scores)[1L] # NULL if p = 1L + scores <- potential_interactions( + object, + v, + nbins = ih_nbins, + color_num = ih_color_num, + scale = ih_scale, + adjusted = ih_adjusted + ) + # 'scores' can be NULL, or a numeric vector like c(0.1, 0, -0.01, NaN, NA) + # Thus, let's take the first positive one (or none) + scores <- scores[!is.na(scores) & scores > 0] # NULL stays NULL + color_var <- if (length(scores) >= 1L) names(scores)[1L] } if (isTRUE(interactions)) { if (is.null(color_var)) { @@ -169,7 +190,9 @@ sv_dependence.shapviz <- function(object, v, color_var = "auto", color = "#3b528 #' @export sv_dependence.mshapviz <- function(object, v, color_var = "auto", color = "#3b528b", viridis_args = getOption("shapviz.viridis_args"), - jitter_width = NULL, interactions = FALSE, ...) { + jitter_width = NULL, interactions = FALSE, + ih_nbins = NULL, ih_color_num = TRUE, + ih_scale = FALSE, ih_adjusted = FALSE, ...) { stopifnot( length(v) == 1L, length(color_var) <= 1L @@ -184,75 +207,19 @@ sv_dependence.mshapviz <- function(object, v, color_var = "auto", color = "#3b52 viridis_args = viridis_args, jitter_width = jitter_width, interactions = interactions, + ih_nbins = ih_nbins, + ih_color_num = ih_color_num, + ih_scale = ih_scale, + ih_adjusted = ih_adjusted, ... ) plot_list <- add_titles(plot_list, nms = names(object)) # see sv_waterfall() patchwork::wrap_plots(plot_list) } -#' Interaction Strength -#' -#' Returns vector of interaction strengths between variable `v` and all other variables. -#' -#' If SHAP interaction values are available, interaction strength -#' between feature `v` and another feature `v'` is measured by twice their -#' mean absolute SHAP interaction values. Otherwise, we use as heuristic the -#' squared correlation between feature values of `v'` and -#' SHAP values of `v`, averaged over (binned) values of `v`. -#' A numeric `v` with more than `n_bins` unique values is binned into quantile bins. -#' Currently `n_bins` equals the smaller of \eqn{n/20} and \eqn{\sqrt n}, where \eqn{n} -#' is the sample size. -#' The average squared correlation is weighted by the number of non-missing feature -#' values in the bin. Note that non-numeric color features are turned to numeric -#' by calling [data.matrix()], which does not necessarily make sense. -#' -#' @param obj An object of class "shapviz". -#' @param v Variable name. -#' @returns A named vector of decreasing interaction strengths. -#' @export -#' @seealso [sv_dependence()] -potential_interactions <- function(obj, v) { - stopifnot(is.shapviz(obj)) - S <- get_shap_values(obj) - S_inter <- get_shap_interactions(obj) - X <- get_feature_values(obj) - nms <- colnames(obj) - v_other <- setdiff(nms, v) - stopifnot(v %in% nms) - - if (ncol(obj) <= 1L) { - return(NULL) - } - - # Simple case: we have SHAP interaction values - if (!is.null(S_inter)) { - return(sort(2 * colMeans(abs(S_inter[, v, ]))[v_other], decreasing = TRUE)) - } - - # Complicated case: we need to rely on correlation based heuristic - r_sq <- function(s, x) { - suppressWarnings(stats::cor(s, data.matrix(x), use = "p")^2) - } - n_bins <- ceiling(min(sqrt(nrow(X)), nrow(X) / 20)) - v_bin <- .fast_bin(X[[v]], n_bins = n_bins) - s_bin <- split(S[, v], v_bin) - X_bin <- split(X[v_other], v_bin) - w <- do.call(rbind, lapply(X_bin, function(z) colSums(!is.na(z)))) - cor_squared <- do.call(rbind, mapply(r_sq, s_bin, X_bin, SIMPLIFY = FALSE)) - sort(colSums(w * cor_squared, na.rm = TRUE) / colSums(w), decreasing = TRUE) -} - # Helper functions +# Checks if z is discrete .is_discrete <- function(z, n_unique) { is.factor(z) || is.character(z) || is.logical(z) || (length(unique(z)) <= n_unique) } - -# Bins z into integer valued bins, but only if discrete -.fast_bin <- function(z, n_bins) { - if (.is_discrete(z, n_unique = n_bins)) { - return(z) - } - q <- stats::quantile(z, seq(0, 1, length.out = n_bins + 1L), na.rm = TRUE) - findInterval(z, unique(q), rightmost.closed = TRUE) -} diff --git a/man/figures/VIGNETTE-dep-ranger.png b/man/figures/VIGNETTE-dep-ranger.png index 75abf16..d083d8a 100644 Binary files a/man/figures/VIGNETTE-dep-ranger.png and b/man/figures/VIGNETTE-dep-ranger.png differ diff --git a/man/figures/VIGNETTE-dep.png b/man/figures/VIGNETTE-dep.png index 0dac117..065e6b4 100644 Binary files a/man/figures/VIGNETTE-dep.png and b/man/figures/VIGNETTE-dep.png differ diff --git a/man/potential_interactions.Rd b/man/potential_interactions.Rd index 7c938a0..5cd141c 100644 --- a/man/potential_interactions.Rd +++ b/man/potential_interactions.Rd @@ -1,34 +1,65 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/sv_dependence.R +% Please edit documentation in R/potential_interactions.R \name{potential_interactions} \alias{potential_interactions} \title{Interaction Strength} \usage{ -potential_interactions(obj, v) +potential_interactions( + obj, + v, + nbins = NULL, + color_num = TRUE, + scale = FALSE, + adjusted = FALSE +) } \arguments{ \item{obj}{An object of class "shapviz".} -\item{v}{Variable name.} +\item{v}{Variable name to calculate potential SHAP interactions for.} + +\item{nbins}{Into how many quantile bins should a numeric \code{v} be binned? +The default \code{NULL} equals the smaller of \eqn{n/20} and \eqn{\sqrt n} (rounded up), +where \eqn{n} is the sample size. Ignored if \code{obj} contains SHAP interactions.} + +\item{color_num}{Should other ("color") features \verb{v'} be converted to numeric, +even if they are factors/characters? Default is \code{TRUE}. +Ignored if \code{obj} contains SHAP interactions.} + +\item{scale}{Should adjusted R-squared be multiplied with the sample variance of +within-bin SHAP values? If \code{TRUE}, bins with stronger vertical scatter will get +higher weight. The default is \code{FALSE}. Ignored if \code{obj} contains SHAP interactions.} + +\item{adjusted}{Should \emph{adjusted} R-squared be used? Default is \code{FALSE}.} } \value{ A named vector of decreasing interaction strengths. } \description{ -Returns vector of interaction strengths between variable \code{v} and all other variables. +Returns vector of interaction strengths between variable \code{v} and all other variables, +see Details. } \details{ -If SHAP interaction values are available, interaction strength +If SHAP interaction values are available, the interaction strength between feature \code{v} and another feature \verb{v'} is measured by twice their -mean absolute SHAP interaction values. Otherwise, we use as heuristic the -squared correlation between feature values of \verb{v'} and -SHAP values of \code{v}, averaged over (binned) values of \code{v}. -A numeric \code{v} with more than \code{n_bins} unique values is binned into quantile bins. -Currently \code{n_bins} equals the smaller of \eqn{n/20} and \eqn{\sqrt n}, where \eqn{n} -is the sample size. -The average squared correlation is weighted by the number of non-missing feature -values in the bin. Note that non-numeric color features are turned to numeric -by calling \code{\link[=data.matrix]{data.matrix()}}, which does not necessarily make sense. +mean absolute SHAP interaction values. + +Otherwise, we use a heuristic calculated as follows to calculate interaction strength +between \code{v} and each other "color" feature `v': +\enumerate{ +\item If \code{v} is numeric, it is binned into \code{nbins} bins. +\item Per bin, the SHAP values of \code{v} are regressed onto \code{v}, and the R-squared +is calculated. +\item The R-squared are averaged over bins, weighted by the bin size. +} + +Set \code{scale = TRUE} to multiply the R-squared by the within-bin variance +of the SHAP values. This will put higher weight to bins with larger scatter. + +Set \code{color_num = FALSE} to \emph{not} turn the values of the "color" feature \verb{v'} +to numeric. + +Finally, set \code{adjusted = TRUE} to use \emph{adjusted} R-squared. } \seealso{ \code{\link[=sv_dependence]{sv_dependence()}} diff --git a/man/sv_dependence.Rd b/man/sv_dependence.Rd index ec08909..2d2493f 100644 --- a/man/sv_dependence.Rd +++ b/man/sv_dependence.Rd @@ -19,6 +19,10 @@ sv_dependence(object, ...) viridis_args = getOption("shapviz.viridis_args"), jitter_width = NULL, interactions = FALSE, + ih_nbins = NULL, + ih_color_num = TRUE, + ih_scale = FALSE, + ih_adjusted = FALSE, ... ) @@ -30,6 +34,10 @@ sv_dependence(object, ...) viridis_args = getOption("shapviz.viridis_args"), jitter_width = NULL, interactions = FALSE, + ih_nbins = NULL, + ih_color_num = TRUE, + ih_scale = FALSE, + ih_adjusted = FALSE, ... ) } @@ -67,6 +75,10 @@ Can be a vector/list if \code{v} is a vector.} Requires SHAP interaction values. If \code{color_var = NULL} (or it is equal to \code{v}), the pure main effect of \code{v} is visualized. Otherwise, twice the SHAP interaction values between \code{v} and the \code{color_var} are plotted.} + +\item{ih_nbins, ih_color_num, ih_scale, ih_adjusted}{Interaction heuristic (ih) +parameters used to select the color variable, see \code{\link[=potential_interactions]{potential_interactions()}}. +Only used if \code{color_var = "auto"} and if there are no SHAP interaction values.} } \value{ An object of class "ggplot" (or "patchwork") representing a dependence plot. @@ -75,6 +87,8 @@ An object of class "ggplot" (or "patchwork") representing a dependence plot. Scatterplot of the SHAP values of a feature against its feature values. If SHAP interaction values are available, setting \code{interactions = TRUE} allows to focus on pure interaction effects (multiplied by two) or on pure main effects. +By default, the feature on the color scale is selected via SHAP interactions +(if available) or an interaction heuristic, see \code{\link[=potential_interactions]{potential_interactions()}}. } \section{Methods (by class)}{ \itemize{ diff --git a/tests/testthat/test-potential_interactions.R b/tests/testthat/test-potential_interactions.R new file mode 100644 index 0000000..ba22b10 --- /dev/null +++ b/tests/testthat/test-potential_interactions.R @@ -0,0 +1,106 @@ +dtrain <- xgboost::xgb.DMatrix( + data.matrix(iris[, -1L]), label = iris[, 1L], nthread = 1L +) +fit <- xgboost::xgb.train(params = list(nthread = 1L), data = dtrain, nrounds = 10L) +x <- shapviz(fit, X_pred = dtrain, X = iris[, -1L]) + +test_that("nbins has no effect for factor v", { + expect_equal( + potential_interactions(x, "Species", nbins = NULL), + potential_interactions(x, "Species", nbins = 2) + ) +}) + +test_that("nbins has an effect for numeric v", { + expect_false( + identical( + potential_interactions(x, "Sepal.Width", nbins = 2), + potential_interactions(x, "Sepal.Width", nbins = 3) + ) + ) +}) + +test_that("'adjusted' leads to smaller values", { + p1 <- potential_interactions(x, "Sepal.Width", adjusted = FALSE) + p2 <- potential_interactions(x, "Sepal.Width", adjusted = TRUE) + expect_true(all(p1 > p2)) +}) + +test_that("color_num has an effect only for non-numeric features", { + p1 <- potential_interactions(x, "Sepal.Width", color_num = TRUE) + p2 <- potential_interactions(x, "Sepal.Width", color_num = FALSE) + num <- c("Petal.Width", "Petal.Length") + expect_equal(p1[num], p2[num]) + expect_false(p1["Species"] == p2["Species"]) +}) + +test_that("potential_interactions respects true SHAP interactions", { + xi <- shapviz(fit, X_pred = dtrain, X = iris[, -1L], interactions = TRUE) + i1 <- potential_interactions(xi, "Species") + i2 <- sv_interaction(xi, kind = "no")[names(i1), "Species"] + expect_equal(i1, i2, tolerance = 1e-5) +}) + +test_that("heuristic_in_bin() returns R-squared", { + fit_lm <- lm(Sepal.Length ~ Species, data = iris) + expect_equal( + unname(heuristic_in_bin(iris$Species, iris$Sepal.Length)[1, 1]), + summary(fit_lm)[["r.squared"]] + ) + expect_equal( + unname(heuristic_in_bin(iris$Species, iris$Sepal.Length, adjusted = TRUE)[1, 1]), + summary(fit_lm)[["adj.r.squared"]] + ) + + # Scaled + expect_equal( + unname(heuristic_in_bin(iris$Species, iris$Sepal.Length, scale = TRUE)[1, 1]), + summary(fit_lm)[["r.squared"]] * var(iris$Sepal.Length) + ) + expect_equal( + unname( + heuristic_in_bin( + iris$Species, iris$Sepal.Length, scale = TRUE, adjusted = TRUE)[1, 1] + ), + summary(fit_lm)[["adj.r.squared"]] * var(iris$Sepal.Length) + ) +}) + +test_that("Failing heuristic_in_bin() returns NA", { + expect_equal(heuristic_in_bin(0, 1:2), cbind(stat = NA, n = 0)) +}) + +test_that("heuristic() returns average R-squared", { + ix <- c(rep(1, 60), rep(2, 90)) + y <- split(iris$Sepal.Length, ix) + x1 <- split(iris$Sepal.Width, ix) + x2 <- split(iris$Species, ix) + f <- function(y, x) { + summary(lm(y ~ x))[["r.squared"]] + } + + expect_equal( + heuristic( + iris$Sepal.Width, + iris$Sepal.Length, + bins = ix, + color_num = TRUE, + scale = FALSE, + adjusted = FALSE + ), + weighted.mean(mapply(f, y, x1), c(60, 90)) + ) + + expect_equal( + heuristic( + iris$Species, + iris$Sepal.Length, + bins = ix, + color_num = FALSE, + scale = FALSE, + adjusted = FALSE + ), + weighted.mean(mapply(f, y, x2), c(60, 90)) + ) +}) + diff --git a/vignettes/basic_use.Rmd b/vignettes/basic_use.Rmd index 9f99858..b993f7f 100644 --- a/vignettes/basic_use.Rmd +++ b/vignettes/basic_use.Rmd @@ -23,7 +23,7 @@ knitr::opts_chunk$set( ## Overview -SHAP (SHapley Additive exPlanations, see @lundberg2017) is an ingenious way to study black box models. SHAP values decompose - as fair as possible - predictions into additive feature contributions. Crunching SHAP values requires clever algorithms by clever people. Analyzing them, however, is super easy with the right visualizations. {shapviz} offers the latter: +SHAP (SHapley Additive exPlanations, see @lundberg2017) is an ingenious way to study black box models. SHAP values decompose - as fair as possible - predictions into additive feature contributions. Crunching SHAP values requires clever algorithms. Analyzing them, however, is super easy with the right visualizations. {shapviz} offers the latter: - `sv_importance()`: Importance plots (bar plots and/or beeswarm plots) to study variable importance. - `sv_dependence()` and `sv_dependence2D()`: Dependence plots to study feature effects and interactions. diff --git a/vignettes/multiple_output.Rmd b/vignettes/multiple_output.Rmd index b664a9c..084eb30 100644 --- a/vignettes/multiple_output.Rmd +++ b/vignettes/multiple_output.Rmd @@ -98,7 +98,7 @@ library(patchwork) library(ranger) # Model -fit <- ranger(Species ~ ., data = iris, num.trees = 100, probability = TRUE) +fit <- ranger(Species ~ ., data = iris, num.trees = 100, probability = TRUE, seed = 1) # "mshapviz" object x <- kernelshap(fit, X = iris[-5], bg_X = iris)