-
Notifications
You must be signed in to change notification settings - Fork 34
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Harmonize batch distribution ++ (#359)
* Bugfix in `prepare_data()` related to vector of approaches. When using several approaches the old version only used the first approach. Verified this by adding a print in each prepare_data.approach() function and saw that only the first approach in internal$parameters$approach was used. Can maybe remove code comments before pull request is accepted. Maybe a better method to get the approach? Also updated roxygen2 for the function, as it seemed that it was reflecting the old version of shapr(?) due to arguments which are no longer present. However, one then get a warning when creating the roxygen2 documentation. Discuss some solutions as comments below. Discuss with Martin. * # Lars have added `n_combinations` - 1 as a possibility, as the function `check_n_batches` threw an error for the vignette with gaussian approach with `n_combinations` = 8 and `n_batches = NULL`, as this function here then set `n_batches` = 10, which was too large. We subtract 1 as `check_n_batches` function specifies that `n_batches` must be strictly less than `n_combinations`. * Samll typo. * Fixed bug. All messages says "n_combinations is larger than or equal to 2^m", but all the test only tested for "larger than". I.e., if the user specified n_combinations = 2^m in the call to shapr::explain, the function would not treat it as exact. * Added script demonstrating the bug that shapr does not enter the exact mode when `n_combinations = 2^m`, before the bugfix. * Added (tentative) test that checks that shapr enters exact mode when `n_combinations >= 2^m`. Remove the large comment after discussing that with Martin. * Added script that demonstrates the bug before the bugfix, and added test checking that we do not get an error when runing the code after the bugfix has been applied. * Fixed lint warnings in `approach.R`. * Added two parameters to the `internal$parameters` list which contains the number of approaches and the number of unique approaches. This is for example useful to check that the provided `n_batches` is a valid value. (see next commits) * Added test to check that `n_batches` must be larger than or equal to the number of unique approaches. Before the user could, e.g., set `n_batches = 2`, but use 4 approaches and then shapr would use 4 but not update `n_batches` and without giwing a warning to the user. * Updated `get_default_n_batches` to take into consideration the number of unique approaches that is used. This was not done before and gave inconsistency in what number shapr would reccomend and use when `n_batches` was set to `null` by the user. * Changed where seed is set such that it applies for both regular and combined approaches. Furthermore, added if test, because previous version resulted in not reproducible code, as setting seed to `null` ruins that we set seed in `explain()`. Just consider this small example: # Set seed to get same values twice set.seed(123) rnorm(1) # Seting the same seed gives the same value set.seed(123) rnorm(1) # If we also include null then the seed is removed and we do not get the same value set.seed(123) set.seed(NULL) rnorm(1) # Setining seed to null actually gives a new "random" number each time. set.seed(123) set.seed(NULL) rnorm(1) * Typo * Added test to check that setting the seed works for combined approaches. * typo in test function * Added file to demonstrate the bugs (before the bugfix) * Added new test * Updated tests by removing n_samples * Added a bugfix to shapr not using the correct number of batches. Maybe not the most elegant solution. * Updated the demonstration script * Added last test and fixed lintr * Lint again. * styler * minor edits to tests * simplifies comment * comb files ok * Updated bug in independence approach related to categorical features which caused shapr to crash later. Added comments when I debuged to understand what was going on. I have added some comments about some stuff I did no understand/agree with. Discuss with Martin and correct this before merge. * Updated bug in independence approach related to categorical features which caused shapr to crash later. Added comments when I debuged to understand what was going on. I have added some comments about some stuff I did no understand/agree with. Discuss with Martin and correct this before merge. * lint warning * Lint * lint * updated test files after accepting new values * adjustments to comments and Lars' TODO-comments * update snapshot file after weight adjustment * cleaned up doc * rerun doc * style * Changed to `n_batches = 10` in the combined approaches, as the previous value (`n_batches = 1`) is not allowed anymore as it is lower than the number of unique used approaches. * accept OK test changes * additonal Ok test files * change batches in test files * accept new files * handle issue with a breaking change update in the testthat package * + these * removing last (unused) input of approach * updating tests * + update setup tests/snaps * correcting unique length * update linting and vignette * update docs * fix example issue * temporary disable tests on older R systems * remove unecessary if-else test * data.table style on Lars's batch adjustment suggestion * del comment * lint --------- Co-authored-by: Martin <[email protected]>
- Loading branch information
Showing
49 changed files
with
2,521 additions
and
238 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
139 changes: 139 additions & 0 deletions
139
inst/scripts/devel/demonstrate_combined_approaches_bugs.R
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,139 @@ | ||
# Use the data objects from the helper-lm.R file. | ||
# Here we want to illustrate three bugs related to combined approaches (before the bugfix) | ||
|
||
|
||
# First we see that setting `n_batches` lower than the number of unique approaches | ||
# produce some inconsistencies in shapr. | ||
# After the bugfix, we force the user to choose a valid value for `n_batches`. | ||
explanation_1 = explain( | ||
model = model_lm_numeric, | ||
x_explain = x_explain_numeric, | ||
x_train = x_train_numeric, | ||
approach = c("independence", "empirical", "gaussian", "copula", "empirical"), | ||
prediction_zero = p0, | ||
n_batches = 3, | ||
timing = FALSE, | ||
seed = 1) | ||
|
||
# It says shapr is using 3 batches | ||
explanation_1$internal$parameters$n_batches | ||
|
||
# But shapr has actually used 4. | ||
# This is because shapr can only handle one type of approach for each batch. | ||
# Hence, the number of batches must be at least as large as the number of unique approaches. | ||
# (excluding the last approach which is not used, as we then condition on all features) | ||
length(explanation_1$internal$objects$S_batch) | ||
|
||
# Note that after the bugfix, we give an error if `n_batches` < # unique approaches. | ||
|
||
|
||
|
||
|
||
|
||
# Second we look at at another situation where # unique approaches is two and we set `n_batches` = 2, | ||
# but shapr still use three batches. This is due to how shapr decides how many batches each approach | ||
# should get. Right now it decided based on the proportion of the number of coalitions each approach | ||
# is responsible. In this setting, independence is responsible for 5 coalitions and ctree for 25 coalitions, | ||
# So, initially shapr sets that ctree should get the two batches while independence gets 0, but this | ||
# is than changed to 1 without considering that it now breaks the consistency with the `n_batches`. | ||
# This is done in the function `create_S_batch_new()` in setup_computation.R. | ||
explanation_2 = explain( | ||
model = model_lm_numeric, | ||
x_explain = x_explain_numeric, | ||
x_train = x_train_numeric, | ||
approach = c("independence", "ctree", "ctree", "ctree" ,"ctree"), | ||
prediction_zero = p0, | ||
n_batches = 2, | ||
timing = FALSE, | ||
seed = 1) | ||
|
||
# It says shapr is using 2 batches | ||
explanation_2$internal$parameters$n_batches | ||
|
||
# But shapr has actually used 3 | ||
length(explanation_2$internal$objects$S_batch) | ||
|
||
# These are equal after the bugfix | ||
|
||
|
||
# Same type of bug but in the opposite direction | ||
explanation_3 = explain( | ||
model = model_lm_numeric, | ||
x_explain = x_explain_numeric, | ||
x_train = x_train_numeric, | ||
approach = c("independence", "ctree", "ctree", "ctree" ,"ctree"), | ||
prediction_zero = p0, | ||
n_batches = 15, | ||
timing = FALSE, | ||
seed = 1) | ||
|
||
# It says shapr is using 15 batches | ||
explanation_3$internal$parameters$n_batches | ||
|
||
# It says shapr is using 14 batches | ||
length(explanation_3$internal$objects$S_batch) | ||
|
||
# These are equal after the bugfix | ||
|
||
|
||
|
||
|
||
|
||
|
||
# Bug number three caused shapr to not to be reproducible as seting the seed did not work for combined approaches. | ||
# This was due to a `set.seed(NULL)` which ruins all of the earlier set.seed procedures. | ||
|
||
|
||
# Check that setting the seed works for a combination of approaches | ||
# Here `n_batches` is set to `4`, so one batch for each method, | ||
# i.e., no randomness. | ||
# In the first example we get no bug as there is no randomness in assigning the batches. | ||
explanation_combined_1 = explain( | ||
model = model_lm_numeric, | ||
x_explain = x_explain_numeric, | ||
x_train = x_train_numeric, | ||
approach = c("independence", "empirical", "gaussian", "copula", "empirical"), | ||
prediction_zero = p0, | ||
timing = FALSE, | ||
seed = 1) | ||
|
||
explanation_combined_2 = explain( | ||
model = model_lm_numeric, | ||
x_explain = x_explain_numeric, | ||
x_train = x_train_numeric, | ||
approach = c("independence", "empirical", "gaussian", "copula", "empirical"), | ||
prediction_zero = p0, | ||
timing = FALSE, | ||
seed = 1) | ||
|
||
# Check that they are equal | ||
all.equal(explanation_combined_1, explanation_combined_2) | ||
|
||
|
||
# Here `n_batches` is set to `10`, so NOT one batch for each method, | ||
# i.e., randomness in assigning the batches. | ||
explanation_combined_3 = explain( | ||
model = model_lm_numeric, | ||
x_explain = x_explain_numeric, | ||
x_train = x_train_numeric, | ||
approach = c("independence", "empirical", "gaussian", "copula", "ctree"), | ||
prediction_zero = p0, | ||
timing = FALSE, | ||
seed = 1) | ||
|
||
explanation_combined_4 = explain( | ||
model = model_lm_numeric, | ||
x_explain = x_explain_numeric, | ||
x_train = x_train_numeric, | ||
approach = c("independence", "empirical", "gaussian", "copula", "ctree"), | ||
prediction_zero = p0, | ||
timing = FALSE, | ||
seed = 1) | ||
|
||
# Check that they are not equal | ||
all.equal(explanation_combined_3, explanation_combined_4) | ||
explanation_combined_3$internal$objects$X | ||
explanation_combined_4$internal$objects$X | ||
|
||
# These are equal after the bugfix | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# In this code we demonstrate that (before the bugfix) the `explain()` function | ||
# does not enter the exact mode when n_combinations is larger than or equal to 2^m. | ||
# The mode is only changed if n_combinations is strictly larger than 2^m. | ||
# This means that we end up with using all coalitions when n_combinations is 2^m, | ||
# but use not the exact Shapley kernel weights. | ||
# Bugfix replaces `>` with `=>`in the places where the code tests if | ||
# n_combinations is larger than or equal to 2^m. Then the text/messages printed by | ||
# shapr and the code correspond. | ||
|
||
library(xgboost) | ||
library(data.table) | ||
|
||
data("airquality") | ||
data <- data.table::as.data.table(airquality) | ||
data <- data[complete.cases(data), ] | ||
|
||
x_var <- c("Solar.R", "Wind", "Temp", "Month") | ||
y_var <- "Ozone" | ||
|
||
ind_x_explain <- 1:6 | ||
x_train <- data[-ind_x_explain, ..x_var] | ||
y_train <- data[-ind_x_explain, get(y_var)] | ||
x_explain <- data[ind_x_explain, ..x_var] | ||
|
||
# Fitting a basic xgboost model to the training data | ||
model <- xgboost::xgboost( | ||
data = as.matrix(x_train), | ||
label = y_train, | ||
nround = 20, | ||
verbose = FALSE | ||
) | ||
|
||
# Specifying the phi_0, i.e. the expected prediction without any features | ||
p0 <- mean(y_train) | ||
|
||
# Shapr sets the default number of batches to be 10 for this dataset for the | ||
# "ctree", "gaussian", and "copula" approaches. Thus, setting `n_combinations` | ||
# to any value lower of equal to 10 causes the error. | ||
any_number_equal_or_below_10 = 8 | ||
|
||
# Before the bugfix, shapr:::check_n_batches() throws the error: | ||
# Error in check_n_batches(internal) : | ||
# `n_batches` (10) must be smaller than the number feature combinations/`n_combinations` (8) | ||
# Bug only occures for "ctree", "gaussian", and "copula" as they are treated different in | ||
# `get_default_n_batches()`, I am not certain why. Ask Martin about the logic behind that. | ||
explanation <- explain( | ||
model = model, | ||
x_explain = x_explain, | ||
x_train = x_train, | ||
n_samples = 2, # Low value for fast computations | ||
approach = "gaussian", | ||
prediction_zero = p0, | ||
n_combinations = any_number_equal_or_below_10 | ||
) |
Oops, something went wrong.