Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the efficiency of the Gaussian and (Gaussian) copula methods #366

Merged
merged 64 commits into from
Jan 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
d637952
Comparing different versions of Gaussian method
LHBO Dec 4, 2023
4c6bead
Hashed out large `n_samples`
LHBO Dec 4, 2023
cbf53a2
Commented out shapr
LHBO Dec 4, 2023
8045102
Merge master into branch
LHBO Dec 4, 2023
87fdc54
Moved file to `inst/scripts/` as tests fails
LHBO Dec 4, 2023
a91e3ad
adding rnorm alternative for version 5
martinju Dec 8, 2023
0d67b17
Added new version of v5_rnorm without sweep. 20% faster
LHBO Dec 12, 2023
9d0ef5b
Added gaussian cpp code
LHBO Dec 18, 2023
7737587
added gaussian cpp in comparison file
LHBO Dec 18, 2023
9d493a5
update cpp supporting files
LHBO Dec 18, 2023
6d2eb56
Removed redudant parameters
LHBO Dec 18, 2023
ed5b87a
logical error used `n_rows` instead of `n_cols`
LHBO Dec 18, 2023
530c816
Prøvde å gjøre koden raskere, men blitt tregere(?)
LHBO Dec 18, 2023
1e8c593
typo
LHBO Dec 18, 2023
e4eb606
Tried to see if I could make it speed it up.
LHBO Dec 20, 2023
3f038eb
Here we add different cpp versions to find the fastest one
LHBO Jan 4, 2024
b0c2180
Arma::cube with efficient indicing is the fastest
LHBO Jan 4, 2024
75aea94
Added cpp functions here so we have them in the future (if needed)
LHBO Jan 5, 2024
b1baed1
Made cube_v2 the main cpp version and removed others
LHBO Jan 5, 2024
4162364
Updated `approach_gaussian()` to use Cpp code. Verified that we get t…
LHBO Jan 5, 2024
adf57e5
Moved function to inst/compare_gaussian.R so it is still runable.
LHBO Jan 5, 2024
74b1ae2
Added `#' @inheritParams default_doc`
LHBO Jan 5, 2024
9590605
Changed order of parameters
LHBO Jan 5, 2024
df9777d
Removed `if (!is.null(index_features))` in `approach_gaussian()`.
LHBO Jan 5, 2024
11819e8
Need to clean code. Push in case something happens with my computer.
LHBO Jan 5, 2024
92ccff3
approach_gaussian: removed not used variables + stylr + added `data.t…
LHBO Jan 6, 2024
ed41f91
Removed quotation marks in dt
LHBO Jan 6, 2024
8833bb7
Start to clean up copula.cpp
LHBO Jan 6, 2024
5a20f69
deleted not used functions in copula.cpp
LHBO Jan 6, 2024
63a911f
Roxygen gaussian
LHBO Jan 7, 2024
24ec723
Roxygen
LHBO Jan 7, 2024
b7ece67
Cleaned up Copula.cpp
LHBO Jan 7, 2024
bc3df32
Updated approach_copula.R
LHBO Jan 7, 2024
0ad95d2
Roxygen
LHBO Jan 7, 2024
ffba299
Deleted unnecessary/old R code from approach_copula.R
LHBO Jan 7, 2024
cf9a5b4
Updated RcppExports
LHBO Jan 7, 2024
3a14985
Renamed illustration script
LHBO Jan 7, 2024
da8f03c
Adding script to compare the R and C++ versions.
LHBO Jan 7, 2024
ff0b2b4
lintr + stylr
LHBO Jan 7, 2024
1de33a1
typo
LHBO Jan 7, 2024
6d6d7bc
Added a copula version that uses cpp, but r for quantile
LHBO Jan 7, 2024
8eec712
Updated comparison file
LHBO Jan 7, 2024
a0f6132
Ran code several time to get more CPU times
LHBO Jan 7, 2024
5bd7be6
stylr
LHBO Jan 7, 2024
19a92be
Updated manuals/documentation
LHBO Jan 7, 2024
cbb06ed
Updated versions with working and efficient C++ code
LHBO Jan 9, 2024
a1e1767
Updated manuals and NAMESPACE
LHBO Jan 9, 2024
f2aa827
Added comparison of shapr and rcpp compile
LHBO Jan 11, 2024
e0a784b
Added to ignore .DS_Store (mac file created when looking at folder el…
LHBO Jan 11, 2024
2613e6e
Fixed TODO tasks in approach_copula.R
LHBO Jan 11, 2024
cfd0b24
Stylr
LHBO Jan 11, 2024
ccfbd59
Removed debug funciton
LHBO Jan 11, 2024
71b8f40
Added comparisons to when we compile with Rcpp::sourceCpp in all exam…
LHBO Jan 11, 2024
33d7391
Added test that shows that cpp and R gives same values when n_samples…
LHBO Jan 12, 2024
7634b38
Go back to using R for gaussian_transform and gaussian_transform_sepa…
LHBO Jan 12, 2024
4db7aac
Typo in Copula.cpp
LHBO Jan 12, 2024
e969bd9
Added code that Shows that C++ code produce equivalent Shapley values…
LHBO Jan 12, 2024
0bd1f84
Updated RcppExports
LHBO Jan 12, 2024
8f111a6
Removed unneeded variable
LHBO Jan 12, 2024
df1c276
man
martinju Jan 15, 2024
e9c5d4f
test files
martinju Jan 15, 2024
21d0599
clear unused packages
martinju Jan 15, 2024
5b394d5
Merge remote-tracking branch 'origin/master' into Improve_Gaussian_Lars
martinju Jan 15, 2024
3ea3d9d
rerun tests after master merge
martinju Jan 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 2 additions & 12 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,32 +5,22 @@
Rprof.out
src/*.o
src/*.so

*.o

*.so

tmp/

\.RDataTmp

*.out

*.Rout

src-i386/

src-x64/

paper_experiments/res/single_res/

paper_experiments/res/single_res_old/
inst/doc

docs/*
doc
Meta
docs
/doc/
/Meta/
.idea
.idea
.DS_Store
2 changes: 0 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,6 @@ Imports:
stats,
data.table,
Rcpp (>= 0.12.15),
condMVNorm,
mvnfast,
Matrix,
future.apply
Suggests:
Expand Down
3 changes: 3 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,8 @@ export(observation_impute_cpp)
export(plot_MSEv_eval_crit)
export(predict_model)
export(prepare_data)
export(prepare_data_copula_cpp)
export(prepare_data_gaussian_cpp)
export(rss_cpp)
export(setup)
export(setup_approach)
Expand Down Expand Up @@ -96,6 +98,7 @@ importFrom(stats,model.matrix)
importFrom(stats,predict)
importFrom(stats,pt)
importFrom(stats,qt)
importFrom(stats,rnorm)
importFrom(stats,sd)
importFrom(stats,setNames)
importFrom(utils,head)
Expand Down
82 changes: 82 additions & 0 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,88 @@ aicc_full_cpp <- function(h, X_list, mcov_list, S_scale_dist, y_list, negative)
.Call(`_shapr_aicc_full_cpp`, h, X_list, mcov_list, S_scale_dist, y_list, negative)
}

#' Compute the quantiles using quantile type seven
#'
#' @details Using quantile type number seven from stats::quantile in R.
#'
#' @param x arma::vec. Numeric vector whose sample quantiles are wanted.
#' @param probs arma::vec. Numeric vector of probabilities with values between zero and one.
#'
#' @return A vector of length `length(probs)` with the quantiles is returned.
#'
#' @keywords internal
#' @author Lars Henry Berge Olsen
quantile_type7_cpp <- function(x, probs) {
.Call(`_shapr_quantile_type7_cpp`, x, probs)
}

#' Transforms new data to a standardized normal distribution
#'
#' @param z arma::mat. The data are the Gaussian Monte Carlos samples to transform.
#' @param x arma::mat. The data with the original transformation. Used to conduct the transformation of `z`.
#'
#' @return arma::mat of the same dimension as `z`
#'
#' @keywords internal
#' @author Lars Henry Berge Olsen
inv_gaussian_transform_cpp <- function(z, x) {
.Call(`_shapr_inv_gaussian_transform_cpp`, z, x)
}

#' Generate (Gaussian) Copula MC samples
#'
#' @param MC_samples_mat arma::mat. Matrix of dimension (`n_samples`, `n_features`) containing samples from the
#' univariate standard normal.
#' @param x_explain_mat arma::mat. Matrix of dimension (`n_explain`, `n_features`) containing the observations
#' to explain on the original scale.
#' @param x_explain_gaussian_mat arma::mat. Matrix of dimension (`n_explain`, `n_features`) containing the
#' observations to explain after being transformed using the Gaussian transform, i.e., the samples have been
#' transformed to a standardized normal distribution.
#' @param x_train_mat arma::mat. Matrix of dimension (`n_train`, `n_features`) containing the training observations.
#' @param S arma::mat. Matrix of dimension (`n_combinations`, `n_features`) containing binary representations of
#' the used coalitions. S cannot contain the empty or grand coalition, i.e., a row containing only zeros or ones.
#' This is not a problem internally in shapr as the empty and grand coalitions treated differently.
#' @param mu arma::vec. Vector of length `n_features` containing the mean of each feature after being transformed
#' using the Gaussian transform, i.e., the samples have been transformed to a standardized normal distribution.
#' @param cov_mat arma::mat. Matrix of dimension (`n_features`, `n_features`) containing the pairwise covariance
#' between all pairs of features after being transformed using the Gaussian transform, i.e., the samples have been
#' transformed to a standardized normal distribution.
#'
#' @return An arma::cube/3D array of dimension (`n_samples`, `n_explain` * `n_coalitions`, `n_features`), where
#' the columns (_,j,_) are matrices of dimension (`n_samples`, `n_features`) containing the conditional Gaussian
#' copula MC samples for each explicand and coalition on the original scale.
#'
#' @export
#' @keywords internal
#' @author Lars Henry Berge Olsen
prepare_data_copula_cpp <- function(MC_samples_mat, x_explain_mat, x_explain_gaussian_mat, x_train_mat, S, mu, cov_mat) {
.Call(`_shapr_prepare_data_copula_cpp`, MC_samples_mat, x_explain_mat, x_explain_gaussian_mat, x_train_mat, S, mu, cov_mat)
}

#' Generate Gaussian MC samples
#'
#' @param MC_samples_mat arma::mat. Matrix of dimension (`n_samples`, `n_features`) containing samples from the
#' univariate standard normal.
#' @param x_explain_mat arma::mat. Matrix of dimension (`n_explain`, `n_features`) containing the observations
#' to explain.
#' @param S arma::mat. Matrix of dimension (`n_combinations`, `n_features`) containing binary representations of
#' the used coalitions. S cannot contain the empty or grand coalition, i.e., a row containing only zeros or ones.
#' This is not a problem internally in shapr as the empty and grand coalitions treated differently.
#' @param mu arma::vec. Vector of length `n_features` containing the mean of each feature.
#' @param cov_mat arma::mat. Matrix of dimension (`n_features`, `n_features`) containing the pairwise covariance
#' between all pairs of features.
#'
#' @return An arma::cube/3D array of dimension (`n_samples`, `n_explain` * `n_coalitions`, `n_features`), where
#' the columns (_,j,_) are matrices of dimension (`n_samples`, `n_features`) containing the conditional Gaussian
#' MC samples for each explicand and coalition.
#'
#' @export
#' @keywords internal
#' @author Lars Henry Berge Olsen
prepare_data_gaussian_cpp <- function(MC_samples_mat, x_explain_mat, S, mu, cov_mat) {
.Call(`_shapr_prepare_data_gaussian_cpp`, MC_samples_mat, x_explain_mat, S, mu, cov_mat)
}

#' (Generalized) Mahalanobis distance
#'
#' Used to get the Euclidean distance as well by setting \code{mcov} = \code{diag(m)}.
Expand Down
190 changes: 56 additions & 134 deletions R/approach_copula.R
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
#' @rdname setup_approach
#'
#' @inheritParams default_doc_explain
#'
#' @export
#' @author Martin Jullum
setup_approach.copula <- function(internal, ...) {
parameters <- internal$parameters
x_train <- internal$data$x_train
x_explain <- internal$data$x_explain
x_train_mat <- as.matrix(internal$data$x_train)
x_explain_mat <- as.matrix(internal$data$x_explain)

# Checking if factor features are present
feature_specs <- internal$objects$feature_specs
Expand All @@ -23,159 +22,97 @@ setup_approach.copula <- function(internal, ...) {
}

# Prepare transformed data
parameters$copula.mu <- rep(0, ncol(x_train))
x_train0 <- apply(
X = x_train,
MARGIN = 2,
FUN = gaussian_transform
)
parameters$copula.cov_mat <- get_cov_mat(x_train0)

parameters$copula.mu <- rep(0, ncol(x_train_mat))
x_train_mat0 <- apply(X = x_train_mat, MARGIN = 2, FUN = gaussian_transform)
parameters$copula.cov_mat <- get_cov_mat(x_train_mat0)

x_explain_gaussian <- apply(
X = rbind(x_explain, x_train),
X = rbind(x_explain_mat, x_train_mat),
MARGIN = 2,
FUN = gaussian_transform_separate,
n_y = nrow(x_explain)
n_y = nrow(x_explain_mat)
)
if (is.null(dim(x_explain_gaussian))) x_explain_gaussian <- t(as.matrix(x_explain_gaussian))

if (is.null(dim(x_explain_gaussian))) {
x_explain_gaussian <- t(as.matrix(x_explain_gaussian))
}
# TODO: Change to this a data.table for consistency (not speed/memory)
internal$data$copula.x_explain_gaussian <- x_explain_gaussian
# Add objects to internal list
internal$parameters <- parameters
internal$data$copula.x_explain_gaussian <- as.data.table(x_explain_gaussian)

return(internal)
}

#' @inheritParams default_doc
#' @rdname prepare_data
#' @export
prepare_data.copula <- function(internal, index_features = NULL, ...) {
x_train <- internal$data$x_train
x_explain <- internal$data$x_explain
#' @author Lars Henry Berge Olsen
prepare_data.copula <- function(internal, index_features, ...) {
# Extract used variables
S <- internal$objects$S[index_features, , drop = FALSE]
feature_names <- internal$parameters$feature_names
n_explain <- internal$parameters$n_explain
copula.cov_mat <- internal$parameters$copula.cov_mat
n_samples <- internal$parameters$n_samples
copula.mu <- internal$parameters$copula.mu
n_features <- internal$parameters$n_features
n_combinations_now <- length(index_features)
x_train_mat <- as.matrix(internal$data$x_train)
x_explain_mat <- as.matrix(internal$data$x_explain)
copula.mu <- internal$parameters$copula.mu
copula.cov_mat <- internal$parameters$copula.cov_mat
copula.x_explain_gaussian_mat <- as.matrix(internal$data$copula.x_explain_gaussian)

# Generate the MC samples from N(0, 1)
MC_samples_mat <- matrix(rnorm(n_samples * n_features), nrow = n_samples, ncol = n_features)

# Use C++ to convert the MC samples to N(mu_{Sbar|S}, Sigma_{Sbar|S}), for all coalitions and explicands,
# and then transforming them back to the original scale using the inverse Gaussian transform in C++.
# The object `dt` is a 3D array of dimension (n_samples, n_explain * n_coalitions, n_features).
dt <- prepare_data_copula_cpp(
MC_samples_mat = MC_samples_mat,
x_explain_mat = x_explain_mat,
x_explain_gaussian_mat = copula.x_explain_gaussian_mat,
x_train_mat = x_train_mat,
S = S,
mu = copula.mu,
cov_mat = copula.cov_mat
)

copula.x_explain_gaussian <- internal$data$copula.x_explain_gaussian
X <- internal$objects$X


x_explain0 <- as.matrix(x_explain)
dt_l <- list()
if (is.null(index_features)) {
features <- X$features
} else {
features <- X$features[index_features]
}

# Reshape `dt` to a 2D array of dimension (n_samples * n_explain * n_coalitions, n_features).
dim(dt) <- c(n_combinations_now * n_explain * n_samples, n_features)

for (i in seq_len(n_explain)) {
l <- lapply(
X = features,
FUN = sample_copula,
n_samples = n_samples,
mu = copula.mu,
cov_mat = copula.cov_mat,
m = n_features,
x_explain = x_explain0[i, , drop = FALSE],
x_train = as.matrix(x_train),
x_explain_gaussian = copula.x_explain_gaussian[i, , drop = FALSE]
)
# Convert to a data.table and add extra identification columns
dt <- data.table::as.data.table(dt)
data.table::setnames(dt, feature_names)
dt[, id_combination := rep(seq_len(nrow(S)), each = n_samples * n_explain)]
dt[, id := rep(seq(n_explain), each = n_samples, times = nrow(S))]
dt[, w := 1 / n_samples]
dt[, id_combination := index_features[id_combination]]
data.table::setcolorder(dt, c("id_combination", "id", feature_names))

dt_l[[i]] <- data.table::rbindlist(l, idcol = "id_combination")
dt_l[[i]][, w := 1 / n_samples]
dt_l[[i]][, id := i]
if (!is.null(index_features)) dt_l[[i]][, id_combination := index_features[id_combination]]
}
dt <- data.table::rbindlist(dt_l, use.names = TRUE, fill = TRUE)
return(dt)
}

#' Sample conditional variables using the Gaussian copula approach
#'
#' @param index_given Integer vector. The indices of the features to condition upon. Note that
#' `min(index_given) >= 1` and `max(index_given) <= m`.
#' @param m Positive integer. The total number of features.
#' @param x_explain_gaussian Numeric matrix. Contains the observation whose predictions ought
#' to be explained (test data),
#' after quantile-transforming them to standard Gaussian variables.
#' @param x_explain Numeric matrix. Contains the features of the observation whose
#' predictions ought to be explained (test data).
#'
#' @return data.table
#'
#' @keywords internal
#'
#' @author Martin Jullum
sample_copula <- function(index_given, n_samples, mu, cov_mat, m, x_explain_gaussian, x_train, x_explain) {
# Handles the unconditional and full conditional separtely when predicting
if (length(index_given) %in% c(0, m)) {
ret <- matrix(x_explain, ncol = m, nrow = 1)
} else {
dependent_ind <- (seq_len(length(mu)))[-index_given]

tmp <- condMVNorm::condMVN(
mean = mu,
sigma = cov_mat,
dependent.ind = dependent_ind,
given.ind = index_given,
X.given = x_explain_gaussian[index_given]
)

ret0_z <- mvnfast::rmvn(n = n_samples, mu = tmp$condMean, sigma = tmp$condVar)

ret0_x <- apply(
X = rbind(ret0_z, x_train[, dependent_ind, drop = FALSE]),
MARGIN = 2,
FUN = inv_gaussian_transform,
n_z = n_samples
)

ret <- matrix(NA, ncol = m, nrow = n_samples)
ret[, index_given] <- rep(x_explain[index_given], each = n_samples)
ret[, dependent_ind] <- ret0_x
}
colnames(ret) <- colnames(x_explain)
return(as.data.table(ret))
}


#' Transforms new data to a standardized normal distribution
#' Transforms a sample to standardized normal distribution
#'
#' @param zx Numeric vector. The first `n_z` items are the Gaussian data, and the last part is
#' the data with the original transformation.
#' @param n_z Positive integer. Number of elements of `zx` that belongs to new data.
#' @param x Numeric vector.The data which should be transformed to a standard normal distribution.
#'
#' @return Numeric vector of length `n_z`
#' @return Numeric vector of length `length(x)`
#'
#' @keywords internal
#'
#' @author Martin Jullum
inv_gaussian_transform <- function(zx, n_z) {
if (n_z >= length(zx)) stop("n_z should be less than length(zx)")
ind <- 1:n_z
z <- zx[ind]
x <- zx[-ind]
u <- stats::pnorm(z)
x_new <- stats::quantile(x, probs = u)
return(as.double(x_new))
gaussian_transform <- function(x) {
u <- rank(x) / (length(x) + 1)
z <- stats::qnorm(u)
return(z)
}

#' Transforms new data to standardized normal (dimension 1) based on other data transformations
#'
#' @param yx Numeric vector. The first `n_y` items is the data that is transformed, and last
#' part is the data with the original transformation.
#' @param n_y Positive integer. Number of elements of `yx` that belongs to the gaussian data.
#' @param n_y Positive integer. Number of elements of `yx` that belongs to the Gaussian data.
#'
#' @return Vector of back-transformed Gaussian data
#'
#' @keywords internal
#'
#' @author Martin Jullum
gaussian_transform_separate <- function(yx, n_y) {
if (n_y >= length(yx)) stop("n_y should be less than length(yx)")
Expand All @@ -187,18 +124,3 @@ gaussian_transform_separate <- function(yx, n_y) {
z_y <- stats::qnorm(u_y)
return(z_y)
}

#' Transforms a sample to standardized normal distribution
#'
#' @param x Numeric vector.The data which should be transformed to a standard normal distribution.
#'
#' @return Numeric vector of length `length(x)`
#'
#' @keywords internal
#'
#' @author Martin Jullum
gaussian_transform <- function(x) {
u <- rank(x) / (length(x) + 1)
z <- stats::qnorm(u)
return(z)
}
Loading
Loading