Improving the efficiency of the Gaussian and (Gaussian) copula methods #366

LHBO · 2023-12-04T15:28:34Z

In this pull request, we improve the efficiency of the Gaussian and (Gaussian) copula methods.

The efficiency has been improved by

Computing the conditional Gaussian distributions only once and then adding the explicand-specific parts to the precomputed quantities as outlined in Improve efficency for Gaussian method #231.
Sampling MC samples only once from a standard normal and then transforming them to the correct conditional normal distribution.
Rewriting the computationally heavy parts of the code into C++ code to speed up the computations.

In this pull request, the following C++ functions replace the equivalent R functions:

The C++ function prepare_data_gaussian_cpp replaces the R function prepare_data_gaussian.
The C++ function prepare_data_copula_cpp replaces the R function prepare_data_copula.
The C++ function inv_gaussian_transform_cpp replaces the R function inv_gaussian_transform.
The C++ function quantile_type7_cpp replaces the R function quantile(..., type = 7).

Closes #231

LHBO · 2023-12-04T15:49:40Z

Currently, I have just added an extra file where I have written six versions of the Gaussian method. We need to decide which one to use and replace the prepare_data.gaussian() function in approach_gaussian.R with the one we decide to use.

In the old version, we compute and test the conditional covariance matrix for each coalition and explicand. Generate new MC samples for each explicand and coalition.
Only compute and test the covariance matrix once for each coalition, but mvnfast computes the Cholensky decomposition for each explicand. So it is computed n_explain times. Generate new MC samples for each explicand and coalition.
Same as 1, but we use R's chol() function to compute the Cholensky decomposition once and provide that directly to rmvn().
We only sample once per coalition and only add the explicand dependent mean in a secondary call. I.e., we sample the n_samples MC samples from N(0, Sigma_{Sbar|S}) for each coalition, and then add the mean for the different explicands. Let rmvn() compute the Cholensky decomposition, but this is only done once now.
Same as above, but we use R's chol() function to compute the Cholensky decomposition once and provide that directly to rmvn().
Only generate the n_samples MC samples once. We generate n_samples from N(0, I) and then use Cholensky decomposition to transform to N(0, Sigma_{Sbar|S}), and then add the explicand specific means. That is, we use the same n_samples to create the MC samples for all n_explain explicands and n_combinations coalitions.
Same as 5, but generate n_samples * n_combinations MC samples from N(0, I), such that each coalition is based on different samples. Different explicands are still based on the same generated samples from N(0,I).

In general, we see in a small 8-dim example that the new versions are between two to four times faster than the old version. Version 3-6 are faster than 1-2. In general number 5 seems to be the fastest (in the setup I looked at).

One can probably write more efficient code, too.

Merge remote-tracking branch 'origin/master' into Lars/Improve_Gaussian # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

martinju · 2023-12-08T15:06:10Z

This is great! I am most in favor v5, which I added a simplification to, using rnorm directly instead of rmvnorm. This is slightly faster. I didn't check, but I think one can then also go directly to the transpose of B so that may help even more.
Further, since we are improving code here, it would be nice to see whether the whole double-lapply think can be done with 2 basic for loops in C++. All of that code is essentially just matrix multiplication etc which is much fast in C++. It is quite much code to transform, but ChatGPT should be super-helpful here, and you can also see the other cpp-files in shapr for how to set it up. Note that it does not matter if the c++-code is not "efficient" using loops and so on, it will anyway be fast since it is being complied.

Feel free to try that out if you want. Otherwise, I can give it a go next week.

LHBO · 2023-12-08T16:27:21Z

This is great! I am most in favor v5, which I added a simplification to, using rnorm directly instead of rmvnorm. This is slightly faster. I didn't check, but I think one can then also go directly to the transpose of B so that may help even more. Further, since we are improving code here, it would be nice to see whether the whole double-lapply think can be done with 2 basic for loops in C++. All of that code is essentially just matrix multiplication etc which is much fast in C++. It is quite much code to transform, but ChatGPT should be super-helpful here, and you can also see the other cpp-files in shapr for how to set it up. Note that it does not matter if the c++-code is not "efficient" using loops and so on, it will anyway be fast since it is being complied.

Feel free to try that out if you want. Otherwise, I can give it a go next week.

I will look at it next week. I agree that using C++ will be beneficial

…he same as the old.

…able::`

…ements as tiles)

…ples

src/Copula.cpp

martinju

@Lars: All good except the manual update mentioned above, and the ties-issue mentioned on slack. I will approve the failing tests once that is fixed, and you have confirmed that the tests differences are due to sampling error only.

… is large

…rate

… as R code when `n_samples` tends to infinity

LHBO · 2024-01-12T14:39:34Z

@Lars: All good except the manual update mentioned above, and the ties-issue mentioned on slack. I will approve the failing tests once that is fixed, and you have confirmed that the tests differences are due to sampling error only.

Fixed the manual update mentioned above.
I fixed the ties issues by using the R rank function for that problem. We do not need C++ code there, as the code takes milliseconds to run.
I added a code example in shapr/inst/scripts/compare_copula_in_R_and_C++.R where I compare the Shapley values produced by the R and C++ versions. The difference tends to zero when n_samples tends to infinity. For n_samples = 1000000, the max absolute difference is in the third decimal.

Comparing different versions of Gaussian method

d637952

LHBO marked this pull request as draft December 4, 2023 15:29

LHBO and others added 5 commits December 4, 2023 15:50

Hashed out large n_samples

4c6bead

Commented out shapr

cbf53a2

Moved file to inst/scripts/ as tests fails

87fdc54

adding rnorm alternative for version 5

a91e3ad

LHBO added 20 commits December 12, 2023 13:42

Added new version of v5_rnorm without sweep. 20% faster

0d67b17

Added gaussian cpp code

9d0ef5b

added gaussian cpp in comparison file

7737587

update cpp supporting files

9d493a5

Removed redudant parameters

6d2eb56

logical error used n_rows instead of n_cols

ed5b87a

Prøvde å gjøre koden raskere, men blitt tregere(?)

530c816

typo

1e8c593

Tried to see if I could make it speed it up.

e4eb606

Here we add different cpp versions to find the fastest one

3f038eb

Arma::cube with efficient indicing is the fastest

b0c2180

Added cpp functions here so we have them in the future (if needed)

75aea94

Made cube_v2 the main cpp version and removed others

b1baed1

Updated approach_gaussian() to use Cpp code. Verified that we get t…

4162364

…he same as the old.

Moved function to inst/compare_gaussian.R so it is still runable.

adf57e5

Added #' @inheritParams default_doc

74b1ae2

Changed order of parameters

9590605

Removed if (!is.null(index_features)) in approach_gaussian().

df9777d

Need to clean code. Push in case something happens with my computer.

11819e8

approach_gaussian: removed not used variables + stylr + added `data.t…

92ccff3

…able::`

LHBO added 10 commits January 7, 2024 23:38

stylr

5bd7be6

Updated manuals/documentation

19a92be

Updated versions with working and efficient C++ code

cbb06ed

Updated manuals and NAMESPACE

a1e1767

Added comparison of shapr and rcpp compile

f2aa827

Added to ignore .DS_Store (mac file created when looking at folder el…

e0a784b

…ements as tiles)

Fixed TODO tasks in approach_copula.R

2613e6e

Stylr

cfd0b24

Removed debug funciton

ccfbd59

Added comparisons to when we compile with Rcpp::sourceCpp in all exam…

71b8f40

…ples

LHBO marked this pull request as ready for review January 11, 2024 17:50

martinju reviewed Jan 12, 2024

View reviewed changes

src/Copula.cpp Outdated Show resolved Hide resolved

martinju requested changes Jan 12, 2024

View reviewed changes

LHBO added 5 commits January 12, 2024 12:02

Added test that shows that cpp and R gives same values when n_samples…

33d7391

… is large

Go back to using R for gaussian_transform and gaussian_transform_sepa…

7634b38

…rate

Typo in Copula.cpp

4db7aac

Added code that Shows that C++ code produce equivalent Shapley values…

e969bd9

… as R code when `n_samples` tends to infinity

Updated RcppExports

0bd1f84

Removed unneeded variable

8f111a6

LHBO changed the title ~~Comparing different versions of the Gaussian method (#231)~~ Improving the efficiency of the Gaussian and (Gaussian) copula methods Jan 12, 2024

martinju mentioned this pull request Jan 14, 2024

Run shapr on HPC with a large size for x_explain #370

Closed

martinju added 5 commits January 15, 2024 11:40

man

df1c276

test files

e9c5d4f

clear unused packages

21d0599

Merge remote-tracking branch 'origin/master' into Improve_Gaussian_Lars

5b394d5

rerun tests after master merge

3ea3d9d

martinju approved these changes Jan 15, 2024

View reviewed changes

martinju merged commit f940886 into NorskRegnesentral:master Jan 15, 2024
7 checks passed

LHBO deleted the Lars/Improve_Gaussian branch April 20, 2024 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving the efficiency of the Gaussian and (Gaussian) copula methods #366

Improving the efficiency of the Gaussian and (Gaussian) copula methods #366

LHBO commented Dec 4, 2023 •

edited by martinju

Loading

LHBO commented Dec 4, 2023

martinju commented Dec 8, 2023

LHBO commented Dec 8, 2023

martinju left a comment

LHBO commented Jan 12, 2024 •

edited

Loading

Improving the efficiency of the Gaussian and (Gaussian) copula methods #366

Improving the efficiency of the Gaussian and (Gaussian) copula methods #366

Conversation

LHBO commented Dec 4, 2023 • edited by martinju Loading

LHBO commented Dec 4, 2023

martinju commented Dec 8, 2023

LHBO commented Dec 8, 2023

martinju left a comment

Choose a reason for hiding this comment

LHBO commented Jan 12, 2024 • edited Loading

LHBO commented Dec 4, 2023 •

edited by martinju

Loading

LHBO commented Jan 12, 2024 •

edited

Loading