Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the efficiency of the Gaussian and (Gaussian) copula methods #366

Merged
merged 64 commits into from
Jan 15, 2024

Conversation

LHBO
Copy link
Collaborator

@LHBO LHBO commented Dec 4, 2023

In this pull request, we improve the efficiency of the Gaussian and (Gaussian) copula methods.

The efficiency has been improved by

  1. Computing the conditional Gaussian distributions only once and then adding the explicand-specific parts to the precomputed quantities as outlined in Improve efficency for Gaussian method #231.
  2. Sampling MC samples only once from a standard normal and then transforming them to the correct conditional normal distribution.
  3. Rewriting the computationally heavy parts of the code into C++ code to speed up the computations.

In this pull request, the following C++ functions replace the equivalent R functions:

  • The C++ function prepare_data_gaussian_cpp replaces the R function prepare_data_gaussian.
  • The C++ function prepare_data_copula_cpp replaces the R function prepare_data_copula.
  • The C++ function inv_gaussian_transform_cpp replaces the R function inv_gaussian_transform.
  • The C++ function quantile_type7_cpp replaces the R function quantile(..., type = 7).

Closes #231

@LHBO LHBO marked this pull request as draft December 4, 2023 15:29
@LHBO
Copy link
Collaborator Author

LHBO commented Dec 4, 2023

Currently, I have just added an extra file where I have written six versions of the Gaussian method. We need to decide which one to use and replace the prepare_data.gaussian() function in approach_gaussian.R with the one we decide to use.

  1. In the old version, we compute and test the conditional covariance matrix for each coalition and explicand. Generate new MC samples for each explicand and coalition.
  2. Only compute and test the covariance matrix once for each coalition, but mvnfast computes the Cholensky decomposition for each explicand. So it is computed n_explain times. Generate new MC samples for each explicand and coalition.
  3. Same as 1, but we use R's chol() function to compute the Cholensky decomposition once and provide that directly to rmvn().
  4. We only sample once per coalition and only add the explicand dependent mean in a secondary call. I.e., we sample the n_samples MC samples from N(0, Sigma_{Sbar|S}) for each coalition, and then add the mean for the different explicands. Let rmvn() compute the Cholensky decomposition, but this is only done once now.
  5. Same as above, but we use R's chol() function to compute the Cholensky decomposition once and provide that directly to rmvn().
  6. Only generate the n_samples MC samples once. We generate n_samples from N(0, I) and then use Cholensky decomposition to transform to N(0, Sigma_{Sbar|S}), and then add the explicand specific means. That is, we use the same n_samples to create the MC samples for all n_explain explicands and n_combinations coalitions.
  7. Same as 5, but generate n_samples * n_combinations MC samples from N(0, I), such that each coalition is based on different samples. Different explicands are still based on the same generated samples from N(0,I).

In general, we see in a small 8-dim example that the new versions are between two to four times faster than the old version. Version 3-6 are faster than 1-2. In general number 5 seems to be the fastest (in the setup I looked at).

One can probably write more efficient code, too.

LHBO and others added 5 commits December 4, 2023 15:50
Merge remote-tracking branch 'origin/master' into Lars/Improve_Gaussian
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
@martinju
Copy link
Member

martinju commented Dec 8, 2023

This is great! I am most in favor v5, which I added a simplification to, using rnorm directly instead of rmvnorm. This is slightly faster. I didn't check, but I think one can then also go directly to the transpose of B so that may help even more.
Further, since we are improving code here, it would be nice to see whether the whole double-lapply think can be done with 2 basic for loops in C++. All of that code is essentially just matrix multiplication etc which is much fast in C++. It is quite much code to transform, but ChatGPT should be super-helpful here, and you can also see the other cpp-files in shapr for how to set it up. Note that it does not matter if the c++-code is not "efficient" using loops and so on, it will anyway be fast since it is being complied.

Feel free to try that out if you want. Otherwise, I can give it a go next week.

@LHBO
Copy link
Collaborator Author

LHBO commented Dec 8, 2023

This is great! I am most in favor v5, which I added a simplification to, using rnorm directly instead of rmvnorm. This is slightly faster. I didn't check, but I think one can then also go directly to the transpose of B so that may help even more. Further, since we are improving code here, it would be nice to see whether the whole double-lapply think can be done with 2 basic for loops in C++. All of that code is essentially just matrix multiplication etc which is much fast in C++. It is quite much code to transform, but ChatGPT should be super-helpful here, and you can also see the other cpp-files in shapr for how to set it up. Note that it does not matter if the c++-code is not "efficient" using loops and so on, it will anyway be fast since it is being complied.

Feel free to try that out if you want. Otherwise, I can give it a go next week.

I will look at it next week. I agree that using C++ will be beneficial

@LHBO LHBO marked this pull request as ready for review January 11, 2024 17:50
src/Copula.cpp Outdated Show resolved Hide resolved
Copy link
Member

@martinju martinju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lars: All good except the manual update mentioned above, and the ties-issue mentioned on slack. I will approve the failing tests once that is fixed, and you have confirmed that the tests differences are due to sampling error only.

@LHBO
Copy link
Collaborator Author

LHBO commented Jan 12, 2024

@Lars: All good except the manual update mentioned above, and the ties-issue mentioned on slack. I will approve the failing tests once that is fixed, and you have confirmed that the tests differences are due to sampling error only.

  • Fixed the manual update mentioned above.
  • I fixed the ties issues by using the R rank function for that problem. We do not need C++ code there, as the code takes milliseconds to run.
  • I added a code example in shapr/inst/scripts/compare_copula_in_R_and_C++.R where I compare the Shapley values produced by the R and C++ versions. The difference tends to zero when n_samples tends to infinity. For n_samples = 1000000, the max absolute difference is in the third decimal.

@LHBO LHBO changed the title Comparing different versions of the Gaussian method (#231) Improving the efficiency of the Gaussian and (Gaussian) copula methods Jan 12, 2024
@martinju martinju merged commit f940886 into NorskRegnesentral:master Jan 15, 2024
7 checks passed
@LHBO LHBO deleted the Lars/Improve_Gaussian branch April 20, 2024 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve efficency for Gaussian method
2 participants