-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Are there any suggestions/best practices on what value to use for n_combinations? #226
Comments
Yes, this is applicable for all approaches, and yes it is really just about trading performance for memory usage and runtime. Note that the cutoff to trigger the warning is changed from 12 to 13 in d3f39f3 (will be in master by next week). So for 14 or more features we recommend Note also that n_combinations is currently not giving you the specified number of unique combinations, so if you want 1000 unique combinations you would have to increase n_combinations to say 2000 or 3000 depending. You can check how many that are being used by I agree that it would be useful to study this trade-off between accuracy and runtime. The problem is that it will potentially vary quite a bit depending on the method you use (empirical, gaussian, ctree) and the type of predictive model you are explaining. Thus, I think it would be hard to reach a general answer to this question. Sorry to hear your program crashes. Ctree is indeed heavier to run. Depending on where the crash happens, it might help to reduce the number of test samples (predictions you are explaining), and then just run the explain function repeatedly. There are also settings to reduce the complexity of the trees that ctree builds, see ?simulateAllTrees |
That's a smart suggestion! How would one go about doing this? |
I was merely thinking of just looping over batches of the predictions you want to explain, like
where
This will give you an estimate of the same thing, and is straight forward. A better way would be to call explain with just a subset of all the samples of combinations and store only the conditional expectations in each of the 10 repeated runs. Finally, you would glue those together and do the computations with the Shapley formula. This is not possible with the current code base, though. It would require a little bit of hacking, but should not be extremely hard. At least for the ctree and empirical approaches, reducing the number of training observations would also directly reduce memory usage. |
Good call! I will try this one out! |
Great! Please also report back if successful. PRs are welcome :-) |
Didn't quite know how to deal with the different-sized (due to random sampling) weight matrices W :-( For now, I just loop over the whole process with different seeds, record the Shapley Values, and just calculate the mean. |
My take on this is that you need to run the shapr function as usual with the full set of say 10*1000 samples, then create 10 copies of this but with just 1000 rows in the S matrix. Then you need to pass each of those copies to explain, and keeping the res_mat which is created in the last part of the predictions call and storing that (you need to modify the function in the package to do this as res_mat is currently not saved). Finally you merge all the res_mats and run the last few lines of code in predictions. What you are doing right now might also be OK, although I believe what I am suggesting above is more sampling effective as you avoid (unnecessary) sampling of the same feature combinations in repeated runs. |
@mlds2020
|
Hello,
I check the parameter "dt_mat_final":
could you help me how to remove this error? Best, |
This code is outdated. I recommend using the more recent n_batches argument with the GitHub master version of shapr to get essentially the same. |
I am asking mainly regarding the ctree method, but I guess this would be applicable to any approach?
n_combination obviously has a big impact on memory usage, time and performance it takes for the model to run.
Are there any rules of thumb on what value to set?
I have seen you have implemented a warning:
Would that mean your recommendation is to set exact = TRUE while the number of features <=12, and set it to 10 000 when its higher?
I think a study into the tradeoff between n_combinations and the accuracy of the predicted Shapley Values would be an interesting undertaking - for which I, unfortunately, don't have the necessary hardware for (my program crashes at as little as n_combinations=1000 using the c_tree approach, for the Credit Card Default dataset (i.e. 23 features)). Maybe the Norwegian Computing Center would be better equipped for that? :-)
The text was updated successfully, but these errors were encountered: