-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[webgpu] Only capture validation errors in Debug build #23438
base: main
Are you sure you want to change the base?
Conversation
The CPU walltime of waiting for PopErrorScope is non-trivial, and also validation errors are not expected to happen in Release build.
The wait for
|
The document https://toji.dev/webgpu-best-practices/error-handling said it half correct for ONNX Runtime. Because the shaders are generated dynamically and it is user input based (ONNX model is considered user input), it is impossible to have 100% test coverage for all possible shaders that may run. We can never expect the code is bug free no matter how carefully we write the code. So, validation errors will happen in ORT Release build. As a library, we want to try our best to avoid unexpected crash/abort. It is OK that a validation error causing unexpected inferencing failure, but we want to avoid crash/abort or a stale state. My concern of this change is: Will this change cause higher possibility of potential crash (abort), or a stale state (ie. any inferencing after this failure will all fail)? |
Dawn/WebGPU was designed for Web where security is always a critical consideration. It would be a disaster if a malicious input from clients could easily cause a crash. So I am not too worried about that. For the concern of stale state, as far as I can imagine, this change could make the validation error not to be raised to the framework immediately, so it keeps processing the next operators, which is unnecessary, as the result of inferencing is not supposed to correct. It's no harm than just a waste. This is a trade off to be made. More validation brings more robustness, and meantime more overhead. |
It's fine to revisit this later when we are confident enough and more concerned about the performance. |
This makes sense. I think using pushErrorScope()/popErrorScope() once for an inference run is better idea - this should have much less perf impact and still avoid errors go into uncaptured handlers, also allows users to know whether the inference is success. What do you think? |
Good idea. It would make the best of both worlds. |
The CPU walltime of waiting for PopErrorScope is non-trivial, and also validation errors are not expected to happen in Release build.
Description
Motivation and Context