Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[webgpu] Only capture validation errors in Debug build #23438

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jchen10
Copy link
Contributor

@jchen10 jchen10 commented Jan 21, 2025

The CPU walltime of waiting for PopErrorScope is non-trivial, and also validation errors are not expected to happen in Release build.

Description

Motivation and Context

The CPU walltime of waiting for PopErrorScope is non-trivial, and also
validation errors are not expected to happen in Release build.
@jchen10
Copy link
Contributor Author

jchen10 commented Jan 21, 2025

The wait for PopErrorScope can take around 7% wall time per my profiling, which is not negligible. Meanwhile, no need to capture validation erros in release build according to WebGPU Error Handling best practices:

validation errors occur whenever invalid inputs were given to a WebGPU call. These are consistent, predictable, and should generally not be expected during normal operation of a well formed application. They will fail in the same way on every device your code runs on, so once you’ve fixed any errors that show up during development you probably don’t need to observe them directly most of the time. An exception to that rule is if you’re consuming user-supplied assets/shaders/etc, in which case watching for validation errors while loading may be helpful.

@jchen10
Copy link
Contributor Author

jchen10 commented Jan 21, 2025

@guschmue @fs-eire PTAL

@fs-eire
Copy link
Contributor

fs-eire commented Jan 21, 2025

The document https://toji.dev/webgpu-best-practices/error-handling said it half correct for ONNX Runtime.

Because the shaders are generated dynamically and it is user input based (ONNX model is considered user input), it is impossible to have 100% test coverage for all possible shaders that may run. We can never expect the code is bug free no matter how carefully we write the code. So, validation errors will happen in ORT Release build.

As a library, we want to try our best to avoid unexpected crash/abort. It is OK that a validation error causing unexpected inferencing failure, but we want to avoid crash/abort or a stale state.

My concern of this change is: Will this change cause higher possibility of potential crash (abort), or a stale state (ie. any inferencing after this failure will all fail)?

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Jan 22, 2025
@jchen10
Copy link
Contributor Author

jchen10 commented Jan 22, 2025

The document https://toji.dev/webgpu-best-practices/error-handling said it half correct for ONNX Runtime.

Because the shaders are generated dynamically and it is user input based (ONNX model is considered user input), it is impossible to have 100% test coverage for all possible shaders that may run. We can never expect the code is bug free no matter how carefully we write the code. So, validation errors will happen in ORT Release build.

As a library, we want to try our best to avoid unexpected crash/abort. It is OK that a validation error causing unexpected inferencing failure, but we want to avoid crash/abort or a stale state.

My concern of this change is: Will this change cause higher possibility of potential crash (abort), or a stale state (ie. any inferencing after this failure will all fail)?

Dawn/WebGPU was designed for Web where security is always a critical consideration. It would be a disaster if a malicious input from clients could easily cause a crash. So I am not too worried about that.

For the concern of stale state, as far as I can imagine, this change could make the validation error not to be raised to the framework immediately, so it keeps processing the next operators, which is unnecessary, as the result of inferencing is not supposed to correct. It's no harm than just a waste.

This is a trade off to be made. More validation brings more robustness, and meantime more overhead.

@jchen10
Copy link
Contributor Author

jchen10 commented Jan 22, 2025

It's fine to revisit this later when we are confident enough and more concerned about the performance.

@fs-eire
Copy link
Contributor

fs-eire commented Jan 22, 2025

For the concern of stale state, as far as I can imagine, this change could make the validation error not to be raised to the framework immediately, so it keeps processing the next operators, which is unnecessary, as the result of inferencing is not supposed to correct. It's no harm than just a waste.

This makes sense. I think using pushErrorScope()/popErrorScope() once for an inference run is better idea - this should have much less perf impact and still avoid errors go into uncaptured handlers, also allows users to know whether the inference is success. What do you think?

@jchen10
Copy link
Contributor Author

jchen10 commented Jan 23, 2025

For the concern of stale state, as far as I can imagine, this change could make the validation error not to be raised to the framework immediately, so it keeps processing the next operators, which is unnecessary, as the result of inferencing is not supposed to correct. It's no harm than just a waste.

This makes sense. I think using pushErrorScope()/popErrorScope() once for an inference run is better idea - this should have much less perf impact and still avoid errors go into uncaptured handlers, also allows users to know whether the inference is success. What do you think?

Good idea. It would make the best of both worlds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants