[webgpu] Only capture validation errors in Debug build #23438

jchen10 · 2025-01-21T06:08:13Z

The CPU walltime of waiting for PopErrorScope is non-trivial, and also validation errors are not expected to happen in Release build.

Description

Motivation and Context

The CPU walltime of waiting for PopErrorScope is non-trivial, and also validation errors are not expected to happen in Release build.

jchen10 · 2025-01-21T06:26:02Z

The wait for PopErrorScope can take around 7% wall time per my profiling, which is not negligible. Meanwhile, no need to capture validation erros in release build according to WebGPU Error Handling best practices:

validation errors occur whenever invalid inputs were given to a WebGPU call. These are consistent, predictable, and should generally not be expected during normal operation of a well formed application. They will fail in the same way on every device your code runs on, so once you’ve fixed any errors that show up during development you probably don’t need to observe them directly most of the time. An exception to that rule is if you’re consuming user-supplied assets/shaders/etc, in which case watching for validation errors while loading may be helpful.

jchen10 · 2025-01-21T06:31:16Z

@guschmue @fs-eire PTAL

fs-eire · 2025-01-21T19:45:47Z

The document https://toji.dev/webgpu-best-practices/error-handling said it half correct for ONNX Runtime.

Because the shaders are generated dynamically and it is user input based (ONNX model is considered user input), it is impossible to have 100% test coverage for all possible shaders that may run. We can never expect the code is bug free no matter how carefully we write the code. So, validation errors will happen in ORT Release build.

As a library, we want to try our best to avoid unexpected crash/abort. It is OK that a validation error causing unexpected inferencing failure, but we want to avoid crash/abort or a stale state.

My concern of this change is: Will this change cause higher possibility of potential crash (abort), or a stale state (ie. any inferencing after this failure will all fail)?

jchen10 · 2025-01-22T01:53:30Z

The document https://toji.dev/webgpu-best-practices/error-handling said it half correct for ONNX Runtime.

Because the shaders are generated dynamically and it is user input based (ONNX model is considered user input), it is impossible to have 100% test coverage for all possible shaders that may run. We can never expect the code is bug free no matter how carefully we write the code. So, validation errors will happen in ORT Release build.

As a library, we want to try our best to avoid unexpected crash/abort. It is OK that a validation error causing unexpected inferencing failure, but we want to avoid crash/abort or a stale state.

My concern of this change is: Will this change cause higher possibility of potential crash (abort), or a stale state (ie. any inferencing after this failure will all fail)?

Dawn/WebGPU was designed for Web where security is always a critical consideration. It would be a disaster if a malicious input from clients could easily cause a crash. So I am not too worried about that.

For the concern of stale state, as far as I can imagine, this change could make the validation error not to be raised to the framework immediately, so it keeps processing the next operators, which is unnecessary, as the result of inferencing is not supposed to correct. It's no harm than just a waste.

This is a trade off to be made. More validation brings more robustness, and meantime more overhead.

jchen10 · 2025-01-22T02:28:37Z

It's fine to revisit this later when we are confident enough and more concerned about the performance.

fs-eire · 2025-01-22T15:46:48Z

For the concern of stale state, as far as I can imagine, this change could make the validation error not to be raised to the framework immediately, so it keeps processing the next operators, which is unnecessary, as the result of inferencing is not supposed to correct. It's no harm than just a waste.

This makes sense. I think using pushErrorScope()/popErrorScope() once for an inference run is better idea - this should have much less perf impact and still avoid errors go into uncaptured handlers, also allows users to know whether the inference is success. What do you think?

jchen10 · 2025-01-23T01:21:04Z

For the concern of stale state, as far as I can imagine, this change could make the validation error not to be raised to the framework immediately, so it keeps processing the next operators, which is unnecessary, as the result of inferencing is not supposed to correct. It's no harm than just a waste.

This makes sense. I think using pushErrorScope()/popErrorScope() once for an inference run is better idea - this should have much less perf impact and still avoid errors go into uncaptured handlers, also allows users to know whether the inference is success. What do you think?

Good idea. It would make the best of both worlds.

[webgpu] Only capture validation errors in Debug build

2ec9d76

The CPU walltime of waiting for PopErrorScope is non-trivial, and also validation errors are not expected to happen in Release build.

guschmue added the ep:WebGPU ort-web webgpu provider label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[webgpu] Only capture validation errors in Debug build #23438

[webgpu] Only capture validation errors in Debug build #23438

jchen10 commented Jan 21, 2025

jchen10 commented Jan 21, 2025

jchen10 commented Jan 21, 2025

fs-eire commented Jan 21, 2025

jchen10 commented Jan 22, 2025

jchen10 commented Jan 22, 2025

fs-eire commented Jan 22, 2025

jchen10 commented Jan 23, 2025

[webgpu] Only capture validation errors in Debug build #23438

Are you sure you want to change the base?

[webgpu] Only capture validation errors in Debug build #23438

Conversation

jchen10 commented Jan 21, 2025

Description

Motivation and Context

jchen10 commented Jan 21, 2025

jchen10 commented Jan 21, 2025

fs-eire commented Jan 21, 2025

jchen10 commented Jan 22, 2025

jchen10 commented Jan 22, 2025

fs-eire commented Jan 22, 2025

jchen10 commented Jan 23, 2025