[naga]Naga does not generate `for` loops #6521

b0nes164 · 2024-11-12T01:42:08Z

Currently, naga exclusively generates while loops when translating code for downstream compilers. This:

Breaks compilation of some shaders on FXC because loops can't be unrolled so arrays cannot be dynamically indexed.
Prior to [naga msl-out] Defeat the MSL compiler's infinite loop analysis. #6285, potentially caused UB on metal backend.
Causes suboptimal code generation on downstream compilers, again because loops cannot be unrolled. See for loops get translated into complex control-flow that confuses downstream compilers #4499.

To measure the impact on performance this is causing, I implemented my prefix sum demo in wgpu + naga and in dawn + tint, compiling the shaders from the exact same wgsl code. The results on an Apple M1 Pro(8 + 2 version):

I can't be certain that the loop unrolling is 100% of the performance discrepancy but here's what I can say:

It's not occupancy related. That is to say, it doesn't seem like inefficient translation into metal is eating up register counts. Using a hack technique, I obtained identical occupancy estimations from all implementations.
It's not an artifact of the data collection. Other metrics gathered during the tests show that threadblocks are indeed waiting longer on wgpu+naga:

//wgpu+naga 22.0 Decoupled Fallback
Estimated Occupancy across GPU: 80
Thread Blocks Launched: 8192
Average Total Spins per Pass: 10181.346
Average Fallbacks Initiated per Pass: 1867.042
Average Successful Fallback Insertions per Pass: 2.922

//wgpu+naga 23.0 Decoupled Fallback
Estimated Occupancy across GPU: 80
Thread Blocks Launched: 8192
Average Total Spins per Pass: 5527.18
Average Fallbacks Initiated per Pass: 1026.484
Average Successful Fallback Insertions per Pass: 0.082

//dawn+tint Decoupled Fallback
Estimated Occupancy across GPU: 80
Thread Blocks Launched:  8192
Average Total Spins per Pass: 3625.62
Average Fallbacks Initiated per Pass:  362.902
Average Successful Fallback Insertions per Pass: 0.116

The uptick in performance in between wgpu 22.0 and 23.0 suggests that #4972 was also causing slowdowns. However, the fix may be introducing new problems that are also contributing to the performance discrepancy see #6518.

The text was updated successfully, but these errors were encountered:

b0nes164 · 2024-11-12T15:21:18Z

The first thing I forgot to mention is that also data gathered here was already on unchecked shaders. If this was purely performance degradation caused caused by 6285, I would expect a performance decrease from 22.0 -> 23.0, not the increase we are seeing here.

I also did not put forward what a solution would be. I don't think all for loops need be translated, though that would be nice. Instead, a guarantee to translate for loops that meet certain conditions--compiler visible constant on the conditional, no complicated control flow--would be enough to solve these issues.

github-project-automation bot added this to WebGPU for Firefox Nov 12, 2024

github-project-automation bot moved this to Todo in WebGPU for Firefox Nov 12, 2024

b0nes164 mentioned this issue Nov 12, 2024

Attempt add single pass scan. linebender/vello#685

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[naga]Naga does not generate `for` loops #6521

[naga]Naga does not generate `for` loops #6521

b0nes164 commented Nov 12, 2024

b0nes164 commented Nov 12, 2024 •

edited

Loading

[naga]Naga does not generate for loops #6521

[naga]Naga does not generate for loops #6521

Comments

b0nes164 commented Nov 12, 2024

b0nes164 commented Nov 12, 2024 • edited Loading

[naga]Naga does not generate `for` loops #6521

[naga]Naga does not generate `for` loops #6521

b0nes164 commented Nov 12, 2024 •

edited

Loading