Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[naga]Naga does not generate for loops #6521

Open
b0nes164 opened this issue Nov 12, 2024 · 1 comment
Open

[naga]Naga does not generate for loops #6521

b0nes164 opened this issue Nov 12, 2024 · 1 comment

Comments

@b0nes164
Copy link

Currently, naga exclusively generates while loops when translating code for downstream compilers. This:

To measure the impact on performance this is causing, I implemented my prefix sum demo in wgpu + naga and in dawn + tint, compiling the shaders from the exact same wgsl code. The results on an Apple M1 Pro(8 + 2 version):

WebGPU Implementation comparison

I can't be certain that the loop unrolling is 100% of the performance discrepancy but here's what I can say:

  • It's not occupancy related. That is to say, it doesn't seem like inefficient translation into metal is eating up register counts. Using a hack technique, I obtained identical occupancy estimations from all implementations.

  • It's not an artifact of the data collection. Other metrics gathered during the tests show that threadblocks are indeed waiting longer on wgpu+naga:

//wgpu+naga 22.0 Decoupled Fallback
Estimated Occupancy across GPU: 80
Thread Blocks Launched: 8192
Average Total Spins per Pass: 10181.346
Average Fallbacks Initiated per Pass: 1867.042
Average Successful Fallback Insertions per Pass: 2.922

//wgpu+naga 23.0 Decoupled Fallback
Estimated Occupancy across GPU: 80
Thread Blocks Launched: 8192
Average Total Spins per Pass: 5527.18
Average Fallbacks Initiated per Pass: 1026.484
Average Successful Fallback Insertions per Pass: 0.082

//dawn+tint Decoupled Fallback
Estimated Occupancy across GPU: 80
Thread Blocks Launched:  8192
Average Total Spins per Pass: 3625.62
Average Fallbacks Initiated per Pass:  362.902
Average Successful Fallback Insertions per Pass: 0.116

The uptick in performance in between wgpu 22.0 and 23.0 suggests that #4972 was also causing slowdowns. However, the fix may be introducing new problems that are also contributing to the performance discrepancy see #6518.

@b0nes164
Copy link
Author

b0nes164 commented Nov 12, 2024

The first thing I forgot to mention is that also data gathered here was already on unchecked shaders. If this was purely performance degradation caused caused by 6285, I would expect a performance decrease from 22.0 -> 23.0, not the increase we are seeing here.

I also did not put forward what a solution would be. I don't think all for loops need be translated, though that would be nice. Instead, a guarantee to translate for loops that meet certain conditions--compiler visible constant on the conditional, no complicated control flow--would be enough to solve these issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

1 participant