You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To measure the impact on performance this is causing, I implemented my prefix sum demo in wgpu + naga and in dawn + tint, compiling the shaders from the exact same wgsl code. The results on an Apple M1 Pro(8 + 2 version):
I can't be certain that the loop unrolling is 100% of the performance discrepancy but here's what I can say:
It's not occupancy related. That is to say, it doesn't seem like inefficient translation into metal is eating up register counts. Using a hack technique, I obtained identical occupancy estimations from all implementations.
It's not an artifact of the data collection. Other metrics gathered during the tests show that threadblocks are indeed waiting longer on wgpu+naga:
//wgpu+naga 22.0 Decoupled Fallback
Estimated Occupancy across GPU: 80
Thread Blocks Launched: 8192
Average Total Spins per Pass: 10181.346
Average Fallbacks Initiated per Pass: 1867.042
Average Successful Fallback Insertions per Pass: 2.922
//wgpu+naga 23.0 Decoupled Fallback
Estimated Occupancy across GPU: 80
Thread Blocks Launched: 8192
Average Total Spins per Pass: 5527.18
Average Fallbacks Initiated per Pass: 1026.484
Average Successful Fallback Insertions per Pass: 0.082
//dawn+tint Decoupled Fallback
Estimated Occupancy across GPU: 80
Thread Blocks Launched: 8192
Average Total Spins per Pass: 3625.62
Average Fallbacks Initiated per Pass: 362.902
Average Successful Fallback Insertions per Pass: 0.116
The uptick in performance in between wgpu 22.0 and 23.0 suggests that #4972 was also causing slowdowns. However, the fix may be introducing new problems that are also contributing to the performance discrepancy see #6518.
The text was updated successfully, but these errors were encountered:
The first thing I forgot to mention is that also data gathered here was already on unchecked shaders. If this was purely performance degradation caused caused by 6285, I would expect a performance decrease from 22.0 -> 23.0, not the increase we are seeing here.
I also did not put forward what a solution would be. I don't think allfor loops need be translated, though that would be nice. Instead, a guarantee to translate for loops that meet certain conditions--compiler visible constant on the conditional, no complicated control flow--would be enough to solve these issues.
Currently, naga exclusively generates
while
loops when translating code for downstream compilers. This:for
loops get translated into complex control-flow that confuses downstream compilers #4499.To measure the impact on performance this is causing, I implemented my prefix sum demo in wgpu + naga and in dawn + tint, compiling the shaders from the exact same wgsl code. The results on an Apple M1 Pro(8 + 2 version):
I can't be certain that the loop unrolling is 100% of the performance discrepancy but here's what I can say:
It's not occupancy related. That is to say, it doesn't seem like inefficient translation into metal is eating up register counts. Using a hack technique, I obtained identical occupancy estimations from all implementations.
It's not an artifact of the data collection. Other metrics gathered during the tests show that threadblocks are indeed waiting longer on wgpu+naga:
The uptick in performance in between wgpu 22.0 and 23.0 suggests that #4972 was also causing slowdowns. However, the fix may be introducing new problems that are also contributing to the performance discrepancy see #6518.
The text was updated successfully, but these errors were encountered: