-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducing Table 2 from Orion Paper #35
Comments
Hello! I have not tried this example on an H100 GPU, only on a V100. Now, they way that i recommend getting the timings is by removing the For example, i adapted your snippet to:
and run it on a V100 GPU (with cuda 11.6) and got 1.2-1.3x overall speedup compared to using the same stream for the two kernels.
and then checked the trace, and saw: which means the kernels actually are scheduled together. Now, if i try to schedule the two convolution kernels together (using the same script, but both streams run conv) i see the following trace: meaning that the kernels are serialized So i would recommend using the nsys tool to see what happens. Unfortunately, i do not have access to an H100 GPU, to see exactly what happens. I hope this helped, and please let me know if anything else is needed! |
I am trying to reproduce the numbers from the
conv/bnorm
toy benchmark from the Orion paper . I saw some code provided here but did not see a script to run bnorm and conv in parallel on different streams. I rewrote this benchmark in the following script and reran on H100. I got the following results and didn't see any significant speedup from running in parallel. Any advise?Source:
The text was updated successfully, but these errors were encountered: