-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#49: GitHub CI Benchmarking #257
Conversation
fb8ce32
to
7af148d
Compare
Heads up: The build time on Phoenix (both CPU and GPU) using the previous Slurm scripts was very slow, almost assuredly slower than it was on an actual node. I wonder if this had to do with the resource request or something else. But we should pay attention that the run-time is not likewise much worse via this benchmark CI than it is on an "actual" node. For example, check this out: https://github.com/MFlowCode/MFC/actions/runs/7215702552/job/19660473029 Obviously, it doesn't take 30 minutes to build and 1.25 hours to test CPU MFC on Phoenix using 12 cores (which should be the whole node, I think). Please let me know if you find a "fix" for this. |
9716590
to
3f2a826
Compare
(perhaps useful for something https://github.com/MFlowCode/MFC/commits/autobench/) |
[Feature request @henryleberre ] A dump of the nsight sys CLI output using flags that make it... readable/usable. Main goal is to make sure that a specific nvtx range has not gotten disproportionately more expensive for an unknown reason. |
I am amending this comment. Test is indeed slow because all jobs run on one core right now (despite the |
5c43bc9
to
bd50800
Compare
I added a linting script and CI job therefor. Although I fixed a few warnings and errors, an undergraduate student could have a field day fixing the rest. |
@sbryngelson Perhaps this feature could be added in a later revision? |
f0305ac
to
7055d12
Compare
@sbryngelson I think this PR is mostly ready. Related tickets:
Caveats:
|
Great! So how do we keep track of performance then? Or check a PR's performance? I can tune the examples myself or add more as needed. |
Sure, so long as it doesn't require a major toolchain update. |
@sbryngelson, could you give me the output of these commands on Phoenix? (or grant me read permissions) $ ls /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/master/
$ ls /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/pr/
$ .. any *.out files in /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC or one level down These runs are under your account so I otherwise cannot debug without resorting to iteratively pushing and retriggering these workflows with aggressive logging. |
104a545
to
1f57111
Compare
For nsys profiling on some systems (like Phoenix) we need the nsys command to precede the mpi binary call like so
This is suggested in the forums as well
|
960006d
to
df48cc0
Compare
FYI @henryleberre (to make your life easier) -- Summit is going offline, so not much focus (if any) needs to be placed on the lfs scheduler/"template", and I don't think we use any computers with PBS scripts anymore. Of course this sort of secretly means that all the complexity of these computers is being offloaded into SLURM scripts. |
@henryleberre I added open issues that your PR here might touch. If it won't, remove them (or leave them until you know). |
No description provided.