#49: GitHub CI Benchmarking #257

henryleberre · 2023-12-15T04:42:04Z

No description provided.

sbryngelson · 2023-12-15T13:21:17Z

Heads up: The build time on Phoenix (both CPU and GPU) using the previous Slurm scripts was very slow, almost assuredly slower than it was on an actual node. I wonder if this had to do with the resource request or something else. But we should pay attention that the run-time is not likewise much worse via this benchmark CI than it is on an "actual" node.

For example, check this out: https://github.com/MFlowCode/MFC/actions/runs/7215702552/job/19660473029

Obviously, it doesn't take 30 minutes to build and 1.25 hours to test CPU MFC on Phoenix using 12 cores (which should be the whole node, I think). Please let me know if you find a "fix" for this.

sbryngelson · 2023-12-17T09:00:28Z

(perhaps useful for something https://github.com/MFlowCode/MFC/commits/autobench/)

sbryngelson · 2023-12-18T02:35:45Z

[Feature request @henryleberre ] A dump of the nsight sys CLI output using flags that make it... readable/usable. Main goal is to make sure that a specific nvtx range has not gotten disproportionately more expensive for an unknown reason.

sbryngelson · 2023-12-18T02:37:42Z

Heads up: The build time on Phoenix (both CPU and GPU) using the previous Slurm scripts was very slow, almost assuredly slower than it was on an actual node. I wonder if this had to do with the resource request or something else. But we should pay attention that the run-time is not likewise much worse via this benchmark CI than it is on an "actual" node.

For example, check this out: MFlowCode/MFC/actions/runs/7215702552/job/19660473029

Obviously, it doesn't take 30 minutes to build and 1.25 hours to test CPU MFC on Phoenix using 12 cores (which should be the whole node, I think). Please let me know if you find a "fix" for this.

I am amending this comment. Test is indeed slow because all jobs run on one core right now (despite the -j 24 flag), it's rebuilding the postprocess code and dependencies, and the -a option.

henryleberre · 2023-12-18T04:15:36Z

I added a linting script and CI job therefor. Although I fixed a few warnings and errors, an undergraduate student could have a field day fixing the rest.

henryleberre · 2023-12-18T04:20:01Z

[Feature request @henryleberre ] A dump of the nsight sys CLI output using flags that make it... readable/usable. Main goal is to make sure that a specific nvtx range has not gotten disproportionately more expensive for an unknown reason.

@sbryngelson Perhaps this feature could be added in a later revision?

henryleberre · 2023-12-20T05:51:17Z

@sbryngelson I think this PR is mostly ready. Related tickets:

Fixed Run Phoenix jobs using available CPU cores #266.. argument parsing was a nightmare.
Fixed Is it worth linting the toolchain/? #230.
Implemented Use GitHub CI to continuously benchmark some example cases #49 per 🤝.

Caveats:

The current list of benchmark cases is just that of 3D examples. toolchain/bench.yaml lists the benchmark cases and how they run.
No PR comments. A quirk of how GitHub Actions implements the pull_request trigger makes it so that it cannot have a GITHUB_TOKEN with the ability to post comments, if the PR is made from a fork - which is almost always the case for us. It currently just prints the Markdown it would have posted as a comment. There are ways to get around this (sigh) but I didn't venture (too deep into) into this territory where security, practicality, and ease-of-use clash.

sbryngelson · 2023-12-20T08:48:56Z

@sbryngelson I think this PR is mostly ready. Related tickets:

Fixed Run Phoenix jobs using available CPU cores #266.. argument parsing was a nightmare.

Fixed Is it worth linting the toolchain/? #230.

Implemented Use GitHub CI to continuously benchmark some example cases #49 per 🤝.

Caveats:

The current list of benchmark cases is just that of 3D examples. toolchain/bench.yaml lists the benchmark cases and how they run.

No PR comments. A quirk of how GitHub Actions implements the pull_request trigger makes it so that it cannot have a GITHUB_TOKEN with the ability to post comments, if the PR is made from a fork - which is almost always the case for us. It currently just prints the Markdown it would have posted as a comment. There are ways to get around this (sigh) but I didn't venture (too deep into) into this territory where security, practicality, and ease-of-use clash.

Great! So how do we keep track of performance then? Or check a PR's performance?

I can tune the examples myself or add more as needed.

sbryngelson · 2023-12-20T11:33:53Z

[Feature request @henryleberre ] A dump of the nsight sys CLI output using flags that make it... readable/usable. Main goal is to make sure that a specific nvtx range has not gotten disproportionately more expensive for an unknown reason.

@sbryngelson Perhaps this feature could be added in a later revision?

Sure, so long as it doesn't require a major toolchain update.

.github/workflows/bench.yml

henryleberre · 2023-12-21T22:56:23Z

@sbryngelson, could you give me the output of these commands on Phoenix? (or grant me read permissions)

$ ls /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/master/
$ ls /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/pr/
$ .. any *.out files in /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC or one level down

These runs are under your account so I otherwise cannot debug without resorting to iteratively pushing and retriggering these workflows with aggressive logging.

sbryngelson · 2023-12-28T02:32:52Z

For nsys profiling on some systems (like Phoenix) we need the nsys command to precede the mpi binary call like so nsys profile --stats=true --trace=mpi,nvtx,openacc mpirun -np 1 ../../build/no-debug_gpu_mpi/simulation/simulation. This particular case gives an output that does indeed make sense:

 Simulating a 399x0x0 case on 1 rank(s)
 [  0%]  Time step        1 of 2 @ t_step = 0
 Final Time    0.000000000000000
Generating '/scratch/4450742/nsys-report-2eff.qdstrm'
[1/3] [========================100%] report12.nsys-rep
[2/3] [========================100%] report12.sqlite
[3/3] Executing 'nvtxsum' stats report

 Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)   Style         Range
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  -----------  -------  ------------------
     54.5      126,113,739          1  126,113,739.0  126,113,739.0  126,113,739  126,113,739          0.0  PushPop  MPI:MPI_Init
     22.4       51,874,387          1   51,874,387.0   51,874,387.0   51,874,387   51,874,387          0.0  PushPop  Time_Step
     18.6       43,032,719          1   43,032,719.0   43,032,719.0   43,032,719   43,032,719          0.0  PushPop  MPI:MPI_Finalize
      3.0        6,944,330          3    2,314,776.7       80,242.0       73,200    6,790,888  3,876,427.7  PushPop  RHS-Riemann
      0.7        1,687,441          3      562,480.3       23,976.0       19,438    1,644,027    936,649.6  PushPop  RHS-MPI
      0.5        1,130,619          3      376,873.0       82,640.0       73,695      974,284    517,392.4  PushPop  RHS-WENO
      0.2          498,715          3      166,238.3       22,361.0       19,634      456,720    251,568.2  PushPop  RHS-CONVERT
      0.0          111,616        387          288.4          190.0          178       21,550      1,085.9  PushPop  MPI:MPI_Bcast
      0.0           85,226          3       28,408.7       24,963.0       22,657       37,606      8,048.1  PushPop  RHS_Flux_Add
      0.0           36,969          2       18,484.5       18,484.5       16,421       20,548      2,918.2  PushPop  MPI:MPI_Barrier
      0.0              998          3          332.7          266.0          254          478        126.0  PushPop  Viscous
      0.0              952          3          317.3          327.0          260          365         53.2  PushPop  RHS_Hypoelasticity

Generated:
    /storage/home/hcoda1/6/sbryngelson3/MFC/examples/1D_sodshocktube/report12.nsys-rep
    /storage/home/hcoda1/6/sbryngelson3/MFC/examples/1D_sodshocktube/report12.sqlite

This is suggested in the forums as well

https://forums.developer.nvidia.com/t/nsys-for-multi-gpu-apps/64880
but I have not tried it with srun yet.

.github/workflows/phoenix/test.sh

sbryngelson · 2024-01-03T16:50:00Z

FYI @henryleberre (to make your life easier) -- Summit is going offline, so not much focus (if any) needs to be placed on the lfs scheduler/"template", and I don't think we use any computers with PBS scripts anymore. Of course this sort of secretly means that all the complexity of these computers is being offloaded into SLURM scripts.

sbryngelson · 2024-01-03T16:51:29Z

@henryleberre I added open issues that your PR here might touch. If it won't, remove them (or leave them until you know).

henryleberre requested a review from sbryngelson as a code owner December 15, 2023 04:42

henryleberre marked this pull request as draft December 15, 2023 04:42

henryleberre force-pushed the master branch 7 times, most recently from fb8ce32 to 7af148d Compare December 15, 2023 06:38

henryleberre force-pushed the master branch 7 times, most recently from 9716590 to 3f2a826 Compare December 16, 2023 05:43

henryleberre force-pushed the master branch 2 times, most recently from 5c43bc9 to bd50800 Compare December 18, 2023 04:11

This was linked to issues Dec 18, 2023

Use GitHub CI to continuously benchmark some example cases #49

Closed

Is it worth linting the toolchain/? #230

Closed

henryleberre force-pushed the master branch from bd50800 to 2a73f74 Compare December 18, 2023 04:42

sbryngelson mentioned this pull request Dec 18, 2023

Run Phoenix jobs using available CPU cores #266

Merged

henryleberre force-pushed the master branch 2 times, most recently from f0305ac to 7055d12 Compare December 18, 2023 21:02

sbryngelson added the enhancement New feature or request label Dec 20, 2023

sbryngelson reviewed Dec 20, 2023

View reviewed changes

.github/workflows/bench.yml Outdated Show resolved Hide resolved

henryleberre force-pushed the master branch from e40c97d to 2d5464e Compare December 21, 2023 06:28

henryleberre force-pushed the master branch 4 times, most recently from 104a545 to 1f57111 Compare December 26, 2023 20:15

sbryngelson reviewed Dec 31, 2023

View reviewed changes

.github/workflows/phoenix/test.sh Outdated Show resolved Hide resolved

henryleberre force-pushed the master branch 2 times, most recently from 960006d to df48cc0 Compare January 2, 2024 23:47

This was unlinked from issues Jan 4, 2024

Python min version still a problem #269

Closed

./mfc.sh not enforcing minimum Python version - what is the minimum version anyhow? #286

Closed

henryleberre force-pushed the master branch from df48cc0 to 883c4c3 Compare January 4, 2024 20:11

henryleberre removed a link to an issue Jan 5, 2024

Is it worth linting the toolchain/? #230

Closed

henryleberre closed this Jan 9, 2024

henryleberre force-pushed the master branch from 883c4c3 to c312cd7 Compare January 9, 2024 08:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#49: GitHub CI Benchmarking #257

#49: GitHub CI Benchmarking #257

henryleberre commented Dec 15, 2023

sbryngelson commented Dec 15, 2023 •

edited

Loading

sbryngelson commented Dec 17, 2023

sbryngelson commented Dec 18, 2023

sbryngelson commented Dec 18, 2023

henryleberre commented Dec 18, 2023

henryleberre commented Dec 18, 2023

henryleberre commented Dec 20, 2023

sbryngelson commented Dec 20, 2023

sbryngelson commented Dec 20, 2023

henryleberre commented Dec 21, 2023 •

edited

Loading

sbryngelson commented Dec 28, 2023 •

edited

Loading

sbryngelson commented Jan 3, 2024

sbryngelson commented Jan 3, 2024

#49: GitHub CI Benchmarking #257

#49: GitHub CI Benchmarking #257

Conversation

henryleberre commented Dec 15, 2023

sbryngelson commented Dec 15, 2023 • edited Loading

sbryngelson commented Dec 17, 2023

sbryngelson commented Dec 18, 2023

sbryngelson commented Dec 18, 2023

henryleberre commented Dec 18, 2023

henryleberre commented Dec 18, 2023

henryleberre commented Dec 20, 2023

sbryngelson commented Dec 20, 2023

sbryngelson commented Dec 20, 2023

henryleberre commented Dec 21, 2023 • edited Loading

sbryngelson commented Dec 28, 2023 • edited Loading

sbryngelson commented Jan 3, 2024

sbryngelson commented Jan 3, 2024

sbryngelson commented Dec 15, 2023 •

edited

Loading

henryleberre commented Dec 21, 2023 •

edited

Loading

sbryngelson commented Dec 28, 2023 •

edited

Loading