Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#49: GitHub CI Benchmarking #257

Closed
wants to merge 0 commits into from

Conversation

henryleberre
Copy link
Member

No description provided.

@sbryngelson
Copy link
Member

sbryngelson commented Dec 15, 2023

Heads up: The build time on Phoenix (both CPU and GPU) using the previous Slurm scripts was very slow, almost assuredly slower than it was on an actual node. I wonder if this had to do with the resource request or something else. But we should pay attention that the run-time is not likewise much worse via this benchmark CI than it is on an "actual" node.

For example, check this out: https://github.com/MFlowCode/MFC/actions/runs/7215702552/job/19660473029

Obviously, it doesn't take 30 minutes to build and 1.25 hours to test CPU MFC on Phoenix using 12 cores (which should be the whole node, I think). Please let me know if you find a "fix" for this.

@sbryngelson
Copy link
Member

(perhaps useful for something https://github.com/MFlowCode/MFC/commits/autobench/)

@sbryngelson
Copy link
Member

[Feature request @henryleberre ] A dump of the nsight sys CLI output using flags that make it... readable/usable. Main goal is to make sure that a specific nvtx range has not gotten disproportionately more expensive for an unknown reason.

@sbryngelson
Copy link
Member

Heads up: The build time on Phoenix (both CPU and GPU) using the previous Slurm scripts was very slow, almost assuredly slower than it was on an actual node. I wonder if this had to do with the resource request or something else. But we should pay attention that the run-time is not likewise much worse via this benchmark CI than it is on an "actual" node.

For example, check this out: MFlowCode/MFC/actions/runs/7215702552/job/19660473029

Obviously, it doesn't take 30 minutes to build and 1.25 hours to test CPU MFC on Phoenix using 12 cores (which should be the whole node, I think). Please let me know if you find a "fix" for this.

I am amending this comment. Test is indeed slow because all jobs run on one core right now (despite the -j 24 flag), it's rebuilding the postprocess code and dependencies, and the -a option.

@henryleberre
Copy link
Member Author

I added a linting script and CI job therefor. Although I fixed a few warnings and errors, an undergraduate student could have a field day fixing the rest.

@henryleberre
Copy link
Member Author

[Feature request @henryleberre ] A dump of the nsight sys CLI output using flags that make it... readable/usable. Main goal is to make sure that a specific nvtx range has not gotten disproportionately more expensive for an unknown reason.

@sbryngelson Perhaps this feature could be added in a later revision?

@henryleberre
Copy link
Member Author

@sbryngelson I think this PR is mostly ready. Related tickets:

Caveats:

  • The current list of benchmark cases is just that of 3D examples. toolchain/bench.yaml lists the benchmark cases and how they run.
  • No PR comments. A quirk of how GitHub Actions implements the pull_request trigger makes it so that it cannot have a GITHUB_TOKEN with the ability to post comments, if the PR is made from a fork - which is almost always the case for us. It currently just prints the Markdown it would have posted as a comment. There are ways to get around this (sigh) but I didn't venture (too deep into) into this territory where security, practicality, and ease-of-use clash.

@sbryngelson
Copy link
Member

@sbryngelson I think this PR is mostly ready. Related tickets:

Caveats:

  • The current list of benchmark cases is just that of 3D examples. toolchain/bench.yaml lists the benchmark cases and how they run.
  • No PR comments. A quirk of how GitHub Actions implements the pull_request trigger makes it so that it cannot have a GITHUB_TOKEN with the ability to post comments, if the PR is made from a fork - which is almost always the case for us. It currently just prints the Markdown it would have posted as a comment. There are ways to get around this (sigh) but I didn't venture (too deep into) into this territory where security, practicality, and ease-of-use clash.

Great! So how do we keep track of performance then? Or check a PR's performance?

I can tune the examples myself or add more as needed.

@sbryngelson sbryngelson added the enhancement New feature or request label Dec 20, 2023
@sbryngelson
Copy link
Member

[Feature request @henryleberre ] A dump of the nsight sys CLI output using flags that make it... readable/usable. Main goal is to make sure that a specific nvtx range has not gotten disproportionately more expensive for an unknown reason.

@sbryngelson Perhaps this feature could be added in a later revision?

Sure, so long as it doesn't require a major toolchain update.

@henryleberre
Copy link
Member Author

henryleberre commented Dec 21, 2023

@sbryngelson, could you give me the output of these commands on Phoenix? (or grant me read permissions)

$ ls /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/master/
$ ls /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC/pr/
$ .. any *.out files in /storage/coda1/p-sbryngelson3/0/sbryngelson3/runners/actions-runner-3/_work/MFC/MFC or one level down

These runs are under your account so I otherwise cannot debug without resorting to iteratively pushing and retriggering these workflows with aggressive logging.

@sbryngelson
Copy link
Member

sbryngelson commented Dec 28, 2023

For nsys profiling on some systems (like Phoenix) we need the nsys command to precede the mpi binary call like so nsys profile --stats=true --trace=mpi,nvtx,openacc mpirun -np 1 ../../build/no-debug_gpu_mpi/simulation/simulation. This particular case gives an output that does indeed make sense:

 Simulating a 399x0x0 case on 1 rank(s)
 [  0%]  Time step        1 of 2 @ t_step = 0
 Final Time    0.000000000000000
Generating '/scratch/4450742/nsys-report-2eff.qdstrm'
[1/3] [========================100%] report12.nsys-rep
[2/3] [========================100%] report12.sqlite
[3/3] Executing 'nvtxsum' stats report

 Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)   Style         Range
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  -----------  -------  ------------------
     54.5      126,113,739          1  126,113,739.0  126,113,739.0  126,113,739  126,113,739          0.0  PushPop  MPI:MPI_Init
     22.4       51,874,387          1   51,874,387.0   51,874,387.0   51,874,387   51,874,387          0.0  PushPop  Time_Step
     18.6       43,032,719          1   43,032,719.0   43,032,719.0   43,032,719   43,032,719          0.0  PushPop  MPI:MPI_Finalize
      3.0        6,944,330          3    2,314,776.7       80,242.0       73,200    6,790,888  3,876,427.7  PushPop  RHS-Riemann
      0.7        1,687,441          3      562,480.3       23,976.0       19,438    1,644,027    936,649.6  PushPop  RHS-MPI
      0.5        1,130,619          3      376,873.0       82,640.0       73,695      974,284    517,392.4  PushPop  RHS-WENO
      0.2          498,715          3      166,238.3       22,361.0       19,634      456,720    251,568.2  PushPop  RHS-CONVERT
      0.0          111,616        387          288.4          190.0          178       21,550      1,085.9  PushPop  MPI:MPI_Bcast
      0.0           85,226          3       28,408.7       24,963.0       22,657       37,606      8,048.1  PushPop  RHS_Flux_Add
      0.0           36,969          2       18,484.5       18,484.5       16,421       20,548      2,918.2  PushPop  MPI:MPI_Barrier
      0.0              998          3          332.7          266.0          254          478        126.0  PushPop  Viscous
      0.0              952          3          317.3          327.0          260          365         53.2  PushPop  RHS_Hypoelasticity

Generated:
    /storage/home/hcoda1/6/sbryngelson3/MFC/examples/1D_sodshocktube/report12.nsys-rep
    /storage/home/hcoda1/6/sbryngelson3/MFC/examples/1D_sodshocktube/report12.sqlite

This is suggested in the forums as well

@sbryngelson
Copy link
Member

FYI @henryleberre (to make your life easier) -- Summit is going offline, so not much focus (if any) needs to be placed on the lfs scheduler/"template", and I don't think we use any computers with PBS scripts anymore. Of course this sort of secretly means that all the complexity of these computers is being offloaded into SLURM scripts.

@sbryngelson
Copy link
Member

@henryleberre I added open issues that your PR here might touch. If it won't, remove them (or leave them until you know).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment