Skip to content

Commit

Permalink
Updates docs for grammar and removes unstable links (#588)
Browse files Browse the repository at this point in the history
  • Loading branch information
sbryngelson authored Aug 23, 2024
1 parent 19878a3 commit 7bdf4e3
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 33 deletions.
22 changes: 13 additions & 9 deletions docs/documentation/expectedPerformance.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,34 @@
# Performance

MFC has been benchmarked on several CPUs and GPU devices.
This page shows a summary of these results.
This page is a summary of these results.

## Figure of merit: Grind time performance

The following table outlines observed performance as nanoseconds per grid point (ns/GP) per equation (eq) per right-hand side (rhs) evaluation (lower is better), also known as the grind time.
The following table outlines observed performance as nanoseconds per grid point (ns/gp) per equation (eq) per right-hand side (rhs) evaluation (lower is better), also known as the grind time.
We solve an example 3D, inviscid, 5-equation model problem with two advected species (8 PDEs) and 8M grid points (158-cubed uniform grid).
The numerics are WENO5 finite volume reconstruction and HLLC approximate Riemann solver.
This case is located in `examples/3D_performance_test`.
You can run it via `./mfc.sh run -n <num_processors> -j $(nproc) ./examples/3D_performance_test/case.py -t pre_process simulation --case-optimization`, which will build an optimized version of the code for this case then execute it.
If the above does not work on your machine, see the rest of this documentation for other ways to use the `./mfc.sh run` command.

Results are for MFC v4.9.3 (July 2024 release), though numbers have not changed meaningfully since then.
Similar performance is also seen for other problem configurations, such as the Euler equations (4 PDEs).
All results are for the compiler that gave the best performance.
Note:
* CPU results may be performed on CPUs with more cores than reported in the table; we report results for the best performance given the full processor die by checking the performance for different core counts on that device. CPU results are for the best performance we achieved using a single socket (or die).
* CPU results may be performed on CPUs with more cores than reported in the table; we report results for the best performance given the full processor die by checking the performance for different core counts on that device. CPU results are the best performance we achieved using a single socket (or die).
These are reported as (X/Y cores), where X is the used cores, and Y is the total on the die.
* GPU results are for a single GPU device. For single-precision (SP) GPUs, we performed computation in double-precision via conversion in compiler/software; these numbers are _not_ for single-precision computation. AMD MI250X GPUs have two graphics compute dies (GCDs) per MI250X device; we report results for one GCD, though one can quickly estimate full MI250X runtime by halving the single GCD grind time number.
* GPU results are for a single GPU device. For single-precision (SP) GPUs, we performed computation in double-precision via conversion in compiler/software; these numbers are _not_ for single-precision computation. AMD MI250X and MI300A GPUs have multiple graphics compute dies (GCDs) per device; we report results for one _GCD_*, though one can quickly estimate full device runtime by dividing the grind time number by the number of GCDs on the device (the MI250X has 2 GCDs). We gratefully acknowledge the permission of LLNL, HPE/Cray, and AMD for permission to release MI300A performance numbers.

| Hardware | | Grind Time | Compiler | Computer |
| Hardware | | Grind Time [ns] | Compiler | Computer |
| ---: | ----: | ----: | :--- | :--- |
| NVIDIA GH200 (GPU only) | 1 GPU | 0.32 | NVHPC 24.1 | GT Rogues Gallery |
| NVIDIA H100 | 1 GPU | 0.45 | NVHPC 24.5 | GT Rogues Gallery |
| AMD MI300A | 1 __GCD__ | 0.60 | CCE 18.0.0 | LLNL Tioga |
| AMD MI300A | 1 _GCD_* | 0.60 | CCE 18.0.0 | LLNL Tioga |
| NVIDIA A100 | 1 GPU | 0.62 | NVHPC 22.11 | GT Phoenix |
| NVIDIA V100 | 1 GPU | 0.99 | NVHPC 22.11 | GT Phoenix |
| NVIDIA A30 | 1 GPU | 1.1 | NVHPC 24.1 | GT Rogues Gallery |
| AMD MI250X | 1 __GCD__ | 1.1 | CCE 16.0.1 | OLCF Frontier |
| AMD MI250X | 1 _GCD_* | 1.1 | CCE 16.0.1 | OLCF Frontier |
| AMD MI100 | 1 GPU | 1.4 | CCE 16.0.1 | Cray internal system |
| NVIDIA L40S (SP GPU) | 1 GPU | 1.7 | NVHPC 24.5 | GT ICE |
| NVIDIA P100 | 1 GPU | 2.4 | NVHPC 23.5 | GT CSE Internal |
Expand Down Expand Up @@ -81,8 +84,9 @@ Strong scaling results are obtained by keeping the problem size constant and inc

### NVIDIA V100 GPU

For these tests, the base case utilizes 8 GPUs with one MPI process per GPU.
The performance is analyzed at two different problem sizes of 16M and 64M grid points, with the base case using 2M and 8M grid points per process.
The base case utilizes 8 GPUs with one MPI process per GPU for these tests.
The performance is analyzed at two problem sizes: 16M and 64M grid points.
The "base case" uses 2M and 8M grid points per process.

#### 16M Grid Points

Expand Down
28 changes: 13 additions & 15 deletions docs/documentation/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ sudo apt install tar wget make cmake gcc g++ \
python3-venv
```

- **Via [Pacman](https://wiki.archlinux.org/title/pacman):**
- **Via Pacman (Arch):**

```shell
sudo pacman -Syu
Expand Down Expand Up @@ -87,12 +87,12 @@ Once you have WSL installed, you can follow the instructions for *nix systems ab

Install the latest version of:
- [Microsoft Visual Studio Community](https://visualstudio.microsoft.com/)
- [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html)
- [Intel® oneAPI HPC Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/hpc-toolkit-download.html)
- Intel® oneAPI Base Toolkit
- Intel® oneAPI HPC Toolkit
- [Strawberry Perl](https://strawberryperl.com/) (Install and add `C:\strawberry\perl\bin\perl.exe` or your installation path to your [PATH](https://www.architectryan.com/2018/03/17/add-to-the-path-on-windows-10/))
Please note that Visual Studio must be installed first, and the oneAPI Toolkits need to be configured with the installed Visual Studio, even if you plan to use a different IDE.

Then, in order to initialize your development environment, run the following command (or your installation path) in command prompt:
Then, to initialize your development environment, run the following command (or your installation path) in the command prompt:
```shell
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
Expand Down Expand Up @@ -130,8 +130,8 @@ open ~/.bash_profile

An editor should open.
Please paste the following lines into it before saving the file.
If you wish to use a version of GNU's GCC other than 13, modify the first assignment.
These lines ensure that LLVM's Clang, and Apple's modified version of GCC, won't be used to compile MFC.
Modify the first assignment if you wish to use a different version of GNU's GCC.
These lines ensure that LLVM's Clang and Apple's modified version of GCC are not used to compile MFC.
Further reading on `open-mpi` incompatibility with `clang`-based `gcc` on macOS: [here](https://stackoverflow.com/questions/27930481/how-to-build-openmpi-with-homebrew-and-gcc-4-9).
We do *not* support `clang` due to conflicts with the Silo dependency.

Expand All @@ -158,7 +158,7 @@ They will download the dependencies MFC requires to build itself.
Docker is a lightweight, cross-platform, and performant alternative to Virtual Machines (VMs).
We build a Docker Image that contains the packages required to build and run MFC on your local machine.

First install Docker and Git:
First, install Docker and Git:
- Windows: [Docker](https://docs.docker.com/get-docker/) + [Git](https://git-scm.com/downloads).
- macOS: `brew install git docker` (requires [Homebrew](https://brew.sh/)).
- Other systems:
Expand All @@ -182,13 +182,11 @@ recommended settings applied, run
.\mfc.bat docker # If on Windows
```

We automatically mount and configure the proper permissions in order for you to
access your local copy of MFC, available at `~/MFC`. You will be logged-in as the
`me` user with root permissions.
We automatically mount and configure the proper permissions for you to access your local copy of MFC, available at `~/MFC`.
You will be logged in as the `me` user with root permissions.

:warning: The state of your container is entirely transient, except for the MFC mount.
Thus, any modification outside of `~/MFC` should be considered as permanently lost upon
session exit.
Thus, any modification outside of `~/MFC` should be considered permanently lost upon session exit.

</details>

Expand All @@ -207,10 +205,10 @@ MFC can be built with support for various (compile-time) features:
_⚠️ The `--gpu` option requires that your compiler supports OpenACC for Fortran for your target GPU architecture._

When these options are given to `mfc.sh`, they will be remembered when you issue future commands.
You can enable and disable features at any time by passing any of the arguments above.
For example, if you have previously built MFC with MPI support and no longer wish to run using MPI, you can pass `--no-mpi` once, for the change to be permanent.
You can enable and disable features anytime by passing any of the arguments above.
For example, if you previously built MFC with MPI support and no longer wish to run using MPI, you can pass `--no-mpi` once, making the change permanent.

MFC is composed of three codes, each being a separate _target_.
MFC comprises three codes, each being a separate _target_.
By default, all targets (`pre_process`, `simulation`, and `post_process`) are selected.
To only select a subset, use the `-t` (i.e., `--targets`) argument.
For a detailed list of options, arguments, and features, please refer to `./mfc.sh build --help`.
Expand Down
19 changes: 10 additions & 9 deletions docs/documentation/running.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# Running

MFC can be run using `mfc.sh`'s `run` command.
It supports both interactive and batch execution, the latter being designed for multi-socket systems, namely supercomputers, equipped with a scheduler such as PBS, SLURM, and LSF.
A full (and updated) list of available arguments can be acquired with `./mfc.sh run -h`.
It supports interactive and batch execution.
Batch mode is designed for multi-node distributed systems (supercomputers) equipped with a scheduler such as PBS, SLURM, or LSF.
A full (and up-to-date) list of available arguments can be acquired with `./mfc.sh run -h`.

MFC supports running simulations locally (Linux, MacOS, and Windows) as well as
several supercomputer clusters, both interactively and through batch submission.
Expand All @@ -17,7 +18,7 @@ several supercomputer clusters, both interactively and through batch submission.
>
> Adding a new template file or modifying an existing one will most likely be required if:
> - You are on a cluster that does not have a template yet.
> - Your cluster is configured with SLURM but interactive job launches fail when
> - Your cluster is configured with SLURM, but interactive job launches fail when
> using `srun`. You might need to invoke `mpirun` instead.
> - Something in the existing default or computer template file is incompatible with
> your system or does not provide a feature you need.
Expand Down Expand Up @@ -88,7 +89,7 @@ MFC provides two different arguments to facilitate profiling with NVIDIA Nsight.
- Nsight Systems (Nsys): `./mfc.sh run ... -t simulation --nsys [nsys flags]` allows one to visualize MFC's system-wide performance with [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems).
NSys is best for understanding the order and execution times of major subroutines (WENO, Riemann, etc.) in MFC.
When used, `--nsys` will run the simulation and generate `.nsys-rep` files in the case directory for all targets.
These files can then be imported into Nsight System's GUI, which can be downloaded [here](https://developer.nvidia.com/nsight-systems/get-started#latest-Platforms). It is best to run case files with a few timesteps to keep the report files small. Learn more about NVIDIA Nsight Systems [here](https://docs.nvidia.com/nsight-systems/UserGuide/index.html).
These files can then be imported into Nsight System's GUI, which can be downloaded [here](https://developer.nvidia.com/nsight-systems/get-started#latest-Platforms). To keep the report files small, it is best to run case files with a few timesteps. Learn more about NVIDIA Nsight Systems [here](https://docs.nvidia.com/nsight-systems/UserGuide/index.html).
- Nsight Compute (NCU): `./mfc.sh run ... -t simulation --ncu [ncu flags]` allows one to conduct kernel-level profiling with [NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute).
NCU provides profiling information for every subroutine called and is more detailed than NSys.
When used, `--ncu` will output profiling information for all subroutines, including elapsed clock cycles, memory used, and more after the simulation is run.
Expand All @@ -101,11 +102,11 @@ Learn more about NVIDIA Nsight Compute [here](https://docs.nvidia.com/nsight-com
When used, `--roc` will run the simulation and generate files in the case directory for all targets.
`results.json` can then be imported in [Perfetto's UI](https://ui.perfetto.dev/).
Learn more about AMD Rocprof [here](https://rocm.docs.amd.com/projects/rocprofiler/en/docs-5.5.1/rocprof.html)
It is best to run case files with a few timesteps to keep the report files small.
- Omniperf (OMNI): `./mfc.sh run ... -t simulation --omni [omniperf flags]`allows one to conduct kernel-level profiling with [AMD Omniperf](https://rocm.github.io/omniperf/introduction.html#what-is-omniperf).
When used, `--omni` will output profiling information for all subroutines, including rooflines, cache usage, register usage, and more after the simulation is run.
It is best to run case files with few timesteps to keep the report file sizes manageable.
- Omniperf (OMNI): `./mfc.sh run ... -t simulation --omni [omniperf flags]` allows one to conduct kernel-level profiling with [AMD Omniperf](https://rocm.github.io/omniperf/introduction.html#what-is-omniperf).
When used, `--omni` will output profiling information for all subroutines, including rooflines, cache usage, register usage, and more, after the simulation is run.
Adding this argument will moderately slow down the simulation and run the MFC executable several times.
For this reason it should only be used with case files that have a few timesteps.
For this reason, it should only be used with case files with few timesteps.


### Restarting Cases
Expand All @@ -123,7 +124,7 @@ If you want to restart a simulation,
in which $t_i$ is the starting time, $t_f$ is the final time, and $SF$ is the saving frequency time.
- Run `pre_process` and `simulation` on the case.
- `./mfc.sh run case.py -t pre_process simulation `
- As the simulation runs, it will create Lustre files for each saved timestep in `./restart_data`.
- As the simulation runs, Lustre files will be created for each saved timestep in `./restart_data`.
- When the simulation stops, choose any Lustre file as the restarting point (lustre_ $t_s$.dat)
- Create a new duplicate input file (e.g., `restart_case.py`), which should have:

Expand Down

0 comments on commit 7bdf4e3

Please sign in to comment.