Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmarks for CPU/GPU #114

Open
vaiana opened this issue Jul 13, 2023 · 3 comments
Open

Add benchmarks for CPU/GPU #114

vaiana opened this issue Jul 13, 2023 · 3 comments

Comments

@vaiana
Copy link
Contributor

vaiana commented Jul 13, 2023

Describe the new feature or enhancement

It is possible to run NDK on GPU but its not clear how speed up we get (if any). GPU instances are higher cost so it would be nice to know if the extra cost is worthwhile.

Describe your proposed implementation

Add a benchmark directory, possibly in tests/benchmark with a script to time the execution of the scenarios under different parameters. The script should be platform/environment agnostic. We could then compile the benchmark statistics against a few different platforms (memory, gpu, cpus) to give users a sense of the run time. This would also improve the time estimate printed in the simulation.

Additional comments

In very simple testing I found no speed up using GPU for the default settings of scenario-1-2d-v0, both ran in about 10s. There may be much more speed up when doing large simulations (3d?) but this is the type of thing that would be good to know ahead of time.

@charlesbmi charlesbmi self-assigned this Jul 28, 2023
@charlesbmi
Copy link
Contributor

charlesbmi commented Jul 28, 2023

CPU comparisons are available in this private repo: https://github.com/agencyenterprise/ndk-research/blob/main/experiments/184354911-resources-benchmark-for-scenarios/184354911-resource_utilization_benchmark.ipynb

That data was used to estimate time/memory requirements for NDK users.

Might look into the GPU comparisons; would run into memory limits though.

@charlesbmi
Copy link
Contributor

charlesbmi commented Jul 28, 2023

As a preliminary CPU/GPU comparison, I ran docs/examples/plot_scenarios.py on an AWS EC2 p3.2xlarge instance, which has an NVIDIA V100 GPU and 8 vCPUs. A more thorough investigation is in order, but the generated logs actually provide some good intuition about what kinds of speed-ups can be expected.

Stride's computations appear to take 3 major steps:

  1. generating the C++ code for the given scenario
  2. Compiling the generated C++ code. This can be cached.
  3. Running the code/operator.

For a given scenario, the times seemed consistent (+/-5%) across runs. For the CPU setup, I used Devito's automatic selection of # CPUs, which always came out to 4 threads.

Scenario [grid shape] Step CPU (4 threads) GPU
scenario-0-v0 [101, 81] Generate acoustic_iso_state operator 10.03s 10.39s
JIT-compile C++ file 3.64s 4.3s
Run operator [11 GFlops] 18 GFlops/s 18 GFlops/s
Scenario-1-2d-v0 [241, 141] Generate acoustic_iso_state operator 9.76s 10.8s
JIT-compile C++ file 3.24s 4.38s
Run operator [42 Gflops] 27 GFlops/s 48 GFlops/s
scenario-2-2d-v0 [451, 351] Generate acoustic_iso_state operator 9.82s 10.8s
JIT-compile C++ file 3.25s 4.27s
Run operator [202 GFlops] 35 GFlops/s 109 GFlops/s
scenario-1-3d-v0 [241, 141, 141] Generate acoustic_iso_state operator 23.3s 23.66s
JIT-compile C++ file 6.9s 8.53s
Run operator [16 TFlops] 21 GFlops/s 610 GFlops/s

From the table, we can see that:

  • Generating the C++ code is similar between CPU/GPU setups and between different 2-D scenarios.
  • Compiling the C++ code for CPU with gcc was ~1 second faster than compiling for GPU with pgc++. Compiling takes a similar amount of time for different 2-D scenarios.
  • For the largest 2D example scenario, scenario-2-2d-v0, a GPU provided a 3x speed up in the final step: running the operator.
  • The larger the spatial grid, the larger the GPU performance boost over CPU. GPU provides a huge boost for 3D simulations.

Raw logs:
env -u PLATFORM python docs/examples/plot_scenarios.py | tee cpu_log.txt
cpu_log.txt

PLATFORM=nvidia-acc python docs/examples/plot_scenarios.py | tee gpu_log.txt
gpu_log.txt

@charlesbmi
Copy link
Contributor

Note: I tried running the scenarios with different points_per_period settings to change the time resolution. This linearly changed the number of GFlops for the Run operator, did not seem to have any effect on the GFlops/s. Seems like GPU only improves parallelization over the Space dimensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants