Add field-last benchmark script #1950

charleskawczynski · 2024-08-22T00:59:13Z

This PR adds a benchmark to compare a dropped field dimension against moving the field dimension to the last index. This benchmark turned out to be a pretty simple modification of the offset benchmark.

cc @dennisYatunin (who was interested in this benchmark).

Let's look at the Float32 results for two kernels:

Kernel `add3(x1, x2, x3) = x1+x2+x3` and  `n_reads_writes=4`:
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 5400, 1), float_type = Float32, device_bandwidth_GBs=2039
┌─────────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                               │ time per call                    │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│     FLD.aos_cart_offset!(X_aos_ref, Y_aos_ref, us; bm, nreps = 100) │ 72 microseconds, 899 nanoseconds │ 54.568  │ 1112.64     │ 4              │ 100    │
│     FLD.aos_lin_offset!(X_aos, Y_aos, us; bm, nreps = 100)          │ 56 microseconds, 259 nanoseconds │ 70.708  │ 1441.74     │ 4              │ 100    │
│     FLD.soa_linear_index!(X_soa, Y_soa, us; bm, nreps = 100)        │ 56 microseconds, 515 nanoseconds │ 70.3877 │ 1435.21     │ 4              │ 100    │
│     FLD.soa_cart_index!(X_soa, Y_soa, us; bm, nreps = 100)          │ 67 microseconds, 462 nanoseconds │ 58.9663 │ 1202.32     │ 4              │ 100    │
└─────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Kernel `add3(x1, x2, x3) = x1+x2+x3` and  `n_reads_writes=4`:
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 5400, 1), float_type = Float64, device_bandwidth_GBs=2039
┌─────────────────────────────────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                               │ time per call                     │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├─────────────────────────────────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│     FLD.aos_cart_offset!(X_aos_ref, Y_aos_ref, us; bm, nreps = 100) │ 106 microseconds, 783 nanoseconds │ 74.5051 │ 1519.16     │ 4              │ 100    │
│     FLD.aos_lin_offset!(X_aos, Y_aos, us; bm, nreps = 100)          │ 102 microseconds, 472 nanoseconds │ 77.6396 │ 1583.07     │ 4              │ 100    │
│     FLD.soa_linear_index!(X_soa, Y_soa, us; bm, nreps = 100)        │ 102 microseconds, 523 nanoseconds │ 77.6008 │ 1582.28     │ 4              │ 100    │
│     FLD.soa_cart_index!(X_soa, Y_soa, us; bm, nreps = 100)          │ 106 microseconds, 834 nanoseconds │ 74.4694 │ 1518.43     │ 4              │ 100    │
└─────────────────────────────────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Kernel `add3(x1, x2, x3) = x1` and  `n_reads_writes=2`:
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 5400, 1), float_type = Float32, device_bandwidth_GBs=2039
┌─────────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                               │ time per call                    │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│     FLD.aos_cart_offset!(X_aos_ref, Y_aos_ref, us; bm, nreps = 100) │ 61 microseconds, 185 nanoseconds │ 32.5079 │ 662.837     │ 2              │ 100    │
│     FLD.aos_lin_offset!(X_aos, Y_aos, us; bm, nreps = 100)          │ 31 microseconds, 376 nanoseconds │ 63.3926 │ 1292.57     │ 2              │ 100    │
│     FLD.soa_linear_index!(X_soa, Y_soa, us; bm, nreps = 100)        │ 31 microseconds, 120 nanoseconds │ 63.9141 │ 1303.21     │ 2              │ 100    │
│     FLD.soa_cart_index!(X_soa, Y_soa, us; bm, nreps = 100)          │ 44 microseconds, 53 nanoseconds  │ 45.1499 │ 920.607     │ 2              │ 100    │
└─────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Note that soa_linear_index! is what #1929 implements. aos_lin_offset! would be the best we can get by moving the field index to the last index. So, if we move the field dimension to the end, and avoid converting to cartesian indices altogether, we can reach maximum performance, even in "low-utilization" expressions (where not all field variables are used). I'm happy to merge this as a way to document our performance analysis.

This supports that moving the field dimension to the end of the datalayout (in addition to leveraging linear indexing) will fix #1910.

cc @cmbengue, @tapios

benchmarks/scripts/benchmark_field_last.jl

wip wip

charleskawczynski added the Performance monitoring 🔍🚀 label Aug 22, 2024

charleskawczynski requested review from dennisYatunin, Sbozzolo and sriharshakandala August 22, 2024 00:59

charleskawczynski force-pushed the ck/field_last_bm branch 2 times, most recently from 7a72fa6 to 4638fe2 Compare August 27, 2024 12:26

sriharshakandala reviewed Aug 27, 2024

View reviewed changes

benchmarks/scripts/benchmark_field_last.jl Show resolved Hide resolved

sriharshakandala reviewed Aug 29, 2024

View reviewed changes

benchmarks/scripts/benchmark_field_last.jl Show resolved Hide resolved

sriharshakandala mentioned this pull request Aug 30, 2024

Add a benchmark script for IJFVH datalayout. #1963

Draft

4 tasks

charleskawczynski force-pushed the ck/field_last_bm branch from 4638fe2 to 5e6b4c2 Compare October 14, 2024 13:41

charleskawczynski enabled auto-merge October 14, 2024 13:42

charleskawczynski force-pushed the ck/field_last_bm branch 3 times, most recently from 48a2946 to 9a6fbfe Compare October 15, 2024 00:17

Add field-last benchmark script

179e4e9

wip wip

charleskawczynski force-pushed the ck/field_last_bm branch from 9a6fbfe to 179e4e9 Compare October 15, 2024 12:36

charleskawczynski merged commit 193ecfa into main Oct 15, 2024
17 checks passed

charleskawczynski deleted the ck/field_last_bm branch October 15, 2024 13:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add field-last benchmark script #1950

Add field-last benchmark script #1950

charleskawczynski commented Aug 22, 2024 •

edited

Loading

Add field-last benchmark script #1950

Add field-last benchmark script #1950

Conversation

charleskawczynski commented Aug 22, 2024 • edited Loading

charleskawczynski commented Aug 22, 2024 •

edited

Loading