Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add field-last benchmark script #1950

Merged
merged 1 commit into from
Oct 15, 2024
Merged

Add field-last benchmark script #1950

merged 1 commit into from
Oct 15, 2024

Conversation

charleskawczynski
Copy link
Member

@charleskawczynski charleskawczynski commented Aug 22, 2024

This PR adds a benchmark to compare a dropped field dimension against moving the field dimension to the last index. This benchmark turned out to be a pretty simple modification of the offset benchmark.

cc @dennisYatunin (who was interested in this benchmark).

Let's look at the Float32 results for two kernels:

Kernel `add3(x1, x2, x3) = x1+x2+x3` and  `n_reads_writes=4`:
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 5400, 1), float_type = Float32, device_bandwidth_GBs=2039
┌─────────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                               │ time per call                    │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│     FLD.aos_cart_offset!(X_aos_ref, Y_aos_ref, us; bm, nreps = 100) │ 72 microseconds, 899 nanoseconds │ 54.568  │ 1112.64     │ 4              │ 100    │
│     FLD.aos_lin_offset!(X_aos, Y_aos, us; bm, nreps = 100)          │ 56 microseconds, 259 nanoseconds │ 70.708  │ 1441.74     │ 4              │ 100    │
│     FLD.soa_linear_index!(X_soa, Y_soa, us; bm, nreps = 100)        │ 56 microseconds, 515 nanoseconds │ 70.3877 │ 1435.21     │ 4              │ 100    │
│     FLD.soa_cart_index!(X_soa, Y_soa, us; bm, nreps = 100)          │ 67 microseconds, 462 nanoseconds │ 58.9663 │ 1202.32     │ 4              │ 100    │
└─────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Kernel `add3(x1, x2, x3) = x1+x2+x3` and  `n_reads_writes=4`:
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 5400, 1), float_type = Float64, device_bandwidth_GBs=2039
┌─────────────────────────────────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                               │ time per call                     │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├─────────────────────────────────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│     FLD.aos_cart_offset!(X_aos_ref, Y_aos_ref, us; bm, nreps = 100) │ 106 microseconds, 783 nanoseconds │ 74.5051 │ 1519.16     │ 4              │ 100    │
│     FLD.aos_lin_offset!(X_aos, Y_aos, us; bm, nreps = 100)          │ 102 microseconds, 472 nanoseconds │ 77.6396 │ 1583.07     │ 4              │ 100    │
│     FLD.soa_linear_index!(X_soa, Y_soa, us; bm, nreps = 100)        │ 102 microseconds, 523 nanoseconds │ 77.6008 │ 1582.28     │ 4              │ 100    │
│     FLD.soa_cart_index!(X_soa, Y_soa, us; bm, nreps = 100)          │ 106 microseconds, 834 nanoseconds │ 74.4694 │ 1518.43     │ 4              │ 100    │
└─────────────────────────────────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Kernel `add3(x1, x2, x3) = x1` and  `n_reads_writes=2`:
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 5400, 1), float_type = Float32, device_bandwidth_GBs=2039
┌─────────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                               │ time per call                    │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├─────────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│     FLD.aos_cart_offset!(X_aos_ref, Y_aos_ref, us; bm, nreps = 100) │ 61 microseconds, 185 nanoseconds │ 32.5079 │ 662.837     │ 2              │ 100    │
│     FLD.aos_lin_offset!(X_aos, Y_aos, us; bm, nreps = 100)          │ 31 microseconds, 376 nanoseconds │ 63.3926 │ 1292.57     │ 2              │ 100    │
│     FLD.soa_linear_index!(X_soa, Y_soa, us; bm, nreps = 100)        │ 31 microseconds, 120 nanoseconds │ 63.9141 │ 1303.21     │ 2              │ 100    │
│     FLD.soa_cart_index!(X_soa, Y_soa, us; bm, nreps = 100)          │ 44 microseconds, 53 nanoseconds  │ 45.1499 │ 920.607     │ 2              │ 100    │
└─────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Note that soa_linear_index! is what #1929 implements. aos_lin_offset! would be the best we can get by moving the field index to the last index. So, if we move the field dimension to the end, and avoid converting to cartesian indices altogether, we can reach maximum performance, even in "low-utilization" expressions (where not all field variables are used). I'm happy to merge this as a way to document our performance analysis.

This supports that moving the field dimension to the end of the datalayout (in addition to leveraging linear indexing) will fix #1910.

cc @cmbengue, @tapios

@charleskawczynski charleskawczynski merged commit 193ecfa into main Oct 15, 2024
17 checks passed
@charleskawczynski charleskawczynski deleted the ck/field_last_bm branch October 15, 2024 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix uncoalesced memory reads
2 participants