Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Improve webgpu optimizer performance with a top-k parallel selection algorithm #551

Open
fribbels opened this issue Aug 31, 2024 · 0 comments

Comments

@fribbels
Copy link
Owner

fribbels commented Aug 31, 2024

Motivation

Currently the GPU acceleration is being suboptimal with buffer usage & transfer, there's potential for improving the permutations/sec with some more efficient code

The problem

The current relevant WebGPU pipeline defaults are:

WORKGROUP_SIZE: 256
NUM_WORKGROUPS: 256
CYCLES_PER_INVOCATION: 128
RESULTS_LIMIT: 1024 (nothing special about this number, generally users don't really require more than this)

What the gpu optimizer does now is assign each permutation to an index. The webgpu pipeline parallelizes the permutation calculations so that each invocation is 256 * 256 threads, and then each thread does 128 permutations of work.

We allocate a buffer of 8,388,608 Float32s to store this GPU output. This buffer has to be sent to the GPU, have results written to every index, then gets mapped back to the CPU as a Float32Array, then the JS loops over all 8,388,608 results, inserts them all into a heap, and then transfers back to the GPU for the next iteration. At the end of all the permutations we use this heap to calculate the top 1024 (RESULTS_LIMIT) values to display.

The inefficiency here is the buffer is way larger than necessary - we're transferring 8,388,608 elements but only using 1024 of them in the end, we could instead just transfer a buffer of size 1024 (technically x2 since we need two pieces of information but that comes later), get the top 1024 sorted results.

The solutions

We can select the top 1024 elements on the GPU itself before writing to the buffer, instead of sorting the whole thing on the CPU. There's a lot of different ways to do this:

  • Sort all the elements
    • I've prototyped radix sort and bitonic sort for this, but this brings its own issues
    • We're doing more work than necessary since we only care about 1024 out of the 8,388,608, most of the sorting is useless
    • I believe sorts are also effectively limited to their local workgroup due to the workgroup buffer size, we also need a separate merge step at the end to get the global max results.
    • This is potentially fine if we transfer the full buffer but only read the first 1024 results, but that sorta only half solves the problem
  • Implement a min-heap data structure and atomically insert results
    • Maybe there's a parallel way to do this but the atomic writes will probably be costly
    • The benefit is this would always store the top 1024 results and the heap array to just be returned right away
  • Top-k selection
    • There's a lot of algorithms to research for this, we can do some recursive median-finding stuff, or do a partial sort, etc

The problem, part 2

These all sound great but they come with difficulties when dealing with limitations of working on large arrays on the GPU. Here are the default device limits on my device, might be slightly different for other GPUs:

image

Some relevant limit are:

  • maxComputeWorkgroupStorageSize: 16384
    • This means the max array size in a workgroup is 4096 Float32s
    • But each workgroup generates 256 * 128 = 32768 results
  • maxComputeInvocationsPerWorkgroup: 256
  • maxComputeWorkgroupSizeX: 256
  • maxComputeWorkgroupsPerDimension: 65535
    • Ideally we could just generate more threads
    • But these 3 defaults effectively limit the threads to 256 * 256 = 65536
    • We instead have to do 128 permutations per thread
  • maxBufferSize: 268435456
    • This probably shouldn't matter since we're trying to reduce the buffer size, but it does limit how many permutations per thread we can run, which is where the 128 comes from
  • Currently only f32s, i32, u32s are supported in webgpu. There's support for f16s coming soon which would reduce the buffer size by two (but general availability in browsers may be a while)

We can technically request that the adapter provide a device higher than these limits, but this isn't always supported on every machine.

We also require two pieces of information per permutation: the permutation index, and the sort value. This lets the CPU side of the optimizer unpack the permutation index into the 6 relic IDs, and use the sort value to sort the top 1024 results between multiple GPU passes. So effectively we need to return 2048 values instead not just 1024.

Goal

Research a path to take and implement the algorithm, the goal should be to pass a 2048 length buffer of i32s to the GPU, and have it return the top 1024 sorted results within its work bounds.

Feel free to ask me for more info in #dev-chat about implementation details, I have a bunch of prototype code from trying stuff out too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Lowest priority / Not planned
Development

No branches or pull requests

1 participant