CodeGen: Rewrite dot product lowering using a dedicated IR instruction #1512

zeux · 2024-11-08T19:12:58Z

Instead of doing the dot product related math in scalar IR, we lift the computation into a dedicated IR instruction.

On x64, we can use VDPPS which was more or less tailor made for this purpose. This is better than manual scalar lowering that requires reloading components from memory; it's not always a strict improvement over the shuffle+add version (which we never had), but this can now be adjusted in the IR lowering in an optimal fashion (maybe even based on CPU vendor, although that'd create issues for offline compilation).

On A64, we can either use naive adds or paired adds, as there is no dedicated vector-wide horizontal instruction until SVE. Both run at about the same performance on M2, but paired adds require fewer instructions and temporaries.

I've measured this using mesh-normal-vector benchmark, changing the benchmark to just report the time of the second loop inside calculate_normals, testing master vs #1504 vs this PR, also increasing the grid size to 400 for more stable timings.

On Zen 4 (7950X), this PR is comfortably ~8% faster vs master, while I see neutral to negative results in #1504.
On M2 (base), this PR is ~28% faster vs master, while #1504 is only about ~10% faster.

If I measure the second loop in calculate_tangent_space instead, I get:

On Zen 4 (7950X), this PR is ~12% faster vs master, while #1504 is ~3% faster
On M2 (base), this PR is ~24% faster vs master, while #1504 is only about ~13% faster.

Note that the loops in question are not quite optimal, as they store and reload various vectors to dictionary values due to inappropriate use of locals. The underlying gains in individual functions are thus larger than the numbers above; for example, changing the calculate_normals loop to use a local variable to store the normalized vector (but still saving the result to dictionary value), I get a ~24% performance increase from this PR on Zen4 vs master instead of just 8% (#1504 is ~15% slower in this setup).

This will help optimize lowering of vector.dot and vector.normalize

This exposes vdpps on X64 and allows to compute a 3-wide dot product for two vectors, returning the result as a number.

This is useful for vector.dot, vector.magnitude and vector.normalize.

This is using existing instructions and scalar adds to have a baseline. This is still faster than the original implementation of vector. ops.

We now support scalar version of faddp opcode which can add the first two floats of the vector into the first scalar of the destination.

This results in about the same performance as a naive version on M2, but uses fewer registers and is what clang generates for a similar source.

zeux added 6 commits November 8, 2024 09:46

CodeGen: Implement support for vdpps AVX instruction

d2c008c

This will help optimize lowering of vector.dot and vector.normalize

CodeGen: Implement DOT_VEC IR opcode

74a9128

This exposes vdpps on X64 and allows to compute a 3-wide dot product for two vectors, returning the result as a number.

CodeGen: Use DOT_VEC under a fast flag for lowering vector lib

cd73807

This is useful for vector.dot, vector.magnitude and vector.normalize.

CodeGen: Implement a naive version of A64 DOT_VEC

8fc458e

This is using existing instructions and scalar adds to have a baseline. This is still faster than the original implementation of vector. ops.

CodeGen: Implement faddp opcode for A64

6ebac70

We now support scalar version of faddp opcode which can add the first two floats of the vector into the first scalar of the destination.

CodeGen: Rewrite DOT_VEC lowering for A64 using faddp

81b691b

This results in about the same performance as a naive version on M2, but uses fewer registers and is what clang generates for a similar source.

aviralg approved these changes Nov 9, 2024

View reviewed changes

aviralg merged commit e6bf718 into master Nov 9, 2024
8 checks passed

aviralg deleted the vector-dot branch November 9, 2024 00:23

aviralg mentioned this pull request Nov 9, 2024

Optimize vector normalize and dot IR #1504

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CodeGen: Rewrite dot product lowering using a dedicated IR instruction #1512

CodeGen: Rewrite dot product lowering using a dedicated IR instruction #1512

zeux commented Nov 8, 2024 •

edited

Loading

CodeGen: Rewrite dot product lowering using a dedicated IR instruction #1512

CodeGen: Rewrite dot product lowering using a dedicated IR instruction #1512

Conversation

zeux commented Nov 8, 2024 • edited Loading

zeux commented Nov 8, 2024 •

edited

Loading