CodeGen: Rewrite dot product lowering using a dedicated IR instruction #1512

This will help optimize lowering of vector.dot and vector.normalize

This exposes vdpps on X64 and allows to compute a 3-wide dot product for two vectors, returning the result as a number.

This is useful for vector.dot, vector.magnitude and vector.normalize.

This is using existing instructions and scalar adds to have a baseline. This is still faster than the original implementation of vector. ops.

We now support scalar version of faddp opcode which can add the first two floats of the vector into the first scalar of the destination.

This results in about the same performance as a naive version on M2, but uses fewer registers and is what clang generates for a similar source.

Provide feedback