Figure out which target features are required for which SIMD size #131800

RalfJung · 2024-10-16T19:19:09Z

workingjubilee · 2024-10-17T01:54:36Z

cc @programmerjake to confirm what features are relevant re: PowerPC

workingjubilee · 2024-10-17T02:08:12Z

arm: may be able to rip something out of Revise arm platform notes regarding soft float #130987
csky: cc @Dirreke re: target-features that affect vector ABIs?
loongarch: cc @heiher re: target-features that affect vector ABIs?
riscv: cc @kito-cheng or @topperc to confirm for LLVM features that can affect the vector ABI?
s390x: this got a nice nice writeup recently in s390x vector facilities support #130869 (comment)

programmerjake · 2024-10-17T02:11:07Z

+altivec enables 128-bit vectors, I'm not sure if there are any wider types -- there's MMA with 512-bit accumulators, but idk if they are vector types, they're used for matrix ops.

topperc · 2024-10-17T02:29:01Z

riscv: cc @kito-cheng or @topperc to confirm for LLVM features that can affect the vector ABI?

+zve32x, +zve32f, +zve64x, +zve64f, +zve64d, +v, +zvbb, +zvkb. There several others that start with +zv*. There is an implies relationship so they all ultimately imply +zve32x. These change the ABI for fixed length vector arguments/returns in IR in the backend.

When compiling with clang with the default C ABI, fixed length vectors are passed coerced to a scalar integer type or passed indirectly through memory. clang will not create fixed length vector return types or arguments in IR.

There is a fixed length vector ABI being implemented via an attribute llvm/llvm-project#100346. This changes how fixed length vectors are passed.

programmerjake · 2024-10-17T02:40:26Z

vsx just enables an additional 32 registers that overlap with the scalar floating point registers but are otherwise the same as the 32 128-bit registers from vmx (aka. altivec), so no new simd bitwidths.

workingjubilee · 2024-10-17T02:57:47Z

@programmerjake and no alterations to the calling convention?

workingjubilee · 2024-10-17T03:13:30Z

@topperc hm, it seems LLVM doesn't have the crypto implications for these features specified here? is it somewhere else? https://github.com/llvm/llvm-project/blob/d54953ef472bfd8d4b503aae7682aa76c49f8cc0/llvm/lib/Target/RISCV/RISCVFeatures.td#L734-L746

it seems to rather be the opposite, a requirement relationship, but perhaps I'm misunderstanding: https://github.com/llvm/llvm-project/blob/d54953ef472bfd8d4b503aae7682aa76c49f8cc0/llvm/lib/TargetParser/RISCVISAInfo.cpp#L754-L758

topperc · 2024-10-17T03:19:11Z

@topperc hm, it seems LLVM doesn't have the crypto implications for these features specified here? is it somewhere else? https://github.com/llvm/llvm-project/blob/d54953ef472bfd8d4b503aae7682aa76c49f8cc0/llvm/lib/Target/RISCV/RISCVFeatures.td#L734-L746

it seems to rather be the opposite, a requirement relationship, but perhaps I'm misunderstanding: https://github.com/llvm/llvm-project/blob/d54953ef472bfd8d4b503aae7682aa76c49f8cc0/llvm/lib/TargetParser/RISCVISAInfo.cpp#L754-L758

My mistake. I'm not sure why we don't have the implies. gcc does.

programmerjake · 2024-10-17T04:22:19Z

@programmerjake and no alterations to the calling convention?

enabling vsx doesn't alter the calling convention.

RalfJung · 2024-10-17T06:05:50Z

+altivec enables 128-bit vectors, I'm not sure if there are any wider types -- there's MMA with 512-bit accumulators, but idk if they are vector types, they're used for matrix ops.

Are there types which are passed via these MMA registers?

RalfJung · 2024-10-17T06:07:28Z

My mistake. I'm not sure why we don't have the implies. gcc does.

For us it's nice that there's no "implies" here, that makes it a lot easier to check the ABI consequences. ;) This way we juts have to block the actual feature changing something, not other features implying them.

Though maybe this is also unnecessary if #131807 takes care of all that.

EDIT: Ah, that's just for float ABI, not for vectors, is it?

RalfJung · 2024-10-17T06:08:44Z

-Cllvm-args="--riscv-v-vector-bits-min=N"

Uh wait a second, we are exposing another flag that can change ABI? 😢 😭

EDIT: That's probably a discussion for Zulip.

heiher · 2024-10-17T07:12:17Z

LoongArch: According to the LoongArch ABI Specs, vector type parameters and return values are passed in GAR(general-purpose argument registers) or on the stack, and do not rely on vector registers or vector features.

programmerjake · 2024-10-17T07:15:59Z

+altivec enables 128-bit vectors, I'm not sure if there are any wider types -- there's MMA with 512-bit accumulators, but idk if they are vector types, they're used for matrix ops.

Are there types which are passed via these MMA registers?

after a bit more research, there are types for MMA, but you can't pass them by value in function arguments or return, so they're not ABI-breaking: https://clang.godbolt.org/z/e4sTY37Pv
they lower to <256 x i1> and <512 x i1>

workingjubilee · 2024-10-17T07:25:54Z

...concerning. these blockades are in clang's semantic checks, they don't seem to be enforced by LLVM.

workingjubilee · 2024-10-17T07:33:21Z

cc @jacobbramley re: aarch64

workingjubilee · 2024-10-17T07:35:27Z

cc @androm3da re: hexagon

RalfJung · 2024-10-17T08:13:15Z

after a bit more research, there are types for MMA, but you can't pass them by value in function arguments or return

How do we handle that in Rust? We'd need a special pass during collection rejecting them as arguments, likely as part of the simd arg check that this issue is about.

I assume we don't support these types yet, but this will need to be considered when someone decides to add them.

RalfJung · 2024-10-17T08:24:22Z

-Cllvm-args="--riscv-v-vector-bits-min=N"

Uh wait a second, we are exposing another flag that can change ABI? 😢 😭

I only just realized this only affects scalable vector types. Which anyway we don't support. So we can ignore this for now.

Dirreke · 2024-10-17T12:48:14Z

According to the CSKY Development Guide: 4.5 vdsp, the vector width is configured as follows:

For vdspv2, the width is fixed at 128 bits.
For vdspv1, the default width is 128 bits, but it can optionally be set to 64 bits using the -mvdsp-width=64 compiler flag. However, please note that the 64-bit width option is currently unsupported by LLVM.

Therefore, for both versions, you can safely set the vector width to 128 bits.

csky:
"vdspv1" | "vdspv2" => vlen(128),

androm3da · 2024-10-17T13:51:26Z

Hexagon HVX supports uses these target flags - +hvx-length64b, +hvx-length128b, +hvx, +hvxv60 through +hvxv73 (explicitly 60,62,65,66,67,68,69,71,73). hvx-length64b maps to 64 byte (512 bit) value registers, hvx-length128b is 128 byte (1024 bit) value reg. These registers can also be paired.

Oh - and there's also vector predicate registers - they're 64-bit and 128-bit wide in hvx-length64b and hvx-length-128b modes, respectively. These registers can also be quad'd.

All of these vector, vector-predicate registers are caller-saved.

Link to hexagon ABI doc

jacobbramley · 2024-10-21T14:58:13Z

cc @jacobbramley re: aarch64

Neon vectors always have at least 128 bits, so I can confirm this (if I understand the notation correctly):

"neon" => vlen(128)

Neon also provides operations on 64-bit vectors, and corresponding types. Those types can be passed as arguments too, in case that matters, but anything with Neon supports both the 128-bit and 64-bit types.

SVE vectors have a size that isn't known until runtime. In general, we say that they must be passed in "Z" registers (or "P" registers for the predicate type, svbool_t). Expressing this as N in vlen(N) is verbose, but it's simpler to say that Rust will need the "sve" target feature to pass these (as well as to handle them at all), and then each vector will occupy exactly one "Z" or one "P" register, or an appropriately-sized stack slot in case a function takes many arguments.

"sve" & -Cllvm-args="--aarch64-sve-vector-bits-min={N}" => vlen(N), NOTE: only scalable vectors?

That N can be any power of two such that 128 <= N <= 2048. SVE code is usually intended to be length-agnostic. However, some compilers have options for specialising code generation in case N is actually known in advance. Clang has -msve-vector-bits, for example. I would guess that --aarch64-sve-vector-bits-min does a similar thing, setting a lower bound.

Passing something like --aarch64-sve-vector-bits-min could generate incorrect code if the user gets the vector length wrong. It doesn't actually change the procedure call part of the ABI, though, so I think this is no different than any of the other -Cllvm-args or -Ctarget-features; these can all result in incorrect code generation if the runtime environment doesn't support what is enabled.

jacobbramley · 2024-10-21T15:12:14Z

For AArch32 ("arm"), Neon has 128-bit vectors (like AArch64), and also supports 64-bit vectors. They are used differently for argument passing, but I doubt that it's important here.

"neon" => vlen(128)

MVE has a register layout similar to Neon, and also has 128-bit vectors, but I can't see any operations on 64-bit vectors (e.g. in ACLE). I'm not deeply familiar with MVE, but from recent experiments inspired by Zulip discussions, I think the calling convention is similar to Neon on A-class AArch32, with a few caveats such as being able to turn off FP support but still use the vector registers.

"mve" => ???

workingjubilee · 2024-10-21T22:37:06Z

I experimented with gcc and clang and it seems that these C compilers at least offer us the small mercy of resisting doing anything but stack-passing for fixed-size vectors on AArch64. Even if you have SVE above 128 bits, and even if you do your merit best to trick it into accepting such code: https://gcc.godbolt.org/z/GYGEso3Pe

Which, well, good.

It still conceivably has some ABI problems, but they at least don't try to sometimes do register passing and sometimes not, which is where we are with x86_64's featureset.

Dirreke · 2024-10-21T22:44:05Z

@RalfJung Sorry, there's a typo in my previous response. In csky, its vdspv1 and vdspv2, not vdsp1 or vdsp2. Thanks

workingjubilee · 2024-10-21T23:17:36Z

clang says:

VDSPV1 : clang::targets::CSKYTargetInfo
VDSPV2 : clang::targets::CSKYTargetInfo

so going with that.

workingjubilee · 2024-10-22T20:32:41Z

pinging maintainers on the remaining architectures or subarchitectures:

powerpc-unknown-linux-muslspe is the PowerPC Signal Processing Engine? Isn't that a specialized DSP-oriented PowerPC-based ISA? (not to be confused with the other specialized PowerPC-based SIMD ISA of the Synergistic Processing Elements of IBM's Cell!) cc @BKPepe
sparc64: I think some have vectors? cc @glaubitz
mips*r6: I know about MSA, but are there any other vector extensions for MIPSr6, and can it even use MSA? @chenx97 @Cyanoxygen @Fearyncess @709924470
xtensa: I know it has some DSP extensions, but...??? cc @ivmarkov @MabezDev @SergioGasquez
m68k? I'm expecting a "no", but: cc @glaubitz @ricky26
nvptx64: I remember this being "it's complicated". cc @RDambrosio016 @kjetilkjeka

chenx97 · 2024-10-23T03:22:16Z

but are there any other vector extensions for MIPSr6, and can it even use MSA?

There is DSP but it cannot co-exist with MSA. The document also discourages the use of DSP in general.
MIPS® Architecture For Programmers Volume I-A: Introduction to the MIPS64® Architecture p.36
DSP is a good option when your processor only has legacy nan, and thus cannot implement MSA, but MIPS R6 is nan-2008-only, so it's always recommended to implement MSA instead.
MSA is the preferred vector extension for MIPS R6 so yes.

MabezDev · 2024-10-23T14:46:17Z

xtensa: I know it has some DSP extensions, but...??? cc @ivmarkov @MabezDev @SergioGasquez

The ESP32-S3 uses custom DSP instructions, not based on any official Xtensa vector instructions. These instructions have their own register file (prefixed with q), separate from the general purpose ones, so using these SIMD instructions won't change the ABI. The ESP32-P4 (RISCV) will also contain similar DSP instructions, but fortunately, they also have a separate block of q registers.

Both sets of instructions will operate on a maximum of 128bits.

RalfJung · 2024-10-23T15:02:32Z

And these q registers are not used for any ABI / any arguments of any type?

MabezDev · 2024-10-23T16:21:05Z

And these q registers are not used for any ABI / any arguments of any type?

Correct, they are strictly used for accumulation or parallelisation of 8/16/32 bit math/float operations in memory. There are no custom types nor any changes to the standard calling convention. As the DSP operates on memory, alignment could be a concern here, but the DSP is able to load/store from unaligned addresses (at a perf cost) meaning it stays ABI compatible, but it would of course be preferable to use 128 bit alignment where possible which I'm guessing std::simd::Simd will take care of as it mentions "Simd<T, N> can have an alignment greater than T, for better mechanical sympathy" .

kito-cheng · 2024-10-23T16:55:55Z

The issue for the q reg from the esp32 s3 is 1) they didn't upstream their DSP extension to LLVM upstream 2) standard RISC-V ABI won't use reg from vendor extension to pass any things

MabezDev · 2024-10-23T18:15:23Z

they didn't upstream their DSP extension to LLVM upstream

The plan is for the DSP extensions to be upstreamed into LLVM.

standard RISC-V ABI won't use reg from vendor extension to pass any things

Good point, the P4 isn't finalized yet so perhaps we'll end up supporting the official vector extensions, if not we'll just expose the instructions via a separate crate instead of std::simd::*.

sayantn · 2024-10-27T14:33:28Z

AMX has a special type in LLVM called x86_amx, LLVM also supports auto-bitcast of this from/to i32x256. Rustc doesn't generate IR that references this type - so we were not able to add this type and it will probably require compiler support

kjetilkjeka · 2024-10-29T21:30:24Z

nvptx64: I remember this being "it's complicated". cc @RDambrosio016 @kjetilkjeka

I'm not sure I understand the topic well enough to give a complete answer but I will give it my best.

First of all. PTX only deals with virtual registers and exactly how values are passed is handled at a lower level (SASS). The errors encountered due to not treating SIMD types correctly might therefore be of a different nature when compiling to PTX.

Instead of having features there's ISA versions that may slightly change how things work. When using LLVM to produce ptx it's tied to the cpu arch, while the NVIDIA tooling strives to use new ISA versions even for old arches.

At the current newest version (PTX ISA 8.5) the documented size of vector registers is as described here

scalar registers have a width of 8-, 16-, 32-, 64-, or 128-bits, and vector registers have a width of 16-, 32-, 64-, or 128-bits

On the other hand there doesn't exists actual vector instructions operating on the described 16, 64, or 128-bit registers. It only exists instructions operating on 32-bit registers (4x8bit or 2x16bit).

I still believe this means the "most correct " thing to use is vlen(128) and if that ever changes in a newer ISA version we hopefully have a better grasp of how different ISA version should be handled than we have today.

rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Oct 16, 2024

RalfJung mentioned this issue Oct 16, 2024

The extern "C" ABI of SIMD vector types depends on target features (tracking issue for abi_unsupported_vector_types future-incompatibility lint) #116558

Open

jieyouxu added E-needs-investigation Call for partcipation: This issues needs some investigation to determine current status C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. labels Oct 17, 2024

workingjubilee mentioned this issue Oct 17, 2024

[RISCV] Allow crypto features to imply dependents llvm/llvm-project#112659

Merged

RalfJung mentioned this issue Oct 17, 2024

Tracking issue for all the ways in which -C compiler flags can alter the ABI #131837

Open

Figure out which target features are required for which SIMD size #131800

Figure out which target features are required for which SIMD size #131800

Comments

RalfJung commented Oct 16, 2024 • edited by workingjubilee Loading

workingjubilee commented Oct 17, 2024

workingjubilee commented Oct 17, 2024 • edited Loading

programmerjake commented Oct 17, 2024

topperc commented Oct 17, 2024

programmerjake commented Oct 17, 2024

workingjubilee commented Oct 17, 2024

workingjubilee commented Oct 17, 2024

topperc commented Oct 17, 2024

programmerjake commented Oct 17, 2024

RalfJung commented Oct 17, 2024

RalfJung commented Oct 17, 2024 • edited Loading

RalfJung commented Oct 17, 2024 • edited Loading

heiher commented Oct 17, 2024

programmerjake commented Oct 17, 2024 • edited Loading

workingjubilee commented Oct 17, 2024

workingjubilee commented Oct 17, 2024

workingjubilee commented Oct 17, 2024

RalfJung commented Oct 17, 2024

RalfJung commented Oct 17, 2024

Dirreke commented Oct 17, 2024 • edited Loading

androm3da commented Oct 17, 2024

jacobbramley commented Oct 21, 2024 • edited Loading

jacobbramley commented Oct 21, 2024 • edited Loading

workingjubilee commented Oct 21, 2024

Dirreke commented Oct 21, 2024 • edited by workingjubilee Loading

workingjubilee commented Oct 21, 2024

workingjubilee commented Oct 22, 2024 • edited Loading

chenx97 commented Oct 23, 2024

MabezDev commented Oct 23, 2024 • edited Loading

RalfJung commented Oct 23, 2024 via email

MabezDev commented Oct 23, 2024

kito-cheng commented Oct 23, 2024

MabezDev commented Oct 23, 2024

sayantn commented Oct 27, 2024

kjetilkjeka commented Oct 29, 2024

RalfJung commented Oct 16, 2024 •

edited by workingjubilee

Loading

workingjubilee commented Oct 17, 2024 •

edited

Loading

RalfJung commented Oct 17, 2024 •

edited

Loading

RalfJung commented Oct 17, 2024 •

edited

Loading

programmerjake commented Oct 17, 2024 •

edited

Loading

Dirreke commented Oct 17, 2024 •

edited

Loading

jacobbramley commented Oct 21, 2024 •

edited

Loading

jacobbramley commented Oct 21, 2024 •

edited

Loading

Dirreke commented Oct 21, 2024 •

edited by workingjubilee

Loading

workingjubilee commented Oct 22, 2024 •

edited

Loading

MabezDev commented Oct 23, 2024 •

edited

Loading