New JMH benchmark method - vdot8s that implement int8 dotProduct in C… #13572

goankur · 2024-07-16T01:51:26Z

Credit: https://www.elastic.co/search-labs/blog/vector-similarity-computations-ludicrous-speed

Description (WIP - Need to be updated)

Implements vectorized dot product for byte[] vectors in native C code using SVE and Neon intrinsics.

TODOs

Some more plumbing pending, marking as draft as work is incomplete
The native 'C' is now moved to a separate native module and tests in core now depend on it. Ideally I'd prefer to only enable this dependency on supported platforms but not sure how to do this in a clean way.
Native dot product is only enabled on aarch64 architecture and applicable only to byte[] vectors. Adding additional conditions to enable/disable need to be added.
Rudimentary correctness testing of native code was done in int main(...) in dotProduct.c. This code should be converted to unit tests exercised using CUnit support in Gradle.
JMH benchmarks on other Graviton2 and Apple M3 need to be redone as some bugs in native code were fixed.
KNN benchmarks on scalar quantized vectors need to be done to assess indexing/querying impact.

NOTE

I had to export environment variable CC in my shell environment for Gradle to pickup the gcc-10 compiler toolchain. export CC=/usr/bin/gcc10-cc

Build Lucene

$ pwd
/home/goankur/open-source/lucene

$./gradlew  build

Generates compiled shared library ${PWD}/lucene/native/build/libs/dotProduct/shared/libdotProduct.so
and JMH benchmarking JAR

Run Benchmark JAR

java --enable-native-access=ALL-UNNAMED \
       --enable-preview \
       -Djava.library.path="./lucene/native/build/libs/dotProduct/shared" \
       -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-11.0.0-SNAPSHOT.jar \
      regexp "(.*)?(binaryDotProductVector|dot8s)"

IGNORE -- These need to be redone

Summary

Throughput gains relative to Panama API based java implementation of int8 dot-product.
- Graviton2: 10X (NEON intrinsics)
- Graviton3: 4.3X (SVE intrinsics)
- Graviton4: TBD
- Apple M3 Pro: 8X (NEON intrinsics)
On Graviton 2, performance of manually unrolled loop vectorized using NEON intrinsics is equivalent to auto-vectorized/auto-unrolled code. SVE intrinsics are not available.
On Graviton3, manually unrolled loop vectorized using SVE intrinsics provides +9.8% (score: 45.635) throughput gain on top of auto-vectorized/auto-unrolled code (score: 41.729).
[CAUTION]: Graviton4 results look suspicious and need to be re-evaluated.
On Apple M3 Pro, performance of manually unrolled loop vectorized using NEON intrinsics is 2.38X FASTER than auto-vectorization/auto-unrolled code. SVE intrinsics are not available.

Test Environment

GCC/Clang flags: --shared -O3 -march=native -funroll-loops

AWS Instance Type	Operating System	Compiler	JDK	Gradle
Graviton2 (`m6g.4xlarge`)	Amazon Linux 2	GCC 10.5.0	Amazon Corretto 21	Gradle 8.8
Graviton3 (`m7g.4xlarge`)	Amazon Linux 2	GCC 10.5.0	Amazon Corretto 21	Gradle 8.8
Graviton4 (`r8g.4xlarge`)	Amazon Linux 2023	GCC: 11.4.1	Amazon Corretto 21	Gradle 8.8
Apple M3	Mac OS Sonoma (14.5)	Apple clang 15.0.0	OpenJDK 21.0.3	Gradle 8.8

Results

Graviton 2 (m6g.4xlarge)

Benchmark	(size)	Mode	Cnt	Score	Error	Units
VectorUtilBenchmark.binaryDotProductVector	768	thrpt	15	2.857	± 0.014	ops/us
VectorUtilBenchmark.dot8s	768	thrpt	15	28.580	± 0.189	ops/us
VectorUtilBenchmark.neonVdot8s	768	thrpt	15	28.722	± 0.454	ops/us
VectorUtilBenchmark.sveVdot8s	768	thrpt	15	28.892	± 0.340	ops/us

Graviton 3 (m7g.4xlarge)

Benchmark	(size)	Mode	Cnt	Score	Error	Units
VectorUtilBenchmark.binaryDotProductVector	768	thrpt	15	10.563	± 0.004	ops/us
VectorUtilBenchmark.dot8s	768	thrpt	15	41.729	± 1.060	ops/us
VectorUtilBenchmark.neonVdot8s	768	thrpt	15	41.032	± 0.215	ops/us
VectorUtilBenchmark.sveVdot8s	768	thrpt	15	45.635	± 0.337	ops/us

[TBD] - Graviton 4 (r8g.4xlarge)

Apple M3 Pro

Benchmark	(size)	Mode	Cnt	Score	Error	Units
VectorUtilBenchmark.binaryDotProductVector	768	thrpt	15	10.399	± 0.025	ops/us
VectorUtilBenchmark.dot8s	768	thrpt	15	34.288	± 0.062	ops/us
VectorUtilBenchmark.neonVdot8s	768	thrpt	15	81.013	± 0.569	ops/us
VectorUtilBenchmark.sveVdot8s	768	thrpt	15	80.657	± 0.760	ops/us

rmuir · 2024-07-16T05:18:59Z

Do we even need to use intrinsics? function is so simple that the compiler seems to do the right thing, e.g. use SDOT dot production instruction, given the correct flags:

https://godbolt.org/z/KG1dPnrqn

https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-

rmuir · 2024-07-16T05:34:40Z

I haven't benchmarked, just seems SDOT is the one to optimize for, and GCC can both recognize the code shape and autovectorize to it without hassle.

my cheap 2021 phone has asimddp feature in /proc/cpuinfo, dot product support seems widespread.

You can use it directly via intrinsic, too, no need to use add/multiply intrinsic: https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#dot-product

But unless it is really faster than what GCC does with simple C, no need.

lucene/core/build.gradle

goankur · 2024-07-18T02:28:45Z

Do we even need to use intrinsics? function is so simple that the compiler seems to do the right thing, e.g. use SDOT dot production instruction, given the correct flags:

https://godbolt.org/z/KG1dPnrqn

https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-

I haven't benchmarked, just seems SDOT is the one to optimize for, and GCC can both recognize the code shape and autovectorize to it without hassle.

my cheap 2021 phone has asimddp feature in /proc/cpuinfo, dot product support seems widespread.

You can use it directly via intrinsic, too, no need to use add/multiply intrinsic: https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#dot-product

But unless it is really faster than what GCC does with simple C, no need.

With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).

rmuir · 2024-07-18T02:58:55Z

With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).

This seems correct to me. The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as SDOT in its vocabulary at all.

rmuir · 2024-07-18T03:10:52Z

lucene/core/build.gradle

+        cCompiler.withArguments { args ->
+          args << "--shared"
+               << "-O3"
+               << "-march=armv8.2-a+dotprod"


for simple case of benchmarking i would just try -march=native to compile the best it can for the specific cpu/microarch. Then you could test your dot8s on ARM 256-bit SVE (graviton3, graviton4) which might be interesting...

I tried -march=native on an m7g.4xlarge (Graviton3) instance type, OS, JDK, GCC were unchanged. dot8s and vdot8s are now ~3.5x faster than java implementation, compared to being 10x better on m6g.4xlarge (Graviton2)

Benchmark (size) Mode Cnt Score Error Units

VectorUtilBenchmark.binaryDotProductVector 768 thrpt 15 10.570 ± 0.003 ops/us

VectorUtilBenchmark.dot8s 768 thrpt 15 37.659 ± 0.422 ops/us

VectorUtilBenchmark.vdot8s 768 thrpt 15 37.123 ± 0.237 ops/us

I think this happens because with the gcc10+sve, the implementation is not really unrolled. it should be possible to be 10x faster, too.

oh, the other likely explanation on the performance is that the integer dot product in java is not AS HORRIBLE on the 256-bit SVE as it is on the 128-bit neon. it more closely resembles the logic of how it behaves on AVX-256: two 8x8 bit integers ("64-bit vectors") are multiplied into intermediate 8x16-bit result (128-bit vector) and added to 8x32-bit (256-bit vector). Of course, it does not use SDOT instruction which is sad as it is CPU instruction intended precisely for this purpose.

On the 128-bit neon there is not a possibility with java's vector api to process 4x8 bit integers ("32-bit vectors") like the SDOT instruction does: https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-
Nor is it even performant to take 64-bit vector and process "part 0" then "part 1". The situation is really sad, and the performance reflects that.

ChrisHegarty · 2024-07-18T13:30:29Z

With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).

This seems correct to me.

Same. The performance benefit is large here. We do similar in Elasticsearch, and have impls for both AArch64 and x64. I remember @rmuir making a similar comment before about the auto-vectorization - which is correct. I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate.

https://github.com/elastic/elasticsearch/blob/main/libs/simdvec/native/src/vec/c/aarch64/vec.c

The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as SDOT in its vocabulary at all.

++ Yes. This is not a good state of affairs. I'll make sure to get an issue filed with OpenJDK for it.

ChrisHegarty · 2024-07-18T13:35:12Z

Could Lucene ever have this directly in one of its modules? We currently use the FlatVectorsScorer to plugin the "native code optimized" alternative, when scoring Scalar Quantized vectors. But Lucene misses out on this by default, which is a pity.

Note: we currently do dot product and square distance on int7, since this gives a bit more flexibility on x64, and Lucene's scalar quantization is in the range of 0 - 127. (#13336)

rmuir · 2024-07-18T14:13:59Z

I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate.

It should work well with any modern gcc (@goankur uses gcc 10 here). If using clang, I think you need to add some pragmas for it, that's all, see https://llvm.org/docs/Vectorizers.html and also there are some hints in the ARM docs.

One issue would be which to compile for/detect:

IMO for arm there would be two good ones:

armv8.2-a+dotprod (NEON)
armv9-a (SVE)

For x86 probably at most three?:

x86-64-v3 (AVX2)
x86-64-v4 (AVX512)
VNNI (should be tested but seems fit for the task? but not widespread? cascade-lake+?)

rmuir · 2024-07-18T14:34:39Z

Here is my proposal visually: https://godbolt.org/z/6fcjPWojf

As you can see, by passing -march=cascadelake it generates VNNI instructions.
IMO, no need for any intrinsics anywhere, for x86 nor ARM. Just a dead-simple C function and being smart about which targets we compile for.

(these variants should all be benchmarked of course first)

edit: I modified it to use same GCC compiler version across all variants. This newer version unrolls the SVE, too.

rmuir · 2024-07-18T15:04:15Z

And i see from playing around with compiler versions, the advantage of intrinsics approach: although I worry how many variants we'd maintain. it would give stability across releasing lucene without worrying about change to underlying C compiler etc, no crazy aggressive compiler flags needed, etc.

But we should at least START with autovectorizer and look/benchmark various approaches it uses, such as VNNI one in the example above.

rmuir · 2024-07-18T15:13:08Z

I definitely want to play around more with @goankur 's PR here and see what performance looks like across machines, but will be out of town for a bit.

There is a script to run the benchmarks across every aws instance type in parallel and gather the results: https://github.com/apache/lucene/tree/main/dev-tools/aws-jmh

I want to tweak https://github.com/apache/lucene/blob/main/dev-tools/aws-jmh/group_vars/all.yml to add the graviton4 (or any other new instance types since it was last touched) and also add c-compiler for testing this out, but we should be able to quickly iterate with it and make progress.

ChrisHegarty · 2024-07-19T09:29:59Z

Just to expand a little on a previous comment I made above.

Could Lucene ever have this directly in one of its modules?

An alternative option to putting this in core, is to put it in say misc, allowing users creating KnnVectorsFormat to hook into it through the Lucene99FlatVectorsFormat and FlatVectorsScorer. That way it would be somewhat optional in the build and distribution.

msokolov · 2024-07-19T12:22:23Z

An alternative option to putting this in core, is to put it in say misc, allowing users creating KnnVectorsFormat to hook into it through the Lucene99FlatVectorsFormat and FlatVectorsScorer. That way it would be somewhat optional in the build and distribution.

There are a few issues with distributing native-built methods that I can see. First, building becomes more complicated - you need a compiler, the build becomes more different on different platforms (can we support windows?), and when building you might have complex needs like targeting a different platform (or more platforms) than you are building on. Aside from building, there is the question of distributing the binaries and linking them into the runtime. I guess some command-line option is required in order to locate the .so (on linux) or .dll or whatever? And there can be environment-variable versions of this. So we would have to decide how much hand-holding to do. Finally we'd want to build the java code so it can fall back in a reasonable way when the native impl is not present. But none of this seems insurmountable. I believe Lucene previously had some native code libraries in its distribution and we could do so again. I'm not sure about the core vs misc distinction, I'd be OK either way, although I expect we want to keep core "clean"?

goankur · 2024-07-20T02:14:23Z

With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).

This seems correct to me.

Same. The performance benefit is large here. We do similar in Elasticsearch, and have impls for both AArch64 and x64. I remember @rmuir making a similar comment before about the auto-vectorization - which is correct. I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate.

https://github.com/elastic/elasticsearch/blob/main/libs/simdvec/native/src/vec/c/aarch64/vec.c

The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as SDOT in its vocabulary at all.

++ Yes. This is not a good state of affairs. I'll make sure to get an issue filed with OpenJDK for it.

Thanks Chris. Please share the link to OpenJDK issue when you get a chance.

goankur · 2024-07-20T02:52:53Z

I definitely want to play around more with @goankur 's PR here and see what performance looks like across machines, but will be out of town for a bit.

There is a script to run the benchmarks across every aws instance type in parallel and gather the results: https://github.com/apache/lucene/tree/main/dev-tools/aws-jmh

I want to tweak https://github.com/apache/lucene/blob/main/dev-tools/aws-jmh/group_vars/all.yml to add the graviton4 (or any other new instance types since it was last touched) and also add c-compiler for testing this out, but we should be able to quickly iterate with it and make progress.

Let me try to squeeze some cycles out next week and see if I can make progress on this front.

rmuir · 2024-07-30T19:11:07Z

lucene/core/src/c/dotProduct.c

+ * Looks like Apple M3 does not implement SVE and Apple's official documentation
+ * is not explicit about this or at least I could not find it. 
+ * 
+ */


I think we should remove the ifdef, this does not happen with -march=native, correct? The problem is only you try to "force" SVE? afaik, M3 doesnt support it and so -march=native should automatically take the neon path.

Simply having -march=native compiler flag and no ifdef directive makes clang (15.0) throw compilation error: SVE support not enabled. So, I think, guarding the SVE code with ifdef is necessary. That said there is no need for extra check on __APPLE__ and I have removed that part from ifdef.

rmuir · 2024-07-30T19:13:23Z

lucene/core/src/c/dotProduct.h

+
+int32_t vdot8s_sve(int8_t* vec1[], int8_t* vec2, int32_t limit);
+int32_t vdot8s_neon(int8_t* vec1[], int8_t* vec2[], int32_t limit);
+int32_t dot8s(int8_t* a, int8_t* b, int32_t limit);


can we fix these prototypes to all be the same? Can we include from the .c file? Maybe also dding -Wall -Werror will help keep the code tidy?

Prototypes fixed and dotProduct.h included in dotProduct.c. I did not add -Wall -Werror yet as after including changes from your patch the compilation of dotProduct.c file fails for me on Graviton boxes. Let's continue that discussion in the separate thread.

lucene/core/src/c/dotProduct.c

rmuir · 2024-07-30T19:26:33Z

lucene/core/src/c/dotProduct.c

+    // REDUCE: Add every vector element in target and write result to scalar
+    result = svaddv_s32(svptrue_b8(), acc1);
+
+    // Scalar tail. TODO: Use FMA


I think you can remove TODO, since aarch64 "mul" is really "madd", i expect it already emits single instruction. look at assembler if you are curious.

This has been replaced with Vector tail as suggested in a later comment.

rmuir · 2024-07-30T19:31:14Z

lucene/core/src/c/dotProduct.c

+	acc2 = svdot_s32(acc2, va2, vb2);
+	acc3 = svdot_s32(acc3, va3, vb3);
+	acc4 = svdot_s32(acc4, va4, vb4);
+    }	     


maybe consider "vector tail", since 4 * vector length can be significant: and vector dimensions may not be exact multiple of that. It limits the worst case processing that "scalar tail" must do. Example from the java vector code: https://github.com/apache/lucene/blob/main/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java#L153-L158

Converted scalar tail to vector and throughput score on Graviton3 dropped from 45.5 -> 44.5 which is a bit surprising. My guess is that auto-vectorized and unrolled code for scalar tail is more efficient than vector tail. Comparing scalar and vector tail in godbolt - https://godbolt.org/z/sjo6KMnP7, the whilelo instruction generated by auto-vectorization, stands out to me but I don't understand why that would cause a meaningful drop in performance. Let me try and dig a little deeper.

Graviton 3

Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductVector 768 thrpt 15 10.570 ± 0.002 ops/us VectorUtilBenchmark.dot8s 768 thrpt 15 44.562 ± 0.448 ops/us

rmuir · 2024-08-01T01:04:36Z

lucene/core/src/c/dotProduct.c

+    // Scalar tail. TODO: Use FMA
+    for (; i < limit; i++) {
+        result += vec1[i] * vec2[i];
+    }


for any "tails" like this where we manually unroll and vectorize the main loop, we can add pragmas to prevent GCC/LLVM from trying to unroll and vectorize the tail. It is not strictly necessary but will lead to tighter code.

and maybe its just enough to disable autovectorization, as unrolling might still be a win for scalar tail: can't remember what hotspot is doing on the java vector code for this.

rmuir · 2024-08-01T01:07:49Z

go @goankur, awesome progress here. It is clear we gotta do something :) I left comments just to try to help. Do you mind avoiding rebase for updates? I am going to take a stab at the x86 side of the house.

rmuir · 2024-08-01T01:26:33Z

lucene/core/src/c/dotProduct.c

+    for (i = 0; i + 4 * vec_length <= limit; i += 4 * vec_length) {
+	// Load vectors into the Z registers which can range from 128-bit to 2048-bit wide
+	// The predicate register - P determines which bytes are active
+	// svptrue_b8() returns a predictae in which every element is true


With gcc autovectorization, i see use of SVE whilelo (predicate as counter) instead. I think it basically works on loop counter, maybe look at autogenerated assembly with godbolt for inspiration? Sorry, I'm not well-versed in SVE but trying to help improve the performance as it would be important for Graviton3/Graviton4/etc.

See also https://developer.arm.com/documentation/ddi0602/2024-06/SVE-Instructions/WHILELO--predicate-as-counter---While-incrementing-unsigned-scalar-lower-than-scalar--predicate-as-counter--

Let me look into this tomorrow.

@rmuir -- I tried using svwhilelt_b8_u32(i, limit) intrinsic for generating the predicate in both the unrolled-loop and vector tail but the performance was actually worse :-(.

To give you an idea of what I did, here is link to the ARM documentation with code sample

https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics

mikemccand · 2024-09-11T13:12:48Z

Anyways: At moment we do not want to have native code in Lucene Core.

+1, we don't put native code in Lucene's core.

But @uschindler is there maybe a way forward not using core? I.e. add native code to misc or sandbox or so, to enable iterating on direct access to SIMD for early adopters (opt in)?

Or is the src/javaXX approach core-only for some reason?

Having the likes of OpenSearch, Elasticsearch, and Solr implement their own (high performance) direct native SIMD access seems not ideal to me. It really should somehow be an option in Lucene, even if it's not the default.

This would be our "Plan C" ("Plan B" was the previous workaround of gaining early access to SIMD via Panama, but even that is bottlenecking us too much now).

ChrisHegarty · 2024-09-13T16:59:58Z

Anyways: At moment we do not want to have native code in Lucene Core.
..
Having the likes of OpenSearch, Elasticsearch, and Solr implement their own (high performance) direct native SIMD access seems not ideal to me. It really should somehow be an option in Lucene, even if it's not the default.

As someone who wrote and maintains the native scorer in Elasticsearch, it gives great speed ups for Scalar Quantised vectors on ARM. Since Panama Vector is horrible for arithmetic operations on byte-sized values; both widen and accessing various parts of the vector. We have an AVX implementation too, but it gives less improvements. I don't see Panama Vector improving in this area any time soon :-(. For floats we don't do any native.

In the latest Binary Quantization #13651, Panama Vector performs very nicely. It might be that some of the motivation for a native scorer may just go away.

uschindler · 2024-09-14T07:17:25Z

gradle/java/javac.gradle

@@ -24,7 +24,11 @@ allprojects { project ->

    // Use 'release' flag instead of 'source' and 'target'
    tasks.withType(JavaCompile) {
-      options.compilerArgs += ["--release", rootProject.minJavaVersion.toString()]
+      options.compilerArgs += ["--release", rootProject.minJavaVersion.toString(), "--enable-preview"]


No need for this when everything is moved to java21 folder

uschindler · 2024-09-14T07:57:12Z

Anyways: At moment we do not want to have native code in Lucene Core.

+1, we don't put native code in Lucene's core.

But @uschindler is there maybe a way forward not using core? I.e. add native code to misc or sandbox or so, to enable iterating on direct access to SIMD for early adopters (opt in)?

Or is the src/javaXX approach core-only for some reason?

In the current setup it is. One could generalize this, but this should be fine step by step in separate PRs. There are several problems to solve, also with regards to modularization and where the stub jars are:

compilation stubs (apijars) need to move to a common folder across Gradle projects (we don't have no such common module yet as core itself has no dependencies).
the bootstrapping of Panama and memory segment code is based on factories which are not yet pluggable. This may need SPI. So one could simply drop a separate JAR into class path/module path and then it enabled native code.
I would like to prevent duplicate and hairy initialization code, so one single place that loads the service providers and does appropriate checks before.

In general it's doable but I really doubt if this is healthy for this project's infrastructure.

So I am still -1 to have native code before we have refactored major parts of infrastructure. One condition is to have java 22 as minimum version, so we can at least introduce MemorySegment as top level citizen into our own APIs (e.g. remove byte arrays from our IndexInput APIs). I was arguing in the java community to have the foreign API stabilized and released with Java 21, but unfortunately it was done in 22. If we can get rid of the preview mode shit and simply tell our downstream users to accept java 22 as minimum version both at compilation and runtime, then only the Panama vector stuff needs to be mr-jarred so it would simplify a lot.

Uwe

msokolov

I took a look. I have to say I'm still kind of confused about why we sometimes get these vector scorers via scorersupplier and other times get them from static create() methods and I worry about the resource usage - it's difficult to reason about by staring at code, so I think we need to run some tests to be more confident

msokolov · 2024-10-23T17:49:53Z

lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/VectorUtilBenchmark.java

+              MethodType.methodType(lookup.findClass(MemorySegment), byte[].class);
+          MethodHandle nativeMemorySegment =
+              lookup.findStatic(vectorUtilSupportClass, "nativeMemorySegment", methodType);
+          byte[] a = new byte[size];


I'm a little confused -- is this setup code for the benchmark? We just run it once and then run dot product on the same two vectors many times? I wonder if we would see something different if we generated a large number of vectors and randomized which ones we compare on each run. Also would performance vary if the vectors are sequential in their buffer (ie vector 0 starts at 0, vector 1 starts at size...)

Not the same two vectors on each iteration. We run setup once every iteration for a total of 15 iterations across 3 forks (5 iterations per fork) for each size being tested. Each fork is preceded by 3 warm-up iterations. Before each iteration we generate random numbers in range [0-127] in two on-heap byte[], allocate off-heap memory segments and populate them with contents from byte[]. These off-heap memory segments are provided to VectorUtil.NATIVE_DOT_PRODUCT method handle.

(Code snippet below for reference)

@Param({"1", "128", "207", "256", "300", "512", "702", "1024"}) int size; @Setup(Level.Iteration) public void init() { ... }

I wonder if we would see something different if we generated a large number of vectors and randomized which ones we compare on each run. Also would performance vary if the vectors are sequential in their buffer (ie vector 0 starts at 0, vector 1 starts at size...)

I guess the question you are hinting at is how does the performance vary when the two candidate vectors are further apart in memory (L1 cache / L2 cache / L3 cache / Main-memory). Do the gains from native implementation become insignificant with increasing distance ? Its an interesting question and I propose that we add benchmark method(s) to answer them in a follow up PR. Does that sound reasonable ?

Nonetheless I will simplify the setup code to make it a bit more readable in the next iteration.

msokolov · 2024-10-23T17:52:00Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/OffHeapQuantizedByteVectorValues.java

@@ -146,6 +146,7 @@ public float getScoreCorrectionConstant(int targetOrd) throws IOException {
    }
    slice.seek(((long) targetOrd * byteSize) + numBytes);
    slice.readFloats(scoreCorrectionConstant, 0, 1);
+    lastOrd = targetOrd;


this looks like a bug because if we do this and then call vectorValue we could get the wrong result because we don't calculate the binaryValue here

Got it!. I will remove this in the next revision. I was just trying to optimize for the case when getScoreCorrectionConstant(int targetOrd) gets invoked with the same targetOrd multiple times in succession.

msokolov · 2024-10-23T18:53:35Z

...c/java21/org/apache/lucene/internal/vectorization/Lucene99MemorySegmentByteVectorScorer.java

+    NativeDotProductScorer(
+        MemorySegmentAccessInput input, KnnVectorValues values, byte[] queryVector) {
+      super(input, values, queryVector);
+      if (offHeapQuery == null) {


how would this ever not be null? can we assert offHeapQuery == null instead? Maybe we could make it final?

This whole NativeDotProductScorer is gone in the next iteration. For the byteVector case, I am planning to generate offHeap MemorySegments in case native dot product is enabled. Then in PanamaVectorUtilSupport.dotProduct(MemorySegment a, MemorySegment b) we will delegate the computation to native code if both segments are native.

msokolov · 2024-10-23T18:55:22Z

...c/java21/org/apache/lucene/internal/vectorization/Lucene99MemorySegmentByteVectorScorer.java

@@ -34,6 +37,8 @@ abstract sealed class Lucene99MemorySegmentByteVectorScorer
  final MemorySegmentAccessInput input;
  final MemorySegment query;
  byte[] scratch;
+  MemorySegment offHeapScratch;


I'm concerned about the cost of creating these for every scorer() we create because that happens a lot. During indexing, we create multiple scorers while adding each new document. Could we move these to the ScorerSupplier instead?

Hmm now I see we did do that! But then I wonder why we also need to do it here.

I'm concerned about the cost of creating these for every scorer() we create because that happens a lot.

The offHeapScratch is only allocated in the rare case of vector data being split across multiple MemorySegments each representing a mmap region.

Hmm now I see we did do that! But then I wonder why we also need to do it here.

When creating a RandomVectorScorer() for the give byte[] of query, we will need to copy these bytes to an off-heap space for consumption within native dot-product. Native code can only work with memory that is outside the management of Java GC.

In the next revision, I am going to refactor some of these and add documentation to hopefully make this more clear.

msokolov · 2024-10-23T19:00:12Z

...g/apache/lucene/internal/vectorization/Lucene99MemorySegmentScalarQuantizedVectorScorer.java

@@ -0,0 +1,407 @@
+/*


can you say where this file came from? Was it mostly copied from some other file, or is it all brand new?

Yep! Will do in the next revision.

mikemccand · 2024-10-24T11:28:19Z

lucene/native/src/c/dotProduct.c

+ * is not explicit about this or at least I could not find it.
+ * 
+ */
+int32_t vdot8s_sve(int8_t vec1[], int8_t vec2[], int32_t limit) {


Earlier @rmuir had suggested just relying on gcc to compile the "obvious" C code down to efficient vectorized implementation, but here it looks like we are still hand rolling the NEON asm intrinsics? Was the gcc path less performant?

Was the gcc path less performant?

Yes. The compiler auto-vectorized dot-product implementation shows ~15% LOWER throughput compared to this manually unrolled and vectorized implementation which uses SVE intrinsics. Here is the JMH benchmark output from the next revision executed on AWS Graviton 3 instance. C implementation compiled with GCC 10 and compiler flags -O3 -march=native -funroll-loops

Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.dot8s 768 thrpt 15 43.233 ± 0.639 ops/us VectorUtilBenchmark.simpleDot8s 768 thrpt 15 36.437 ± 0.572 ops/us

Command to reproduce these results with the next revision

java --enable-native-access=ALL-UNNAMED \ --enable-preview \ -Djava.library.path="./lucene/native/build/libs/dotProduct/shared" \ -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-11.0.0-SNAPSHOT.jar regexp \ "(.*)?(ot8s)" \ -p "size=768"

A benefit of manually vectorized implementation is that our implementation is less likely to suffer from unintended performance regressions due to auto-vectorization variations across different ARM compilers.

mikemccand · 2024-10-24T11:29:12Z

...g/apache/lucene/internal/vectorization/Lucene99MemorySegmentScalarQuantizedVectorScorer.java

+ */
+package org.apache.lucene.internal.vectorization;
+
+/*


Duplicate copyright header.

Oops! Removed in the next revision.

mikemccand · 2024-10-24T12:05:46Z

@goankur -- thank you for pulling out the actual native code into a new native Lucene module. I'm not sure we need a new module -- could we use misc or sandbox maybe?

I have not looked so closely at the PR, just left a few minor comments so far, but ... how come there are so many changes necessary in core? It looks like a lot of hard forking of existing sources to specialize specific cases (memory segment, byte quentization, ???). I had expected (from the view from 10000 feet) the PR to be nearly entirely outside of core.

yugushihuang · 2024-10-24T16:49:44Z

We have measured performance using knnPerfTest.py in lucene util with this PR commit as candidate branch.

cmd

'/usr/lib/jvm/java-21-amazon-corretto/bin/java', '-cp', [...], '--add-modules', 'jdk.incubator.vector', '-Djava.library.path=/home/[user_name]/lucene_candidate/lucene/native/build/libs/dotProduct/shared', 'knn.KnnGraphTester', '-quantize', '-ndoc', '1500000', '-maxConn', '32', '-beamWidthIndex', '50', '-fanout', '6', '-quantizeBits', '7', '-numMergeWorker', '12', '-numMergeThread', '4', '-encoding', 'float32', '-topK', '10', '-dim', '768', '-docs', 'enwiki-20120502-lines-1k-mpnet.vec', '-reindex', '-search-and-stats', 'enwiki-20120502-mpnet.vec', '-forceMerge', '-quiet'

Lucene_Baseline

Graph level=3 size=46, connectedness=1.00
Graph level=2 size=1405, connectedness=1.00
Graph level=1 size=46174, connectedness=1.00
Graph level=0 size=1500000, connectedness=1.00

Results:
recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.332         0.333  1500000    10       6       32         50     7 bits   432.69         271.51             1          5558.90

Lucene_Candidate

Graph level=3 size=46, connectedness=1.00
Graph level=2 size=1410, connectedness=1.00
Graph level=1 size=46205, connectedness=1.00
Graph level=0 size=1500000, connectedness=1.00

Results:
recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.337         0.260  1500000    10       6       32         50     7 bits   441.25         293.41             1          5558.91

The latency has dropped from 0.333ms to 0.26ms.

msokolov · 2024-10-24T17:59:26Z

nice improvement! I do see the index time increased and wonder if it is due to creating too many heavyweight RandomVectorScorers. Maybe we can make this easier by simplifying the whole RandomVectorScorer creation pathway. Although that is out of scope here, it might be good to try and work around it if possible as I suggested above

msokolov · 2024-11-01T13:43:09Z

lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java

+   * MemorySegment backing on-heap memory to native code we get
+   * "java.lang.UnsupportedOperationException: Not a native address"
+   *
+   * <p>Stack overflow thread:


Thanks this is helpful, but let's not permanently record the SO thread in the code comments

ok! removed in the next revision.

msokolov · 2024-11-01T13:46:15Z

lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java

+   * https://stackoverflow.com/questions/69521289/jep-412-pass-a-on-heap-byte-array-to-native-code-getting-unsupportedoperatione
+   * explains the issue in more detail.
+   *
+   * <p>Q1. Why did we enable the native code path here if its inefficient ? A1. So that it can be


I don't think the unit-testing use case is a good one -- unit tests should be testing the actual production code path. I see that the required test setup would be complex, but IMO it's better to have simpler production code and complex unit tests than complex unused production code and simple unit tests!

makes sense! The changes here were undone and TestVectorUtilSupport, BaseVectorizationTestCase were modified in the next revision to exercise the native code path by obtaining and invoking method handle for dotProduct(MemorySegment a, MemorySegment b) where 'a' and 'b' are constructed as off-heap MemorySegments.

msokolov · 2024-11-01T13:48:27Z

lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java

+   * heap. For target-vector, copying to off-heap memory will still be needed but allocation can
+   * happen once per scorer.
+   *
+   * <p>Q3. Should JMH benchmarks measure the performance of this method? A3. No, because they would


These comments are helpful for the PR, but I think in the actual code we would want to simplify and maybe say something like: do not call in production. Indeed we could possibly even add an assert false: "inefficient implementation, do not use" ? And in production fall back to the non-native impl

The assert false:... won't be necessary after undoing the changes as the old code wraps input byte[] into on-heap MemorySegment and native implementation is not exercised with that condition.

msokolov · 2024-11-01T13:49:44Z

lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java

+   * @param numBytes Dimension of the byte vector
+   * @return offHeap memory segment
+   */
+  public static MemorySegment offHeapByteVector(int numBytes) {


maybe put "random" in the name so we can easily tell this is creating a random value

I changed the implementation slightly to simply copy the input byte[] to off-heap MemorySegment. The process of populating the byte[] with random bytes will happen outside this method in the next revision.

msokolov · 2024-11-01T13:51:10Z

lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java

+      try {
+        int limit = (int) a.byteSize();
+        return (int) NativeMethodHandles.DOT_PRODUCT_IMPL.invokeExact(a, b, limit);
+      } catch (Throwable ex$) {


what are we trying to catch here?

WrongMethodTypeException and anything else (Throwable) propagated by underlying method handle. WrongMethodTypeException is a subclass of Throwable.

msokolov · 2024-11-01T13:55:21Z

...g/apache/lucene/internal/vectorization/Lucene99MemorySegmentScalarQuantizedVectorScorer.java

@@ -0,0 +1,407 @@
+/*


…lement int8 dotProduct in C using Neon and SVE intrinsics respectively. Fallback to Neon if SVE instructions are not supported by target platform

…if native dot-product is enabled. Simplifyy JMH benchmark code that tests native dot product. Incorporate other review feedback

rmuir reviewed Jul 16, 2024

View reviewed changes

lucene/core/build.gradle Outdated Show resolved Hide resolved

goankur force-pushed the main branch from 5490b8e to 356d194 Compare July 18, 2024 02:00

rmuir reviewed Jul 18, 2024

View reviewed changes

goankur force-pushed the main branch from 356d194 to 160347f Compare July 20, 2024 02:50

goankur force-pushed the main branch 4 times, most recently from 9e9b33b to e45224b Compare July 30, 2024 04:31

rmuir reviewed Jul 30, 2024

View reviewed changes

lucene/core/src/c/dotProduct.c Outdated Show resolved Hide resolved

rmuir reviewed Jul 30, 2024

View reviewed changes

rmuir reviewed Aug 1, 2024

View reviewed changes

mikemccand mentioned this pull request Sep 13, 2024

Can we remove compress option for quantized KNN vector indexing? #13768

Open

goankur force-pushed the main branch from fa1d887 to 40dbac9 Compare September 14, 2024 03:12

uschindler reviewed Sep 14, 2024

View reviewed changes

goankur force-pushed the main branch 3 times, most recently from 28b05b9 to 2dce12f Compare October 13, 2024 22:58

goankur marked this pull request as ready for review October 13, 2024 23:54

goankur marked this pull request as draft October 17, 2024 01:51

goankur force-pushed the main branch from 2dce12f to 85d7811 Compare October 22, 2024 20:32

msokolov reviewed Oct 23, 2024

View reviewed changes

mikemccand reviewed Oct 24, 2024

View reviewed changes

goankur force-pushed the main branch 3 times, most recently from 3079c5d to 9adcb8b Compare November 1, 2024 00:23

msokolov reviewed Nov 1, 2024

View reviewed changes

goankur force-pushed the main branch 2 times, most recently from 78b7c1f to 0694ac6 Compare November 3, 2024 22:20

Ankur Goel added 3 commits November 5, 2024 02:05

New JMH benchmark methods - vdot8s, neonVdot8s and sveVdot8s that imp…

e4697c4

…lement int8 dotProduct in C using Neon and SVE intrinsics respectively. Fallback to Neon if SVE instructions are not supported by target platform

New JMH benchmark methods - vdot8s, neonVdot8s and sveVdot8s that imp…

3cb723c

…lement int8 dotProduct in C using Neon and SVE intrinsics respectively. Fallback to Neon if SVE instructions are not supported by target platform

Move C code to native module and integrate Java code under java21

7349d4f

goankur force-pushed the main branch from 0694ac6 to ec7c6d0 Compare November 5, 2024 02:15

Allocate offHeap memory in dotProduct(byte[], byte[]) for unit tests …

3c94ceb

…if native dot-product is enabled. Simplifyy JMH benchmark code that tests native dot product. Incorporate other review feedback

goankur force-pushed the main branch from ec7c6d0 to 3c94ceb Compare November 5, 2024 02:27

Benchmark	(size)	Mode	Cnt	Score	Error	Units
VectorUtilBenchmark.binaryDotProductVector	768	thrpt	15	10.570	± 0.003	ops/us
VectorUtilBenchmark.dot8s	768	thrpt	15	37.659	± 0.422	ops/us
VectorUtilBenchmark.vdot8s	768	thrpt	15	37.123	± 0.237	ops/us

New JMH benchmark method - vdot8s that implement int8 dotProduct in C… #13572

Are you sure you want to change the base?

New JMH benchmark method - vdot8s that implement int8 dotProduct in C… #13572

Conversation

goankur commented Jul 16, 2024 • edited Loading

Credit: https://www.elastic.co/search-labs/blog/vector-similarity-computations-ludicrous-speed

Description (WIP - Need to be updated)

TODOs

NOTE

Build Lucene

Run Benchmark JAR

IGNORE -- These need to be redone

Summary

Test Environment

Results

Graviton 2 (m6g.4xlarge)

Graviton 3 (m7g.4xlarge)

[TBD] - Graviton 4 (r8g.4xlarge)

Apple M3 Pro

rmuir commented Jul 16, 2024

rmuir commented Jul 16, 2024

goankur commented Jul 18, 2024

rmuir commented Jul 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChrisHegarty commented Jul 18, 2024

ChrisHegarty commented Jul 18, 2024 • edited Loading

rmuir commented Jul 18, 2024 • edited Loading

rmuir commented Jul 18, 2024 • edited Loading

rmuir commented Jul 18, 2024

rmuir commented Jul 18, 2024

ChrisHegarty commented Jul 19, 2024

msokolov commented Jul 19, 2024

goankur commented Jul 20, 2024

goankur commented Jul 20, 2024

Choose a reason for hiding this comment

goankur Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goankur Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Graviton 3

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmuir commented Aug 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand commented Sep 11, 2024

ChrisHegarty commented Sep 13, 2024

Choose a reason for hiding this comment

uschindler commented Sep 14, 2024 • edited Loading

msokolov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goankur Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goankur Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goankur Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goankur commented Jul 16, 2024 •

edited

Loading

ChrisHegarty commented Jul 18, 2024 •

edited

Loading

rmuir commented Jul 18, 2024 •

edited

Loading

rmuir commented Jul 18, 2024 •

edited

Loading

goankur Aug 2, 2024 •

edited

Loading

goankur Aug 2, 2024 •

edited

Loading

uschindler commented Sep 14, 2024 •

edited

Loading

goankur Oct 25, 2024 •

edited

Loading

goankur Oct 28, 2024 •

edited

Loading

goankur Oct 29, 2024 •

edited

Loading

goankur Nov 5, 2024 •

edited

Loading