Introduce multiSelect for ScalarQuantizer #13919

HoustonPutman · 2024-10-15T19:52:34Z

Resolves #13918

Description

This introduces a multiSelect(from, to, k[]) method on the Selector abstract class, and gives implementations of the method for both Selector implementations, IntroSelector and RadixSelector.

This is really only used so far in the ScalarQuantizer, which uses IntroSelector, to find confidence intervals (quantiles) for lists of vectors. This code was refactored to find all quantiles at once. ScalarQuantizer.fromVectors() does 1 confidence interval (2 quantiles), and ScalarQuantizer.fromVectorsAutoInterval() does 2 confidence intervals (4 quantiles).

multiSelect details

For both RadixSelector and IntroSelector, the multiSelect() code is not too dissimilar from the select() code, however all paths are taken that could lead to one of the k values. In each path, only the relevant k values are checked. If a path has only 1 k value, then select() is called instead of multiSelect().

One difference in RadixSelector is that recursive calls are not done in-place, rather kept in a list to do after checking all path-options, so that we don't have to keep a stack of histograms. This should not be expensive, because the list that contains these recursive-call-path-options, is capped at the number of valid k values for that path, which will likely be small most of the time.

Also I provided a default Selector.multiSelect() call that is not optimized, because Selector is a public class that people might have made custom implementations of?

Benchmarking

Real benchmarking still needs to be done, but using IntelliJ's profiler on TestScalarQuantizer.testFromVectorsAutoInterval4Bit(), I have seen a >20% speed improvement in the call to ScalarQuantizer.fromVectorsAutoInterval() when using 1028 dimensions (310 ms -> 240 ms). This speedup holds when using 1000 vectors instead of 100 (3200 ms -> 2500 ms) and when using 1000 vectors and a smaller dimension of 128 (480 ms -> 370 ms).
Oops, bad results

I'd love to make benchmarks that fit with the way Lucene does them. Are there already vector benchmarks that I could build off of, or do I need to start from scratch?

HoustonPutman · 2024-10-15T22:24:47Z

So after fixing the innocuous bugs in the implementations, it looks like there is no speed up here. The confidence interval finding can be up to 30% faster or so, but that's such a small portion of the quantization cost, that it really doesn't make a difference at all. (I think the bug was creating bad quantization that made the findNearestNeighbors search much faster)

Oh well, good to test out and see at least. Probably no reason to merge if there isn't a Lucene use case that would really benefit from this.

benwtrent · 2024-10-16T15:22:30Z

lucene/core/src/java/org/apache/lucene/util/quantization/ScalarQuantizer.java

   * @return lower and upper quantile values
   */
-  static float[] getUpperAndLowerQuantile(float[] arr, float confidenceInterval) {
+  static float[][] getUpperAndLowerQuantiles(float[] arr, float[] confidenceIntervals) {


Intuitively, this would speed things up. But as you said in other comments, the overall cost of quantization isn't really this part. But instead that we iterate every vector and then quantize them. The calculating of the quantiles or quantile candidates is a very small portion of the runtime :/

github-actions · 2024-10-31T00:22:56Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

HoustonPutman added 3 commits October 15, 2024 09:44

Introduce multi-select for scalar quantization

f5fa804

Implement radix multiSelect, add tests, rename, make default method

f985f75

Another method, more docs

5aa46e5

HoustonPutman requested a review from benwtrent October 15, 2024 19:52

HoustonPutman added 5 commits October 15, 2024 14:53

tidy

0819657

Undo test change

cd7bcb3

Small refactor, less checks. Copy k as to keep order of array for caller

b30e967

Fix bad api

d1e0210

Add in the sort that was removed accidentally

48c3f35

HoustonPutman removed the request for review from benwtrent October 15, 2024 22:24

benwtrent reviewed Oct 16, 2024

View reviewed changes

github-actions bot added the Stale label Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce multiSelect for ScalarQuantizer #13919

Introduce multiSelect for ScalarQuantizer #13919

HoustonPutman commented Oct 15, 2024 •

edited

Loading

HoustonPutman commented Oct 15, 2024

benwtrent Oct 16, 2024

github-actions bot commented Oct 31, 2024

Introduce multiSelect for ScalarQuantizer #13919

Are you sure you want to change the base?

Introduce multiSelect for ScalarQuantizer #13919

Conversation

HoustonPutman commented Oct 15, 2024 • edited Loading

Description

multiSelect details

Benchmarking

HoustonPutman commented Oct 15, 2024

benwtrent Oct 16, 2024

Choose a reason for hiding this comment

github-actions bot commented Oct 31, 2024

HoustonPutman commented Oct 15, 2024 •

edited

Loading