Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce multiSelect for ScalarQuantizer #13919

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

HoustonPutman
Copy link
Contributor

@HoustonPutman HoustonPutman commented Oct 15, 2024

Resolves #13918

Description

This introduces a multiSelect(from, to, k[]) method on the Selector abstract class, and gives implementations of the method for both Selector implementations, IntroSelector and RadixSelector.

This is really only used so far in the ScalarQuantizer, which uses IntroSelector, to find confidence intervals (quantiles) for lists of vectors. This code was refactored to find all quantiles at once. ScalarQuantizer.fromVectors() does 1 confidence interval (2 quantiles), and ScalarQuantizer.fromVectorsAutoInterval() does 2 confidence intervals (4 quantiles).

multiSelect details

For both RadixSelector and IntroSelector, the multiSelect() code is not too dissimilar from the select() code, however all paths are taken that could lead to one of the k values. In each path, only the relevant k values are checked. If a path has only 1 k value, then select() is called instead of multiSelect().

One difference in RadixSelector is that recursive calls are not done in-place, rather kept in a list to do after checking all path-options, so that we don't have to keep a stack of histograms. This should not be expensive, because the list that contains these recursive-call-path-options, is capped at the number of valid k values for that path, which will likely be small most of the time.

Also I provided a default Selector.multiSelect() call that is not optimized, because Selector is a public class that people might have made custom implementations of?

Benchmarking

Real benchmarking still needs to be done, but using IntelliJ's profiler on TestScalarQuantizer.testFromVectorsAutoInterval4Bit(), I have seen a >20% speed improvement in the call to ScalarQuantizer.fromVectorsAutoInterval() when using 1028 dimensions (310 ms -> 240 ms). This speedup holds when using 1000 vectors instead of 100 (3200 ms -> 2500 ms) and when using 1000 vectors and a smaller dimension of 128 (480 ms -> 370 ms).
Oops, bad results

I'd love to make benchmarks that fit with the way Lucene does them. Are there already vector benchmarks that I could build off of, or do I need to start from scratch?

@HoustonPutman
Copy link
Contributor Author

So after fixing the innocuous bugs in the implementations, it looks like there is no speed up here. The confidence interval finding can be up to 30% faster or so, but that's such a small portion of the quantization cost, that it really doesn't make a difference at all. (I think the bug was creating bad quantization that made the findNearestNeighbors search much faster)

Oh well, good to test out and see at least. Probably no reason to merge if there isn't a Lucene use case that would really benefit from this.

@HoustonPutman HoustonPutman removed the request for review from benwtrent October 15, 2024 22:24
* @return lower and upper quantile values
*/
static float[] getUpperAndLowerQuantile(float[] arr, float confidenceInterval) {
static float[][] getUpperAndLowerQuantiles(float[] arr, float[] confidenceIntervals) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively, this would speed things up. But as you said in other comments, the overall cost of quantization isn't really this part. But instead that we iterate every vector and then quantize them. The calculating of the quantiles or quantile candidates is a very small portion of the runtime :/

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up ScalarQuantization by selecting quantiles together
2 participants