-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce multiSelect for ScalarQuantizer #13919
base: main
Are you sure you want to change the base?
Introduce multiSelect for ScalarQuantizer #13919
Conversation
So after fixing the innocuous bugs in the implementations, it looks like there is no speed up here. The confidence interval finding can be up to 30% faster or so, but that's such a small portion of the quantization cost, that it really doesn't make a difference at all. (I think the bug was creating bad quantization that made the Oh well, good to test out and see at least. Probably no reason to merge if there isn't a Lucene use case that would really benefit from this. |
* @return lower and upper quantile values | ||
*/ | ||
static float[] getUpperAndLowerQuantile(float[] arr, float confidenceInterval) { | ||
static float[][] getUpperAndLowerQuantiles(float[] arr, float[] confidenceIntervals) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intuitively, this would speed things up. But as you said in other comments, the overall cost of quantization isn't really this part. But instead that we iterate every vector and then quantize them. The calculating of the quantiles or quantile candidates is a very small portion of the runtime :/
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
Resolves #13918
Description
This introduces a
multiSelect(from, to, k[])
method on theSelector
abstract class, and gives implementations of the method for bothSelector
implementations,IntroSelector
andRadixSelector
.This is really only used so far in the
ScalarQuantizer
, which usesIntroSelector
, to find confidence intervals (quantiles) for lists of vectors. This code was refactored to find all quantiles at once.ScalarQuantizer.fromVectors()
does 1 confidence interval (2 quantiles), andScalarQuantizer.fromVectorsAutoInterval()
does 2 confidence intervals (4 quantiles).multiSelect details
For both
RadixSelector
andIntroSelector
, themultiSelect()
code is not too dissimilar from theselect()
code, however all paths are taken that could lead to one of thek
values. In each path, only the relevantk
values are checked. If a path has only 1k
value, thenselect()
is called instead ofmultiSelect()
.One difference in
RadixSelector
is that recursive calls are not done in-place, rather kept in a list to do after checking all path-options, so that we don't have to keep a stack of histograms. This should not be expensive, because the list that contains these recursive-call-path-options, is capped at the number of validk
values for that path, which will likely be small most of the time.Also I provided a default
Selector.multiSelect()
call that is not optimized, becauseSelector
is a public class that people might have made custom implementations of?Benchmarking
Real benchmarking still needs to be done, but using IntelliJ's profiler onTestScalarQuantizer.testFromVectorsAutoInterval4Bit()
, I have seen a >20% speed improvement in the call toScalarQuantizer.fromVectorsAutoInterval()
when using 1028 dimensions (310 ms -> 240 ms). This speedup holds when using 1000 vectors instead of 100 (3200 ms -> 2500 ms) and when using 1000 vectors and a smaller dimension of 128 (480 ms -> 370 ms).Oops, bad results
I'd love to make benchmarks that fit with the way Lucene does them. Are there already vector benchmarks that I could build off of, or do I need to start from scratch?