-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use multi-select instead of a full sort for DynamicRange creation #13914
base: main
Are you sure you want to change the base?
Use multi-select instead of a full sort for DynamicRange creation #13914
Conversation
I have not looked closely but this sounds very cool!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @HoustonPutman, this is really interesting!
The old logic would start counting again after a weight range was complete, which removes information from the overflow of previous weight-ranges
Isn't there a risk with this PR that we would have a heavily weighted item at the end of a range that would make it so the next range is empty or almost empty?
List<DynamicRangeUtil.DynamicRangeInfo> mockResult, | ||
List<DynamicRangeUtil.DynamicRangeInfo> expectedResult) { | ||
return mockResult.size() == expectedResult.size() && mockResult.containsAll(expectedResult); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, thanks for changing this!
double rangeWeightTarget = (double) totalWeight / topN; | ||
double[] kWeights = new double[topN]; | ||
for (int i = 0; i < topN; i++) { | ||
kWeights[i] = (i == 0 ? 0 : kWeights[i - 1]) + rangeWeightTarget; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be some subtlety here I don't understand, but I'm wondering if we can we make this simpler.
for (int i = 1; i < topN; i++) {
kWeights[i] = i * rangeWeightTarget;
}
The array should be initialised with zeros by default, so we can also write
for (int i = 1; i < topN; i++) {
kWeights[i] = kWeights[i - 1] + rangeWeightTarget;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow yeah, both are better (though I like the first). This is the beauty of PR reviews haha. When you are 500 lines into a change, who knows what dumb things you will write...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought maybe you wanted to avoid the multiplications 😄
Which would be fair, my guess is the second one is faster because we're only doing sums and referencing values in the array that are cached.
long beforeTotalValue, | ||
long rangeWeight, | ||
long beforeWeight, | ||
double[] kWeights) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kWeights
doesn't communicate to me what these are. I wonder if there's a more descriptive name we could use or otherwise if we could explain in a comment. We use this k
prefix a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very fair point. I struggled naming this. Basically the k
prefix is for choosing where to select. So kWeights
is the weight-cutoffs that you want to select. if you have a total weight of 100 and want to group into 5, then kWeights
would be [20,40,60,80,100]
. Very open to better naming anywhere!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to replace k
with quantile
maybe?
this.random = new SplittableRandom(); | ||
} | ||
SplittableRandom random = this.random; | ||
for (int i = to - 1; i > from; i--) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to go in descending order?
} | ||
SplittableRandom random = this.random; | ||
for (int i = to - 1; i > from; i--) { | ||
swap(i, random.nextInt(from, i + 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll end up swapping an element with itself quite often. Is it worth checking for that case in the swap method and exiting right away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't even looked at this method. It was straight copied from IntroSelector
.
After doing some research, this seems to be the right way of doing it according to the algorithm they specified: https://en.wikipedia.org/wiki/Fisher–Yates_shuffle#The_modern_algorithm
@@ -202,66 +208,83 @@ public SegmentOutput(int hitsLength) { | |||
* is used to compute the equi-weight per bin. | |||
*/ | |||
public static List<DynamicRangeInfo> computeDynamicNumericRanges( | |||
long[] values, long[] weights, int len, long totalWeight, int topN) { | |||
long[] values, long[] weights, int len, long totalValue, long totalWeight, int topN) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that this can go into 10.1 despite being an API change since this class is marked experimental. Could you add an entry to CHANGES.txt?
|
||
protected abstract long getValue(int i); | ||
|
||
public final WeightRangeInfo[] select( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add some Javadoc explaining what you get if you run this method, maybe with a small example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. Was going to go through and add docs, just wanted to make sure it was a good direction to go in first. Probably worth doing the benchmarking first 🥹
@@ -80,24 +84,25 @@ public void testComputeDynamicNumericRangesWithOneLargeWeight() { | |||
expectedRangeInfoList.add(new DynamicRangeUtil.DynamicRangeInfo(1, 52343, 14L, 14L, 14D)); | |||
expectedRangeInfoList.add( | |||
new DynamicRangeUtil.DynamicRangeInfo(6, 2766, 32L, 455L, 163.16666666666666D)); | |||
assertDynamicNumericRangeResults(values, weights, 4, 55109, expectedRangeInfoList); | |||
assertDynamicNumericRangeResults(values, weights, 4, 993, 55109, expectedRangeInfoList); | |||
} | |||
|
|||
private static void assertDynamicNumericRangeResults( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange things can happen if many or all the weights are zero. I've dealt with that for the Amazon use-case. I wonder if we're handling those situations well in this PR. Should we add a test?
} | ||
|
||
// Visible for testing. | ||
void select( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really interesting, but it goes a little over my head. Really curious to see the benchmark results!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the 3-way partitioning was also quite confusing to me until I looked it up. And even then, the code is still quite hard to understand. I copied the default implementation from IntroSelector
, then modified it to support multi-select, and select by cumulative weight, not by ordinal. So a lot of the complexity/confusion I can't necessarily speak to. Maybe this would be clearer if in the Javadocs of the class, it called out IntroSelector
as the base algorithm?
if ((size = to - from) > 3) { | ||
|
||
if (--maxDepth == -1) { | ||
// Max recursion depth exceeded: shuffle (only once) and continue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also say why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from IntroSelector
, but basically I think it's saying if I've done enough recursions in QuickSelect, that means that our data has a really bad distribution? So just randomize it a bit and continue. I don't have an opinion as I haven't studied it, but hopefully there was research put into the idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a risk with this PR that we would have a heavily weighted item at the end of a range that would make it so the next range is empty or almost empty?
Yes, that would be a risk. But in the existing implementation, the last range would be almost empty instead. Either way the heavily weighted item has to take space from some group. So in my mind, it's easier to understand that the groups that you are given back better represent the actual quantiles, versus leaving the small group for the end. Users might actually be interested in that last quantile the most.
|
||
protected abstract long getValue(int i); | ||
|
||
public final WeightRangeInfo[] select( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. Was going to go through and add docs, just wanted to make sure it was a good direction to go in first. Probably worth doing the benchmarking first 🥹
long beforeTotalValue, | ||
long rangeWeight, | ||
long beforeWeight, | ||
double[] kWeights) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very fair point. I struggled naming this. Basically the k
prefix is for choosing where to select. So kWeights
is the weight-cutoffs that you want to select. if you have a total weight of 100 and want to group into 5, then kWeights
would be [20,40,60,80,100]
. Very open to better naming anywhere!
} | ||
|
||
// Visible for testing. | ||
void select( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the 3-way partitioning was also quite confusing to me until I looked it up. And even then, the code is still quite hard to understand. I copied the default implementation from IntroSelector
, then modified it to support multi-select, and select by cumulative weight, not by ordinal. So a lot of the complexity/confusion I can't necessarily speak to. Maybe this would be clearer if in the Javadocs of the class, it called out IntroSelector
as the base algorithm?
if ((size = to - from) > 3) { | ||
|
||
if (--maxDepth == -1) { | ||
// Max recursion depth exceeded: shuffle (only once) and continue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from IntroSelector
, but basically I think it's saying if I've done enough recursions in QuickSelect, that means that our data has a really bad distribution? So just randomize it a bit and continue. I don't have an opinion as I haven't studied it, but hopefully there was research put into the idea?
double rangeWeightTarget = (double) totalWeight / topN; | ||
double[] kWeights = new double[topN]; | ||
for (int i = 0; i < topN; i++) { | ||
kWeights[i] = (i == 0 ? 0 : kWeights[i - 1]) + rangeWeightTarget; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow yeah, both are better (though I like the first). This is the beauty of PR reviews haha. When you are 500 lines into a change, who knows what dumb things you will write...
} | ||
SplittableRandom random = this.random; | ||
for (int i = to - 1; i > from; i--) { | ||
swap(i, random.nextInt(from, i + 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't even looked at this method. It was straight copied from IntroSelector
.
After doing some research, this seems to be the right way of doing it according to the algorithm they specified: https://en.wikipedia.org/wiki/Fisher–Yates_shuffle#The_modern_algorithm
Resolves #13760
Description
This is using a similar approach to how Solr used to compute multiple percentiles at a single time. Basically utilize the quick select method, but instead of following a single path, follow a path for each of the
k
s that is requested. Multi-quickselect.That's what I originally made, until I realized that the
DynamicRangeUtil
is weighted, so I refactored it to choose by weights instead, and also capture the running-value-total and running-weight-total, because that information is used in theDynamicRangeInfo
.My goal was to add this as a generic capability of the
Selector
(orIntroSelector
) class, but because of the limitations above, it is currently a separate class to handle this. If there's any suggestions on how to make this generic enough to be put in the generic class, that would be great. But it might not be worth the effort if it wouldn't be used anywhere else.As for the original multi-quickSelect algorithm I mentioned, I looked for other multi-select use cases across Lucene, but I only found one instance (
ScalarQuantizer
does two select calls in succession). If there's more instances we can find, I would be happy to add multiSelect as an option on theSelector
class, and implement it in all provided classes.To-Do
Caveat
The implement is slightly different, as it will pick the groups according to "The first value for which the running weight is <= weight-range-boundary". The old logic would start counting again after a weight range was complete, which removes information from the overflow of previous weight-ranges. I'm not sure either approach is right or wrong, but I wanted to explicitly state how the results would be different and why I had to alter a unit test to pass.