-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow reading binary doc values as a DataInput #12459
Comments
I wrote a prototype for this change here: #12460 |
Something I am not happy with is that we loose the ability of having random access to the underlaying bytes without having to clone the data input. |
Another argument in favor of this proposal that I can think of is that many call sites of this API want to decode numbers from the byte[] and end up re-inventing |
I expect that the use case of wrapping the byte[] from a binary doc values with a I think we should provide the possibility to be able to have random access so I would like to refine my original proposal by adding a DataInputDocValue which is just a DataInput which is positional aware. The API would look like:
I updated the prototype accordingly. |
Now that lucene 10 has been released and our java minimum version is 21, the RandomAccessInput API got efficient methods to read byte[] from the method #readBytes, I think this API makes more sense now to use that abstraction as in my opinion os the closest API to a BytesRef but not implying the the bytes are on heap. In this case the API can be as simple as something like:
|
Description
Binary doc values allow to store a variable number of bytes on a doc value. In order to read those bytes, we currently get a BytesRef from the API which contains the bytes on heap. In order to do that, the current implementation preallocates a byte array with the size equals to the biggest doc value. This strategy has two main drawbacks:
We are using a byte array as a middle data structure so we are copying each doc value in this byte array. In many cases this is an unnecessary overhead.
when one of the doc values is big, in the order of few megabytes, it can cause issues with small heaps (or even big heaps if big enough). This is due to the allocation of a big byte arrays upfront, that can be consider humongous allocations by the G1 garbage collector and it can cause heap issues under high load.
Therefore I would like to propose to add a new API on top of binary doc values that allows reading them using a DataInput. This data input can read directly from the underlaying IndexInput and therefore we don't need to copy data into an intermediate data structure and we don't need to preallocate a byte array.
The new API would be built on top of the existing binary doc values and it would look something like:
The text was updated successfully, but these errors were encountered: