-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow reading binary doc values as a DataInput #12460
Conversation
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea! It looks useful. Are you planning to replace some existing uses of BinaryDocValues
with DataInputDocValues
? I checked the code for places where it might make more sense to use DataInputDocValues
:
- MatchingFacetSetsCounts.
- TaxonomyFacetIntAssociations.
- TaxonomyFacetFloatAssociations.
- SerializedDVStrategy.
The last one falls under the ByteArrayInputStream
case previously mentioned.
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java
Outdated
Show resolved
Hide resolved
I am currently not planing to replace any of the usages as I am not familiar with them. Note that some of them encode data in big endian while DataOutput/DataInput uses little endian since 8.0 so there might not be compatible. The My use case is more similar to ShapeDocValues and that would be a good candidate. I am not familiar with the implementation and it seems to requires some signature changes so left the implementation to whoever is interested. |
The more I think of this change, the more I like it: most of the time, you would need to read data out of binary doc values, e.g. (variable-length) integers, strings, etc. and exposing binary doc values as a data input not only makes this easier to do, but also more efficient by saving one memory copy. Thoughts:
In terms of backward compatibility, I'm contemplating not introducing a new To @stefanvodita 's point, it would be nice to migrate at least one or two call sites that would get simpler with a |
I think this approach defeats on of the main purposes for this change, that is to avoid allocating a byte array when reading doc values. I don't think we want BinaryDocValues to do that lazily:
On my own use case, getting a DataInput is not enough as I need random access via get/set position, in a similar fashion to what I am doing now via ByteArrayDataInput. |
What is the problem with allocating lazily? It wouldn't make sense to me with the current API, where binaryValue() is the only way to retrieve the data, but if it were to only remain for bw compat it would make sense to me to only incur the byte[] overhead if the legacy
This has been a challenge so many times in the past, maybe it's time to add |
We have full random access (positional reads), if you optionall extend the interface As alternative in IndexInputs there is an additional method to retrieve a random access view on the input, the ByteBuffer ones return itself, but it's emulated otherwise using IndexInput seek+read pairs). But here as it is a new API and random access should always work for binary docvalues, we can return 2 getters, both retuning applicable type of view, implemented by same class. Adding seek support to |
To save more memory copies, the codec may use a slice from the underlying IndexInput directly to support both access apis. All file pointer checks would then be performed by the low level JVM intrinsics used my MemorySegmentIndexInput's slices. If you use those views and not let them escape the current method, the GC pressure isn't there (be careful: Profilers show them as escape analysis no longer works when the profiler prevents inlining). At some point we must anyways go away from the reusing of instances and move Lucene to the shortliving immutable instances used by modern JDK APIs. Escape analysis works fine, if you have simple apis. As example see the vector API in Lucene. It produces up to around 15 object instances per dotProduct, but it's still 4 times faster than looping through a byte array without any new instance allocations. 😜 |
Thanks @jpountz and @uschindler for the input. I had a look into What I am missing is an abstraction between
(Naming is so hard, naming proposals are welcome) This abstraction can be extended by ** EDIT***: rename the proposal class / method and javadocs |
I have been thinking a bit longer about this and I think this approach of What convinces me is the fact that
The first two are easy to implement and maybe we should do it regardless this issue. The other two are much trickier. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
I open #13948 which is clearly less invasive. |
see #12459