Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reading binary doc values as a DataInput #12460

Closed
wants to merge 12 commits into from

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Jul 25, 2023

see #12459

Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea! It looks useful. Are you planning to replace some existing uses of BinaryDocValues with DataInputDocValues? I checked the code for places where it might make more sense to use DataInputDocValues:

  1. MatchingFacetSetsCounts.
  2. TaxonomyFacetIntAssociations.
  3. TaxonomyFacetFloatAssociations.
  4. SerializedDVStrategy.

The last one falls under the ByteArrayInputStream case previously mentioned.

@iverase
Copy link
Contributor Author

iverase commented Aug 21, 2023

I am currently not planing to replace any of the usages as I am not familiar with them. Note that some of them encode data in big endian while DataOutput/DataInput uses little endian since 8.0 so there might not be compatible. The SerializedDVStrategy uses a java.io.ByteArrayInputStream so it is not a good candidate either.

My use case is more similar to ShapeDocValues and that would be a good candidate. I am not familiar with the implementation and it seems to requires some signature changes so left the implementation to whoever is interested.

@iverase iverase marked this pull request as ready for review September 12, 2023 06:31
@jpountz
Copy link
Contributor

jpountz commented Sep 12, 2023

The more I think of this change, the more I like it: most of the time, you would need to read data out of binary doc values, e.g. (variable-length) integers, strings, etc. and exposing binary doc values as a data input not only makes this easier to do, but also more efficient by saving one memory copy.

Thoughts:

  • We are losing the symmetry between the index-time API (BinaryDocValuesField), which takes a byte[] while the search-time API produces a DataInput. Not a deal breaker, but users have to be careful about endianness.
  • Today it's possible to consume values multiple times, but the data input is stateful, so if one consumer reads data from it, and then another consumer tries to read data again it will fail unless it resets the input via a call to advanceExact. Again not a deal breaker but something that could be a surprise to some callers.

In terms of backward compatibility, I'm contemplating not introducing a new DataInputDocValues class, and instead have a dataInput() method on BinaryDocValues, as well as a default implementation of binaryValue() that reads all bytes from the data input. What do you think? I'm also curious of the opinion of our sophisticated backward compatibility policeman @uschindler.

To @stefanvodita 's point, it would be nice to migrate at least one or two call sites that would get simpler with a DataInput to further validate this change.

@iverase
Copy link
Contributor Author

iverase commented Sep 12, 2023

I'm contemplating not introducing a new DataInputDocValues class, and instead have a dataInput() method on BinaryDocValues

I think this approach defeats on of the main purposes for this change, that is to avoid allocating a byte array when reading doc values. I don't think we want BinaryDocValues to do that lazily:

when one of the doc values is big, in the order of few megabytes, it can cause issues with small heaps (or even big heaps if 
big enough). This is due to the allocation of a big byte array upfront, that can be consider humongous allocations by the G1 
garbage collector and it can cause heap issues under high load.

On my own use case, getting a DataInput is not enough as I need random access via get/set position, in a similar fashion to what I am doing now via ByteArrayDataInput.

@jpountz
Copy link
Contributor

jpountz commented Sep 12, 2023

I think this approach defeats on of the main purposes for this change, that is to avoid allocating a byte array when reading doc values. I don't think we want BinaryDocValues to do that lazily

What is the problem with allocating lazily? It wouldn't make sense to me with the current API, where binaryValue() is the only way to retrieve the data, but if it were to only remain for bw compat it would make sense to me to only incur the byte[] overhead if the legacy binaryValue() API is used?

On my own use case, getting a DataInput is not enough as I need random access via get/set position, in a similar fashion to what I am doing now via ByteArrayDataInput.

This has been a challenge so many times in the past, maybe it's time to add seek() support to DataInput?

@uschindler
Copy link
Contributor

uschindler commented Sep 12, 2023

This has been a challenge so many times in the past, maybe it's time to add seek() support to DataInput?

We have full random access (positional reads), if you optionall extend the interface RandomAccessInput (see how ByteBufferDataInput does it: https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/store/ByteBuffersDataInput.html)

As alternative in IndexInputs there is an additional method to retrieve a random access view on the input, the ByteBuffer ones return itself, but it's emulated otherwise using IndexInput seek+read pairs). But here as it is a new API and random access should always work for binary docvalues, we can return 2 getters, both retuning applicable type of view, implemented by same class.

Adding seek support to DataInput is not a good idea because it would prevent those DataInputs based on InputStream (e g. InputStreamDataInput, which is used at various places).

@uschindler
Copy link
Contributor

uschindler commented Sep 12, 2023

To save more memory copies, the codec may use a slice from the underlying IndexInput directly to support both access apis. All file pointer checks would then be performed by the low level JVM intrinsics used my MemorySegmentIndexInput's slices. If you use those views and not let them escape the current method, the GC pressure isn't there (be careful: Profilers show them as escape analysis no longer works when the profiler prevents inlining).

At some point we must anyways go away from the reusing of instances and move Lucene to the shortliving immutable instances used by modern JDK APIs. Escape analysis works fine, if you have simple apis. As example see the vector API in Lucene. It produces up to around 15 object instances per dotProduct, but it's still 4 times faster than looping through a byte array without any new instance allocations. 😜

@iverase
Copy link
Contributor Author

iverase commented Sep 15, 2023

Thanks @jpountz and @uschindler for the input. I had a look into RandomAccessInput and I don't think this what we need. We need an DataInput that is positional ware so it supports seek and in addition it knows its length so we can still read the full stream via readBytes(byte[] b, int offset, int len) to support the current functionality.

What I am missing is an abstraction between DataInput and IndexInput like:

/**
 * A position aware {@link DataInput}. It is possible to get the current position on this
 * stream via {@link #position()} and to return to that position via {@link #seek(long)}.
 */
public abstract class SeekableDataInput extends DataInput {

    /** The number of bytes in the stream. */
    public abstract long size();

    /**
     * Returns the current position in this stream, where the next read will occur.
     *
     * @see #seek(long)
     */
    public abstract long position();

    /**
     * Sets current position in this stream, where the next read will occur.
     *
     * @see #position()
     */
    public abstract void seek(long pos) throws IOException;

    /**
     * {@inheritDoc}
     *
     * <p>Behavior is functionally equivalent to seeking to <code>position() + numBytes</code>.
     *
     * @see #position()
     * @see #seek(long)
     */
    @Override
    public void skipBytes(long numBytes) throws IOException {
        if (numBytes < 0) {
            throw new IllegalArgumentException("numBytes must be >= 0, got " + numBytes);
        }
        final long skipTo = position() + numBytes;
        seek(skipTo);
    }
}

(Naming is so hard, naming proposals are welcome)

This abstraction can be extended by ByteBuffersDataInput and IndexInput and potentially by ByteArrayDataInput and FTS.BytesReader.

** EDIT***: rename the proposal class / method and javadocs

@iverase
Copy link
Contributor Author

iverase commented Sep 26, 2023

I have been thinking a bit longer about this and I think this approach of DataInput is not right. Instead we should try to return an API more similar to RandomAccessInput as Uwe suggested.

What convinces me is the fact that BytesRef can easily implement (and maybe should) RandomAccessInput. Unfortunately it is not straightforward to change BytesRef with RandomAccessInput as we are doing things with BytesRef that are challenging to implement generically:

  • Maybe we should add the length of the RandomAccessInput to the API? public long length()
  • We can copy the contents of a RandomAccessInput byte by byte, should we add a bulk copy method like public byte readBytes(long pos, byte[] b, int offset, int length) throws IOException;?
  • Can RandomAccessInput implement Comparable?
  • Can RandomAccessInput implementation have implementations of equals and hashCode?

The first two are easy to implement and maybe we should do it regardless this issue. The other two are much trickier.

Copy link

github-actions bot commented Jan 8, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 8, 2024
@iverase
Copy link
Contributor Author

iverase commented Oct 23, 2024

I open #13948 which is clearly less invasive.

@iverase iverase closed this Oct 23, 2024
@iverase iverase deleted the DataInputDocValues branch October 23, 2024 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants