Allow reading binary doc values as a DataInput #12460

iverase · 2023-07-25T03:14:19Z

lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java

stefanvodita

Interesting idea! It looks useful. Are you planning to replace some existing uses of BinaryDocValues with DataInputDocValues? I checked the code for places where it might make more sense to use DataInputDocValues:

The last one falls under the ByteArrayInputStream case previously mentioned.

lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java

iverase · 2023-08-21T07:18:46Z

I am currently not planing to replace any of the usages as I am not familiar with them. Note that some of them encode data in big endian while DataOutput/DataInput uses little endian since 8.0 so there might not be compatible. The SerializedDVStrategy uses a java.io.ByteArrayInputStream so it is not a good candidate either.

My use case is more similar to ShapeDocValues and that would be a good candidate. I am not familiar with the implementation and it seems to requires some signature changes so left the implementation to whoever is interested.

jpountz · 2023-09-12T07:08:43Z

The more I think of this change, the more I like it: most of the time, you would need to read data out of binary doc values, e.g. (variable-length) integers, strings, etc. and exposing binary doc values as a data input not only makes this easier to do, but also more efficient by saving one memory copy.

Thoughts:

We are losing the symmetry between the index-time API (BinaryDocValuesField), which takes a byte[] while the search-time API produces a DataInput. Not a deal breaker, but users have to be careful about endianness.
Today it's possible to consume values multiple times, but the data input is stateful, so if one consumer reads data from it, and then another consumer tries to read data again it will fail unless it resets the input via a call to advanceExact. Again not a deal breaker but something that could be a surprise to some callers.

In terms of backward compatibility, I'm contemplating not introducing a new DataInputDocValues class, and instead have a dataInput() method on BinaryDocValues, as well as a default implementation of binaryValue() that reads all bytes from the data input. What do you think? I'm also curious of the opinion of our sophisticated backward compatibility policeman @uschindler.

To @stefanvodita 's point, it would be nice to migrate at least one or two call sites that would get simpler with a DataInput to further validate this change.

iverase · 2023-09-12T08:14:42Z

I'm contemplating not introducing a new DataInputDocValues class, and instead have a dataInput() method on BinaryDocValues

I think this approach defeats on of the main purposes for this change, that is to avoid allocating a byte array when reading doc values. I don't think we want BinaryDocValues to do that lazily:

when one of the doc values is big, in the order of few megabytes, it can cause issues with small heaps (or even big heaps if 
big enough). This is due to the allocation of a big byte array upfront, that can be consider humongous allocations by the G1 
garbage collector and it can cause heap issues under high load.

On my own use case, getting a DataInput is not enough as I need random access via get/set position, in a similar fashion to what I am doing now via ByteArrayDataInput.

jpountz · 2023-09-12T08:23:30Z

I think this approach defeats on of the main purposes for this change, that is to avoid allocating a byte array when reading doc values. I don't think we want BinaryDocValues to do that lazily

What is the problem with allocating lazily? It wouldn't make sense to me with the current API, where binaryValue() is the only way to retrieve the data, but if it were to only remain for bw compat it would make sense to me to only incur the byte[] overhead if the legacy binaryValue() API is used?

On my own use case, getting a DataInput is not enough as I need random access via get/set position, in a similar fashion to what I am doing now via ByteArrayDataInput.

This has been a challenge so many times in the past, maybe it's time to add seek() support to DataInput?

uschindler · 2023-09-12T11:05:36Z

This has been a challenge so many times in the past, maybe it's time to add seek() support to DataInput?

We have full random access (positional reads), if you optionall extend the interface RandomAccessInput (see how ByteBufferDataInput does it: https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/store/ByteBuffersDataInput.html)

As alternative in IndexInputs there is an additional method to retrieve a random access view on the input, the ByteBuffer ones return itself, but it's emulated otherwise using IndexInput seek+read pairs). But here as it is a new API and random access should always work for binary docvalues, we can return 2 getters, both retuning applicable type of view, implemented by same class.

Adding seek support to DataInput is not a good idea because it would prevent those DataInputs based on InputStream (e g. InputStreamDataInput, which is used at various places).

uschindler · 2023-09-12T11:31:21Z

To save more memory copies, the codec may use a slice from the underlying IndexInput directly to support both access apis. All file pointer checks would then be performed by the low level JVM intrinsics used my MemorySegmentIndexInput's slices. If you use those views and not let them escape the current method, the GC pressure isn't there (be careful: Profilers show them as escape analysis no longer works when the profiler prevents inlining).

At some point we must anyways go away from the reusing of instances and move Lucene to the shortliving immutable instances used by modern JDK APIs. Escape analysis works fine, if you have simple apis. As example see the vector API in Lucene. It produces up to around 15 object instances per dotProduct, but it's still 4 times faster than looping through a byte array without any new instance allocations. 😜

iverase · 2023-09-15T09:13:36Z

Thanks @jpountz and @uschindler for the input. I had a look into RandomAccessInput and I don't think this what we need. We need an DataInput that is positional ware so it supports seek and in addition it knows its length so we can still read the full stream via readBytes(byte[] b, int offset, int len) to support the current functionality.

What I am missing is an abstraction between DataInput and IndexInput like:

/**
 * A position aware {@link DataInput}. It is possible to get the current position on this
 * stream via {@link #position()} and to return to that position via {@link #seek(long)}.
 */
public abstract class SeekableDataInput extends DataInput {

    /** The number of bytes in the stream. */
    public abstract long size();

    /**
     * Returns the current position in this stream, where the next read will occur.
     *
     * @see #seek(long)
     */
    public abstract long position();

    /**
     * Sets current position in this stream, where the next read will occur.
     *
     * @see #position()
     */
    public abstract void seek(long pos) throws IOException;

    /**
     * {@inheritDoc}
     *
     * <p>Behavior is functionally equivalent to seeking to <code>position() + numBytes</code>.
     *
     * @see #position()
     * @see #seek(long)
     */
    @Override
    public void skipBytes(long numBytes) throws IOException {
        if (numBytes < 0) {
            throw new IllegalArgumentException("numBytes must be >= 0, got " + numBytes);
        }
        final long skipTo = position() + numBytes;
        seek(skipTo);
    }
}

(Naming is so hard, naming proposals are welcome)

This abstraction can be extended by ByteBuffersDataInput and IndexInput and potentially by ByteArrayDataInput and FTS.BytesReader.

** EDIT***: rename the proposal class / method and javadocs

iverase · 2023-09-26T08:23:49Z

I have been thinking a bit longer about this and I think this approach of DataInput is not right. Instead we should try to return an API more similar to RandomAccessInput as Uwe suggested.

What convinces me is the fact that BytesRef can easily implement (and maybe should) RandomAccessInput. Unfortunately it is not straightforward to change BytesRef with RandomAccessInput as we are doing things with BytesRef that are challenging to implement generically:

Maybe we should add the length of the RandomAccessInput to the API? public long length()
We can copy the contents of a RandomAccessInput byte by byte, should we add a bulk copy method like public byte readBytes(long pos, byte[] b, int offset, int length) throws IOException;?
Can RandomAccessInput implement Comparable?
Can RandomAccessInput implementation have implementations of equals and hashCode?

The first two are easy to implement and maybe we should do it regardless this issue. The other two are much trickier.

github-actions · 2024-01-08T12:23:58Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

iverase · 2024-10-23T14:22:10Z

I open #13948 which is clearly less invasive.

Allow reading binary doc values as a DataInput

038fe7e

iverase marked this pull request as draft July 25, 2023 03:14

iverase mentioned this pull request Jul 25, 2023

Allow reading binary doc values as a DataInput #12459

Open

jpountz reviewed Jul 25, 2023

View reviewed changes

lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java Outdated Show resolved Hide resolved

iverase added 4 commits July 26, 2023 08:11

Introduce DataInputDocValues.DataInputDocValue

95ed882

fix javadocs failure

fb7fe2f

revert changes to ByteArrayDataInput

de2a848

tidy

949eba9

stefanvodita reviewed Aug 19, 2023

View reviewed changes

lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java Outdated Show resolved Hide resolved

iverase added 2 commits August 21, 2023 14:50

Merge branch 'main' into DataInputDocValues

3e3a042

simplify check

e142183

Merge branch 'main' into DataInputDocValues

339a299

iverase marked this pull request as ready for review September 12, 2023 06:31

iverase mentioned this pull request Sep 14, 2023

Use recyclable & paginated buffers for search hits elastic/elasticsearch#99590

Open

github-actions bot added the Stale label Jan 8, 2024

iverase added 4 commits October 23, 2024 10:32

iter

27dce67

Move from DataInput to RandomAccessInput

6b9540f

Move from DataInput to RandomAccessInput

b004cdd

iter

e65dddb

iverase mentioned this pull request Oct 23, 2024

Allow reading binary doc values as a RandomAccessInput #13948

Open

iverase closed this Oct 23, 2024

iverase deleted the DataInputDocValues branch October 23, 2024 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow reading binary doc values as a DataInput #12460

Allow reading binary doc values as a DataInput #12460

iverase commented Jul 25, 2023

stefanvodita left a comment

iverase commented Aug 21, 2023 •

edited

Loading

jpountz commented Sep 12, 2023

iverase commented Sep 12, 2023

jpountz commented Sep 12, 2023

uschindler commented Sep 12, 2023 •

edited

Loading

uschindler commented Sep 12, 2023 •

edited

Loading

iverase commented Sep 15, 2023 •

edited

Loading

iverase commented Sep 26, 2023

github-actions bot commented Jan 8, 2024

iverase commented Oct 23, 2024

Allow reading binary doc values as a DataInput #12460

Allow reading binary doc values as a DataInput #12460

Conversation

iverase commented Jul 25, 2023

stefanvodita left a comment

Choose a reason for hiding this comment

iverase commented Aug 21, 2023 • edited Loading

jpountz commented Sep 12, 2023

iverase commented Sep 12, 2023

jpountz commented Sep 12, 2023

uschindler commented Sep 12, 2023 • edited Loading

uschindler commented Sep 12, 2023 • edited Loading

iverase commented Sep 15, 2023 • edited Loading

iverase commented Sep 26, 2023

github-actions bot commented Jan 8, 2024

iverase commented Oct 23, 2024

iverase commented Aug 21, 2023 •

edited

Loading

uschindler commented Sep 12, 2023 •

edited

Loading

uschindler commented Sep 12, 2023 •

edited

Loading

iverase commented Sep 15, 2023 •

edited

Loading