Allow reading binary doc values as a DataInput #12459

iverase · 2023-07-25T02:06:51Z

Description

Binary doc values allow to store a variable number of bytes on a doc value. In order to read those bytes, we currently get a BytesRef from the API which contains the bytes on heap. In order to do that, the current implementation preallocates a byte array with the size equals to the biggest doc value. This strategy has two main drawbacks:

We are using a byte array as a middle data structure so we are copying each doc value in this byte array. In many cases this is an unnecessary overhead.
when one of the doc values is big, in the order of few megabytes, it can cause issues with small heaps (or even big heaps if big enough). This is due to the allocation of a big byte arrays upfront, that can be consider humongous allocations by the G1 garbage collector and it can cause heap issues under high load.

Therefore I would like to propose to add a new API on top of binary doc values that allows reading them using a DataInput. This data input can read directly from the underlaying IndexInput and therefore we don't need to copy data into an intermediate data structure and we don't need to preallocate a byte array.

The new API would be built on top of the existing binary doc values and it would look something like:

public abstract class DataInputDocValues extends DocValuesIterator {

  /** Sole constructor. (For invocation by subclass constructors, typically implicit.) */
  protected DataInputDocValues() {}

  /**
   * Returns the binary value wrapped as a {@link DataInput} for the current document ID. It is
   * illegal to call this method after {@link #advanceExact(int)} returned {@code false}.
   *
   * @return the binary value wrapped as a {@link DataInput}
   */
  public abstract DataInput dataInputValue() throws IOException;
}

The text was updated successfully, but these errors were encountered:

iverase · 2023-07-25T03:15:06Z

I wrote a prototype for this change here: #12460

iverase · 2023-07-25T09:32:56Z

Something I am not happy with is that we loose the ability of having random access to the underlaying bytes without having to clone the data input.

jpountz · 2023-07-25T12:03:50Z

Another argument in favor of this proposal that I can think of is that many call sites of this API want to decode numbers from the byte[] and end up re-inventing DataInput#readInt and similar methods.

iverase · 2023-07-26T00:24:48Z

I expect that the use case of wrapping the byte[] from a binary doc values with a ByteArrayDataInput to read the contents to be common, so this approach will give that in a more efficient way.

I think we should provide the possibility to be able to have random access so I would like to refine my original proposal by adding a DataInputDocValue which is just a DataInput which is positional aware. The API would look like:

public abstract class DataInputDocValues extends DocValuesIterator {

  /** Sole constructor. (For invocation by subclass constructors, typically implicit.) */
  protected DataInputDocValues() {}

  /**
   * Returns the binary value wrapped as a {@link DataInputDocValue} for the current document ID. It
   * is illegal to call this method after {@link #advanceExact(int)} returned {@code false}.
   *
   * @return the binary value wrapped as a {@link DataInputDocValue}
   */
  public abstract DataInputDocValue dataInputValue() throws IOException;

  /**
   * A {@link DataInput} view over a binary doc value which is positional aware.
   *
   * @lucene.experimental
   */
  public abstract static class DataInputDocValue extends DataInput {

    /** Sets the position in this stream. */
    public abstract void setPosition(int pos) throws IOException;

    /** Returns the current position in this stream. */
    public abstract int getPosition() throws IOException;
  }
}

I updated the prototype accordingly.

iverase · 2024-10-23T09:48:34Z

Now that lucene 10 has been released and our java minimum version is 21, the RandomAccessInput API got efficient methods to read byte[] from the method #readBytes, I think this API makes more sense now to use that abstraction as in my opinion os the closest API to a BytesRef but not implying the the bytes are on heap.

In this case the API can be as simple as something like:

/**
 * A per-document binary value.
 *
 * @lucene.experimental
 */
public abstract class RandomAccessInputDocValues extends DocValuesIterator {

  /** Sole constructor. (For invocation by subclass constructors, typically implicit.) */
  protected RandomAccessInputDocValues() {}

  /**
   * Returns the binary value as a {@link RandomAccessInput} for the current document ID. The bytes
   * start at position 0 up to {@link RandomAccessInput#length()}. It is illegal to call this method
   * after {@link #advanceExact(int)} returned {@code false}.
   *
   * @return the binary value as a {@link RandomAccessInput}
   */
  public abstract RandomAccessInput randomAccessInputValue() throws IOException;
}

iverase added the type:enhancement label Jul 25, 2023

iverase mentioned this issue Jul 25, 2023

Allow reading binary doc values as a DataInput #12460

Closed

iverase mentioned this issue Sep 15, 2023

Allow reading binary stored values as DataInput #12556

Closed

This was referenced Oct 17, 2024

ES nodes OOM due to unbounded allocation's in Lucene90DocValuesProducer elastic/elasticsearch#114941

Open

Should we avoid allocating a byte[] upfront for binary doc values #13929

Closed

iverase mentioned this issue Oct 23, 2024

Allow reading binary doc values as a RandomAccessInput #13948

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow reading binary doc values as a DataInput #12459

Allow reading binary doc values as a DataInput #12459

iverase commented Jul 25, 2023

iverase commented Jul 25, 2023

iverase commented Jul 25, 2023

jpountz commented Jul 25, 2023

iverase commented Jul 26, 2023

iverase commented Oct 23, 2024

Allow reading binary doc values as a DataInput #12459

Allow reading binary doc values as a DataInput #12459

Comments

iverase commented Jul 25, 2023

Description

iverase commented Jul 25, 2023

iverase commented Jul 25, 2023

jpountz commented Jul 25, 2023

iverase commented Jul 26, 2023

iverase commented Oct 23, 2024