Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reading binary doc values as a DataInput #12459

Open
iverase opened this issue Jul 25, 2023 · 5 comments
Open

Allow reading binary doc values as a DataInput #12459

iverase opened this issue Jul 25, 2023 · 5 comments

Comments

@iverase
Copy link
Contributor

iverase commented Jul 25, 2023

Description

Binary doc values allow to store a variable number of bytes on a doc value. In order to read those bytes, we currently get a BytesRef from the API which contains the bytes on heap. In order to do that, the current implementation preallocates a byte array with the size equals to the biggest doc value. This strategy has two main drawbacks:

  1. We are using a byte array as a middle data structure so we are copying each doc value in this byte array. In many cases this is an unnecessary overhead.

  2. when one of the doc values is big, in the order of few megabytes, it can cause issues with small heaps (or even big heaps if big enough). This is due to the allocation of a big byte arrays upfront, that can be consider humongous allocations by the G1 garbage collector and it can cause heap issues under high load.

Therefore I would like to propose to add a new API on top of binary doc values that allows reading them using a DataInput. This data input can read directly from the underlaying IndexInput and therefore we don't need to copy data into an intermediate data structure and we don't need to preallocate a byte array.

The new API would be built on top of the existing binary doc values and it would look something like:

public abstract class DataInputDocValues extends DocValuesIterator {

  /** Sole constructor. (For invocation by subclass constructors, typically implicit.) */
  protected DataInputDocValues() {}

  /**
   * Returns the binary value wrapped as a {@link DataInput} for the current document ID. It is
   * illegal to call this method after {@link #advanceExact(int)} returned {@code false}.
   *
   * @return the binary value wrapped as a {@link DataInput}
   */
  public abstract DataInput dataInputValue() throws IOException;
}
@iverase
Copy link
Contributor Author

iverase commented Jul 25, 2023

I wrote a prototype for this change here: #12460

@iverase
Copy link
Contributor Author

iverase commented Jul 25, 2023

Something I am not happy with is that we loose the ability of having random access to the underlaying bytes without having to clone the data input.

@jpountz
Copy link
Contributor

jpountz commented Jul 25, 2023

Another argument in favor of this proposal that I can think of is that many call sites of this API want to decode numbers from the byte[] and end up re-inventing DataInput#readInt and similar methods.

@iverase
Copy link
Contributor Author

iverase commented Jul 26, 2023

I expect that the use case of wrapping the byte[] from a binary doc values with a ByteArrayDataInput to read the contents to be common, so this approach will give that in a more efficient way.

I think we should provide the possibility to be able to have random access so I would like to refine my original proposal by adding a DataInputDocValue which is just a DataInput which is positional aware. The API would look like:

public abstract class DataInputDocValues extends DocValuesIterator {

  /** Sole constructor. (For invocation by subclass constructors, typically implicit.) */
  protected DataInputDocValues() {}

  /**
   * Returns the binary value wrapped as a {@link DataInputDocValue} for the current document ID. It
   * is illegal to call this method after {@link #advanceExact(int)} returned {@code false}.
   *
   * @return the binary value wrapped as a {@link DataInputDocValue}
   */
  public abstract DataInputDocValue dataInputValue() throws IOException;

  /**
   * A {@link DataInput} view over a binary doc value which is positional aware.
   *
   * @lucene.experimental
   */
  public abstract static class DataInputDocValue extends DataInput {

    /** Sets the position in this stream. */
    public abstract void setPosition(int pos) throws IOException;

    /** Returns the current position in this stream. */
    public abstract int getPosition() throws IOException;
  }
}

I updated the prototype accordingly.

@iverase
Copy link
Contributor Author

iverase commented Oct 23, 2024

Now that lucene 10 has been released and our java minimum version is 21, the RandomAccessInput API got efficient methods to read byte[] from the method #readBytes, I think this API makes more sense now to use that abstraction as in my opinion os the closest API to a BytesRef but not implying the the bytes are on heap.

In this case the API can be as simple as something like:

/**
 * A per-document binary value.
 *
 * @lucene.experimental
 */
public abstract class RandomAccessInputDocValues extends DocValuesIterator {

  /** Sole constructor. (For invocation by subclass constructors, typically implicit.) */
  protected RandomAccessInputDocValues() {}

  /**
   * Returns the binary value as a {@link RandomAccessInput} for the current document ID. The bytes
   * start at position 0 up to {@link RandomAccessInput#length()}. It is illegal to call this method
   * after {@link #advanceExact(int)} returned {@code false}.
   *
   * @return the binary value as a {@link RandomAccessInput}
   */
  public abstract RandomAccessInput randomAccessInputValue() throws IOException;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants