Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ipc RecordBatch encoding #555

Merged
merged 10 commits into from
Aug 7, 2024
Merged

feat: add ipc RecordBatch encoding #555

merged 10 commits into from
Aug 7, 2024

Conversation

bkietz
Copy link
Member

@bkietz bkietz commented Jul 10, 2024

  • added ArrowIpcEncoderEncodeRecordBatch()
  • added ArrowIpcEncoderBuildContiguousBodyBuffer() to provide a default buffer encoder implementation which simply concatenates a batch's buffers into one contiguous (properly aligned and padded) body buffer
  • testing uses the decoder tests, replacing arrow C++'s encoder with the new nanoarrow encoder

@bkietz bkietz requested a review from paleolimbot July 10, 2024 20:43
@bkietz bkietz self-assigned this Jul 10, 2024
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you're still working on this, just a few preliminary things to think about!

In general, I do think we are going to need an ArrowIpcEncoder to avoid painful refactors or excessive numbers of arguments when we add additional features. It may also help collapse status = ArrowXXXX() into NANOARROW_RETURN_NOT_OK() (by making the current flatcc builder something that's cleaned up on ArrowIpcEncoderReset()).

src/nanoarrow/ipc/encoder.c Outdated Show resolved Hide resolved
src/nanoarrow/nanoarrow_ipc.h Outdated Show resolved Hide resolved
src/nanoarrow/ipc/encoder.c Outdated Show resolved Hide resolved
src/nanoarrow/ipc/encoder.c Outdated Show resolved Hide resolved
src/nanoarrow/ipc/encoder_test.cc Outdated Show resolved Hide resolved
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more thoughts! Feel free to leave the RecordBatch encoding to a follow-up PR (it would be easier to review, although I'm also happy to review together if that's easier for you!).

src/nanoarrow/ipc/encoder.c Outdated Show resolved Hide resolved
src/nanoarrow/ipc/encoder.c Outdated Show resolved Hide resolved
src/nanoarrow/ipc/encoder.c Outdated Show resolved Hide resolved
src/nanoarrow/ipc/encoder.c Outdated Show resolved Hide resolved
src/nanoarrow/ipc/encoder.c Outdated Show resolved Hide resolved
src/nanoarrow/ipc/encoder_test.cc Outdated Show resolved Hide resolved
src/nanoarrow/ipc/encoder_test.cc Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 92.27941% with 21 lines in your changes missing coverage. Please review.

Project coverage is 89.10%. Comparing base (41e9e71) to head (f0025a1).
Report is 38 commits behind head on main.

Files Patch % Lines
src/nanoarrow/ipc/encoder.c 93.28% 18 Missing ⚠️
src/nanoarrow/ipc/decoder.c 25.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #555      +/-   ##
==========================================
+ Coverage   88.92%   89.10%   +0.17%     
==========================================
  Files          89       95       +6     
  Lines       16339    15421     -918     
==========================================
- Hits        14530    13741     -789     
+ Misses       1809     1680     -129     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bkietz bkietz marked this pull request as ready for review July 15, 2024 16:15
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

The structure of this is great. I know I said earlier that one big PR was OK, but the scope has enlarged to the point where I would prefer to review as multiple PRs. In particular, it makes it easier to ensure full test coverage for each component (also vaguely better for the changelog). Perhaps:

  • Add array view null counter
  • Add encoder.c with the encoder infrastructure and test for create/destroy
  • Add schema writing
  • Add batch writing + unit tests
  • Add encoder to the ipc_files test (which is our current infrastructure for sending random batches of various types through our IPC library)

I know that there are some testing patterns and style differences with Arrow C++ that take some getting used to here. I'm absolutely open to changing those; however, if they do change they should change in the rest of the code base too (in dedicated PRs). I think the smaller PRs will help with that (to slowly get used to the existing conventions without having to rewrite things).

src/nanoarrow/common/inline_array.h Outdated Show resolved Hide resolved
src/nanoarrow/common/inline_array.h Outdated Show resolved Hide resolved
src/nanoarrow/common/inline_array.h Outdated Show resolved Hide resolved
paleolimbot pushed a commit that referenced this pull request Jul 19, 2024
Adds `ArrowArrayViewComputeNullCount()` and tests. Extracted from
#555
paleolimbot pushed a commit that referenced this pull request Jul 25, 2024
Add `ArrowIpcEncoder`, init/reset, and tests. Extracted from
#555 (review)
@bkietz bkietz force-pushed the ipc-write branch 3 times, most recently from 5f36f64 to 0068806 Compare July 30, 2024 21:29
@bkietz
Copy link
Member Author

bkietz commented Jul 30, 2024

Rebased on #568

bkietz added a commit that referenced this pull request Jul 31, 2024
- added ArrowIpcEncoderEncodeSchema
- added a parameter to ArrowIpcEncoderFinalizeBuffer which controls
whether encapsulated/padded message buffers will be produced instead of
raw
- tests reuse the decoder tests, replacing arrow C++'s encoder with
ArrowIpcEncoder

Extracted from
#555 (review)
@bkietz bkietz force-pushed the ipc-write branch 2 times, most recently from 95ea4cf to 2e3469b Compare July 31, 2024 16:38
@bkietz bkietz changed the title Adding IPC write support feat: add ipc RecordBatch encoding Jul 31, 2024
src/nanoarrow/ipc/encoder.c Outdated Show resolved Hide resolved
Comment on lines 432 to 452
int compressed_buffer_header =
encoder->codec != NANOARROW_IPC_COMPRESSION_TYPE_NONE ? sizeof(int64_t) : 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have the codec member of ArrowIpcEncoder? It seems like that is a very simple way of handling compression that will have to change at some point (either to handle more complex ways of choosing which buffers to compress or adding options unrelated to compression).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it suffices for now. Even if we later extend with more advanced heuristics for determining if a buffer should be compressed, the IPC format only allows a single codec to be provided per schema/stream (which means we'll still have only one codec per encoder).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had in mind something more like how it is specified in Arrow C++ (as a member of an Options struct that includes some other options about IPC writing). I don't see any advantages to premptively adding references to a feature that doesn't exist yet (and might never exist, although I hope that it does).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove compression support for now. I'm not sure how an options struct would be advantageous here, but we can hash that out later

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed (it might not be useful, or might not be useful here!)

src/nanoarrow/ipc/encoder.c Outdated Show resolved Hide resolved
src/nanoarrow/ipc/encoder.c Outdated Show resolved Hide resolved
ArrowErrorCode ArrowIpcEncoderEncodeRecordBatch(struct ArrowIpcEncoder* encoder,
const struct ArrowArrayView* array_view,
struct ArrowError* error) {
NANOARROW_DCHECK(encoder != NULL && encoder->private_data != NULL && schema != NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does schema exist here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, that's a copy paste error. I found this in another PR. It seems that NANOARROW_DEBUG isn't getting defined

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that would be:

target_compile_definitions(nanoarrow PUBLIC "$<$<CONFIG:Debug>:NANOARROW_DEBUG>")

...that doesn't exist for nanoarrow_ipc. (No need to fix that here since it's not your fault 😬 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed by #573

Comment on lines 546 to 560
FLATCC_RETURN_UNLESS_0(RecordBatch_nodes_create( //
builder, (struct ns(FieldNode)*)private->nodes.data,
private->nodes.size_bytes / sizeof(struct ns(FieldNode))));
FLATCC_RETURN_UNLESS_0(RecordBatch_buffers_create( //
builder, (struct ns(Buffer)*)private->buffers.data,
private->buffers.size_bytes / sizeof(struct ns(Buffer))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth just caching the builder for the RecordBatch message and the int64_t* or ns(FieldNode)* to the start of each (to avoid allocating a builder for every batch?). The decoder requires initializing with a schema to make some allocations that only need to happen once (here an ArrowArrayView would do).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean; the builder is allocated in Init and only deallocated in Reset. The buffers used to store ns(FieldNode) or ns(Buffer) are also reused for each batch until Reset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we could avoid creating the new root/fields and copying our cached buffer into it (updating in place instead), but flatcc is probably be very good at anticipating repeated message building and avoiding allocations (and the copy is probably not expensive).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what state flatcc maintains inside its builders. Some of the example code suggests we could have it maintain independent vectors which are not specifically part of a field which is currently being built. I've tried to use that a few times and gotten odd errors when I strayed from strict state machine style, so I'd prefer to experiment independent of other changes in a follow up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

src/nanoarrow/ipc/decoder_test.cc Outdated Show resolved Hide resolved
src/nanoarrow/ipc/decoder_test.cc Outdated Show resolved Hide resolved
@@ -763,6 +764,132 @@ TEST_P(ArrowTypeParameterizedTestFixture, NanoarrowIpcArrowArrayRoundtrip) {
ArrowIpcDecoderReset(&decoder);
}

struct ArrowArrayViewEqualTo {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! The negative matches and messages here are also not tested and I am not sure anybody looking to see if we had a utility to help with array equality would look in decoder_test.cc to find it. Something like:

void AssertArrayViewIdentical(actual, expected) {
  NANOARROW_DCHECK(actual->dictionary != nullptr);
  NANOARROW_DCHECK(expected->dictionary != nullptr);

  ASSERT_EQ(actual->storage_type, expected->storage_type);
  ASSERT_EQ(actual->offset, expected->offset);
  ASSERT_EQ(actual->length, expected->length);
  for (int i = 0; i < 3; i++) {
    auto a_buf = actual->buffer_views[i];
    auto e_buf = expected->buffer_views[i];
    ASSERT_EQ(a_buf.size_bytes,  e_buf->size_bytes);
    if (a_buf.size_bytes != 0) {
      ASSERT_EQ(memcmp(a_buf.data.data, e_buf.data.data, a_buf.size_bytes), 0);
    }
  }

  ASSERT_EQ(actual->n_children, expected->n_children);
  for (int i = 0; i < actual->n_children; i++) {
    AssertArrayViewIdentical(actual->children[i], expected->children[i]);
  }
}

...will give terrible error messages but can be followed up with a tested version of this helper (or a version that partly lives in the C library since this comes up in C situations as well).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we leave extraction of an array equality helper for a follow up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! But since you still need something to test the roundtrip, this suggestion was to use something more compact than the equality helper that is currently here.

Copy link
Member

@paleolimbot paleolimbot Aug 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! In the meantime I bet we can pin the sha of the action before the change or something (I'll try Monday) (meant for another thread! 😬 )

src/nanoarrow/nanoarrow_ipc.h Outdated Show resolved Hide resolved
.github/workflows/r-check.yaml Outdated Show resolved Hide resolved
paleolimbot added a commit that referenced this pull request Aug 5, 2024
…572)

First noticed at
#555 (comment)
, the R check action is failing because an update to r-lib actions
resulted in some quarto actions being invoked, and these have not yet
been whitelisted for use in Apache repositories. It also may be that we
don't need the quarto actions (we probably don't) but some brief
experimentation to attempt circumventing the use of the quarto action
did not result in a successful workflow. Hence, a pin to unblock PR
checks until either the v2 branch is updated or it is clear how to avoid
the failure.
paleolimbot added a commit that referenced this pull request Aug 6, 2024
In previous PRs we consolidated the extensions into the main
CMakeLists.txt; however, there were some things happening for certain
targets (like installing them or setting NANOARROW_DEBUG) but not
others.

Noticed in
#555 (comment)
where there was a DCHECK referencing a variable that didn't exist that
made it through CI 😬
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@bkietz bkietz merged commit cb89444 into apache:main Aug 7, 2024
34 checks passed
@bkietz bkietz deleted the ipc-write branch August 7, 2024 00:57
@paleolimbot paleolimbot added this to the nanoarrow 0.6.0 milestone Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants