feat: add ipc RecordBatch encoding #555

bkietz · 2024-07-10T20:43:33Z

added ArrowIpcEncoderEncodeRecordBatch()
added ArrowIpcEncoderBuildContiguousBodyBuffer() to provide a default buffer encoder implementation which simply concatenates a batch's buffers into one contiguous (properly aligned and padded) body buffer
testing uses the decoder tests, replacing arrow C++'s encoder with the new nanoarrow encoder

paleolimbot

I know you're still working on this, just a few preliminary things to think about!

In general, I do think we are going to need an ArrowIpcEncoder to avoid painful refactors or excessive numbers of arguments when we add additional features. It may also help collapse status = ArrowXXXX() into NANOARROW_RETURN_NOT_OK() (by making the current flatcc builder something that's cleaned up on ArrowIpcEncoderReset()).

src/nanoarrow/ipc/encoder.c

src/nanoarrow/nanoarrow_ipc.h

src/nanoarrow/ipc/encoder.c

src/nanoarrow/ipc/encoder_test.cc

paleolimbot

A few more thoughts! Feel free to leave the RecordBatch encoding to a follow-up PR (it would be easier to review, although I'm also happy to review together if that's easier for you!).

src/nanoarrow/ipc/encoder.c

src/nanoarrow/ipc/encoder_test.cc

codecov-commenter · 2024-07-13T15:31:05Z

Codecov Report

Attention: Patch coverage is 92.27941% with 21 lines in your changes missing coverage. Please review.

Project coverage is 89.10%. Comparing base (41e9e71) to head (f0025a1).
Report is 38 commits behind head on main.

Files	Patch %	Lines
src/nanoarrow/ipc/encoder.c	93.28%	18 Missing ⚠️
src/nanoarrow/ipc/decoder.c	25.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #555      +/-   ##
==========================================
+ Coverage   88.92%   89.10%   +0.17%     
==========================================
  Files          89       95       +6     
  Lines       16339    15421     -918     
==========================================
- Hits        14530    13741     -789     
+ Misses       1809     1680     -129

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

paleolimbot

Thanks!

The structure of this is great. I know I said earlier that one big PR was OK, but the scope has enlarged to the point where I would prefer to review as multiple PRs. In particular, it makes it easier to ensure full test coverage for each component (also vaguely better for the changelog). Perhaps:

Add array view null counter
Add encoder.c with the encoder infrastructure and test for create/destroy
Add schema writing
Add batch writing + unit tests
Add encoder to the ipc_files test (which is our current infrastructure for sending random batches of various types through our IPC library)

I know that there are some testing patterns and style differences with Arrow C++ that take some getting used to here. I'm absolutely open to changing those; however, if they do change they should change in the rest of the code base too (in dedicated PRs). I think the smaller PRs will help with that (to slowly get used to the existing conventions without having to rewrite things).

src/nanoarrow/common/inline_array.h

Adds `ArrowArrayViewComputeNullCount()` and tests. Extracted from #555

Add `ArrowIpcEncoder`, init/reset, and tests. Extracted from #555 (review)

bkietz · 2024-07-30T21:30:13Z

Rebased on #568

- added ArrowIpcEncoderEncodeSchema - added a parameter to ArrowIpcEncoderFinalizeBuffer which controls whether encapsulated/padded message buffers will be produced instead of raw - tests reuse the decoder tests, replacing arrow C++'s encoder with ArrowIpcEncoder Extracted from #555 (review)

src/nanoarrow/ipc/encoder.c

paleolimbot · 2024-08-02T00:18:16Z

src/nanoarrow/ipc/encoder.c

+  int compressed_buffer_header =
+      encoder->codec != NANOARROW_IPC_COMPRESSION_TYPE_NONE ? sizeof(int64_t) : 0;


Does it make sense to have the codec member of ArrowIpcEncoder? It seems like that is a very simple way of handling compression that will have to change at some point (either to handle more complex ways of choosing which buffers to compress or adding options unrelated to compression).

I think it suffices for now. Even if we later extend with more advanced heuristics for determining if a buffer should be compressed, the IPC format only allows a single codec to be provided per schema/stream (which means we'll still have only one codec per encoder).

I had in mind something more like how it is specified in Arrow C++ (as a member of an Options struct that includes some other options about IPC writing). I don't see any advantages to premptively adding references to a feature that doesn't exist yet (and might never exist, although I hope that it does).

I'll remove compression support for now. I'm not sure how an options struct would be advantageous here, but we can hash that out later

Agreed (it might not be useful, or might not be useful here!)

src/nanoarrow/ipc/encoder.c

paleolimbot · 2024-08-02T00:37:13Z

src/nanoarrow/ipc/encoder.c

+ArrowErrorCode ArrowIpcEncoderEncodeRecordBatch(struct ArrowIpcEncoder* encoder,
+                                                const struct ArrowArrayView* array_view,
+                                                struct ArrowError* error) {
+  NANOARROW_DCHECK(encoder != NULL && encoder->private_data != NULL && schema != NULL);


Does schema exist here?

nope, that's a copy paste error. I found this in another PR. It seems that NANOARROW_DEBUG isn't getting defined

Ah, that would be:

arrow-nanoarrow/CMakeLists.txt

Line 148 in f74d57c

target_compile_definitions(nanoarrow PUBLIC "$<$<CONFIG:Debug>:NANOARROW_DEBUG>")

...that doesn't exist for nanoarrow_ipc. (No need to fix that here since it's not your fault 😬 )

Should be fixed by #573

paleolimbot · 2024-08-02T01:17:59Z

src/nanoarrow/ipc/encoder.c

+  FLATCC_RETURN_UNLESS_0(RecordBatch_nodes_create(  //
+      builder, (struct ns(FieldNode)*)private->nodes.data,
+      private->nodes.size_bytes / sizeof(struct ns(FieldNode))));
+  FLATCC_RETURN_UNLESS_0(RecordBatch_buffers_create(  //
+      builder, (struct ns(Buffer)*)private->buffers.data,
+      private->buffers.size_bytes / sizeof(struct ns(Buffer))));


Is it worth just caching the builder for the RecordBatch message and the int64_t* or ns(FieldNode)* to the start of each (to avoid allocating a builder for every batch?). The decoder requires initializing with a schema to make some allocations that only need to happen once (here an ArrowArrayView would do).

I'm not sure what you mean; the builder is allocated in Init and only deallocated in Reset. The buffers used to store ns(FieldNode) or ns(Buffer) are also reused for each batch until Reset.

I was wondering if we could avoid creating the new root/fields and copying our cached buffer into it (updating in place instead), but flatcc is probably be very good at anticipating repeated message building and avoiding allocations (and the copy is probably not expensive).

I'm not sure what state flatcc maintains inside its builders. Some of the example code suggests we could have it maintain independent vectors which are not specifically part of a field which is currently being built. I've tried to use that a few times and gotten odd errors when I strayed from strict state machine style, so I'd prefer to experiment independent of other changes in a follow up.

src/nanoarrow/ipc/decoder_test.cc

paleolimbot · 2024-08-02T02:10:57Z

src/nanoarrow/ipc/decoder_test.cc

@@ -763,6 +764,132 @@ TEST_P(ArrowTypeParameterizedTestFixture, NanoarrowIpcArrowArrayRoundtrip) {
  ArrowIpcDecoderReset(&decoder);
 }

+struct ArrowArrayViewEqualTo {


This is awesome! The negative matches and messages here are also not tested and I am not sure anybody looking to see if we had a utility to help with array equality would look in decoder_test.cc to find it. Something like:

void AssertArrayViewIdentical(actual, expected) { NANOARROW_DCHECK(actual->dictionary != nullptr); NANOARROW_DCHECK(expected->dictionary != nullptr); ASSERT_EQ(actual->storage_type, expected->storage_type); ASSERT_EQ(actual->offset, expected->offset); ASSERT_EQ(actual->length, expected->length); for (int i = 0; i < 3; i++) { auto a_buf = actual->buffer_views[i]; auto e_buf = expected->buffer_views[i]; ASSERT_EQ(a_buf.size_bytes, e_buf->size_bytes); if (a_buf.size_bytes != 0) { ASSERT_EQ(memcmp(a_buf.data.data, e_buf.data.data, a_buf.size_bytes), 0); } } ASSERT_EQ(actual->n_children, expected->n_children); for (int i = 0; i < actual->n_children; i++) { AssertArrayViewIdentical(actual->children[i], expected->children[i]); } }

...will give terrible error messages but can be followed up with a tested version of this helper (or a version that partly lives in the C library since this comes up in C situations as well).

can we leave extraction of an array equality helper for a follow up?

Yes! But since you still need something to test the roundtrip, this suggestion was to use something more compact than the equality helper that is currently here.

~~Thanks! In the meantime I bet we can pin the sha of the action before the change or something (I'll try Monday)~~ (meant for another thread! 😬 )

src/nanoarrow/nanoarrow_ipc.h

.github/workflows/r-check.yaml

src/nanoarrow/ipc/encoder.c

…572) First noticed at #555 (comment) , the R check action is failing because an update to r-lib actions resulted in some quarto actions being invoked, and these have not yet been whitelisted for use in Apache repositories. It also may be that we don't need the quarto actions (we probably don't) but some brief experimentation to attempt circumventing the use of the quarto action did not result in a successful workflow. Hence, a pin to unblock PR checks until either the v2 branch is updated or it is clear how to avoid the failure.

In previous PRs we consolidated the extensions into the main CMakeLists.txt; however, there were some things happening for certain targets (like installing them or setting NANOARROW_DEBUG) but not others. Noticed in #555 (comment) where there was a DCHECK referencing a variable that didn't exist that made it through CI 😬

paleolimbot

Thank you!

bkietz requested a review from paleolimbot July 10, 2024 20:43

bkietz self-assigned this Jul 10, 2024

paleolimbot reviewed Jul 11, 2024

View reviewed changes

paleolimbot reviewed Jul 13, 2024

View reviewed changes

bkietz marked this pull request as ready for review July 15, 2024 16:15

paleolimbot reviewed Jul 18, 2024

View reviewed changes

src/nanoarrow/common/inline_array.h Outdated Show resolved Hide resolved

src/nanoarrow/common/inline_array.h Outdated Show resolved Hide resolved

src/nanoarrow/common/inline_array.h Outdated Show resolved Hide resolved

bkietz mentioned this pull request Jul 19, 2024

add ArrowArrayViewComputeNullCount #562

Merged

paleolimbot pushed a commit that referenced this pull request Jul 19, 2024

feat: Add ArrowArrayViewComputeNullCount (#562)

9a00532

Adds `ArrowArrayViewComputeNullCount()` and tests. Extracted from #555

bkietz mentioned this pull request Jul 23, 2024

feat: Add IPC writer scaffolding #564

Merged

paleolimbot pushed a commit that referenced this pull request Jul 25, 2024

feat: Add IPC writer scaffolding (#564)

2040e74

Add `ArrowIpcEncoder`, init/reset, and tests. Extracted from #555 (review)

bkietz mentioned this pull request Jul 26, 2024

feat: Add IPC schema encoding #568

Merged

bkietz force-pushed the ipc-write branch 3 times, most recently from 5f36f64 to 0068806 Compare July 30, 2024 21:29

bkietz force-pushed the ipc-write branch 2 times, most recently from 95ea4cf to 2e3469b Compare July 31, 2024 16:38

bkietz changed the title ~~Adding IPC write support~~ feat: add ipc RecordBatch encoding Jul 31, 2024

bkietz force-pushed the ipc-write branch from 2e3469b to d0ff82e Compare July 31, 2024 16:41

paleolimbot reviewed Aug 2, 2024

View reviewed changes

bkietz force-pushed the ipc-write branch from 2f0b8d0 to 9667735 Compare August 2, 2024 17:54

bkietz commented Aug 2, 2024

View reviewed changes

.github/workflows/r-check.yaml Outdated Show resolved Hide resolved

paleolimbot reviewed Aug 3, 2024

View reviewed changes

src/nanoarrow/ipc/encoder.c Show resolved Hide resolved

This was referenced Aug 5, 2024

fix(ci): Pin r-lib actions as a workaround for latest action updates #572

Merged

refactor: Consolidate per-target actions in CMakeLists.txt #573

Merged

bkietz force-pushed the ipc-write branch from 1ce9686 to 41ec1b3 Compare August 5, 2024 19:02

bkietz force-pushed the ipc-write branch from 8fe3790 to 477f01e Compare August 6, 2024 16:18

paleolimbot approved these changes Aug 6, 2024

View reviewed changes

bkietz added 10 commits August 6, 2024 15:12

Add RecordBatch encoding

37e84bb

fix sign conversion

faaea85

review comments

e3e644d

try repairing broken GH action

9c1fefb

++i -> i++

ea2cf93

bool?

c10626e

delete codec, explicit error for macros

9c2ac55

simplify array view equality checking

1868656

rassafrassa pedantic checks

bad93ec

rebase fixups

af2d913

bkietz force-pushed the ipc-write branch from 477f01e to af2d913 Compare August 6, 2024 20:13

bkietz merged commit cb89444 into apache:main Aug 7, 2024
34 checks passed

bkietz deleted the ipc-write branch August 7, 2024 00:57

paleolimbot added this to the nanoarrow 0.6.0 milestone Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ipc RecordBatch encoding #555

feat: add ipc RecordBatch encoding #555

bkietz commented Jul 10, 2024 •

edited

Loading

paleolimbot left a comment

paleolimbot left a comment

codecov-commenter commented Jul 13, 2024

paleolimbot left a comment

bkietz commented Jul 30, 2024

paleolimbot Aug 2, 2024

bkietz Aug 2, 2024

paleolimbot Aug 3, 2024

bkietz Aug 5, 2024

paleolimbot Aug 5, 2024

paleolimbot Aug 2, 2024

bkietz Aug 2, 2024

paleolimbot Aug 3, 2024

paleolimbot Aug 5, 2024

paleolimbot Aug 2, 2024

bkietz Aug 2, 2024

paleolimbot Aug 3, 2024

bkietz Aug 5, 2024

paleolimbot Aug 5, 2024

paleolimbot Aug 2, 2024

bkietz Aug 2, 2024

paleolimbot Aug 3, 2024

paleolimbot Aug 3, 2024 •

edited

Loading

paleolimbot left a comment

		int compressed_buffer_header =
		encoder->codec != NANOARROW_IPC_COMPRESSION_TYPE_NONE ? sizeof(int64_t) : 0;

feat: add ipc RecordBatch encoding #555

feat: add ipc RecordBatch encoding #555

Conversation

bkietz commented Jul 10, 2024 • edited Loading

paleolimbot left a comment

Choose a reason for hiding this comment

paleolimbot left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 13, 2024

Codecov Report

paleolimbot left a comment

Choose a reason for hiding this comment

bkietz commented Jul 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot Aug 3, 2024 • edited Loading

Choose a reason for hiding this comment

paleolimbot left a comment

Choose a reason for hiding this comment

bkietz commented Jul 10, 2024 •

edited

Loading

paleolimbot Aug 3, 2024 •

edited

Loading