feat: Implement ArrowBitmapUnpackInt8Unsafe #276

WillAyd · 2023-08-16T16:40:32Z

No description provided.

WillAyd · 2023-08-16T16:42:19Z

src/nanoarrow/nanoarrow.h

@@ -664,6 +664,10 @@ static inline ArrowErrorCode ArrowBufferAppendBufferView(struct ArrowBuffer* buf
 /// \brief Extract a boolean value from a bitmap
 static inline int8_t ArrowBitGet(const uint8_t* bits, int64_t i);

+/// \brief Extract boolean values from a range in a bitmap
+static inline void ArrowBitsGet(const uint8_t* bits, int64_t start_offset, int64_t length,


What is perhaps a little confusing is that start_offset is number of bits for this function, whereas the set functions refer to it as number of bytes. I think that makes sense given the inputs/outputs, but maybe we should change the identifers?

I'm not sure I understand...I think that offset and length are always "number of items" (bits for a uint8_t* bits; bytes for uint8_t*, etc) but perhaps there are functions for which this isn't the case?

That's a much simpler way of expressing this than I was putting out there

We should definitely update any function whose signature doesn't follow that convention! I can't spot any from a glance but I'm happy to update any inconsistency.

codecov-commenter · 2023-08-16T16:45:21Z

Codecov Report

Merging #276 (1a8ed33) into main (9ce719a) will increase coverage by 0.02%.
Report is 3 commits behind head on main.
The diff coverage is 96.66%.

@@            Coverage Diff             @@
##             main     #276      +/-   ##
==========================================
+ Coverage   87.19%   87.22%   +0.02%     
==========================================
  Files          66       66              
  Lines       10128    10158      +30     
==========================================
+ Hits         8831     8860      +29     
- Misses       1297     1298       +1

Files Changed	Coverage Δ
src/nanoarrow/nanoarrow.h	`100.00% <ø> (ø)`
src/nanoarrow/buffer_inline.h	`98.55% <96.66%> (-0.23%)`	⬇️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

paleolimbot

This is awesome! I'm excited to follow this up with a 32-bit version for R's logical vectors. A few notes to start the discussion!

paleolimbot · 2023-08-16T17:21:04Z

src/nanoarrow/buffer_inline.h

+static inline void ArrowBitsGet(const uint8_t* bits, int64_t start_offset, int64_t length,
+                                int8_t* out) {


Maybe ArrowBitUnpackInt8()? (As a follow-up I'll add the Int32() version)

paleolimbot · 2023-08-16T17:25:26Z

src/nanoarrow/buffer_inline.h

+  const uint8_t* bits_cursor = bits;
+  int64_t n_remaining = length;
+  int8_t* out_cursor = out;


Do you mind using the same terminology/approach as ArrowBitCountSet()? ( https://github.com/apache/arrow-nanoarrow/blob/main/src/nanoarrow/buffer_inline.h#L297-L302 ). I recently had to fix a segfault resulting from index math on the first/middle/last byte and while I'm not sure the approach there is better than what you have here, it will probably help fix the next issue with either function to have the code look similar for both functions.

paleolimbot · 2023-08-16T17:30:20Z

src/nanoarrow/buffer_inline.h

+  for (int i = 0; i < 8; i++) {
+    out[i] = (*bits >> i) & 1;
+  }


I don't know the details well enough to know if it matters, but Arrow C++ writes out all N lines explicitly ( https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/bpacking_default.h#L35-L105 )

paleolimbot · 2023-08-16T17:43:23Z

src/nanoarrow/buffer_test.cc

+
+  bitmap[2] = 0xfd;
+  ArrowBitsGet(bitmap, 0, sizeof(result), result);
+  EXPECT_EQ(result[16 + 0], 1);


Would an expectation in the form EXPECT_EQ(std::basic_string<uint8_t>(result, n), std::basic_string<uint8_t>({0, 1, 1, 0, 0, 0})) be any more readable? It might scale better to include more cases (e.g., offset != 0, all bits within the same byte)

Not sure if the newer test design is exactly what you are looking for from a readability perspective, but should definitely give better coverage

It's perfect! Thanks! Does it make sense to remove this first bit that doesn't use the helper? (I am a little confused as to what it's testing that the tests below it don't cover).

paleolimbot · 2023-08-16T17:50:41Z

src/nanoarrow/buffer_test.cc

+  uint8_t bitmap[10];
+  int8_t result[sizeof(bitmap) * 8];


You can probably test all the cases with just 3 bytes (which might make the expectations a little less verbose)? I recently fixed a segfault in the first byte/middle byte/last byte logic in ArrowBitCountSet() and it may be worth testing that here, too (i.e., offset == 0, offset > 0 that does not align on a byte boundary, offset + length ends on a byte boundary, offset + length does not end on a byte boundary, offset + length refer to bits in the same byte).

paleolimbot · 2023-08-17T02:00:14Z

src/nanoarrow/buffer_inline.h

+static inline void _ArrowBitmapUnpackInt8(const uint8_t* bits, int8_t* out) {
+  const uint8_t word = *bits;


Suggested change

static inline void _ArrowBitmapUnpackInt8(const uint8_t* bits, int8_t* out) {

const uint8_t word = *bits;

static inline void _ArrowBitmapUnpackInt8(uint8_t word, int8_t* out) {

...may simplify its usage below?

paleolimbot · 2023-08-17T02:06:49Z

src/nanoarrow/buffer_inline.h

+
+  // middle bytes
+  for (int64_t i = bytes_begin + 1; i < bytes_last_valid; i++) {
+    _ArrowBitmapUnpackInt8(&bits[i], out);


Suggested change

_ArrowBitmapUnpackInt8(&bits[i], out);

_ArrowBitmapUnpackInt8(bits[i], out);

paleolimbot · 2023-08-17T02:21:01Z

src/nanoarrow/buffer_test.cc

+
+  bitmap[2] = 0xfd;
+  ArrowBitsGet(bitmap, 0, sizeof(result), result);
+  EXPECT_EQ(result[16 + 0], 1);


It's perfect! Thanks! Does it make sense to remove this first bit that doesn't use the helper? (I am a little confused as to what it's testing that the tests below it don't cover).

paleolimbot

Thank you!

@WillAyd

As a follow-up to #276. The `int32` version is useful because R uses 32-bit integers to represent boolean (i.e., logical) arrays. This results in a significant speedup in boolean conversion! @WillAyd: I updated a few things that you *just* added (Sorry! 😬 ): - I changed `Bitmap` -> `Bits` and removed `Unsafe` to make it more consistent with the other functions that accept `const uint8_t* bits` - I updated the test function so that it tests both the int32 and int8 types at once Before this PR: ``` r library(nanoarrow) lgls <- nanoarrow:::vec_gen(logical(), 1e6) bool_array <- as_nanoarrow_array(lgls) bool_array_arrow <- arrow::as_arrow_array(bool_array) bench::mark( convert_array(bool_array, logical()), as.vector(bool_array_arrow), as.logical(lgls) ) #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> #> 1 convert_array(bool_array, logical()) 556µs 749µs 1.33e3 3.82MB 156. #> 2 as.vector(bool_array_arrow) 558µs 780µs 1.30e3 3.82MB 144. #> 3 as.logical(lgls) 0 1ns 2.28e8 0B 0 bench::mark( convert_array(bool_array, integer()), as.integer(lgls) ) #> # A tibble: 2 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> #> 1 convert_array(bool_array, integer()) 733µs 912µs 1093. 3.81MB 167. #> 2 as.integer(lgls) 615µs 788µs 1273. 3.81MB 182. ``` After this PR: ``` r library(nanoarrow) lgls <- nanoarrow:::vec_gen(logical(), 1e6) bool_array <- as_nanoarrow_array(lgls) bool_array_arrow <- arrow::as_arrow_array(bool_array) bench::mark( convert_array(bool_array, logical()), as.vector(bool_array_arrow), as.logical(lgls) ) #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> #> 1 convert_array(bool_array, logical()) 105µs 308µs 3.21e3 3.83MB 367. #> 2 as.vector(bool_array_arrow) 559µs 772µs 1.30e3 3.82MB 143. #> 3 as.logical(lgls) 0 0 5.87e8 0B 0 bench::mark( convert_array(bool_array, integer()), as.integer(lgls) ) #> # A tibble: 2 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> #> 1 convert_array(bool_array, integer()) 104µs 310µs 3181. 3.81MB 423. #> 2 as.integer(lgls) 615µs 784µs 1278. 3.81MB 142. ``` <sup>Created on 2023-08-17 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

WillAyd added 3 commits August 16, 2023 10:26

structure and failing test

06e12c8

passing test

e65194c

comprehensive tests

981b460

WillAyd commented Aug 16, 2023

View reviewed changes

Remove TODO

c8a5a7c

paleolimbot reviewed Aug 16, 2023

View reviewed changes

WillAyd added 3 commits August 16, 2023 15:04

refactors

1fb6017

more testing

673dab1

better name

6c71f89

WillAyd changed the title ~~feat: Implement ArrowBitsGet~~ feat: Implement ArrowBitmapUnpackInt8Unsafe Aug 16, 2023

paleolimbot reviewed Aug 17, 2023

View reviewed changes

feedback

1a8ed33

paleolimbot approved these changes Aug 17, 2023

View reviewed changes

paleolimbot merged commit e21cc98 into apache:main Aug 17, 2023
27 checks passed

paleolimbot mentioned this pull request Aug 17, 2023

feat: Add ArrowBitsUnpackInt32() #278

Merged

paleolimbot modified the milestones: nanoarrow 0.4.0, nanoarrow 0.3.0 Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement ArrowBitmapUnpackInt8Unsafe #276

feat: Implement ArrowBitmapUnpackInt8Unsafe #276

WillAyd commented Aug 16, 2023

WillAyd Aug 16, 2023

paleolimbot Aug 16, 2023

WillAyd Aug 16, 2023

paleolimbot Aug 17, 2023

codecov-commenter commented Aug 16, 2023 •

edited

Loading

paleolimbot left a comment

paleolimbot Aug 16, 2023

paleolimbot Aug 16, 2023

paleolimbot Aug 16, 2023

paleolimbot Aug 16, 2023

WillAyd Aug 16, 2023

paleolimbot Aug 17, 2023

paleolimbot Aug 16, 2023

paleolimbot Aug 17, 2023

paleolimbot Aug 17, 2023

paleolimbot Aug 17, 2023

paleolimbot left a comment

		static inline void ArrowBitsGet(const uint8_t* bits, int64_t start_offset, int64_t length,
		int8_t* out) {

		static inline void _ArrowBitmapUnpackInt8(const uint8_t* bits, int8_t* out) {
		const uint8_t word = *bits;

	_ArrowBitmapUnpackInt8(&bits[i], out);
	_ArrowBitmapUnpackInt8(bits[i], out);

feat: Implement ArrowBitmapUnpackInt8Unsafe #276

feat: Implement ArrowBitmapUnpackInt8Unsafe #276

Conversation

WillAyd commented Aug 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 16, 2023 • edited Loading

Codecov Report

paleolimbot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot left a comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 16, 2023 •

edited

Loading