-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement ArrowBitmapUnpackInt8Unsafe #276
Conversation
src/nanoarrow/nanoarrow.h
Outdated
@@ -664,6 +664,10 @@ static inline ArrowErrorCode ArrowBufferAppendBufferView(struct ArrowBuffer* buf | |||
/// \brief Extract a boolean value from a bitmap | |||
static inline int8_t ArrowBitGet(const uint8_t* bits, int64_t i); | |||
|
|||
/// \brief Extract boolean values from a range in a bitmap | |||
static inline void ArrowBitsGet(const uint8_t* bits, int64_t start_offset, int64_t length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is perhaps a little confusing is that start_offset
is number of bits for this function, whereas the set functions refer to it as number of bytes. I think that makes sense given the inputs/outputs, but maybe we should change the identifers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand...I think that offset
and length
are always "number of items" (bits for a uint8_t* bits
; bytes for uint8_t*
, etc) but perhaps there are functions for which this isn't the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a much simpler way of expressing this than I was putting out there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely update any function whose signature doesn't follow that convention! I can't spot any from a glance but I'm happy to update any inconsistency.
Codecov Report
@@ Coverage Diff @@
## main #276 +/- ##
==========================================
+ Coverage 87.19% 87.22% +0.02%
==========================================
Files 66 66
Lines 10128 10158 +30
==========================================
+ Hits 8831 8860 +29
- Misses 1297 1298 +1
... and 1 file with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! I'm excited to follow this up with a 32-bit version for R's logical vectors. A few notes to start the discussion!
src/nanoarrow/buffer_inline.h
Outdated
static inline void ArrowBitsGet(const uint8_t* bits, int64_t start_offset, int64_t length, | ||
int8_t* out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ArrowBitUnpackInt8()
? (As a follow-up I'll add the Int32()
version)
src/nanoarrow/buffer_inline.h
Outdated
const uint8_t* bits_cursor = bits; | ||
int64_t n_remaining = length; | ||
int8_t* out_cursor = out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind using the same terminology/approach as ArrowBitCountSet()
? ( https://github.com/apache/arrow-nanoarrow/blob/main/src/nanoarrow/buffer_inline.h#L297-L302 ). I recently had to fix a segfault resulting from index math on the first/middle/last byte and while I'm not sure the approach there is better than what you have here, it will probably help fix the next issue with either function to have the code look similar for both functions.
src/nanoarrow/buffer_inline.h
Outdated
for (int i = 0; i < 8; i++) { | ||
out[i] = (*bits >> i) & 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the details well enough to know if it matters, but Arrow C++ writes out all N lines explicitly ( https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/bpacking_default.h#L35-L105 )
src/nanoarrow/buffer_test.cc
Outdated
|
||
bitmap[2] = 0xfd; | ||
ArrowBitsGet(bitmap, 0, sizeof(result), result); | ||
EXPECT_EQ(result[16 + 0], 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would an expectation in the form EXPECT_EQ(std::basic_string<uint8_t>(result, n), std::basic_string<uint8_t>({0, 1, 1, 0, 0, 0}))
be any more readable? It might scale better to include more cases (e.g., offset != 0, all bits within the same byte)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the newer test design is exactly what you are looking for from a readability perspective, but should definitely give better coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's perfect! Thanks! Does it make sense to remove this first bit that doesn't use the helper? (I am a little confused as to what it's testing that the tests below it don't cover).
src/nanoarrow/buffer_test.cc
Outdated
uint8_t bitmap[10]; | ||
int8_t result[sizeof(bitmap) * 8]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably test all the cases with just 3 bytes (which might make the expectations a little less verbose)? I recently fixed a segfault in the first byte/middle byte/last byte logic in ArrowBitCountSet()
and it may be worth testing that here, too (i.e., offset == 0, offset > 0 that does not align on a byte boundary, offset + length ends on a byte boundary, offset + length does not end on a byte boundary, offset + length refer to bits in the same byte).
src/nanoarrow/buffer_inline.h
Outdated
static inline void _ArrowBitmapUnpackInt8(const uint8_t* bits, int8_t* out) { | ||
const uint8_t word = *bits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static inline void _ArrowBitmapUnpackInt8(const uint8_t* bits, int8_t* out) { | |
const uint8_t word = *bits; | |
static inline void _ArrowBitmapUnpackInt8(uint8_t word, int8_t* out) { |
...may simplify its usage below?
src/nanoarrow/buffer_inline.h
Outdated
|
||
// middle bytes | ||
for (int64_t i = bytes_begin + 1; i < bytes_last_valid; i++) { | ||
_ArrowBitmapUnpackInt8(&bits[i], out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ArrowBitmapUnpackInt8(&bits[i], out); | |
_ArrowBitmapUnpackInt8(bits[i], out); |
src/nanoarrow/buffer_test.cc
Outdated
|
||
bitmap[2] = 0xfd; | ||
ArrowBitsGet(bitmap, 0, sizeof(result), result); | ||
EXPECT_EQ(result[16 + 0], 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's perfect! Thanks! Does it make sense to remove this first bit that doesn't use the helper? (I am a little confused as to what it's testing that the tests below it don't cover).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
As a follow-up to #276. The `int32` version is useful because R uses 32-bit integers to represent boolean (i.e., logical) arrays. This results in a significant speedup in boolean conversion! @WillAyd: I updated a few things that you *just* added (Sorry! 😬 ): - I changed `Bitmap` -> `Bits` and removed `Unsafe` to make it more consistent with the other functions that accept `const uint8_t* bits` - I updated the test function so that it tests both the int32 and int8 types at once Before this PR: ``` r library(nanoarrow) lgls <- nanoarrow:::vec_gen(logical(), 1e6) bool_array <- as_nanoarrow_array(lgls) bool_array_arrow <- arrow::as_arrow_array(bool_array) bench::mark( convert_array(bool_array, logical()), as.vector(bool_array_arrow), as.logical(lgls) ) #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> #> 1 convert_array(bool_array, logical()) 556µs 749µs 1.33e3 3.82MB 156. #> 2 as.vector(bool_array_arrow) 558µs 780µs 1.30e3 3.82MB 144. #> 3 as.logical(lgls) 0 1ns 2.28e8 0B 0 bench::mark( convert_array(bool_array, integer()), as.integer(lgls) ) #> # A tibble: 2 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> #> 1 convert_array(bool_array, integer()) 733µs 912µs 1093. 3.81MB 167. #> 2 as.integer(lgls) 615µs 788µs 1273. 3.81MB 182. ``` After this PR: ``` r library(nanoarrow) lgls <- nanoarrow:::vec_gen(logical(), 1e6) bool_array <- as_nanoarrow_array(lgls) bool_array_arrow <- arrow::as_arrow_array(bool_array) bench::mark( convert_array(bool_array, logical()), as.vector(bool_array_arrow), as.logical(lgls) ) #> # A tibble: 3 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> #> 1 convert_array(bool_array, logical()) 105µs 308µs 3.21e3 3.83MB 367. #> 2 as.vector(bool_array_arrow) 559µs 772µs 1.30e3 3.82MB 143. #> 3 as.logical(lgls) 0 0 5.87e8 0B 0 bench::mark( convert_array(bool_array, integer()), as.integer(lgls) ) #> # A tibble: 2 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> #> 1 convert_array(bool_array, integer()) 104µs 310µs 3181. 3.81MB 423. #> 2 as.integer(lgls) 615µs 784µs 1278. 3.81MB 142. ``` <sup>Created on 2023-08-17 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
No description provided.