Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement ArrowBitmapUnpackInt8Unsafe #276

Merged
merged 8 commits into from
Aug 17, 2023

Conversation

WillAyd
Copy link
Contributor

@WillAyd WillAyd commented Aug 16, 2023

No description provided.

@@ -664,6 +664,10 @@ static inline ArrowErrorCode ArrowBufferAppendBufferView(struct ArrowBuffer* buf
/// \brief Extract a boolean value from a bitmap
static inline int8_t ArrowBitGet(const uint8_t* bits, int64_t i);

/// \brief Extract boolean values from a range in a bitmap
static inline void ArrowBitsGet(const uint8_t* bits, int64_t start_offset, int64_t length,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is perhaps a little confusing is that start_offset is number of bits for this function, whereas the set functions refer to it as number of bytes. I think that makes sense given the inputs/outputs, but maybe we should change the identifers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand...I think that offset and length are always "number of items" (bits for a uint8_t* bits; bytes for uint8_t*, etc) but perhaps there are functions for which this isn't the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a much simpler way of expressing this than I was putting out there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely update any function whose signature doesn't follow that convention! I can't spot any from a glance but I'm happy to update any inconsistency.

@codecov-commenter
Copy link

codecov-commenter commented Aug 16, 2023

Codecov Report

Merging #276 (1a8ed33) into main (9ce719a) will increase coverage by 0.02%.
Report is 3 commits behind head on main.
The diff coverage is 96.66%.

@@            Coverage Diff             @@
##             main     #276      +/-   ##
==========================================
+ Coverage   87.19%   87.22%   +0.02%     
==========================================
  Files          66       66              
  Lines       10128    10158      +30     
==========================================
+ Hits         8831     8860      +29     
- Misses       1297     1298       +1     
Files Changed Coverage Δ
src/nanoarrow/nanoarrow.h 100.00% <ø> (ø)
src/nanoarrow/buffer_inline.h 98.55% <96.66%> (-0.23%) ⬇️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! I'm excited to follow this up with a 32-bit version for R's logical vectors. A few notes to start the discussion!

Comment on lines 245 to 246
static inline void ArrowBitsGet(const uint8_t* bits, int64_t start_offset, int64_t length,
int8_t* out) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ArrowBitUnpackInt8()? (As a follow-up I'll add the Int32() version)

Comment on lines 251 to 253
const uint8_t* bits_cursor = bits;
int64_t n_remaining = length;
int8_t* out_cursor = out;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind using the same terminology/approach as ArrowBitCountSet()? ( https://github.com/apache/arrow-nanoarrow/blob/main/src/nanoarrow/buffer_inline.h#L297-L302 ). I recently had to fix a segfault resulting from index math on the first/middle/last byte and while I'm not sure the approach there is better than what you have here, it will probably help fix the next issue with either function to have the code look similar for both functions.

Comment on lines 226 to 228
for (int i = 0; i < 8; i++) {
out[i] = (*bits >> i) & 1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the details well enough to know if it matters, but Arrow C++ writes out all N lines explicitly ( https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/bpacking_default.h#L35-L105 )


bitmap[2] = 0xfd;
ArrowBitsGet(bitmap, 0, sizeof(result), result);
EXPECT_EQ(result[16 + 0], 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would an expectation in the form EXPECT_EQ(std::basic_string<uint8_t>(result, n), std::basic_string<uint8_t>({0, 1, 1, 0, 0, 0})) be any more readable? It might scale better to include more cases (e.g., offset != 0, all bits within the same byte)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the newer test design is exactly what you are looking for from a readability perspective, but should definitely give better coverage

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's perfect! Thanks! Does it make sense to remove this first bit that doesn't use the helper? (I am a little confused as to what it's testing that the tests below it don't cover).

Comment on lines 275 to 276
uint8_t bitmap[10];
int8_t result[sizeof(bitmap) * 8];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably test all the cases with just 3 bytes (which might make the expectations a little less verbose)? I recently fixed a segfault in the first byte/middle byte/last byte logic in ArrowBitCountSet() and it may be worth testing that here, too (i.e., offset == 0, offset > 0 that does not align on a byte boundary, offset + length ends on a byte boundary, offset + length does not end on a byte boundary, offset + length refer to bits in the same byte).

@WillAyd WillAyd changed the title feat: Implement ArrowBitsGet feat: Implement ArrowBitmapUnpackInt8Unsafe Aug 16, 2023
Comment on lines 225 to 226
static inline void _ArrowBitmapUnpackInt8(const uint8_t* bits, int8_t* out) {
const uint8_t word = *bits;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static inline void _ArrowBitmapUnpackInt8(const uint8_t* bits, int8_t* out) {
const uint8_t word = *bits;
static inline void _ArrowBitmapUnpackInt8(uint8_t word, int8_t* out) {

...may simplify its usage below?


// middle bytes
for (int64_t i = bytes_begin + 1; i < bytes_last_valid; i++) {
_ArrowBitmapUnpackInt8(&bits[i], out);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ArrowBitmapUnpackInt8(&bits[i], out);
_ArrowBitmapUnpackInt8(bits[i], out);


bitmap[2] = 0xfd;
ArrowBitsGet(bitmap, 0, sizeof(result), result);
EXPECT_EQ(result[16 + 0], 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's perfect! Thanks! Does it make sense to remove this first bit that doesn't use the helper? (I am a little confused as to what it's testing that the tests below it don't cover).

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@paleolimbot paleolimbot merged commit e21cc98 into apache:main Aug 17, 2023
27 checks passed
paleolimbot added a commit that referenced this pull request Aug 17, 2023
As a follow-up to #276. The `int32` version is useful because R uses
32-bit integers to represent boolean (i.e., logical) arrays. This
results in a significant speedup in boolean conversion!

@WillAyd: I updated a few things that you *just* added (Sorry! 😬 ):

- I changed `Bitmap` -> `Bits` and removed `Unsafe` to make it more
consistent with the other functions that accept `const uint8_t* bits`
- I updated the test function so that it tests both the int32 and int8
types at once

Before this PR:

``` r
library(nanoarrow)

lgls <- nanoarrow:::vec_gen(logical(), 1e6)
bool_array <- as_nanoarrow_array(lgls)
bool_array_arrow <- arrow::as_arrow_array(bool_array)

bench::mark(
  convert_array(bool_array, logical()),
  as.vector(bool_array_arrow),
  as.logical(lgls)
)
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 convert_array(bool_array, logical()) 556µs  749µs    1.33e3    3.82MB     156.
#> 2 as.vector(bool_array_arrow)          558µs  780µs    1.30e3    3.82MB     144.
#> 3 as.logical(lgls)                         0    1ns    2.28e8        0B       0

bench::mark(
  convert_array(bool_array, integer()),
  as.integer(lgls)
)
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 convert_array(bool_array, integer()) 733µs  912µs     1093.    3.81MB     167.
#> 2 as.integer(lgls)                     615µs  788µs     1273.    3.81MB     182.
```

After this PR:

``` r
library(nanoarrow)

lgls <- nanoarrow:::vec_gen(logical(), 1e6)
bool_array <- as_nanoarrow_array(lgls)
bool_array_arrow <- arrow::as_arrow_array(bool_array)

bench::mark(
  convert_array(bool_array, logical()),
  as.vector(bool_array_arrow),
  as.logical(lgls)
)
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 convert_array(bool_array, logical()) 105µs  308µs    3.21e3    3.83MB     367.
#> 2 as.vector(bool_array_arrow)          559µs  772µs    1.30e3    3.82MB     143.
#> 3 as.logical(lgls)                         0      0    5.87e8        0B       0

bench::mark(
  convert_array(bool_array, integer()),
  as.integer(lgls)
)
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 convert_array(bool_array, integer()) 104µs  310µs     3181.    3.81MB     423.
#> 2 as.integer(lgls)                     615µs  784µs     1278.    3.81MB     142.
```

<sup>Created on 2023-08-17 with [reprex
v2.0.2](https://reprex.tidyverse.org)</sup>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants