-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Improved Bit (Un)packing Performance #280
Conversation
Codecov Report
@@ Coverage Diff @@
## main #280 +/- ##
==========================================
- Coverage 88.23% 87.01% -1.22%
==========================================
Files 3 70 +67
Lines 357 10574 +10217
==========================================
+ Hits 315 9201 +8886
- Misses 42 1373 +1331
... and 66 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Haven't profiled the packing methods yet but I assume we stand to gain a bit removing the left shift operators there as well |
Cool observation! For what it's worth, I wasn't able to observe a difference on MacOS M1 or x86/AMD Ryzen 5 (although I'm benchmarking via the R package, which unpacks into 32-bit integers and not 8-bit integers). With nanoarrow 0.2.0: # install.packages("nanoarrow")
library(nanoarrow)
lgls <- nanoarrow:::vec_gen(logical(), 1e6)
bool_array <- as_nanoarrow_array(lgls)
bool_array_arrow <- arrow::as_arrow_array(bool_array)
bench::mark(
convert_array(bool_array, logical()),
as.vector(bool_array_arrow),
as.logical(lgls)
)
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl>
#> 1 convert_array(bool_array, logical()) 560µs 735µs 1.33e3 3.82MB 154.
#> 2 as.vector(bool_array_arrow) 568µs 782µs 1.28e3 3.82MB 141.
#> 3 as.logical(lgls) 0 0 4.32e8 0B 0 With nanoarrow on main: # pak::pak("apache/arrow-nanoarrow/r")
library(nanoarrow)
lgls <- nanoarrow:::vec_gen(logical(), 1e6)
bool_array <- as_nanoarrow_array(lgls)
bool_array_arrow <- arrow::as_arrow_array(bool_array)
bench::mark(
convert_array(bool_array, logical()),
as.vector(bool_array_arrow),
as.logical(lgls)
)
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl>
#> 1 convert_array(bool_array, logical()) 106µs 321µs 3.07e3 3.83MB 350.
#> 2 as.vector(bool_array_arrow) 558µs 779µs 1.29e3 3.82MB 144.
#> 3 as.logical(lgls) 0 0 3.61e8 0B 0 Created on 2023-08-17 with reprex v2.0.2 With this PR: # pak::pak("apache/arrow-nanoarrow/r#280")
library(nanoarrow)
lgls <- nanoarrow:::vec_gen(logical(), 1e6)
bool_array <- as_nanoarrow_array(lgls)
bool_array_arrow <- arrow::as_arrow_array(bool_array)
bench::mark(
convert_array(bool_array, logical()),
as.vector(bool_array_arrow),
as.logical(lgls)
)
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl>
#> 1 convert_array(bool_array, logical()) 105µs 315µs 3.12e3 3.83MB 355.
#> 2 as.vector(bool_array_arrow) 558µs 777µs 1.30e3 3.82MB 144.
#> 3 as.logical(lgls) 0 1ns 1.44e8 0B 0 Created on 2023-08-17 with reprex v2.0.2 |
Hmm very strange. I also pieced together this benchmark in C: #include <stdlib.h>
#include "nanoarrow.h"
#include "buffer_inline.h"
int main() {
int N = 1000000;
int8_t* values = malloc(N);
memset(values, 0, N);
struct ArrowBitmap bitmap;
ArrowBitmapInit(&bitmap);
ArrowBitmapReserve(&bitmap, N);
ArrowBitmapAppendInt8Unsafe(&bitmap, values, N);
for (int i = 0; i < 10000; i++) {
ArrowBitsUnpackInt8(bitmap.buffer.data, 0, N, values);
}
free(values);
} which I think generally showed improvement, but not to the degree and consistency with what I was seeing in the Python/Cython library I originally noticed this in. Will see if I can test some more |
If this change helps even a little I'm certainly in favour of merging...it may be that it helps some compilers/compiler flag combinations do some optimizations. You could also try making |
For reference here is a "shiftless" packing implementation too, though I haven't seen as much speed up using this: static inline void _ArrowBitmapPackInt8(const int8_t* values, uint8_t* out) {
*out = (values[0]
| ((values[1] + 0x1) & 0x2)
| ((values[2] + 0x3) & 0x4)
| ((values[3] + 0x7) & 0x8)
| ((values[4] + 0xf) & 0x10)
| ((values[5] + 0x1f) & 0x20)
| ((values[6] + 0x3f) & 0x40)
| ((values[7] + 0x7f) & 0x80)
);
} I am still testing these downstream so no rush to merge this - let me see if anything becomes clearer as I work through this in pandas and will report back |
I set up a repo to compare this from Cython: https://github.com/WillAyd/cython-nanoarrow The benchmarks for me look something like this: In [1]: from comparisons import ComparisonManager
...: mgr = ComparisonManager()
...: %timeit mgr.unpack()
...: %timeit mgr.unpack_no_shift()
547 µs ± 21 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
40.5 µs ± 321 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) Not sure what makes Python here different than the R / C benchmarks we've tried before... |
On the writing side of things my benchmarks shows ~10% improvement not using shift. Not as drastic as read, but still maybe worthwhile In [2]: from comparisons import ComparisonManager
...: mgr = ComparisonManager()
...: %timeit mgr.pack()
...: %timeit mgr.pack_no_shift()
80.1 µs ± 984 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
72.8 µs ± 390 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [3]: from comparisons import ComparisonManager
...: mgr = ComparisonManager()
...: %timeit mgr.pack()
...: %timeit mgr.pack_no_shift()
78.9 µs ± 671 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
71.5 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [4]: from comparisons import ComparisonManager
...: mgr = ComparisonManager()
...: %timeit mgr.pack()
...: %timeit mgr.pack_no_shift()
80.4 µs ± 324 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
72.2 µs ± 272 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) |
FWIW I quickly cloned your repo and followed the instructions there, and I see similar results as you (big difference between |
Hmm strange. So for you the packing without shifts is slower? I see 142 us going up to 203? |
Oh whoops, I read that backward. Yes, apparently without the shifts the packing is much slower for me. |
FWIW I also had to compile slightly differently because I got a bunch of missing symbol errors.
|
Are you on an intel or m1 chip? Guessing that could explain some of the differences. Interesting to see all this variation... |
(I am on ubuntu / intel cpu) |
Yes, I'm on M1. If I change the unpacking to a macro, I get 3x faster unpacking (and no difference between shift/no shift): #define ARROW_BITS_UNPACK1(word, out) \
do { \
out[0] = (word >> 0) & 1; \
out[1] = (word >> 1) & 1; \
out[2] = (word >> 2) & 1; \
out[3] = (word >> 3) & 1; \
out[4] = (word >> 4) & 1; \
out[5] = (word >> 5) & 1; \
out[6] = (word >> 6) & 1; \
out[7] = (word >> 7) & 1; \
} while(0) |
That is an interesting find! Though I wonder since the macro makes a difference - if you compiled with |
@paleolimbot does this program generate any difference for you? Regardless of if compiled with clang or gcc, I get a 2x speedup at -O3 optimization without shifts: #include <stdio.h>
#include <inttypes.h>
#include <time.h>
#define NANOSEC_PER_SEC 1000000000LL
static inline void UnpackInt8Shifts(const uint8_t word, int8_t* out) {
out[0] = (word >> 0) & 1;
out[1] = (word >> 1) & 1;
out[2] = (word >> 2) & 1;
out[3] = (word >> 3) & 1;
out[4] = (word >> 4) & 1;
out[5] = (word >> 5) & 1;
out[6] = (word >> 6) & 1;
out[7] = (word >> 7) & 1;
}
static inline void UnpackInt8NoShifts(const uint8_t word, int8_t* out) {
out[0] = (word & 0x1) != 0;
out[1] = (word & 0x2) != 0;
out[2] = (word & 0x4) != 0;
out[3] = (word & 0x8) != 0;
out[4] = (word & 0x10) != 0;
out[5] = (word & 0x20) != 0;
out[6] = (word & 0x40) != 0;
out[7] = (word & 0x80) != 0;
}
int main(void) {
const size_t niters = 100000000;
const uint8_t word = 0xaa;
int8_t out[8];
struct timespec start, end;
clock_gettime(CLOCK_REALTIME, &start);
for (size_t i = 0; i < niters; i++) {
UnpackInt8Shifts(word, out);
}
clock_gettime(CLOCK_REALTIME, &end);
printf("ns duration of UnpackInt8Shifts was: %lld\n", (end.tv_sec * NANOSEC_PER_SEC + end.tv_nsec) - (start.tv_sec * NANOSEC_PER_SEC + start.tv_nsec));
clock_gettime(CLOCK_REALTIME, &start);
for (size_t i = 0; i < niters; i++) {
UnpackInt8NoShifts(word, out);
}
clock_gettime(CLOCK_REALTIME, &end);
printf("ns duration of UnpackInt8NoShifts was: %lld\n", (end.tv_sec * NANOSEC_PER_SEC + end.tv_nsec) - (start.tv_sec * NANOSEC_PER_SEC + start.tv_nsec));
return 0;
} Still worthwhile though I'm not sure how the 2x speedup in the base benchmark yet results in a 15x speedup in the Python benchmark on my platform |
With my (apple aarch64) Either way I'm happy to merge this (perhaps with a rebase to clear up the CI). In the unlikely event I discover any performance regression I'm happy to revisit. I did try briefly my macro approach before 0.3.0 to see if that could be a quick win, but it seemed to decrease performance in the R package for the "unpack 32" case in my first benchmark so I moved on to other things. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
I was very surprised by this but getting rid of the shifting yielded a huge performance boost for me locally. I was benchmarking some pandas code that took ~500us to unpack 1 million boolean values - with this simple change that time fell to ~30us
Not an expert in assembly but here is what godbolt produces to set the index of 1 before:
and after:
Assuming the
shr
instruction is inefficient compared to thetest
/setne
approach taken in the latter