perf: Improved Bit (Un)packing Performance #280

WillAyd · 2023-08-17T22:26:37Z

I was very surprised by this but getting rid of the shifting yielded a huge performance boost for me locally. I was benchmarking some pandas code that took ~500us to unpack 1 million boolean values - with this simple change that time fell to ~30us

Not an expert in assembly but here is what godbolt produces to set the index of 1 before:

        movzx   eax, BYTE PTR [rbp-1]
        shr     al
        mov     edx, eax
        .loc 1 6 6
        mov     rax, QWORD PTR [rbp-32]
        add     rax, 1
        .loc 1 6 24
        and     edx, 1
        .loc 1 6 10
        mov     BYTE PTR [rax], dl

and after:

        movzx   eax, BYTE PTR [rbp-1]
        and     eax, 2
        .loc 1 6 25
        test    eax, eax
        setne   dl
        .loc 1 6 6
        mov     rax, QWORD PTR [rbp-32]
        add     rax, 1
        .loc 1 6 10
        mov     BYTE PTR [rax], dl

Assuming the shr instruction is inefficient compared to the test / setne approach taken in the latter

codecov-commenter · 2023-08-17T22:29:05Z

Codecov Report

Merging #280 (e4c7ec0) into main (798a1b8) will decrease coverage by 1.22%.
Report is 1 commits behind head on main.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #280      +/-   ##
==========================================
- Coverage   88.23%   87.01%   -1.22%     
==========================================
  Files           3       70      +67     
  Lines         357    10574   +10217     
==========================================
+ Hits          315     9201    +8886     
- Misses         42     1373    +1331

Files	Coverage Δ
src/nanoarrow/buffer_inline.h	`98.39% <100.00%> (ø)`

... and 66 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

WillAyd · 2023-08-17T22:48:44Z

Haven't profiled the packing methods yet but I assume we stand to gain a bit removing the left shift operators there as well

paleolimbot · 2023-08-18T00:47:12Z

Cool observation!

For what it's worth, I wasn't able to observe a difference on MacOS M1 or x86/AMD Ryzen 5 (although I'm benchmarking via the R package, which unpacks into 32-bit integers and not 8-bit integers).

With nanoarrow 0.2.0:

# install.packages("nanoarrow")
library(nanoarrow)

lgls <- nanoarrow:::vec_gen(logical(), 1e6)
bool_array <- as_nanoarrow_array(lgls)
bool_array_arrow <- arrow::as_arrow_array(bool_array)

bench::mark(
  convert_array(bool_array, logical()),
  as.vector(bool_array_arrow),
  as.logical(lgls)
)
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 convert_array(bool_array, logical()) 560µs  735µs    1.33e3    3.82MB     154.
#> 2 as.vector(bool_array_arrow)          568µs  782µs    1.28e3    3.82MB     141.
#> 3 as.logical(lgls)                         0      0    4.32e8        0B       0

With nanoarrow on main:

# pak::pak("apache/arrow-nanoarrow/r")
library(nanoarrow)

lgls <- nanoarrow:::vec_gen(logical(), 1e6)
bool_array <- as_nanoarrow_array(lgls)
bool_array_arrow <- arrow::as_arrow_array(bool_array)

bench::mark(
  convert_array(bool_array, logical()),
  as.vector(bool_array_arrow),
  as.logical(lgls)
)
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 convert_array(bool_array, logical()) 106µs  321µs    3.07e3    3.83MB     350.
#> 2 as.vector(bool_array_arrow)          558µs  779µs    1.29e3    3.82MB     144.
#> 3 as.logical(lgls)                         0      0    3.61e8        0B       0

^{Created on 2023-08-17 with reprex v2.0.2}

With this PR:

# pak::pak("apache/arrow-nanoarrow/r#280")
library(nanoarrow)

lgls <- nanoarrow:::vec_gen(logical(), 1e6)
bool_array <- as_nanoarrow_array(lgls)
bool_array_arrow <- arrow::as_arrow_array(bool_array)

bench::mark(
  convert_array(bool_array, logical()),
  as.vector(bool_array_arrow),
  as.logical(lgls)
)
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 convert_array(bool_array, logical()) 105µs  315µs    3.12e3    3.83MB     355.
#> 2 as.vector(bool_array_arrow)          558µs  777µs    1.30e3    3.82MB     144.
#> 3 as.logical(lgls)                         0    1ns    1.44e8        0B       0

^{Created on 2023-08-17 with reprex v2.0.2}

WillAyd · 2023-08-18T12:49:53Z

Hmm very strange. I also pieced together this benchmark in C:

#include <stdlib.h>

#include "nanoarrow.h"
#include "buffer_inline.h"

int main() {
  int N = 1000000;
  int8_t* values = malloc(N);
  memset(values, 0, N);
  
  struct ArrowBitmap bitmap;
  ArrowBitmapInit(&bitmap);
  ArrowBitmapReserve(&bitmap, N);
  ArrowBitmapAppendInt8Unsafe(&bitmap, values, N);

  for (int i = 0; i < 10000; i++) {
    ArrowBitsUnpackInt8(bitmap.buffer.data, 0, N, values);
  }
  free(values);
}

which I think generally showed improvement, but not to the degree and consistency with what I was seeing in the Python/Cython library I originally noticed this in. Will see if I can test some more

paleolimbot · 2023-08-18T13:10:54Z

If this change helps even a little I'm certainly in favour of merging...it may be that it helps some compilers/compiler flag combinations do some optimizations. You could also try making _ArrowBitsUnpackInt8() a macro (then you'd only need one for int8 and int32).

WillAyd · 2023-08-18T14:25:53Z

For reference here is a "shiftless" packing implementation too, though I haven't seen as much speed up using this:

static inline void _ArrowBitmapPackInt8(const int8_t* values, uint8_t* out) {
  *out = (values[0]
          | ((values[1] + 0x1) & 0x2)
          | ((values[2] + 0x3) & 0x4)
          | ((values[3] + 0x7) & 0x8)
          | ((values[4] + 0xf) & 0x10)
          | ((values[5] + 0x1f) & 0x20)
          | ((values[6] + 0x3f) & 0x40)
          | ((values[7] + 0x7f) & 0x80)
          );
}

I am still testing these downstream so no rush to merge this - let me see if anything becomes clearer as I work through this in pandas and will report back

WillAyd · 2023-08-22T17:58:56Z

I set up a repo to compare this from Cython:

https://github.com/WillAyd/cython-nanoarrow

The benchmarks for me look something like this:

In [1]: from comparisons import ComparisonManager
   ...: mgr = ComparisonManager()
   ...: %timeit mgr.unpack()
   ...: %timeit mgr.unpack_no_shift()
547 µs ± 21 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
40.5 µs ± 321 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Not sure what makes Python here different than the R / C benchmarks we've tried before...

WillAyd · 2023-08-22T18:59:42Z

On the writing side of things my benchmarks shows ~10% improvement not using shift. Not as drastic as read, but still maybe worthwhile

In [2]: from comparisons import ComparisonManager
   ...: mgr = ComparisonManager()
   ...: %timeit mgr.pack()
   ...: %timeit mgr.pack_no_shift()
80.1 µs ± 984 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
72.8 µs ± 390 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [3]: from comparisons import ComparisonManager
   ...: mgr = ComparisonManager()
   ...: %timeit mgr.pack()
   ...: %timeit mgr.pack_no_shift()
78.9 µs ± 671 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
71.5 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [4]: from comparisons import ComparisonManager
   ...: mgr = ComparisonManager()
   ...: %timeit mgr.pack()
   ...: %timeit mgr.pack_no_shift()
80.4 µs ± 324 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
72.2 µs ± 272 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

jorisvandenbossche · 2023-08-22T19:12:45Z

I set up a repo to compare this from Cython:

FWIW I quickly cloned your repo and followed the instructions there, and I see similar results as you (big difference between unpack and unpack_no_shift, 1.36ms vs 140µs, so also around 10x difference)

paleolimbot · 2023-08-22T19:54:24Z

The difference is more subtle for me on unpacking (but more pronounced for packing)...I'm game!

Can you add a comment above each hard-coded section of shifts and explain that it was done for performance reasons?

WillAyd · 2023-08-22T19:59:39Z

Hmm strange. So for you the packing without shifts is slower? I see 142 us going up to 203?

paleolimbot · 2023-08-22T20:03:59Z

Oh whoops, I read that backward. Yes, apparently without the shifts the packing is much slower for me.

paleolimbot · 2023-08-22T20:05:10Z

FWIW I also had to compile slightly differently because I got a bunch of missing symbol errors.

gcc -O3 -Wall -Werror -shared -fPIC \
    -I$(python -c "import sysconfig; print(sysconfig.get_paths()['include'])") \
    -L/Library/Frameworks/Python.framework/Versions/3.11/lib -lpython3.11 \
    comparisons.c nanoarrow.c -o comparisons.so

WillAyd · 2023-08-22T20:19:04Z

Are you on an intel or m1 chip? Guessing that could explain some of the differences. Interesting to see all this variation...

jorisvandenbossche · 2023-08-22T20:35:24Z

(I am on ubuntu / intel cpu)

paleolimbot · 2023-08-23T01:21:46Z

Yes, I'm on M1.

If I change the unpacking to a macro, I get 3x faster unpacking (and no difference between shift/no shift):

#define ARROW_BITS_UNPACK1(word, out) \
  do { \
  out[0] = (word >> 0) & 1; \
  out[1] = (word >> 1) & 1; \
  out[2] = (word >> 2) & 1; \
  out[3] = (word >> 3) & 1; \
  out[4] = (word >> 4) & 1; \
  out[5] = (word >> 5) & 1; \
  out[6] = (word >> 6) & 1; \
  out[7] = (word >> 7) & 1; \
  } while(0)

WillAyd · 2023-08-23T03:44:35Z

That is an interesting find! Though I wonder since the macro makes a difference - if you compiled with -Winline did you see any errors about the arrow functions not getting inline'd? You'd have to sift through some Cython warnings/errors in the project I provided, but that may be part of the reason why a macro helps

WillAyd · 2023-10-06T12:51:06Z

@paleolimbot does this program generate any difference for you? Regardless of if compiled with clang or gcc, I get a 2x speedup at -O3 optimization without shifts:

#include <stdio.h>
#include <inttypes.h>
#include <time.h>

#define NANOSEC_PER_SEC 1000000000LL

static inline void UnpackInt8Shifts(const uint8_t word, int8_t* out) {
  out[0] = (word >> 0) & 1;
  out[1] = (word >> 1) & 1;
  out[2] = (word >> 2) & 1;
  out[3] = (word >> 3) & 1;
  out[4] = (word >> 4) & 1;
  out[5] = (word >> 5) & 1;
  out[6] = (word >> 6) & 1;
  out[7] = (word >> 7) & 1;  
}

static inline void UnpackInt8NoShifts(const uint8_t word, int8_t* out) {
  out[0] = (word & 0x1) != 0;
  out[1] = (word & 0x2) != 0;
  out[2] = (word & 0x4) != 0;
  out[3] = (word & 0x8) != 0;
  out[4] = (word & 0x10) != 0;
  out[5] = (word & 0x20) != 0;
  out[6] = (word & 0x40) != 0;
  out[7] = (word & 0x80) != 0;
}


int main(void) {
  const size_t niters = 100000000;
  const uint8_t word = 0xaa;
  int8_t out[8];
  struct timespec start, end;
  
  clock_gettime(CLOCK_REALTIME, &start);
  for (size_t i = 0; i < niters; i++) {
    UnpackInt8Shifts(word, out);
  }
  clock_gettime(CLOCK_REALTIME, &end);
  printf("ns duration of UnpackInt8Shifts was: %lld\n", (end.tv_sec * NANOSEC_PER_SEC + end.tv_nsec) - (start.tv_sec * NANOSEC_PER_SEC + start.tv_nsec));

  clock_gettime(CLOCK_REALTIME, &start);
  for (size_t i = 0; i < niters; i++) {
    UnpackInt8NoShifts(word, out);
  }
  clock_gettime(CLOCK_REALTIME, &end);
  printf("ns duration of UnpackInt8NoShifts was: %lld\n", (end.tv_sec * NANOSEC_PER_SEC + end.tv_nsec) - (start.tv_sec * NANOSEC_PER_SEC + start.tv_nsec));

  return 0;
}

Still worthwhile though I'm not sure how the 2x speedup in the base benchmark yet results in a 15x speedup in the Python benchmark on my platform

paleolimbot · 2023-10-06T13:29:03Z

does this program generate any difference for you?

With my (apple aarch64) clang, the compiler correctly deduces that it can do all the work at compile time and gives me a time of zero for both! (Except at -O0, where it gives me slightly better timing for the 'no shifts' option).

Either way I'm happy to merge this (perhaps with a rebase to clear up the CI). In the unlikely event I discover any performance regression I'm happy to revisit.

I did try briefly my macro approach before 0.3.0 to see if that could be a quick win, but it seemed to decrease performance in the R package for the "unpack 32" case in my first benchmark so I moved on to other things.

paleolimbot

Thank you!

Improved Bit Unpacking Performance

4e175a1

WillAyd changed the title ~~pef: Improved Bit Unpacking Performance~~ perf: Improved Bit Unpacking Performance Aug 18, 2023

Improved pack performance

39ed307

WillAyd changed the title ~~perf: Improved Bit Unpacking Performance~~ perf: Improved Bit (Un)packing Performance Aug 22, 2023

Merge remote-tracking branch 'upstream/main' into unpack-int8-perf

e4c7ec0

paleolimbot approved these changes Oct 6, 2023

View reviewed changes

paleolimbot merged commit 6006ca2 into apache:main Oct 6, 2023
27 checks passed

WillAyd mentioned this pull request Nov 28, 2023

perf: Better bit packing-unpacking algorithms #326

Closed

WillAyd deleted the unpack-int8-perf branch April 17, 2024 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Improved Bit (Un)packing Performance #280

perf: Improved Bit (Un)packing Performance #280

WillAyd commented Aug 17, 2023

codecov-commenter commented Aug 17, 2023 •

edited

Loading

WillAyd commented Aug 17, 2023

paleolimbot commented Aug 18, 2023 •

edited

Loading

WillAyd commented Aug 18, 2023

paleolimbot commented Aug 18, 2023

WillAyd commented Aug 18, 2023

WillAyd commented Aug 22, 2023

WillAyd commented Aug 22, 2023

jorisvandenbossche commented Aug 22, 2023 •

edited

Loading

paleolimbot commented Aug 22, 2023 •

edited

Loading

WillAyd commented Aug 22, 2023

paleolimbot commented Aug 22, 2023

paleolimbot commented Aug 22, 2023

WillAyd commented Aug 22, 2023

jorisvandenbossche commented Aug 22, 2023

paleolimbot commented Aug 23, 2023

WillAyd commented Aug 23, 2023

WillAyd commented Oct 6, 2023

paleolimbot commented Oct 6, 2023

paleolimbot left a comment

perf: Improved Bit (Un)packing Performance #280

perf: Improved Bit (Un)packing Performance #280

Conversation

WillAyd commented Aug 17, 2023

codecov-commenter commented Aug 17, 2023 • edited Loading

Codecov Report

WillAyd commented Aug 17, 2023

paleolimbot commented Aug 18, 2023 • edited Loading

WillAyd commented Aug 18, 2023

paleolimbot commented Aug 18, 2023

WillAyd commented Aug 18, 2023

WillAyd commented Aug 22, 2023

WillAyd commented Aug 22, 2023

jorisvandenbossche commented Aug 22, 2023 • edited Loading

paleolimbot commented Aug 22, 2023 • edited Loading

WillAyd commented Aug 22, 2023

paleolimbot commented Aug 22, 2023

paleolimbot commented Aug 22, 2023

WillAyd commented Aug 22, 2023

jorisvandenbossche commented Aug 22, 2023

paleolimbot commented Aug 23, 2023

WillAyd commented Aug 23, 2023

WillAyd commented Oct 6, 2023

paleolimbot commented Oct 6, 2023

paleolimbot left a comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 17, 2023 •

edited

Loading

paleolimbot commented Aug 18, 2023 •

edited

Loading

jorisvandenbossche commented Aug 22, 2023 •

edited

Loading

paleolimbot commented Aug 22, 2023 •

edited

Loading