Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Improved Bit (Un)packing Performance #280

Merged
merged 3 commits into from
Oct 6, 2023

Conversation

WillAyd
Copy link
Contributor

@WillAyd WillAyd commented Aug 17, 2023

I was very surprised by this but getting rid of the shifting yielded a huge performance boost for me locally. I was benchmarking some pandas code that took ~500us to unpack 1 million boolean values - with this simple change that time fell to ~30us

Not an expert in assembly but here is what godbolt produces to set the index of 1 before:

        movzx   eax, BYTE PTR [rbp-1]
        shr     al
        mov     edx, eax
        .loc 1 6 6
        mov     rax, QWORD PTR [rbp-32]
        add     rax, 1
        .loc 1 6 24
        and     edx, 1
        .loc 1 6 10
        mov     BYTE PTR [rax], dl

and after:

        movzx   eax, BYTE PTR [rbp-1]
        and     eax, 2
        .loc 1 6 25
        test    eax, eax
        setne   dl
        .loc 1 6 6
        mov     rax, QWORD PTR [rbp-32]
        add     rax, 1
        .loc 1 6 10
        mov     BYTE PTR [rax], dl

Assuming the shr instruction is inefficient compared to the test / setne approach taken in the latter

@codecov-commenter
Copy link

codecov-commenter commented Aug 17, 2023

Codecov Report

Merging #280 (e4c7ec0) into main (798a1b8) will decrease coverage by 1.22%.
Report is 1 commits behind head on main.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #280      +/-   ##
==========================================
- Coverage   88.23%   87.01%   -1.22%     
==========================================
  Files           3       70      +67     
  Lines         357    10574   +10217     
==========================================
+ Hits          315     9201    +8886     
- Misses         42     1373    +1331     
Files Coverage Δ
src/nanoarrow/buffer_inline.h 98.39% <100.00%> (ø)

... and 66 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@WillAyd
Copy link
Contributor Author

WillAyd commented Aug 17, 2023

Haven't profiled the packing methods yet but I assume we stand to gain a bit removing the left shift operators there as well

@paleolimbot
Copy link
Member

paleolimbot commented Aug 18, 2023

Cool observation!

For what it's worth, I wasn't able to observe a difference on MacOS M1 or x86/AMD Ryzen 5 (although I'm benchmarking via the R package, which unpacks into 32-bit integers and not 8-bit integers).

With nanoarrow 0.2.0:

# install.packages("nanoarrow")
library(nanoarrow)

lgls <- nanoarrow:::vec_gen(logical(), 1e6)
bool_array <- as_nanoarrow_array(lgls)
bool_array_arrow <- arrow::as_arrow_array(bool_array)

bench::mark(
  convert_array(bool_array, logical()),
  as.vector(bool_array_arrow),
  as.logical(lgls)
)
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 convert_array(bool_array, logical()) 560µs  735µs    1.33e3    3.82MB     154.
#> 2 as.vector(bool_array_arrow)          568µs  782µs    1.28e3    3.82MB     141.
#> 3 as.logical(lgls)                         0      0    4.32e8        0B       0

With nanoarrow on main:

# pak::pak("apache/arrow-nanoarrow/r")
library(nanoarrow)

lgls <- nanoarrow:::vec_gen(logical(), 1e6)
bool_array <- as_nanoarrow_array(lgls)
bool_array_arrow <- arrow::as_arrow_array(bool_array)

bench::mark(
  convert_array(bool_array, logical()),
  as.vector(bool_array_arrow),
  as.logical(lgls)
)
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 convert_array(bool_array, logical()) 106µs  321µs    3.07e3    3.83MB     350.
#> 2 as.vector(bool_array_arrow)          558µs  779µs    1.29e3    3.82MB     144.
#> 3 as.logical(lgls)                         0      0    3.61e8        0B       0

Created on 2023-08-17 with reprex v2.0.2

With this PR:

# pak::pak("apache/arrow-nanoarrow/r#280")
library(nanoarrow)

lgls <- nanoarrow:::vec_gen(logical(), 1e6)
bool_array <- as_nanoarrow_array(lgls)
bool_array_arrow <- arrow::as_arrow_array(bool_array)

bench::mark(
  convert_array(bool_array, logical()),
  as.vector(bool_array_arrow),
  as.logical(lgls)
)
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 convert_array(bool_array, logical()) 105µs  315µs    3.12e3    3.83MB     355.
#> 2 as.vector(bool_array_arrow)          558µs  777µs    1.30e3    3.82MB     144.
#> 3 as.logical(lgls)                         0    1ns    1.44e8        0B       0

Created on 2023-08-17 with reprex v2.0.2

@WillAyd
Copy link
Contributor Author

WillAyd commented Aug 18, 2023

Hmm very strange. I also pieced together this benchmark in C:

#include <stdlib.h>

#include "nanoarrow.h"
#include "buffer_inline.h"

int main() {
  int N = 1000000;
  int8_t* values = malloc(N);
  memset(values, 0, N);
  
  struct ArrowBitmap bitmap;
  ArrowBitmapInit(&bitmap);
  ArrowBitmapReserve(&bitmap, N);
  ArrowBitmapAppendInt8Unsafe(&bitmap, values, N);

  for (int i = 0; i < 10000; i++) {
    ArrowBitsUnpackInt8(bitmap.buffer.data, 0, N, values);
  }
  free(values);
}

which I think generally showed improvement, but not to the degree and consistency with what I was seeing in the Python/Cython library I originally noticed this in. Will see if I can test some more

@paleolimbot
Copy link
Member

If this change helps even a little I'm certainly in favour of merging...it may be that it helps some compilers/compiler flag combinations do some optimizations. You could also try making _ArrowBitsUnpackInt8() a macro (then you'd only need one for int8 and int32).

@WillAyd WillAyd changed the title pef: Improved Bit Unpacking Performance perf: Improved Bit Unpacking Performance Aug 18, 2023
@WillAyd
Copy link
Contributor Author

WillAyd commented Aug 18, 2023

For reference here is a "shiftless" packing implementation too, though I haven't seen as much speed up using this:

static inline void _ArrowBitmapPackInt8(const int8_t* values, uint8_t* out) {
  *out = (values[0]
          | ((values[1] + 0x1) & 0x2)
          | ((values[2] + 0x3) & 0x4)
          | ((values[3] + 0x7) & 0x8)
          | ((values[4] + 0xf) & 0x10)
          | ((values[5] + 0x1f) & 0x20)
          | ((values[6] + 0x3f) & 0x40)
          | ((values[7] + 0x7f) & 0x80)
          );
}

I am still testing these downstream so no rush to merge this - let me see if anything becomes clearer as I work through this in pandas and will report back

@WillAyd
Copy link
Contributor Author

WillAyd commented Aug 22, 2023

I set up a repo to compare this from Cython:

https://github.com/WillAyd/cython-nanoarrow

The benchmarks for me look something like this:

In [1]: from comparisons import ComparisonManager
   ...: mgr = ComparisonManager()
   ...: %timeit mgr.unpack()
   ...: %timeit mgr.unpack_no_shift()
547 µs ± 21 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
40.5 µs ± 321 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Not sure what makes Python here different than the R / C benchmarks we've tried before...

@WillAyd
Copy link
Contributor Author

WillAyd commented Aug 22, 2023

On the writing side of things my benchmarks shows ~10% improvement not using shift. Not as drastic as read, but still maybe worthwhile

In [2]: from comparisons import ComparisonManager
   ...: mgr = ComparisonManager()
   ...: %timeit mgr.pack()
   ...: %timeit mgr.pack_no_shift()
80.1 µs ± 984 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
72.8 µs ± 390 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [3]: from comparisons import ComparisonManager
   ...: mgr = ComparisonManager()
   ...: %timeit mgr.pack()
   ...: %timeit mgr.pack_no_shift()
78.9 µs ± 671 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
71.5 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [4]: from comparisons import ComparisonManager
   ...: mgr = ComparisonManager()
   ...: %timeit mgr.pack()
   ...: %timeit mgr.pack_no_shift()
80.4 µs ± 324 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
72.2 µs ± 272 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

@WillAyd WillAyd changed the title perf: Improved Bit Unpacking Performance perf: Improved Bit (Un)packing Performance Aug 22, 2023
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 22, 2023

I set up a repo to compare this from Cython:

FWIW I quickly cloned your repo and followed the instructions there, and I see similar results as you (big difference between unpack and unpack_no_shift, 1.36ms vs 140µs, so also around 10x difference)

@paleolimbot
Copy link
Member

paleolimbot commented Aug 22, 2023

The difference is more subtle for me on unpacking (but more pronounced for packing)...I'm game!

Can you add a comment above each hard-coded section of shifts and explain that it was done for performance reasons?

Screenshot 2023-08-22 at 4 50 49 PM

@WillAyd
Copy link
Contributor Author

WillAyd commented Aug 22, 2023

Hmm strange. So for you the packing without shifts is slower? I see 142 us going up to 203?

@paleolimbot
Copy link
Member

Oh whoops, I read that backward. Yes, apparently without the shifts the packing is much slower for me.

@paleolimbot
Copy link
Member

FWIW I also had to compile slightly differently because I got a bunch of missing symbol errors.

gcc -O3 -Wall -Werror -shared -fPIC \
    -I$(python -c "import sysconfig; print(sysconfig.get_paths()['include'])") \
    -L/Library/Frameworks/Python.framework/Versions/3.11/lib -lpython3.11 \
    comparisons.c nanoarrow.c -o comparisons.so

@WillAyd
Copy link
Contributor Author

WillAyd commented Aug 22, 2023

Are you on an intel or m1 chip? Guessing that could explain some of the differences. Interesting to see all this variation...

@jorisvandenbossche
Copy link
Member

(I am on ubuntu / intel cpu)

@paleolimbot
Copy link
Member

Yes, I'm on M1.

If I change the unpacking to a macro, I get 3x faster unpacking (and no difference between shift/no shift):

#define ARROW_BITS_UNPACK1(word, out) \
  do { \
  out[0] = (word >> 0) & 1; \
  out[1] = (word >> 1) & 1; \
  out[2] = (word >> 2) & 1; \
  out[3] = (word >> 3) & 1; \
  out[4] = (word >> 4) & 1; \
  out[5] = (word >> 5) & 1; \
  out[6] = (word >> 6) & 1; \
  out[7] = (word >> 7) & 1; \
  } while(0)

@WillAyd
Copy link
Contributor Author

WillAyd commented Aug 23, 2023

That is an interesting find! Though I wonder since the macro makes a difference - if you compiled with -Winline did you see any errors about the arrow functions not getting inline'd? You'd have to sift through some Cython warnings/errors in the project I provided, but that may be part of the reason why a macro helps

@WillAyd
Copy link
Contributor Author

WillAyd commented Oct 6, 2023

@paleolimbot does this program generate any difference for you? Regardless of if compiled with clang or gcc, I get a 2x speedup at -O3 optimization without shifts:

#include <stdio.h>
#include <inttypes.h>
#include <time.h>

#define NANOSEC_PER_SEC 1000000000LL

static inline void UnpackInt8Shifts(const uint8_t word, int8_t* out) {
  out[0] = (word >> 0) & 1;
  out[1] = (word >> 1) & 1;
  out[2] = (word >> 2) & 1;
  out[3] = (word >> 3) & 1;
  out[4] = (word >> 4) & 1;
  out[5] = (word >> 5) & 1;
  out[6] = (word >> 6) & 1;
  out[7] = (word >> 7) & 1;  
}

static inline void UnpackInt8NoShifts(const uint8_t word, int8_t* out) {
  out[0] = (word & 0x1) != 0;
  out[1] = (word & 0x2) != 0;
  out[2] = (word & 0x4) != 0;
  out[3] = (word & 0x8) != 0;
  out[4] = (word & 0x10) != 0;
  out[5] = (word & 0x20) != 0;
  out[6] = (word & 0x40) != 0;
  out[7] = (word & 0x80) != 0;
}


int main(void) {
  const size_t niters = 100000000;
  const uint8_t word = 0xaa;
  int8_t out[8];
  struct timespec start, end;
  
  clock_gettime(CLOCK_REALTIME, &start);
  for (size_t i = 0; i < niters; i++) {
    UnpackInt8Shifts(word, out);
  }
  clock_gettime(CLOCK_REALTIME, &end);
  printf("ns duration of UnpackInt8Shifts was: %lld\n", (end.tv_sec * NANOSEC_PER_SEC + end.tv_nsec) - (start.tv_sec * NANOSEC_PER_SEC + start.tv_nsec));

  clock_gettime(CLOCK_REALTIME, &start);
  for (size_t i = 0; i < niters; i++) {
    UnpackInt8NoShifts(word, out);
  }
  clock_gettime(CLOCK_REALTIME, &end);
  printf("ns duration of UnpackInt8NoShifts was: %lld\n", (end.tv_sec * NANOSEC_PER_SEC + end.tv_nsec) - (start.tv_sec * NANOSEC_PER_SEC + start.tv_nsec));

  return 0;
}

Still worthwhile though I'm not sure how the 2x speedup in the base benchmark yet results in a 15x speedup in the Python benchmark on my platform

@paleolimbot
Copy link
Member

does this program generate any difference for you?

With my (apple aarch64) clang, the compiler correctly deduces that it can do all the work at compile time and gives me a time of zero for both! (Except at -O0, where it gives me slightly better timing for the 'no shifts' option).

Either way I'm happy to merge this (perhaps with a rebase to clear up the CI). In the unlikely event I discover any performance regression I'm happy to revisit.

I did try briefly my macro approach before 0.3.0 to see if that could be a quick win, but it seemed to decrease performance in the R package for the "unpack 32" case in my first benchmark so I moved on to other things.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@paleolimbot paleolimbot merged commit 6006ca2 into apache:main Oct 6, 2023
27 checks passed
@WillAyd WillAyd deleted the unpack-int8-perf branch April 17, 2024 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants