perf: Better bit packing-unpacking algorithms #326

WillAyd · 2023-11-28T04:17:06Z

In #280 the algorithms seemed to make a huge difference in performance on my intel x86 chip, but other platforms didn't see as much. I asked on SO why that is the case, and while I don't see anything yet definitive one of the commenters pointed to higher performance algorithms that should work more generally like this in particular

When I test this in Cython from the steps in https://github.com/WillAyd/cython-nanoarrow I get the following numbers on my system:

In [1]: from comparisons import ComparisonManager
   ...: mgr = ComparisonManager()

In [2]: %timeit mgr.unpack()
   ...: %timeit mgr.unpack_no_shift()
   ...: %timeit mgr.unpack_multiply()
300 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
35 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
61.4 µs ± 615 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [3]: %timeit mgr.pack()
   ...: %timeit mgr.pack_no_shift()
   ...: %timeit mgr.pack_multiply()
80.2 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
71.1 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
65.5 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

So this multiply technique for unpacking is actually a good deal slower on my x86, but might provide a slight performance boost when packing. On non x86 architecture I'd be curious to know if it makes a difference for anyone

codecov-commenter · 2023-11-28T04:18:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (798a1b8) 88.23% compared to head (847fbf7) 89.10%.
Report is 13 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #326      +/-   ##
==========================================
+ Coverage   88.23%   89.10%   +0.87%     
==========================================
  Files           3        4       +1     
  Lines         357      101     -256     
==========================================
- Hits          315       90     -225     
+ Misses         42       11      -31

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

WillAyd · 2023-11-28T04:21:49Z

src/nanoarrow/buffer_inline.h

+  // see https://stackoverflow.com/a/51750902/621736
+  const uint64_t magic = 0x8040201008040201ULL;
+  const uint64_t mask = 0x8080808080808080ULL;
+  const uint64_t tmp = htobe64((magic * word) & mask) >> 7;


I don't expect this htobe64 to be part of the final design. The SO post actually suggests using a byte-reversed magic entry to get what we need, but that did not work in the special-case of every bit in the word being set (assumedly due to wrap-around). I left a comment on the SO post about that - hopefully someone out there has a smarter way of handling this

paleolimbot · 2023-12-14T20:03:22Z

Sorry for missing this initial notification...thanks for taking a look at this! The 0.4 release (probably early January) is targeted toward reliability and testing...0.5 will have more of a performance focus (i.e., we'd like to add benchmarks to track this kind of thing). I probably won't get to wiring up a benchmarking system until the new year though!

WillAyd · 2023-12-14T20:16:03Z

No problem and thanks for that context. This is not urgent at all - just leaving here for visibility

WillAyd · 2024-01-19T03:58:38Z

Closing for now - can always reopen

mapleFU · 2024-04-17T15:27:18Z

apache/arrow#40845

I'm investigating improving the unpack in arrow, do you have some advices here?

WillAyd · 2024-04-17T16:12:46Z

Hey @mapleFU - that's great. I didn't read through everything you posted in that issue but the research is impressive, and certainly beyond what I was able to accomplish here

If it helps, I noticed in #280 that there was a significant performance difference on x86 if you could avoid shifts when trying to unpack bits. i.e. code like:

static inline void PackInt8Shifts(const int8_t* values, volatile uint8_t* out) {
  *out = (values[0] | values[1] << 1 | values[2] << 2 | values[3] << 3 | values[4] << 4 |
          values[5] << 5 | values[6] << 6 | values[7] << 7);
}

was showing more than 10x slower when used in a larger Python process than the more verbose:

static inline void PackInt8NoShifts(const int8_t* values, volatile uint8_t* out) {
  *out = (values[0] | ((values[1] + 0x1) & 0x2) | ((values[2] + 0x3) & 0x4) |
          ((values[3] + 0x7) & 0x8) | ((values[4] + 0xf) & 0x10) |
          ((values[5] + 0x1f) & 0x20) | ((values[6] + 0x3f) & 0x40) |
          ((values[7] + 0x7f) & 0x80));
}

Unfortunately that performance boost seemed to only work for unpacking, and only on x86. Joris and Dewey were not able to replicate that speedup on other architectures, though it was more or less a moot point for them

I didn't feel like the SO post I created was ever answered, but you may find some value in the comments provided there. Particularly by user Peter Cordes

https://stackoverflow.com/questions/77550709/x86-performance-difference-between-shift-and-add-when-packing-bits

Hope that helps

perf: Better bit packing-unpacking algorithms

847fbf7

WillAyd commented Nov 28, 2023

View reviewed changes

WillAyd closed this Jan 19, 2024

WillAyd deleted the better-pack-unpack branch April 17, 2024 16:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Better bit packing-unpacking algorithms #326

perf: Better bit packing-unpacking algorithms #326

WillAyd commented Nov 28, 2023 •

edited

Loading

codecov-commenter commented Nov 28, 2023

WillAyd Nov 28, 2023

paleolimbot commented Dec 14, 2023

WillAyd commented Dec 14, 2023

WillAyd commented Jan 19, 2024

mapleFU commented Apr 17, 2024

WillAyd commented Apr 17, 2024 •

edited

Loading

perf: Better bit packing-unpacking algorithms #326

perf: Better bit packing-unpacking algorithms #326

Conversation

WillAyd commented Nov 28, 2023 • edited Loading

codecov-commenter commented Nov 28, 2023

Codecov Report

WillAyd Nov 28, 2023

Choose a reason for hiding this comment

paleolimbot commented Dec 14, 2023

WillAyd commented Dec 14, 2023

WillAyd commented Jan 19, 2024

mapleFU commented Apr 17, 2024

WillAyd commented Apr 17, 2024 • edited Loading

WillAyd commented Nov 28, 2023 •

edited

Loading

WillAyd commented Apr 17, 2024 •

edited

Loading