Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Better bit packing-unpacking algorithms #326

Closed
wants to merge 1 commit into from

Conversation

WillAyd
Copy link
Contributor

@WillAyd WillAyd commented Nov 28, 2023

In #280 the algorithms seemed to make a huge difference in performance on my intel x86 chip, but other platforms didn't see as much. I asked on SO why that is the case, and while I don't see anything yet definitive one of the commenters pointed to higher performance algorithms that should work more generally like this in particular

When I test this in Cython from the steps in https://github.com/WillAyd/cython-nanoarrow I get the following numbers on my system:

In [1]: from comparisons import ComparisonManager
   ...: mgr = ComparisonManager()

In [2]: %timeit mgr.unpack()
   ...: %timeit mgr.unpack_no_shift()
   ...: %timeit mgr.unpack_multiply()
300 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
35 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
61.4 µs ± 615 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [3]: %timeit mgr.pack()
   ...: %timeit mgr.pack_no_shift()
   ...: %timeit mgr.pack_multiply()
80.2 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
71.1 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
65.5 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

So this multiply technique for unpacking is actually a good deal slower on my x86, but might provide a slight performance boost when packing. On non x86 architecture I'd be curious to know if it makes a difference for anyone

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (798a1b8) 88.23% compared to head (847fbf7) 89.10%.
Report is 13 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #326      +/-   ##
==========================================
+ Coverage   88.23%   89.10%   +0.87%     
==========================================
  Files           3        4       +1     
  Lines         357      101     -256     
==========================================
- Hits          315       90     -225     
+ Misses         42       11      -31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

// see https://stackoverflow.com/a/51750902/621736
const uint64_t magic = 0x8040201008040201ULL;
const uint64_t mask = 0x8080808080808080ULL;
const uint64_t tmp = htobe64((magic * word) & mask) >> 7;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect this htobe64 to be part of the final design. The SO post actually suggests using a byte-reversed magic entry to get what we need, but that did not work in the special-case of every bit in the word being set (assumedly due to wrap-around). I left a comment on the SO post about that - hopefully someone out there has a smarter way of handling this

@paleolimbot
Copy link
Member

Sorry for missing this initial notification...thanks for taking a look at this! The 0.4 release (probably early January) is targeted toward reliability and testing...0.5 will have more of a performance focus (i.e., we'd like to add benchmarks to track this kind of thing). I probably won't get to wiring up a benchmarking system until the new year though!

@WillAyd
Copy link
Contributor Author

WillAyd commented Dec 14, 2023

No problem and thanks for that context. This is not urgent at all - just leaving here for visibility

@WillAyd
Copy link
Contributor Author

WillAyd commented Jan 19, 2024

Closing for now - can always reopen

@WillAyd WillAyd closed this Jan 19, 2024
@mapleFU
Copy link
Member

mapleFU commented Apr 17, 2024

apache/arrow#40845

I'm investigating improving the unpack in arrow, do you have some advices here?

@WillAyd
Copy link
Contributor Author

WillAyd commented Apr 17, 2024

Hey @mapleFU - that's great. I didn't read through everything you posted in that issue but the research is impressive, and certainly beyond what I was able to accomplish here

If it helps, I noticed in #280 that there was a significant performance difference on x86 if you could avoid shifts when trying to unpack bits. i.e. code like:

static inline void PackInt8Shifts(const int8_t* values, volatile uint8_t* out) {
  *out = (values[0] | values[1] << 1 | values[2] << 2 | values[3] << 3 | values[4] << 4 |
          values[5] << 5 | values[6] << 6 | values[7] << 7);
}

was showing more than 10x slower when used in a larger Python process than the more verbose:

static inline void PackInt8NoShifts(const int8_t* values, volatile uint8_t* out) {
  *out = (values[0] | ((values[1] + 0x1) & 0x2) | ((values[2] + 0x3) & 0x4) |
          ((values[3] + 0x7) & 0x8) | ((values[4] + 0xf) & 0x10) |
          ((values[5] + 0x1f) & 0x20) | ((values[6] + 0x3f) & 0x40) |
          ((values[7] + 0x7f) & 0x80));
}

Unfortunately that performance boost seemed to only work for unpacking, and only on x86. Joris and Dewey were not able to replicate that speedup on other architectures, though it was more or less a moot point for them

I didn't feel like the SO post I created was ever answered, but you may find some value in the comments provided there. Particularly by user Peter Cordes

https://stackoverflow.com/questions/77550709/x86-performance-difference-between-shift-and-add-when-packing-bits

Hope that helps

@WillAyd WillAyd deleted the better-pack-unpack branch April 17, 2024 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants