-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Better bit packing-unpacking algorithms #326
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #326 +/- ##
==========================================
+ Coverage 88.23% 89.10% +0.87%
==========================================
Files 3 4 +1
Lines 357 101 -256
==========================================
- Hits 315 90 -225
+ Misses 42 11 -31 ☔ View full report in Codecov by Sentry. |
// see https://stackoverflow.com/a/51750902/621736 | ||
const uint64_t magic = 0x8040201008040201ULL; | ||
const uint64_t mask = 0x8080808080808080ULL; | ||
const uint64_t tmp = htobe64((magic * word) & mask) >> 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect this htobe64
to be part of the final design. The SO post actually suggests using a byte-reversed magic
entry to get what we need, but that did not work in the special-case of every bit in the word
being set (assumedly due to wrap-around). I left a comment on the SO post about that - hopefully someone out there has a smarter way of handling this
Sorry for missing this initial notification...thanks for taking a look at this! The 0.4 release (probably early January) is targeted toward reliability and testing...0.5 will have more of a performance focus (i.e., we'd like to add benchmarks to track this kind of thing). I probably won't get to wiring up a benchmarking system until the new year though! |
No problem and thanks for that context. This is not urgent at all - just leaving here for visibility |
Closing for now - can always reopen |
I'm investigating improving the unpack in arrow, do you have some advices here? |
Hey @mapleFU - that's great. I didn't read through everything you posted in that issue but the research is impressive, and certainly beyond what I was able to accomplish here If it helps, I noticed in #280 that there was a significant performance difference on x86 if you could avoid shifts when trying to unpack bits. i.e. code like:
was showing more than 10x slower when used in a larger Python process than the more verbose:
Unfortunately that performance boost seemed to only work for unpacking, and only on x86. Joris and Dewey were not able to replicate that speedup on other architectures, though it was more or less a moot point for them I didn't feel like the SO post I created was ever answered, but you may find some value in the comments provided there. Particularly by user Peter Cordes Hope that helps |
In #280 the algorithms seemed to make a huge difference in performance on my intel x86 chip, but other platforms didn't see as much. I asked on SO why that is the case, and while I don't see anything yet definitive one of the commenters pointed to higher performance algorithms that should work more generally like this in particular
When I test this in Cython from the steps in https://github.com/WillAyd/cython-nanoarrow I get the following numbers on my system:
So this multiply technique for unpacking is actually a good deal slower on my x86, but might provide a slight performance boost when packing. On non x86 architecture I'd be curious to know if it makes a difference for anyone