Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
perf: Improved Bit (Un)packing Performance (#280)
I was very surprised by this but getting rid of the shifting yielded a huge performance boost for me locally. I was benchmarking some pandas code that took ~500us to unpack 1 million boolean values - with this simple change that time fell to ~30us Not an expert in assembly but here is what godbolt produces to set the index of 1 before: ```asm movzx eax, BYTE PTR [rbp-1] shr al mov edx, eax .loc 1 6 6 mov rax, QWORD PTR [rbp-32] add rax, 1 .loc 1 6 24 and edx, 1 .loc 1 6 10 mov BYTE PTR [rax], dl ``` and after: ```asm movzx eax, BYTE PTR [rbp-1] and eax, 2 .loc 1 6 25 test eax, eax setne dl .loc 1 6 6 mov rax, QWORD PTR [rbp-32] add rax, 1 .loc 1 6 10 mov BYTE PTR [rax], dl ``` Assuming the `shr` instruction is inefficient compared to the `test` / `setne` approach taken in the latter
- Loading branch information