Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scalar optimizations from CRoaring / arXiv:1709.07821 section 3 #127

Merged
merged 10 commits into from
Jan 12, 2022

Conversation

saik0
Copy link
Contributor

@saik0 saik0 commented Jan 4, 2022

Purpose

This PR adds some optimizations from CRoaring as outlined in arXiv:1709.07821 section 3

Overview

  • All inserts and removes are now branchless (!in arXiv:1709.0782, in CRoaring)
  • Section 3.1 was already implemented, except for BitmapIter. This is covered in Speed up bitmap iteration #125
  • Implement Array-Bitset aggregates as outlined in section 3.2
    • Also branchless 😎
  • Tracks bitmap cardinality while performing bitmap-bitmap ops
    • This is a deviation from CRoaring, and will need to be benchmarked further before this Draft PR is ready
    • Curious to hear what you think about this @lemire
  • In order to track bitmap cardinality the len field had to moved into Store::Bitmap
    • This is unfortunately a cross cutting change
  • Store was quite large (LoC) and had many responsibilities. The largest change in this draft is decomposing Store such hat it's field variants are two new container types: each responsible for maintaining their invariants and implementing ops
    • Bitmap8K keeps track of it's cardinality
    • SortedU16Vec maintains its sorting
    • Store now only delegates to these containers
    • My hope is that this will be useful when implementing run containers. 🤞
    • Unfortunately so much code was moved this PR is HUGE

Out of scope

  • Inline ASM for Array-Bitset aggregates
  • Section 4 (explicit SIMD). As noted by the paper authors: The compiler does a decent job of autovectorization, though not as good as hand-tuned

Notes

  • I attempted to emulate the inline ASM Array-Bitset aggregates by using a mix of unsafe ptr arithmetic and x86-64 intrinsics, hoping to compile to the same instructions. I was unable to get it under 13 instructions per iteration (compared to the papers 5). While it was an improvement, I abandoned the effort in favor of waiting for the asm! macro to stabilize. Tracking Issue for inline assembly (asm!) rust-lang/rust#72016

@saik0
Copy link
Contributor Author

saik0 commented Jan 8, 2022

Got a benchmark harness with ops over the sample datasets. The results are consistent and very promising!

@Kerollmops
Copy link
Member

Kerollmops commented Jan 9, 2022

That's quit good indeed 🚀 😄

However, I am not sure about the use of the new asm! macro to do inline asm. I would like to keep this crate simple and able to compile on most of the platforms.

Maybe we can make sure that using inline asm doesn't break the build, I had too much trouble compiling croaring-rs (the wrapper of the C roaring library in Rust) due to a lot of raw SIMD function calls, unavailable on the platforms. Maybe we can use the std::simd soon to be released crate?

@saik0
Copy link
Contributor Author

saik0 commented Jan 10, 2022

That's quit good indeed 🚀 😄

However, I am not sure about the use of the new asm! macro to do inline asm. I would like to keep this crate simple and able to compile on most of the platforms.

Maybe we can make sure that using inline asm doesn't break the build, I had too much trouble compiling croaring-rs (the wrapper of the C roaring library in Rust) due to a lot of raw SIMD function calls, unavailable on the platforms. Maybe we can use the std::simd soon to be released crate?

Let's have this discussion in #60. This PR does not include any inline asm or explicit SIMD. The perf gains here are in making inserts and remove branchless, and by interleaving the cardinality tracking.

@saik0
Copy link
Contributor Author

saik0 commented Jan 10, 2022

Maintaining the len in the bitmap store has the additional micro-optimization of making sizeof<Container>() smaller, so binary searching containers should be slightly cache-friendlier.

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is huge indeed but the work is great! It reduces the time it takes to do most of the operations and also reduces the amount of code in the bitmap/store.rs file!

Thank you very much! I don't see any reason not to mark this PR as a candidate for merge!

src/bitmap/bitmap_8k.rs Outdated Show resolved Hide resolved
src/bitmap/bitmap_8k.rs Outdated Show resolved Hide resolved
src/bitmap/bitmap_8k.rs Outdated Show resolved Hide resolved
src/bitmap/bitmap_8k.rs Outdated Show resolved Hide resolved
src/bitmap/bitmap_8k.rs Outdated Show resolved Hide resolved
src/bitmap/sorted_u16_vec.rs Outdated Show resolved Hide resolved
src/bitmap/bitmap_8k.rs Outdated Show resolved Hide resolved
@Kerollmops Kerollmops mentioned this pull request Jan 10, 2022
@saik0
Copy link
Contributor Author

saik0 commented Jan 10, 2022

@Kerollmops I'm checking invariants with debug assertion in: 62e2eed. I think it's sufficient. However: deserialize might panic now for malformed inputs! (but only in debug builds) Fixed

src/bitmap/sorted_u16_vec.rs Outdated Show resolved Hide resolved
src/bitmap/sorted_u16_vec.rs Outdated Show resolved Hide resolved
saik0 and others added 2 commits January 11, 2022 09:24
@saik0
Copy link
Contributor Author

saik0 commented Jan 11, 2022

Re: Naming

Bitmap8k and SortedU16Vec are named after what they are. However, if we're being honest with ourselves they are not general purpose types.

We could also name them after what they're for. BitmapStore ArrayStore and possibly rename Store enum to StoreKind or StoreType and move all three into a store module, or include Container and call the mod container.

I don't have strong opinions on this. I'm just trying to think of what would be least surprising to me if I were seeing the codebase for the first time.

@saik0
Copy link
Contributor Author

saik0 commented Jan 11, 2022

The checks are not free, but I think for deserialization correctness matters more than performance. If this change is controversial I can revert it for this PR and we can have a separate discussion.

Screen Shot 2022-01-11 at 11 18 52 AM

@Kerollmops
Copy link
Member

Kerollmops commented Jan 12, 2022

Re: Naming

We could also name them after what they're for. BitmapStore ArrayStore and possibly rename Store enum to StoreKind or StoreType and move all three into a store module, or include Container and call the mod container.

I like the idea, I would just like to keep the Store enum name. This is a store that can be of one of two forms. This is not just the kind or type of a store, that's a store!


Re: Unchecked Deserialization

I am not sure to understand, what are both of these methods? Have you introduced a new method to deserialize without checking some bounds or other stuff? I am not able to find it on your branch.

Ho! I understand now, you were trying to use the try_from and from_unchecked versions of the BitmapStore and ArrayStore, I got it. I don't know about keeping these checked or unchecked versions, but I like that users can choose to use the unchecked version of a deserialization method when they know that's safe. Don't you think?

Also, the timings seem quite low, 2.5ms is very low, which dataset were you using to graph that?

@saik0
Copy link
Contributor Author

saik0 commented Jan 12, 2022

Re: Unchecked Deserialization

I am not sure to understand, what are both of these methods? Have you introduced a new method to deserialize without checking some bounds or other stuff? I am not able to find it on your branch.

Ho! I understand now, you were trying to use the try_from and from_unchecked versions of the BitmapStore and ArrayStore, I got it.

Also, the timings seem quite low, 2.5ms is very low, which dataset were you using to graph that?

Each point is one of the datasets. The one with the most elements is weather_85 if my memory serves me correctly.

In any case, I don't want to hold up this PR with this discussion, as it's out of scope. I will revert deserialize to the prior behavior. We can discuss elsewhere.

@saik0 saik0 marked this pull request as ready for review January 12, 2022 12:35
Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what you've done! Indeed I would like you to remove the unchecked version of the deserialize method. Maybe you can create an issue for this specific issue?

@saik0
Copy link
Contributor Author

saik0 commented Jan 12, 2022

I like what you've done! Indeed I would like you to remove the unchecked version of the deserialize method. Maybe you can create an issue for this specific issue?

OK, I'll open up a followup PR to add data validation to deserialize

@Kerollmops
Copy link
Member

Do you think I can merge this PR while waiting for your follow-up PR on data validation?

@saik0
Copy link
Contributor Author

saik0 commented Jan 12, 2022

Do you think I can merge this PR while waiting for your follow-up PR on data validation?

Yes. Follow up PR will depend on the try versions implemented in this one, so it's required to merge

@Kerollmops
Copy link
Member

bors merge

@bors
Copy link
Contributor

bors bot commented Jan 12, 2022

Build succeeded:

@bors bors bot merged commit 72e1ca7 into RoaringBitmap:master Jan 12, 2022
not-jan pushed a commit to not-jan/roaring-rs that referenced this pull request Aug 31, 2022
127: Add scalar optimizations from CRoaring / arXiv:1709.07821 section 3 r=Kerollmops a=saik0

### Purpose

This PR adds some optimizations from CRoaring as outlined in  arXiv:1709.07821 section 3

### Overview 
 * All inserts and removes are now branchless (!in arXiv:1709.0782, in CRoaring)
 * Section 3.1 was already implemented, except for `BitmapIter`. This is covered in RoaringBitmap#125
 * Implement Array-Bitset aggregates as outlined in section 3.2
   * Also branchless 😎
 * Tracks bitmap cardinality while performing bitmap-bitmap ops
   * This is a deviation from CRoaring, and will need to be benchmarked further before this Draft PR is ready
   * Curious to hear what you think about this `@lemire` 
 * In order to track bitmap cardinality the len field had to moved into `Store::Bitmap`
   * This is unfortunately a cross cutting change
 * `Store` was quite large (LoC) and had many responsibilities. The largest change in this draft is decomposing `Store` such hat it's field variants are two new container types: each responsible for maintaining their invariants and implementing `ops`
   * `Bitmap8K` keeps track of it's cardinality
   * `SortedU16Vec` maintains its sorting
   * `Store` now only delegates to these containers
   * My hope is that this will be useful when implementing run containers. 🤞
   * Unfortunately so much code was  moved this PR is _HUGE_


### Out of scope
 * Inline ASM for Array-Bitset aggregates
 * Section 4 (explicit SIMD). As noted by the paper authors: The compiler does a decent job of autovectorization, though not as good as hand-tuned

### Notes
 * I attempted to emulate the inline ASM Array-Bitset aggregates by using a mix of unsafe ptr arithmetic and x86-64 intrinsics, hoping to compile to the same instructions. I was unable to get it under 13 instructions per iteration (compared to the papers 5). While it was an improvement, I abandoned the effort in favor of waiting for the `asm!` macro to stabilize. rust-lang/rust#72016

Co-authored-by: saik0 <[email protected]>
Co-authored-by: Joel Pedraza <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants