Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up emit_dist with precomputed length codings for static trees #264

Merged

Conversation

brian-pane
Copy link

Benchmark results:

Benchmark 1 (60 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          83.2ms ± 2.00ms    79.7ms … 93.9ms          1 ( 2%)        0%
  peak_rss           26.7MB ± 63.7KB    26.6MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          292M  ±  648K      291M  …  294M           0 ( 0%)        0%
  instructions        612M  ±  324       612M  …  612M           0 ( 0%)        0%
  cache_references    401K  ± 8.39K      396K  …  463K           6 (10%)        0%
  cache_misses        302K  ± 5.51K      277K  …  312K           8 (13%)        0%
  branch_misses      3.05M  ± 4.12K     3.04M  … 3.06M           0 ( 0%)        0%
Benchmark 2 (62 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          81.1ms ± 1.14ms    78.7ms … 84.4ms          1 ( 2%)        ⚡-  2.5% ±  0.7%
  peak_rss           26.7MB ± 77.6KB    26.5MB … 26.7MB          0 ( 0%)          -  0.0% ±  0.1%
  cpu_cycles          285M  ±  641K      283M  …  287M           1 ( 2%)        ⚡-  2.6% ±  0.1%
  instructions        590M  ±  286       590M  …  590M           0 ( 0%)        ⚡-  3.7% ±  0.0%
  cache_references    401K  ± 3.52K      397K  …  420K           3 ( 5%)          -  0.2% ±  0.6%
  cache_misses        302K  ± 7.85K      255K  …  316K           8 (13%)          +  0.0% ±  0.8%
  branch_misses      2.87M  ± 3.09K     2.86M  … 2.87M           1 ( 2%)        ⚡-  5.9% ±  0.0%

Notes:

  • The improvement is smaller for higher compression levels.
  • Possibly controversial: I changed a couple of int-packing operations from big-endian to little-endian, which further improved the performance on x86. Many, but not all, computers are little-endian these days, but I don't know if this will cause a regression on systems that aren't.

@folkertdev
Copy link
Collaborator

it's late here so a proper review will have to wait till tomorrow, but this is looking really good!

Benchmark 1 (64 runs): target/release/examples/compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          78.2ms ± 1.41ms    76.6ms … 82.2ms          0 ( 0%)        0%
  peak_rss           26.7MB ± 65.2KB    26.6MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          286M  ± 5.95M      280M  …  304M           2 ( 3%)        0%
  instructions        590M  ±  419       590M  …  590M           2 ( 3%)        0%
  cache_references   20.0M  ±  265K     19.8M  … 21.8M           1 ( 2%)        0%
  cache_misses        523K  ± 68.4K      423K  …  726K           1 ( 2%)        0%
  branch_misses      3.05M  ± 8.30K     3.04M  … 3.10M           1 ( 2%)        0%
Benchmark 2 (66 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          76.7ms ± 1.55ms    75.8ms … 88.6ms          4 ( 6%)        ⚡-  1.9% ±  0.7%
  peak_rss           26.7MB ± 63.5KB    26.6MB … 26.7MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles          280M  ± 7.14M      275M  …  335M           4 ( 6%)        ⚡-  2.4% ±  0.8%
  instructions        567M  ±  264       567M  …  567M           2 ( 3%)        ⚡-  3.8% ±  0.0%
  cache_references   20.0M  ±  515K     19.7M  … 23.8M           6 ( 9%)          -  0.2% ±  0.7%
  cache_misses        459K  ± 96.8K      343K  …  904K           3 ( 5%)        ⚡- 12.2% ±  5.5%
  branch_misses      2.93M  ± 5.41K     2.92M  … 2.94M           0 ( 0%)        ⚡-  4.1% ±  0.1%

The improvement is smaller for higher compression levels.

That makes sense; they do more other work, so the emit functions are less of a bottleneck.

I changed a couple of int-packing operations from big-endian to little-endian

From a cursory look, you only use those operations on data from the tables right? That's fine: changing the endianness would be problematic when reading from the input stream, but if you just store the data in the opposite order then that's all good. We also test on a big-endian target (s390x) on CI, so it would catch any regressions.

@brian-pane
Copy link
Author

Right, I only changed from_be_bytes to from_le_bytes in functions that were creating integers from temporary arrays constructed specifically for that purpose. Ideally, we could choose at compile time between LE and BE implementations to match the target architecture, but I don't know how to do that in Rust.

@folkertdev
Copy link
Collaborator

Ideally, we could choose at compile time between LE and BE implementations to match the target architecture, but I don't know how to do that in Rust.

there are helpers for using the native endianness

So as long as you make sure uses of these functions are "in-sync" you should be good.

@folkertdev folkertdev force-pushed the emit_dist_precomputed branch from 8b81773 to 0e37a75 Compare December 15, 2024 14:34
Copy link
Collaborator

@folkertdev folkertdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed some changes to compute the table at compile time, so it is clear how it gets generated. This has no impact on the runtime.

Some small other things

zlib-rs/src/deflate.rs Outdated Show resolved Hide resolved
zlib-rs/src/deflate.rs Outdated Show resolved Hide resolved
zlib-rs/src/deflate.rs Outdated Show resolved Hide resolved
@folkertdev folkertdev force-pushed the emit_dist_precomputed branch from 12d5498 to 8347812 Compare December 15, 2024 21:57
@folkertdev
Copy link
Collaborator

folkertdev commented Dec 15, 2024

i added encode_dist to mirror encode_len. I don't think we can (profitably) create a cached lookup table here, but at least this reduces duplication and reduces data dependencies. On my machine there is no change in performance with/without this refactor though, so it's mostly the deduplication that matters there.

Btw, you can git pull --rebase to fetch changes that maintainers make onto your own local branch. That prevents a merge commit from being created (which we'd like to avoid).

I think this looks good now, we'll likely merge this tomorrow, but first I need to make a release (our releases get audited, and someone did that before the weekend, so it's simpler to get this PR added to the next release).

@folkertdev folkertdev requested a review from bjorn3 December 16, 2024 08:41
@@ -841,15 +841,18 @@ impl Value {
self.a
}

pub(crate) fn code(self) -> u16 {
#[inline(always)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage of #[inline(always)] is discouraged unless you can actually prove that it improves performance, but it isn't all that important either.

Copy link
Collaborator

@bjorn3 bjorn3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a bit of time to be convinced between the equivalence between encode_dist + emit_dist and the old emit_dist due to the shuffling of code, but I think it is correct.

Copy link
Collaborator

@folkertdev folkertdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's tricky, I stared at it for a while too and it looks right.

thanks @brian-pane!

@folkertdev folkertdev merged commit d039173 into trifectatechfoundation:main Dec 16, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants