Speed up emit_dist with precomputed length codings for static trees #264

brian-pane · 2024-12-14T23:03:28Z

Benchmark results:

Benchmark 1 (60 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          83.2ms ± 2.00ms    79.7ms … 93.9ms          1 ( 2%)        0%
  peak_rss           26.7MB ± 63.7KB    26.6MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          292M  ±  648K      291M  …  294M           0 ( 0%)        0%
  instructions        612M  ±  324       612M  …  612M           0 ( 0%)        0%
  cache_references    401K  ± 8.39K      396K  …  463K           6 (10%)        0%
  cache_misses        302K  ± 5.51K      277K  …  312K           8 (13%)        0%
  branch_misses      3.05M  ± 4.12K     3.04M  … 3.06M           0 ( 0%)        0%
Benchmark 2 (62 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          81.1ms ± 1.14ms    78.7ms … 84.4ms          1 ( 2%)        ⚡-  2.5% ±  0.7%
  peak_rss           26.7MB ± 77.6KB    26.5MB … 26.7MB          0 ( 0%)          -  0.0% ±  0.1%
  cpu_cycles          285M  ±  641K      283M  …  287M           1 ( 2%)        ⚡-  2.6% ±  0.1%
  instructions        590M  ±  286       590M  …  590M           0 ( 0%)        ⚡-  3.7% ±  0.0%
  cache_references    401K  ± 3.52K      397K  …  420K           3 ( 5%)          -  0.2% ±  0.6%
  cache_misses        302K  ± 7.85K      255K  …  316K           8 (13%)          +  0.0% ±  0.8%
  branch_misses      2.87M  ± 3.09K     2.86M  … 2.87M           1 ( 2%)        ⚡-  5.9% ±  0.0%

Notes:

The improvement is smaller for higher compression levels.
Possibly controversial: I changed a couple of int-packing operations from big-endian to little-endian, which further improved the performance on x86. Many, but not all, computers are little-endian these days, but I don't know if this will cause a regression on systems that aren't.

folkertdev · 2024-12-14T23:25:17Z

it's late here so a proper review will have to wait till tomorrow, but this is looking really good!

Benchmark 1 (64 runs): target/release/examples/compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          78.2ms ± 1.41ms    76.6ms … 82.2ms          0 ( 0%)        0%
  peak_rss           26.7MB ± 65.2KB    26.6MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          286M  ± 5.95M      280M  …  304M           2 ( 3%)        0%
  instructions        590M  ±  419       590M  …  590M           2 ( 3%)        0%
  cache_references   20.0M  ±  265K     19.8M  … 21.8M           1 ( 2%)        0%
  cache_misses        523K  ± 68.4K      423K  …  726K           1 ( 2%)        0%
  branch_misses      3.05M  ± 8.30K     3.04M  … 3.10M           1 ( 2%)        0%
Benchmark 2 (66 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          76.7ms ± 1.55ms    75.8ms … 88.6ms          4 ( 6%)        ⚡-  1.9% ±  0.7%
  peak_rss           26.7MB ± 63.5KB    26.6MB … 26.7MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles          280M  ± 7.14M      275M  …  335M           4 ( 6%)        ⚡-  2.4% ±  0.8%
  instructions        567M  ±  264       567M  …  567M           2 ( 3%)        ⚡-  3.8% ±  0.0%
  cache_references   20.0M  ±  515K     19.7M  … 23.8M           6 ( 9%)          -  0.2% ±  0.7%
  cache_misses        459K  ± 96.8K      343K  …  904K           3 ( 5%)        ⚡- 12.2% ±  5.5%
  branch_misses      2.93M  ± 5.41K     2.92M  … 2.94M           0 ( 0%)        ⚡-  4.1% ±  0.1%

The improvement is smaller for higher compression levels.

That makes sense; they do more other work, so the emit functions are less of a bottleneck.

I changed a couple of int-packing operations from big-endian to little-endian

From a cursory look, you only use those operations on data from the tables right? That's fine: changing the endianness would be problematic when reading from the input stream, but if you just store the data in the opposite order then that's all good. We also test on a big-endian target (s390x) on CI, so it would catch any regressions.

brian-pane · 2024-12-15T02:19:16Z

Right, I only changed from_be_bytes to from_le_bytes in functions that were creating integers from temporary arrays constructed specifically for that purpose. Ideally, we could choose at compile time between LE and BE implementations to match the target architecture, but I don't know how to do that in Rust.

folkertdev · 2024-12-15T13:45:16Z

Ideally, we could choose at compile time between LE and BE implementations to match the target architecture, but I don't know how to do that in Rust.

there are helpers for using the native endianness

So as long as you make sure uses of these functions are "in-sync" you should be good.

folkertdev

I pushed some changes to compute the table at compile time, so it is clear how it gets generated. This has no impact on the runtime.

Some small other things

zlib-rs/src/deflate.rs

folkertdev · 2024-12-15T22:01:34Z

i added encode_dist to mirror encode_len. I don't think we can (profitably) create a cached lookup table here, but at least this reduces duplication and reduces data dependencies. On my machine there is no change in performance with/without this refactor though, so it's mostly the deduplication that matters there.

Btw, you can git pull --rebase to fetch changes that maintainers make onto your own local branch. That prevents a merge commit from being created (which we'd like to avoid).

I think this looks good now, we'll likely merge this tomorrow, but first I need to make a release (our releases get audited, and someone did that before the weekend, so it's simpler to get this PR added to the next release).

zlib-rs/src/deflate/trees_tbl.rs

bjorn3 · 2024-12-16T11:03:02Z

zlib-rs/src/deflate.rs

@@ -841,15 +841,18 @@ impl Value {
        self.a
    }

-    pub(crate) fn code(self) -> u16 {
+    #[inline(always)]


Usage of #[inline(always)] is discouraged unless you can actually prove that it improves performance, but it isn't all that important either.

bjorn3

It took me a bit of time to be convinced between the equivalence between encode_dist + emit_dist and the old emit_dist due to the shuffling of code, but I think it is correct.

folkertdev

Yes it's tricky, I stared at it for a while too and it looks right.

thanks @brian-pane!

Speed up emit_dist with precomputed length codings for static trees

2a1bcec

brian-pane mentioned this pull request Dec 14, 2024

Proof of concept: caching part of the emit_dist computation #263

Closed

folkertdev added 3 commits December 15, 2024 15:12

make encode_static_len a const fn

8df0be6

remove precompute.rs

d0822cb

generate STATIC_LTREE_ENCODINGS as a const

0e37a75

folkertdev force-pushed the emit_dist_precomputed branch from 8b81773 to 0e37a75 Compare December 15, 2024 14:34

folkertdev reviewed Dec 15, 2024

View reviewed changes

zlib-rs/src/deflate.rs Outdated Show resolved Hide resolved

zlib-rs/src/deflate.rs Outdated Show resolved Hide resolved

zlib-rs/src/deflate.rs Outdated Show resolved Hide resolved

brianpane and others added 4 commits December 15, 2024 22:56

Remove old commented-out line

4a3438a

Use getter functions to unpack precomputed ltree lookup

4d68cf0

add encode_dist

5b1eee0

fn emit_dist_static: implicitly use the static dist tree

8347812

folkertdev force-pushed the emit_dist_precomputed branch from 12d5498 to 8347812 Compare December 15, 2024 21:57

folkertdev requested a review from bjorn3 December 16, 2024 08:41

bjorn3 reviewed Dec 16, 2024

View reviewed changes

zlib-rs/src/deflate/trees_tbl.rs Show resolved Hide resolved

bjorn3 reviewed Dec 16, 2024

View reviewed changes

bjorn3 approved these changes Dec 16, 2024

View reviewed changes

correct comment on a cast

7eede0f

folkertdev approved these changes Dec 16, 2024

View reviewed changes

folkertdev merged commit d039173 into trifectatechfoundation:main Dec 16, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up emit_dist with precomputed length codings for static trees #264

Speed up emit_dist with precomputed length codings for static trees #264

brian-pane commented Dec 14, 2024

folkertdev commented Dec 14, 2024

brian-pane commented Dec 15, 2024

folkertdev commented Dec 15, 2024

folkertdev left a comment

folkertdev commented Dec 15, 2024 •

edited

Loading

bjorn3 Dec 16, 2024

bjorn3 left a comment

folkertdev left a comment

Speed up emit_dist with precomputed length codings for static trees #264

Speed up emit_dist with precomputed length codings for static trees #264

Conversation

brian-pane commented Dec 14, 2024

folkertdev commented Dec 14, 2024

brian-pane commented Dec 15, 2024

folkertdev commented Dec 15, 2024

folkertdev left a comment

Choose a reason for hiding this comment

folkertdev commented Dec 15, 2024 • edited Loading

bjorn3 Dec 16, 2024

Choose a reason for hiding this comment

bjorn3 left a comment

Choose a reason for hiding this comment

folkertdev left a comment

Choose a reason for hiding this comment

folkertdev commented Dec 15, 2024 •

edited

Loading