fix deflate performance #18

folkertdev · 2024-02-01T15:05:11Z

we are consistently ~10% slower than zlib-ng on deflate. This fluctuates with the different compression levels, but currently none of then are on-par with zlib-ng.

There is so far no obvious reason for this slowdown, so it's likely a "death by a thousand papercuts" sort of thing.

folkertdev · 2024-04-20T15:32:40Z

as a data point, commit 8048180

Benchmark 1 (29 runs): cargo run --release --example compress 1 ng silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           175ms ± 8.58ms     168ms …  217ms          1 ( 3%)        0%
  peak_rss           26.8MB ± 87.3KB    26.6MB … 27.0MB          0 ( 0%)        0%
  cpu_cycles          515M  ± 6.88M      505M  …  529M           0 ( 0%)        0%
  instructions        744M  ± 35.3K      744M  …  744M           1 ( 3%)        0%
  cache_references   11.3M  ± 1.18M     9.08M  … 14.1M           0 ( 0%)        0%
  cache_misses       2.25M  ± 66.4K     2.12M  … 2.40M           1 ( 3%)        0%
  branch_misses      4.13M  ± 15.1K     4.10M  … 4.17M           0 ( 0%)        0%
Benchmark 2 (27 runs): cargo run --release --example compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           191ms ± 4.14ms     187ms …  207ms          1 ( 4%)        💩+  9.2% ±  2.1%
  peak_rss           26.7MB ± 93.5KB    26.6MB … 26.9MB          0 ( 0%)          -  0.3% ±  0.2%
  cpu_cycles          580M  ± 10.7M      570M  …  619M           1 ( 4%)        💩+ 12.7% ±  0.9%
  instructions        873M  ± 31.4K      873M  …  873M           0 ( 0%)        💩+ 17.3% ±  0.0%
  cache_references   11.4M  ± 1.16M     9.31M  … 13.6M           0 ( 0%)          +  0.9% ±  5.6%
  cache_misses       2.33M  ± 74.4K     2.22M  … 2.49M           0 ( 0%)        💩+  3.6% ±  1.7%
  branch_misses      4.39M  ± 42.1K     4.34M  … 4.52M           1 ( 4%)        💩+  6.2% ±  0.4%

bjorn3 · 2024-10-14T09:44:36Z

#223 should have helped a fair bit.

folkertdev · 2024-10-14T15:43:19Z

yes, but there is still a significant gap (note that that benchmark runs cargo, that's why the numbers are higher)

Benchmark 1 (68 runs): target/release/examples/blogpost-compress 1 ng silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          73.9ms ± 1.02ms    72.7ms … 80.2ms          3 ( 4%)        0%
  peak_rss           26.6MB ± 42.5KB    26.5MB … 26.6MB          8 (12%)        0%
  cpu_cycles          264M  ± 1.79M      261M  …  270M           2 ( 3%)        0%
  instructions        460M  ±  266       460M  …  460M           1 ( 1%)        0%
  cache_references   18.9M  ±  233K     18.5M  … 19.7M           5 ( 7%)        0%
  cache_misses        507K  ± 85.4K      408K  …  949K           2 ( 3%)        0%
  branch_misses      3.32M  ± 5.96K     3.30M  … 3.33M           0 ( 0%)        0%
Benchmark 2 (64 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          78.6ms ±  990us    77.4ms … 83.3ms          5 ( 8%)        💩+  6.5% ±  0.5%
  peak_rss           26.7MB ± 65.8KB    26.6MB … 26.7MB          0 ( 0%)          +  0.3% ±  0.1%
  cpu_cycles          287M  ± 4.11M      284M  …  310M           9 (14%)        💩+  8.8% ±  0.4%
  instructions        633M  ±  276       633M  …  633M           0 ( 0%)        💩+ 37.7% ±  0.0%
  cache_references   19.9M  ±  200K     19.7M  … 20.6M           6 ( 9%)        💩+  5.5% ±  0.4%
  cache_misses        440K  ± 95.8K      327K  …  778K           1 ( 2%)        ⚡- 13.3% ±  6.1%
  branch_misses      3.08M  ± 10.8K     3.06M  … 3.11M           9 (14%)        ⚡-  7.3% ±  0.1%

brian-pane · 2024-12-13T19:17:57Z

In case it's helpful for anyone else, here's a CPU profile of the current main. I used https://github.com/mstange/samply to visualize it, since that tool shows the cost of inlined functions. (Sorry for sharing this in the form of a screenshot. I haven't yet found a way to export the output of the tool as plain text.)

Methodology:

perf record -F max -e instructions -o perf-rs.data target/release/examples/compress 1 rs silesia-small.tar
samply import perf-rs.data

Notes on the profile results:

A surprisingly large amount of CPU time is spent turning the WeakSlices into temporary slices (all the WeakSliceMut functions, plus Window::filled). But I looked at the generated assembly for these functions, and there's nothing surprising there: it's just doing a pair of 64-bit read operations to fetch the pointer and length fields from the appropriate structs.
BitWriter::emit_dist takes a lot of time, but that seems to be because of all the table lookups it needs to do. I haven't found any obvious ways to speed that up.
Crc32HashCalc::quick_insert_string is a hotspot. From a source line level profile, though, almost all the "self" time of that function seems to be spent in these two lines, which may be tricky to improve because they're just doing a straightforward array read and an unavoidable conditional branch:

        let head = state.head.as_slice()[hm];
        if head != string as u16 {

folkertdev · 2024-12-13T19:34:19Z

That corresponds well with what we've been seeing.

A surprisingly large amount of CPU time is spent turning the WeakSlices into temporary slices

You're measuring instructions right, not cpu time? Or maybe I'm misinterpreting? Anyway, if it is just instructions this might be a bit misleading because of instruction-level parallelism?

BitWriter::emit_dist takes a lot of time

Yeah I've tried to turn those tables into integer constants and use bit shifting/masking before, but that did not help. I think i only tried it with simd registers though, maybe using standard registers would work?

It really seems like something should be possible here, but I've not had much luck so far.

Crc32HashCalc::quick_insert_string is a hotspot

besides what you wrote, the read is basically a hashmap lookup, so it is quite likely that the memory you want is not in (L1) cache.

brian-pane · 2024-12-13T21:53:24Z

When I collect the profile as CPU cycles instead of instructions, the WeakSlice methods and Window::filled are a smaller percentage, but still collectively over 10% of the total.

folkertdev · 2024-12-13T22:00:24Z

just want to document this for the future.

I tried replacing this table

#[rustfmt::skip]
pub const LENGTH_CODE: [u8; STD_MAX_MATCH-STD_MIN_MATCH+1] = [
     0,  1,  2,  3,  4,  5,  6,  7,  8,  8,  9,  9, 10, 10, 11, 11, 12, 12, 12, 12,
    13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16,
    17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19,
    19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,
    21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22,
    22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23,
    23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
    24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
    25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
    25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 26,
    26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26,
    26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
    27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 28
];

with a function

#[inline(always)]
pub const fn length_code(i: u8) -> u8 {
    match i {
        0..8 => i,
        255 => 28,
        _ => {
            // let log2 = i.ilog(2) // this is `7 - i.leading_zeros()`
            // ((log2 as u8 - 1) << 2) | ((i >> (log2 - 2)) & 0b11)
            let a = (6 - i.leading_zeros() as u8) << 2;
            let b = (i >> (5u8.saturating_sub(i.leading_zeros() as u8))) & 0b11;

            a | b
        }
    }
}

Yes that is inscrutable and cursed, and took forever to figure out. But it actually replicates the table. Sadly it still generates too many instructions to be worth it

nmoinvaz · 2024-12-22T16:38:25Z

It might be worth trying to benchmark the hashing functions first to make sure they are on par with zlib-ng. Also consider checking all functions in zlib-ng that have static inline and see if function inline might be a cause.

nmoinvaz · 2024-12-22T21:06:22Z

Another thing to check is performance of level 0 (stored) compared to zlib-ng to check to see if it isn't something non-compression related.

brian-pane · 2025-01-04T04:00:25Z

Documenting another attempt to replace a lookup table with a calculation:

I tried replacing this:

pub const BASE_DIST: [u16; D_CODES] = [
    0,     1,     2,     3,     4,     6,     8,    12,    16,    24,
   32,    48,    64,    96,   128,   192,   256,   384,   512,   768,
 1024,  1536,  2048,  3072,  4096,  6144,  8192, 12288, 16384, 24576
];

with this, which generates the equivalent mapping:

#[inline(always)]
pub fn base_dist(i: usize) -> usize {
    let base = std::cmp::min(1, i);  // a hack to handle the i==0 case
    let x = base << ((i + 1) >> 1);
    let y = base << (i >> 1);
    (x + y) / 2
}

Unfortunately, the function version created a 2% regression in instruction count at low compression levels.

folkertdev · 2025-01-04T10:31:12Z

Should that be cmp::max? otherwise any input besides 0 returns the same value I think?

Anyway, I guess it just produces too many instructions, with the shifts and all, and they all kind of depend on each other too much. You could try performing the call earlier, so it maybe gets mixed with some other instructions at a lower cost?

brian-pane · 2025-01-04T19:21:35Z

The key to the cmp::min is how the result is used in the next two lines. If the computation didn't have to handle the i=0 case, this simple formula would work:

    let x = 1 << ((i + 1) >> 1);
    let y = 1 << (i >> 1);
    (x + y) / 2

But that produces the wrong result for i=0, so the leftmost 1 in those shift operations is replaced with base which is 0 for i=0 and 1 for all other values. It's a hack to avoid adding a conditional branch for the i=0 case. Unfortunately, the data dependency on base probably prevents instruction level parallelism in the remainder of the computation.

I also tried this variant, but it was even slower:

#[inline(always)]
pub fn base_dist(i: usize) -> usize {
    let base = std::cmp::min(1, i);
    let x = 1 << ((i + 1) >> 1);
    let y = 1 << (i >> 1);
    base * (x + y) / 2
}

folkertdev · 2025-01-04T20:05:15Z

Hmm, given that most lookups won't be with index 0, maybe a branch is ok here? Or something using (i == 0) as usuze. Or maybe the small lookup table is just the best option

folkertdev · 2025-01-07T12:09:11Z

I code-golfed this to

pub const fn base_dist(i: usize) -> u16 {
    let base = (i != 0) as usize;
    let power = base << (i / 2);
    (power | (power >> (i % 2 != 0) as usize)) as u16
}

but it's still not consistently faster. also, https://oeis.org/A029744 was interesting not actually very helpful.

I suspect that a more compact formulation of the bigger tables would be an improvement, but I haven't really cracked the code yet there.

brian-pane · 2025-01-14T21:34:07Z

The deflate performance is now much closer at compression level 1: about 3% more CPU cycles than zlib-ng on x86_64 when compiled with RUSTFLAGS="-Cllvm-args=-enable-dfa-jump-thread -Ctarget-cpu=native"

Benchmark 1 (69 runs): ./blogpost-compress-baseline-native 1 ng silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          73.0ms ± 1.51ms    71.0ms … 84.1ms          5 ( 7%)        0%
  peak_rss           26.6MB ± 15.8KB    26.5MB … 26.6MB          1 ( 1%)        0%
  cpu_cycles          275M  ±  502K      274M  …  276M           0 ( 0%)        0%
  instructions        454M  ±  224       454M  …  454M           1 ( 1%)        0%
  cache_references    265K  ± 5.94K      260K  …  297K           7 (10%)        0%
  cache_misses        231K  ± 8.54K      195K  …  242K           5 ( 7%)        0%
  branch_misses      3.23M  ± 5.97K     3.22M  … 3.24M           0 ( 0%)        0%
Benchmark 2 (68 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          73.6ms ±  653us    72.6ms … 76.3ms          2 ( 3%)          +  0.8% ±  0.5%
  peak_rss           26.7MB ± 68.0KB    26.5MB … 26.7MB          0 ( 0%)          +  0.2% ±  0.1%
  cpu_cycles          283M  ±  653K      282M  …  285M           5 ( 7%)        💩+  3.1% ±  0.1%
  instructions        536M  ±  247       536M  …  536M           0 ( 0%)        💩+ 17.9% ±  0.0%
  cache_references    267K  ± 14.5K      262K  …  370K           8 (12%)          +  0.9% ±  1.4%
  cache_misses        230K  ± 11.6K      197K  …  241K          10 (15%)          -  0.2% ±  1.5%
  branch_misses      3.03M  ± 9.58K     3.02M  … 3.05M           0 ( 0%)        ⚡-  6.0% ±  0.1%

The higher compression levels have a bigger performance gap:

Benchmark 1 (44 runs): ./blogpost-compress-baseline-native 2 ng silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           116ms ±  415us     115ms …  117ms          0 ( 0%)        0%
  peak_rss           25.0MB ± 19.8KB    24.9MB … 25.0MB          1 ( 2%)        0%
  cpu_cycles          459M  ± 1.04M      457M  …  461M           0 ( 0%)        0%
  instructions        891M  ±  295       891M  …  891M           1 ( 2%)        0%
  cache_references    266K  ± 3.62K      263K  …  279K           4 ( 9%)        0%
  cache_misses        229K  ± 9.20K      198K  …  237K           5 (11%)        0%
  branch_misses      6.81M  ± 17.4K     6.79M  … 6.87M           4 ( 9%)        0%
Benchmark 2 (42 runs): ./target/release/examples/blogpost-compress 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           122ms ±  556us     121ms …  123ms          3 ( 7%)        💩+  5.2% ±  0.2%
  peak_rss           25.0MB ± 65.1KB    24.9MB … 25.0MB          0 ( 0%)          -  0.3% ±  0.1%
  cpu_cycles          496M  ± 1.15M      494M  …  499M           0 ( 0%)        💩+  8.3% ±  0.1%
  instructions       1.06G  ±  308      1.06G  … 1.06G           1 ( 2%)        💩+ 18.4% ±  0.0%
  cache_references    284K  ± 59.5K      266K  …  650K           4 (10%)          +  6.6% ±  6.7%
  cache_misses        235K  ± 4.92K      208K  …  241K           1 ( 2%)        💩+  2.6% ±  1.4%
  branch_misses      6.83M  ± 7.78K     6.82M  … 6.85M           0 ( 0%)          +  0.4% ±  0.1%

folkertdev closed this as completed Dec 13, 2024

folkertdev reopened this Dec 13, 2024

brian-pane mentioned this issue Dec 22, 2024

Hint to the compiler to inline the fill_window function #270

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix deflate performance #18

fix deflate performance #18

folkertdev commented Feb 1, 2024

folkertdev commented Apr 20, 2024

bjorn3 commented Oct 14, 2024

folkertdev commented Oct 14, 2024

brian-pane commented Dec 13, 2024

folkertdev commented Dec 13, 2024

brian-pane commented Dec 13, 2024

folkertdev commented Dec 13, 2024

nmoinvaz commented Dec 22, 2024

nmoinvaz commented Dec 22, 2024

brian-pane commented Jan 4, 2025

folkertdev commented Jan 4, 2025

brian-pane commented Jan 4, 2025

folkertdev commented Jan 4, 2025 •

edited

Loading

folkertdev commented Jan 7, 2025

brian-pane commented Jan 14, 2025

fix deflate performance #18

fix deflate performance #18

Comments

folkertdev commented Feb 1, 2024

folkertdev commented Apr 20, 2024

bjorn3 commented Oct 14, 2024

folkertdev commented Oct 14, 2024

brian-pane commented Dec 13, 2024

folkertdev commented Dec 13, 2024

brian-pane commented Dec 13, 2024

folkertdev commented Dec 13, 2024

nmoinvaz commented Dec 22, 2024

nmoinvaz commented Dec 22, 2024

brian-pane commented Jan 4, 2025

folkertdev commented Jan 4, 2025

brian-pane commented Jan 4, 2025

folkertdev commented Jan 4, 2025 • edited Loading

folkertdev commented Jan 7, 2025

brian-pane commented Jan 14, 2025

folkertdev commented Jan 4, 2025 •

edited

Loading