Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix deflate performance #18

Open
folkertdev opened this issue Feb 1, 2024 · 15 comments
Open

fix deflate performance #18

folkertdev opened this issue Feb 1, 2024 · 15 comments

Comments

@folkertdev
Copy link
Collaborator

we are consistently ~10% slower than zlib-ng on deflate. This fluctuates with the different compression levels, but currently none of then are on-par with zlib-ng.

There is so far no obvious reason for this slowdown, so it's likely a "death by a thousand papercuts" sort of thing.

@folkertdev
Copy link
Collaborator Author

as a data point, commit 8048180

Benchmark 1 (29 runs): cargo run --release --example compress 1 ng silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           175ms ± 8.58ms     168ms …  217ms          1 ( 3%)        0%
  peak_rss           26.8MB ± 87.3KB    26.6MB … 27.0MB          0 ( 0%)        0%
  cpu_cycles          515M  ± 6.88M      505M  …  529M           0 ( 0%)        0%
  instructions        744M  ± 35.3K      744M  …  744M           1 ( 3%)        0%
  cache_references   11.3M  ± 1.18M     9.08M  … 14.1M           0 ( 0%)        0%
  cache_misses       2.25M  ± 66.4K     2.12M  … 2.40M           1 ( 3%)        0%
  branch_misses      4.13M  ± 15.1K     4.10M  … 4.17M           0 ( 0%)        0%
Benchmark 2 (27 runs): cargo run --release --example compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           191ms ± 4.14ms     187ms …  207ms          1 ( 4%)        💩+  9.2% ±  2.1%
  peak_rss           26.7MB ± 93.5KB    26.6MB … 26.9MB          0 ( 0%)          -  0.3% ±  0.2%
  cpu_cycles          580M  ± 10.7M      570M  …  619M           1 ( 4%)        💩+ 12.7% ±  0.9%
  instructions        873M  ± 31.4K      873M  …  873M           0 ( 0%)        💩+ 17.3% ±  0.0%
  cache_references   11.4M  ± 1.16M     9.31M  … 13.6M           0 ( 0%)          +  0.9% ±  5.6%
  cache_misses       2.33M  ± 74.4K     2.22M  … 2.49M           0 ( 0%)        💩+  3.6% ±  1.7%
  branch_misses      4.39M  ± 42.1K     4.34M  … 4.52M           1 ( 4%)        💩+  6.2% ±  0.4%

@bjorn3
Copy link
Collaborator

bjorn3 commented Oct 14, 2024

#223 should have helped a fair bit.

@folkertdev
Copy link
Collaborator Author

yes, but there is still a significant gap (note that that benchmark runs cargo, that's why the numbers are higher)

Benchmark 1 (68 runs): target/release/examples/blogpost-compress 1 ng silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          73.9ms ± 1.02ms    72.7ms … 80.2ms          3 ( 4%)        0%
  peak_rss           26.6MB ± 42.5KB    26.5MB … 26.6MB          8 (12%)        0%
  cpu_cycles          264M  ± 1.79M      261M  …  270M           2 ( 3%)        0%
  instructions        460M  ±  266       460M  …  460M           1 ( 1%)        0%
  cache_references   18.9M  ±  233K     18.5M  … 19.7M           5 ( 7%)        0%
  cache_misses        507K  ± 85.4K      408K  …  949K           2 ( 3%)        0%
  branch_misses      3.32M  ± 5.96K     3.30M  … 3.33M           0 ( 0%)        0%
Benchmark 2 (64 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          78.6ms ±  990us    77.4ms … 83.3ms          5 ( 8%)        💩+  6.5% ±  0.5%
  peak_rss           26.7MB ± 65.8KB    26.6MB … 26.7MB          0 ( 0%)          +  0.3% ±  0.1%
  cpu_cycles          287M  ± 4.11M      284M  …  310M           9 (14%)        💩+  8.8% ±  0.4%
  instructions        633M  ±  276       633M  …  633M           0 ( 0%)        💩+ 37.7% ±  0.0%
  cache_references   19.9M  ±  200K     19.7M  … 20.6M           6 ( 9%)        💩+  5.5% ±  0.4%
  cache_misses        440K  ± 95.8K      327K  …  778K           1 ( 2%)        ⚡- 13.3% ±  6.1%
  branch_misses      3.08M  ± 10.8K     3.06M  … 3.11M           9 (14%)        ⚡-  7.3% ±  0.1%

@brian-pane
Copy link

In case it's helpful for anyone else, here's a CPU profile of the current main. I used https://github.com/mstange/samply to visualize it, since that tool shows the cost of inlined functions. (Sorry for sharing this in the form of a screenshot. I haven't yet found a way to export the output of the tool as plain text.)

Methodology:

perf record -F max -e instructions -o perf-rs.data target/release/examples/compress 1 rs silesia-small.tar
samply import perf-rs.data

Notes on the profile results:

  • A surprisingly large amount of CPU time is spent turning the WeakSlices into temporary slices (all the WeakSliceMut functions, plus Window::filled). But I looked at the generated assembly for these functions, and there's nothing surprising there: it's just doing a pair of 64-bit read operations to fetch the pointer and length fields from the appropriate structs.
  • BitWriter::emit_dist takes a lot of time, but that seems to be because of all the table lookups it needs to do. I haven't found any obvious ways to speed that up.
  • Crc32HashCalc::quick_insert_string is a hotspot. From a source line level profile, though, almost all the "self" time of that function seems to be spent in these two lines, which may be tricky to improve because they're just doing a straightforward array read and an unavoidable conditional branch:
        let head = state.head.as_slice()[hm];
        if head != string as u16 {

profile

@folkertdev
Copy link
Collaborator Author

That corresponds well with what we've been seeing.

A surprisingly large amount of CPU time is spent turning the WeakSlices into temporary slices

You're measuring instructions right, not cpu time? Or maybe I'm misinterpreting? Anyway, if it is just instructions this might be a bit misleading because of instruction-level parallelism?

BitWriter::emit_dist takes a lot of time

Yeah I've tried to turn those tables into integer constants and use bit shifting/masking before, but that did not help. I think i only tried it with simd registers though, maybe using standard registers would work?

It really seems like something should be possible here, but I've not had much luck so far.

Crc32HashCalc::quick_insert_string is a hotspot

besides what you wrote, the read is basically a hashmap lookup, so it is quite likely that the memory you want is not in (L1) cache.

@brian-pane
Copy link

When I collect the profile as CPU cycles instead of instructions, the WeakSlice methods and Window::filled are a smaller percentage, but still collectively over 10% of the total.

@folkertdev
Copy link
Collaborator Author

just want to document this for the future.

I tried replacing this table

#[rustfmt::skip]
pub const LENGTH_CODE: [u8; STD_MAX_MATCH-STD_MIN_MATCH+1] = [
     0,  1,  2,  3,  4,  5,  6,  7,  8,  8,  9,  9, 10, 10, 11, 11, 12, 12, 12, 12,
    13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16,
    17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19,
    19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,
    21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22,
    22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23,
    23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
    24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
    25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
    25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 26,
    26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26,
    26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
    27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 28
];

with a function

#[inline(always)]
pub const fn length_code(i: u8) -> u8 {
    match i {
        0..8 => i,
        255 => 28,
        _ => {
            // let log2 = i.ilog(2) // this is `7 - i.leading_zeros()`
            // ((log2 as u8 - 1) << 2) | ((i >> (log2 - 2)) & 0b11)
            let a = (6 - i.leading_zeros() as u8) << 2;
            let b = (i >> (5u8.saturating_sub(i.leading_zeros() as u8))) & 0b11;

            a | b
        }
    }
}

Yes that is inscrutable and cursed, and took forever to figure out. But it actually replicates the table. Sadly it still generates too many instructions to be worth it

@nmoinvaz
Copy link

It might be worth trying to benchmark the hashing functions first to make sure they are on par with zlib-ng. Also consider checking all functions in zlib-ng that have static inline and see if function inline might be a cause.

@nmoinvaz
Copy link

Another thing to check is performance of level 0 (stored) compared to zlib-ng to check to see if it isn't something non-compression related.

@brian-pane
Copy link

Documenting another attempt to replace a lookup table with a calculation:

I tried replacing this:

pub const BASE_DIST: [u16; D_CODES] = [
    0,     1,     2,     3,     4,     6,     8,    12,    16,    24,
   32,    48,    64,    96,   128,   192,   256,   384,   512,   768,
 1024,  1536,  2048,  3072,  4096,  6144,  8192, 12288, 16384, 24576
];

with this, which generates the equivalent mapping:

#[inline(always)]
pub fn base_dist(i: usize) -> usize {
    let base = std::cmp::min(1, i);  // a hack to handle the i==0 case
    let x = base << ((i + 1) >> 1);
    let y = base << (i >> 1);
    (x + y) / 2
}

Unfortunately, the function version created a 2% regression in instruction count at low compression levels.

@folkertdev
Copy link
Collaborator Author

Should that be cmp::max? otherwise any input besides 0 returns the same value I think?

Anyway, I guess it just produces too many instructions, with the shifts and all, and they all kind of depend on each other too much. You could try performing the call earlier, so it maybe gets mixed with some other instructions at a lower cost?

@brian-pane
Copy link

The key to the cmp::min is how the result is used in the next two lines. If the computation didn't have to handle the i=0 case, this simple formula would work:

    let x = 1 << ((i + 1) >> 1);
    let y = 1 << (i >> 1);
    (x + y) / 2

But that produces the wrong result for i=0, so the leftmost 1 in those shift operations is replaced with base which is 0 for i=0 and 1 for all other values. It's a hack to avoid adding a conditional branch for the i=0 case. Unfortunately, the data dependency on base probably prevents instruction level parallelism in the remainder of the computation.

I also tried this variant, but it was even slower:

#[inline(always)]
pub fn base_dist(i: usize) -> usize {
    let base = std::cmp::min(1, i);
    let x = 1 << ((i + 1) >> 1);
    let y = 1 << (i >> 1);
    base * (x + y) / 2
}

@folkertdev
Copy link
Collaborator Author

folkertdev commented Jan 4, 2025

Hmm, given that most lookups won't be with index 0, maybe a branch is ok here? Or something using (i == 0) as usuze. Or maybe the small lookup table is just the best option

@folkertdev
Copy link
Collaborator Author

I code-golfed this to

pub const fn base_dist(i: usize) -> u16 {
    let base = (i != 0) as usize;
    let power = base << (i / 2);
    (power | (power >> (i % 2 != 0) as usize)) as u16
}

but it's still not consistently faster. also, https://oeis.org/A029744 was interesting not actually very helpful.

I suspect that a more compact formulation of the bigger tables would be an improvement, but I haven't really cracked the code yet there.

@brian-pane
Copy link

The deflate performance is now much closer at compression level 1: about 3% more CPU cycles than zlib-ng on x86_64 when compiled with RUSTFLAGS="-Cllvm-args=-enable-dfa-jump-thread -Ctarget-cpu=native"

Benchmark 1 (69 runs): ./blogpost-compress-baseline-native 1 ng silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          73.0ms ± 1.51ms    71.0ms … 84.1ms          5 ( 7%)        0%
  peak_rss           26.6MB ± 15.8KB    26.5MB … 26.6MB          1 ( 1%)        0%
  cpu_cycles          275M  ±  502K      274M  …  276M           0 ( 0%)        0%
  instructions        454M  ±  224       454M  …  454M           1 ( 1%)        0%
  cache_references    265K  ± 5.94K      260K  …  297K           7 (10%)        0%
  cache_misses        231K  ± 8.54K      195K  …  242K           5 ( 7%)        0%
  branch_misses      3.23M  ± 5.97K     3.22M  … 3.24M           0 ( 0%)        0%
Benchmark 2 (68 runs): ./target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          73.6ms ±  653us    72.6ms … 76.3ms          2 ( 3%)          +  0.8% ±  0.5%
  peak_rss           26.7MB ± 68.0KB    26.5MB … 26.7MB          0 ( 0%)          +  0.2% ±  0.1%
  cpu_cycles          283M  ±  653K      282M  …  285M           5 ( 7%)        💩+  3.1% ±  0.1%
  instructions        536M  ±  247       536M  …  536M           0 ( 0%)        💩+ 17.9% ±  0.0%
  cache_references    267K  ± 14.5K      262K  …  370K           8 (12%)          +  0.9% ±  1.4%
  cache_misses        230K  ± 11.6K      197K  …  241K          10 (15%)          -  0.2% ±  1.5%
  branch_misses      3.03M  ± 9.58K     3.02M  … 3.05M           0 ( 0%)        ⚡-  6.0% ±  0.1%

The higher compression levels have a bigger performance gap:

Benchmark 1 (44 runs): ./blogpost-compress-baseline-native 2 ng silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           116ms ±  415us     115ms …  117ms          0 ( 0%)        0%
  peak_rss           25.0MB ± 19.8KB    24.9MB … 25.0MB          1 ( 2%)        0%
  cpu_cycles          459M  ± 1.04M      457M  …  461M           0 ( 0%)        0%
  instructions        891M  ±  295       891M  …  891M           1 ( 2%)        0%
  cache_references    266K  ± 3.62K      263K  …  279K           4 ( 9%)        0%
  cache_misses        229K  ± 9.20K      198K  …  237K           5 (11%)        0%
  branch_misses      6.81M  ± 17.4K     6.79M  … 6.87M           4 ( 9%)        0%
Benchmark 2 (42 runs): ./target/release/examples/blogpost-compress 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           122ms ±  556us     121ms …  123ms          3 ( 7%)        💩+  5.2% ±  0.2%
  peak_rss           25.0MB ± 65.1KB    24.9MB … 25.0MB          0 ( 0%)          -  0.3% ±  0.1%
  cpu_cycles          496M  ± 1.15M      494M  …  499M           0 ( 0%)        💩+  8.3% ±  0.1%
  instructions       1.06G  ±  308      1.06G  … 1.06G           1 ( 2%)        💩+ 18.4% ±  0.0%
  cache_references    284K  ± 59.5K      266K  …  650K           4 (10%)          +  6.6% ±  6.7%
  cache_misses        235K  ± 4.92K      208K  …  241K           1 ( 2%)        💩+  2.6% ±  1.4%
  branch_misses      6.83M  ± 7.78K     6.82M  … 6.85M           0 ( 0%)          +  0.4% ±  0.1%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants