-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix deflate performance #18
Comments
as a data point, commit 8048180
|
#223 should have helped a fair bit. |
yes, but there is still a significant gap (note that that benchmark runs
|
In case it's helpful for anyone else, here's a CPU profile of the current Methodology:
Notes on the profile results:
|
That corresponds well with what we've been seeing.
You're measuring instructions right, not cpu time? Or maybe I'm misinterpreting? Anyway, if it is just instructions this might be a bit misleading because of instruction-level parallelism?
Yeah I've tried to turn those tables into integer constants and use bit shifting/masking before, but that did not help. I think i only tried it with simd registers though, maybe using standard registers would work? It really seems like something should be possible here, but I've not had much luck so far.
besides what you wrote, the read is basically a hashmap lookup, so it is quite likely that the memory you want is not in (L1) cache. |
When I collect the profile as CPU cycles instead of instructions, the |
just want to document this for the future. I tried replacing this table #[rustfmt::skip]
pub const LENGTH_CODE: [u8; STD_MAX_MATCH-STD_MIN_MATCH+1] = [
0, 1, 2, 3, 4, 5, 6, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 12, 12,
13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16,
17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19,
19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,
21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22,
22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23,
23, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 26,
26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26,
26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 28
]; with a function #[inline(always)]
pub const fn length_code(i: u8) -> u8 {
match i {
0..8 => i,
255 => 28,
_ => {
// let log2 = i.ilog(2) // this is `7 - i.leading_zeros()`
// ((log2 as u8 - 1) << 2) | ((i >> (log2 - 2)) & 0b11)
let a = (6 - i.leading_zeros() as u8) << 2;
let b = (i >> (5u8.saturating_sub(i.leading_zeros() as u8))) & 0b11;
a | b
}
}
} Yes that is inscrutable and cursed, and took forever to figure out. But it actually replicates the table. Sadly it still generates too many instructions to be worth it |
It might be worth trying to benchmark the hashing functions first to make sure they are on par with zlib-ng. Also consider checking all functions in zlib-ng that have |
Another thing to check is performance of level 0 (stored) compared to zlib-ng to check to see if it isn't something non-compression related. |
Documenting another attempt to replace a lookup table with a calculation: I tried replacing this:
with this, which generates the equivalent mapping:
Unfortunately, the function version created a 2% regression in instruction count at low compression levels. |
Should that be Anyway, I guess it just produces too many instructions, with the shifts and all, and they all kind of depend on each other too much. You could try performing the call earlier, so it maybe gets mixed with some other instructions at a lower cost? |
The key to the
But that produces the wrong result for i=0, so the leftmost I also tried this variant, but it was even slower:
|
Hmm, given that most lookups won't be with index 0, maybe a branch is ok here? Or something using |
I code-golfed this to pub const fn base_dist(i: usize) -> u16 {
let base = (i != 0) as usize;
let power = base << (i / 2);
(power | (power >> (i % 2 != 0) as usize)) as u16
} but it's still not consistently faster. also, https://oeis.org/A029744 was interesting not actually very helpful. I suspect that a more compact formulation of the bigger tables would be an improvement, but I haven't really cracked the code yet there. |
The deflate performance is now much closer at compression level 1: about 3% more CPU cycles than zlib-ng on x86_64 when compiled with
The higher compression levels have a bigger performance gap:
|
we are consistently ~10% slower than zlib-ng on deflate. This fluctuates with the different compression levels, but currently none of then are on-par with zlib-ng.
There is so far no obvious reason for this slowdown, so it's likely a "death by a thousand papercuts" sort of thing.
The text was updated successfully, but these errors were encountered: