Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LibCompress: Speed up CanonicalCode::read_symbol() slow path #25008

Merged
merged 1 commit into from
Sep 14, 2024

Conversation

nico
Copy link
Contributor

@nico nico commented Sep 9, 2024

Symbols that need <= 8 bits hit a fast path as of #18075, but the slow path has done a full binary search over all symbols ever since this code was added in #2963. (#3405 even added a FIXME for doing this, but #18075 removed it.)

Instead of doing a binary search over all codes for every single bit read, this implements the Moffat-Turpin approach described at https://www.hanshq.net/zip.html#huffdec, which only requires a table read per bit.

hyperfine 'Build/lagom/bin/unzip ~/Downloads/enwik8.zip'
1.008 s ± 0.016 s  =>  957.7 ms ± 3.9 ms, 5% faster

Due to issue #25005, we can't peek the full 15 bits at once but have to read them one-by-one. This makes the code look a bit different than in the linked article.

I also tried not changing CanonicalCode::from_bytes() too much. It does 15 passes over all symbols. I think it could do it in a single pass instead. But that's for a future change.

No behavior change (other than slightly faster perf).

@nico nico requested a review from timschumi as a code owner September 9, 2024 22:07
@github-actions github-actions bot added the 👀 pr-needs-review PR needs review from a maintainer or community member label Sep 9, 2024
@nico
Copy link
Contributor Author

nico commented Sep 9, 2024

(I do have a local fix for #25005, but while it simplifies a whole bunch of stuff and arguably fixes a bug, it does cost some performance. If it ever lands, this is what can happen on top of this PR:

% diff -u deflate-half.cpp deflate-full.cpp                     
--- deflate-half.cpp	2024-09-09 18:07:46
+++ deflate-full.cpp	2024-09-09 18:11:02
@@ -128,7 +128,7 @@
             next_code++;
         }
 
-        u32 sentinel = next_code;
+        u32 sentinel = (uint32_t)(next_code << (15 - code_length));
         code.m_first_symbol_of_length_after.append(sentinel);
         VERIFY(code.m_first_symbol_of_length_after[code_length] == sentinel);
 
@@ -166,15 +166,16 @@
         return symbol_value;
     }
 
-    auto code_bits = TRY(stream.read_bits<u16>(m_max_prefixed_code_length + 1));
-    code_bits = fast_reverse16(code_bits, m_max_prefixed_code_length + 1);
+    auto code_bits = TRY(stream.peek_bits<u16>(15));
+    code_bits = fast_reverse16(code_bits, 15);
 
     for (size_t i = m_max_prefixed_code_length + 1; i <= 15; ++i) {
         if (code_bits < m_first_symbol_of_length_after[i]) {
+            code_bits >>= 15 - i;
             auto symbol_index = (uint16_t)(m_offset_to_first_symbol_index[i] + code_bits);
+            TRY(stream.discard_previously_peeked_bits(i));
             return m_symbol_values[symbol_index];
         }
-        code_bits = code_bits << 1 | TRY(stream.read_bit());
     }
 
     return Error::from_string_literal("Symbol exceeds maximum symbol number");

It's a bit less work in the hot loop of the slow path, but since my local fix requires adding a check to the hot loop of the fast path, it's currently a net loss.)

@nico
Copy link
Contributor Author

nico commented Sep 10, 2024

(macOS system unzip needs 682.4 ms ± 5.7 ms per hyperfine 'unzip -o ~/Downloads/enwik8.zip', so quite a ways to go after this still.)

@nico
Copy link
Contributor Author

nico commented Sep 10, 2024

(and hwzip needs 582.4 ms ± 2.7 ms.)

Symbols that need <= 8 bits hit a fast path as of SerenityOS#18075, but
the slow path has done a full binary search over all symbols
ever since this code was added in SerenityOS#2963. (SerenityOS#3405 even added a FIXME
for doing this, but SerenityOS#18075 removed it.)

Instead of doing a binary search over all codes for every single
bit read, this implements the Moffat-Turpin approach described at
https://www.hanshq.net/zip.html#huffdec, which only requires a
table read per bit.

    hyperfine 'Build/lagom/bin/unzip ~/Downloads/enwik8.zip'
    1.008 s ± 0.016 s  =>  957.7 ms ± 3.9 ms, 5% faster

Due to issue SerenityOS#25005, we can't peek the full 15 bits at once but
have to read them one-by-one. This makes the code look a bit
different than in the linked article.

I also tried not changing CanonicalCode::from_bytes() too much.
It does 15 passes over all symbols. I think it could do it in
a single pass instead. But that's for a future change.

No behavior change (other than slightly faster perf).
@nico nico force-pushed the deflate-moffat-turpin-redux branch from 31a475a to 9945266 Compare September 10, 2024 00:55
@timschumi timschumi merged commit b61e670 into SerenityOS:master Sep 14, 2024
14 checks passed
@github-actions github-actions bot removed the 👀 pr-needs-review PR needs review from a maintainer or community member label Sep 14, 2024
@nico nico deleted the deflate-moffat-turpin-redux branch September 14, 2024 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants