LibCompress: Speed up CanonicalCode::read_symbol() slow path #25008

nico · 2024-09-09T22:07:00Z

Symbols that need <= 8 bits hit a fast path as of #18075, but the slow path has done a full binary search over all symbols ever since this code was added in #2963. (#3405 even added a FIXME for doing this, but #18075 removed it.)

Instead of doing a binary search over all codes for every single bit read, this implements the Moffat-Turpin approach described at https://www.hanshq.net/zip.html#huffdec, which only requires a table read per bit.

hyperfine 'Build/lagom/bin/unzip ~/Downloads/enwik8.zip'
1.008 s ± 0.016 s  =>  957.7 ms ± 3.9 ms, 5% faster

Due to issue #25005, we can't peek the full 15 bits at once but have to read them one-by-one. This makes the code look a bit different than in the linked article.

I also tried not changing CanonicalCode::from_bytes() too much. It does 15 passes over all symbols. I think it could do it in a single pass instead. But that's for a future change.

No behavior change (other than slightly faster perf).

nico · 2024-09-09T22:12:08Z

(I do have a local fix for #25005, but while it simplifies a whole bunch of stuff and arguably fixes a bug, it does cost some performance. If it ever lands, this is what can happen on top of this PR:

% diff -u deflate-half.cpp deflate-full.cpp                     
--- deflate-half.cpp	2024-09-09 18:07:46
+++ deflate-full.cpp	2024-09-09 18:11:02
@@ -128,7 +128,7 @@
             next_code++;
         }
 
-        u32 sentinel = next_code;
+        u32 sentinel = (uint32_t)(next_code << (15 - code_length));
         code.m_first_symbol_of_length_after.append(sentinel);
         VERIFY(code.m_first_symbol_of_length_after[code_length] == sentinel);
 
@@ -166,15 +166,16 @@
         return symbol_value;
     }
 
-    auto code_bits = TRY(stream.read_bits<u16>(m_max_prefixed_code_length + 1));
-    code_bits = fast_reverse16(code_bits, m_max_prefixed_code_length + 1);
+    auto code_bits = TRY(stream.peek_bits<u16>(15));
+    code_bits = fast_reverse16(code_bits, 15);
 
     for (size_t i = m_max_prefixed_code_length + 1; i <= 15; ++i) {
         if (code_bits < m_first_symbol_of_length_after[i]) {
+            code_bits >>= 15 - i;
             auto symbol_index = (uint16_t)(m_offset_to_first_symbol_index[i] + code_bits);
+            TRY(stream.discard_previously_peeked_bits(i));
             return m_symbol_values[symbol_index];
         }
-        code_bits = code_bits << 1 | TRY(stream.read_bit());
     }
 
     return Error::from_string_literal("Symbol exceeds maximum symbol number");

It's a bit less work in the hot loop of the slow path, but since my local fix requires adding a check to the hot loop of the fast path, it's currently a net loss.)

nico · 2024-09-10T00:30:40Z

(macOS system unzip needs 682.4 ms ± 5.7 ms per hyperfine 'unzip -o ~/Downloads/enwik8.zip', so quite a ways to go after this still.)

nico · 2024-09-10T00:40:12Z

(and hwzip needs 582.4 ms ± 2.7 ms.)

Symbols that need <= 8 bits hit a fast path as of SerenityOS#18075, but the slow path has done a full binary search over all symbols ever since this code was added in SerenityOS#2963. (SerenityOS#3405 even added a FIXME for doing this, but SerenityOS#18075 removed it.) Instead of doing a binary search over all codes for every single bit read, this implements the Moffat-Turpin approach described at https://www.hanshq.net/zip.html#huffdec, which only requires a table read per bit. hyperfine 'Build/lagom/bin/unzip ~/Downloads/enwik8.zip' 1.008 s ± 0.016 s => 957.7 ms ± 3.9 ms, 5% faster Due to issue SerenityOS#25005, we can't peek the full 15 bits at once but have to read them one-by-one. This makes the code look a bit different than in the linked article. I also tried not changing CanonicalCode::from_bytes() too much. It does 15 passes over all symbols. I think it could do it in a single pass instead. But that's for a future change. No behavior change (other than slightly faster perf).

nico requested a review from timschumi as a code owner September 9, 2024 22:07

github-actions bot added the 👀 pr-needs-review PR needs review from a maintainer or community member label Sep 9, 2024

nico force-pushed the deflate-moffat-turpin-redux branch from 31a475a to 9945266 Compare September 10, 2024 00:55

timschumi approved these changes Sep 14, 2024

View reviewed changes

timschumi merged commit b61e670 into SerenityOS:master Sep 14, 2024
14 checks passed

github-actions bot removed the 👀 pr-needs-review PR needs review from a maintainer or community member label Sep 14, 2024

nico deleted the deflate-moffat-turpin-redux branch September 14, 2024 11:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LibCompress: Speed up CanonicalCode::read_symbol() slow path #25008

LibCompress: Speed up CanonicalCode::read_symbol() slow path #25008

nico commented Sep 9, 2024 •

edited

Loading

nico commented Sep 9, 2024

nico commented Sep 10, 2024

nico commented Sep 10, 2024

LibCompress: Speed up CanonicalCode::read_symbol() slow path #25008

LibCompress: Speed up CanonicalCode::read_symbol() slow path #25008

Conversation

nico commented Sep 9, 2024 • edited Loading

nico commented Sep 9, 2024

nico commented Sep 10, 2024

nico commented Sep 10, 2024

nico commented Sep 9, 2024 •

edited

Loading