Faster BoolReader #124

SLiV9 · 2024-12-28T18:06:39Z

Moved BoolReader to its own file.
It now reads from the buffer in chunks of 4 bytes at a time, except for the final 0-3 bytes.
Optimize successive calls to read_bool and read_with_tree by assuming none of them reach the end of the buffer and returning a transparent BitResult, then validating at the end.
Optimize each individual call to read_bool and read_with_tree by assuming each bit can be read from the 4-byte chunks (in FastReader), and retrying with the slow approach if this fails.

Final performance results are a 1.3x speedup compared to image-rs 0.2.0 (--use-reference), although it is still 1.3x slower than libwebp:

Summary
'dwebp -noasm -nofancy Puente.webp' ran
1.00 ± 0.01 times faster than 'dwebp -noasm -nofancy Puente.webp'
1.31 ± 0.01 times faster than 'target/release/image-webp-runner Puente.webp'
1.69 ± 0.01 times faster than 'target/release/image-webp-runner Puente.webp --use-reference'

(I ran dwebp as the first and the last candidate to negate any effects from my poor laptop's CPU overheating.)

This uses as_flattened_mut() which was stabilized in 1.80.0, so merging this probably requires raising the MSRV. I don't know your policy on that, but the alternative was adding unsafe or adding another dependency (that itself uses unsafe), so I left it as is.

PS:

I thought about extending the buffer with a few zero bytes so that everything can be read with the FastReader, but I don't think it would help much, and it might worst-case require reallocating the buffer.
read_literal has some obvious optimizations but it doesn't seem part of the latency critical path.
There might be an optimization in read_flag's 1 + ((range - 1) * 128) >> 8) but it seems hard to measure.
I think further optimizations to other parts of the decoding might push us to be faster than libwebp's performance.
I tried my hand at coaxing the compiler to apply SIMD to src/transform.rs, but it was very dependent on preventing function inlining, and ultimately I didn't get any noticable performance gains yet. I might try again later and create a separate PR.

Shnatsel · 2024-12-28T18:19:24Z

transform.rs is used only for lossless images, so changing anything there won't affect lossy ones. You can create a lossless image with convert -quality 100 input.png output.webp and verify with webpinfo that the file is indeed lossless.

That said, lossless WebP is already plenty fast specifically due to optimizations to transforms. We actually beat dwebp -noasm in my tests, although dwebp when allowed to use handwritten assembly still beats us by 7% to 15% on lossless images.

Shnatsel · 2024-12-28T18:24:17Z

Regarding bit reading: libwebp has a dedicated codepath for reading with probability 128 that is distinct from the general-purpose one. Is that something that you've explored?

If you haven't attempted it, it doesn't have to be a part of this PR. I just wanted to know if this has been attempted or not.

I would expect this not to matter if the hot variant of read_bool gets inlined anyway - the constant propagation should probably take care of it.

SLiV9 · 2024-12-28T20:20:51Z

transform.rs is used only for lossless images, so changing anything there won't affect lossy ones.

Huh are you sure? I only mentioned it because idct4x4 showed up as 8% of the runtime in callgrind when running against the Puenta image. I did some optimizations that involved renaming that function and the new function was 7.5% or something like that, but anyway not enough to be measurable.

Not denying that it's already plenty fast, just that I'm certain it showed up in my call graphs inside read_coefficients().

SLiV9 · 2024-12-28T20:25:24Z

Regarding bit reading: libwebp has a dedicated codepath for reading with probability 128 that is distinct from the general-purpose one. Is that something that you've explored?

If you haven't attempted it, it doesn't have to be a part of this PR. I just wanted to know if this has been attempted or not.

I would expect this not to matter if the hot variant of read_bool gets inlined anyway - the constant propagation should probably take care of it.

Yes that's the read_flag optimization I mentioned in the PR. I didn't end up doing it, and in fact the way I have the inlining set up it actually prevents the compiler from doing any special optimizations for the 128 case. That's because too much inlining/specialization seemed to make everything 20% slower, which I theorize to be because of instruction cache misses.

But indeed, that's something that can be revisited in a separate PR.

fintelia · 2024-12-28T20:49:02Z

transform.rs is for lossy images while lossless_transform.rs is for lossless images.

It might be worth renaming "bool reader" to "arithmetic decoder" or something to that effect, because it is doing boolean arithmetic coding rather than simply reading bits.

Shnatsel · 2024-12-29T17:19:13Z

FWIW there is no change on end-to-end benchmarks for the large image on my machine from the FastReader::read_flag optimization. It's possible that it helps other machines, just not mine.

Shnatsel · 2024-12-31T20:19:35Z

I can confirm this didn't break anything 🎉

No behavioral changes before and after on my corpus of 7,500 images scraped from the web.

kornelski · 2025-01-05T12:53:18Z

I've made clippy happy. Please rebase.

Shnatsel · 2025-01-05T13:50:30Z

And I'd like to get this merged before any further merge conflicts arise.

I don't think we can ship 1.80 MSRV just yet. It is very recent, and image tries to be more conservative with MSRV, currently at 1.70.

I see two viable options:

Add bytemuck dependency. We can then use cast_slice to accomplish .as_flattened_mut() on older rustc. Here's how
Just copy-paste the implementation of .as_flattened_mut() from the standard library. It's only 6 lines of code. We can then replace it with the stabilized method once we can bump MSRV to 1.80.

Thoughts?

kornelski · 2025-01-07T16:06:48Z

I vote for using bytemuck. It's already in the dependencies of the parent image crate.

fintelia · 2025-01-08T02:13:29Z

Honestly, maybe we should just bump the MSRV to 1.80. We've got other changes that have been waiting on that same version bump for a while, and it has been nearly 6 months since the 1.80 release.

image-rs does try to be conservative with MSRV bumps, but we've never fully worked out a policy on quite what that means. Mostly, we just try not to frivolously bump the version or pick anything super new. With those criteria, it turns out to be quite easy to go a long time without increasing the MSRV. And then it looks like we have a really conservative policy

Shnatsel · 2025-01-10T14:27:36Z

In that case that only needs a rebase against latest main, and it's good to go! @SLiV9 can you handle that? I'll push the merge button immediately after so that it doesn't diverge again.

SLiV9 · 2025-01-11T10:05:57Z

I just rebased. Let me know if you still want it to be bytemuck and I'll do it later today.

Just copy-paste the implementation of .as_flattened_mut() from the standard library. It's only 6 lines of code. We can then replace it with the stabilized method once we can bump MSRV to 1.80.

I did take a look at inlining, but as_flattened_mut uses unsafe which is forbidden.

Shnatsel · 2025-01-11T14:18:10Z

@SLiV9 thanks again for the PR! These are really impressive performance gains, and I don't think we would've been able to optimize this part ourselves.

SLiV9 · 2025-01-11T18:36:54Z

Happy to help! Thanks for the challenge and the awesome library.

Shnatsel · 2025-01-17T21:53:40Z

With this optimization (and other optimizations that went into the 0.2.1 release), https://github.com/Shnatsel/wondermagick backed by image-webp is now faster than imagemagick at decoding and thumbnailing a WebP image!

SLiV9 mentioned this pull request Dec 28, 2024

47% of the time is spent in BoolReader::read_bool when decoding lossy images #71

Closed

SLiV9 force-pushed the main branch from 6d47958 to 522858c Compare January 11, 2025 10:02

SLiV9 added 3 commits January 11, 2025 13:33

Faster BoolReader

7b25ba8

Optimize FastReader::read_flag

3607c21

Rename BoolReader to ArithmeticDecoder

cadb88f

kornelski force-pushed the main branch from 522858c to cadb88f Compare January 11, 2025 12:39

kornelski merged commit 344ec6f into image-rs:main Jan 11, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster BoolReader #124

Faster BoolReader #124

SLiV9 commented Dec 28, 2024

Shnatsel commented Dec 28, 2024

Shnatsel commented Dec 28, 2024

SLiV9 commented Dec 28, 2024

SLiV9 commented Dec 28, 2024

fintelia commented Dec 28, 2024

Shnatsel commented Dec 29, 2024

Shnatsel commented Dec 31, 2024

kornelski commented Jan 5, 2025

Shnatsel commented Jan 5, 2025

kornelski commented Jan 7, 2025

fintelia commented Jan 8, 2025

Shnatsel commented Jan 10, 2025

SLiV9 commented Jan 11, 2025

Shnatsel commented Jan 11, 2025

SLiV9 commented Jan 11, 2025

Shnatsel commented Jan 17, 2025

Faster BoolReader #124

Faster BoolReader #124

Conversation

SLiV9 commented Dec 28, 2024

Shnatsel commented Dec 28, 2024

Shnatsel commented Dec 28, 2024

SLiV9 commented Dec 28, 2024

SLiV9 commented Dec 28, 2024

fintelia commented Dec 28, 2024

Shnatsel commented Dec 29, 2024

Shnatsel commented Dec 31, 2024

kornelski commented Jan 5, 2025

Shnatsel commented Jan 5, 2025

kornelski commented Jan 7, 2025

fintelia commented Jan 8, 2025

Shnatsel commented Jan 10, 2025

SLiV9 commented Jan 11, 2025

Shnatsel commented Jan 11, 2025

SLiV9 commented Jan 11, 2025

Shnatsel commented Jan 17, 2025