Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster BoolReader #124

Merged
merged 3 commits into from
Jan 11, 2025
Merged

Faster BoolReader #124

merged 3 commits into from
Jan 11, 2025

Conversation

SLiV9
Copy link
Contributor

@SLiV9 SLiV9 commented Dec 28, 2024

  • Moved BoolReader to its own file.
  • It now reads from the buffer in chunks of 4 bytes at a time, except for the final 0-3 bytes.
  • Optimize successive calls to read_bool and read_with_tree by assuming none of them reach the end of the buffer and returning a transparent BitResult, then validating at the end.
  • Optimize each individual call to read_bool and read_with_tree by assuming each bit can be read from the 4-byte chunks (in FastReader), and retrying with the slow approach if this fails.

Final performance results are a 1.3x speedup compared to image-rs 0.2.0 (--use-reference), although it is still 1.3x slower than libwebp:

Summary
'dwebp -noasm -nofancy Puente.webp' ran
1.00 ± 0.01 times faster than 'dwebp -noasm -nofancy Puente.webp'
1.31 ± 0.01 times faster than 'target/release/image-webp-runner Puente.webp'
1.69 ± 0.01 times faster than 'target/release/image-webp-runner Puente.webp --use-reference'

(I ran dwebp as the first and the last candidate to negate any effects from my poor laptop's CPU overheating.)

This uses as_flattened_mut() which was stabilized in 1.80.0, so merging this probably requires raising the MSRV. I don't know your policy on that, but the alternative was adding unsafe or adding another dependency (that itself uses unsafe), so I left it as is.

PS:

  • I thought about extending the buffer with a few zero bytes so that everything can be read with the FastReader, but I don't think it would help much, and it might worst-case require reallocating the buffer.
  • read_literal has some obvious optimizations but it doesn't seem part of the latency critical path.
  • There might be an optimization in read_flag's 1 + ((range - 1) * 128) >> 8) but it seems hard to measure.
  • I think further optimizations to other parts of the decoding might push us to be faster than libwebp's performance.
  • I tried my hand at coaxing the compiler to apply SIMD to src/transform.rs, but it was very dependent on preventing function inlining, and ultimately I didn't get any noticable performance gains yet. I might try again later and create a separate PR.

@Shnatsel
Copy link
Contributor

transform.rs is used only for lossless images, so changing anything there won't affect lossy ones. You can create a lossless image with convert -quality 100 input.png output.webp and verify with webpinfo that the file is indeed lossless.

That said, lossless WebP is already plenty fast specifically due to optimizations to transforms. We actually beat dwebp -noasm in my tests, although dwebp when allowed to use handwritten assembly still beats us by 7% to 15% on lossless images.

@Shnatsel
Copy link
Contributor

Regarding bit reading: libwebp has a dedicated codepath for reading with probability 128 that is distinct from the general-purpose one. Is that something that you've explored?

If you haven't attempted it, it doesn't have to be a part of this PR. I just wanted to know if this has been attempted or not.

I would expect this not to matter if the hot variant of read_bool gets inlined anyway - the constant propagation should probably take care of it.

@SLiV9
Copy link
Contributor Author

SLiV9 commented Dec 28, 2024

transform.rs is used only for lossless images, so changing anything there won't affect lossy ones.

Huh are you sure? I only mentioned it because idct4x4 showed up as 8% of the runtime in callgrind when running against the Puenta image. I did some optimizations that involved renaming that function and the new function was 7.5% or something like that, but anyway not enough to be measurable.

Not denying that it's already plenty fast, just that I'm certain it showed up in my call graphs inside read_coefficients().

@SLiV9
Copy link
Contributor Author

SLiV9 commented Dec 28, 2024

Regarding bit reading: libwebp has a dedicated codepath for reading with probability 128 that is distinct from the general-purpose one. Is that something that you've explored?

If you haven't attempted it, it doesn't have to be a part of this PR. I just wanted to know if this has been attempted or not.

I would expect this not to matter if the hot variant of read_bool gets inlined anyway - the constant propagation should probably take care of it.

Yes that's the read_flag optimization I mentioned in the PR. I didn't end up doing it, and in fact the way I have the inlining set up it actually prevents the compiler from doing any special optimizations for the 128 case. That's because too much inlining/specialization seemed to make everything 20% slower, which I theorize to be because of instruction cache misses.

But indeed, that's something that can be revisited in a separate PR.

@fintelia
Copy link
Contributor

transform.rs is for lossy images while lossless_transform.rs is for lossless images.

It might be worth renaming "bool reader" to "arithmetic decoder" or something to that effect, because it is doing boolean arithmetic coding rather than simply reading bits.

@Shnatsel
Copy link
Contributor

FWIW there is no change on end-to-end benchmarks for the large image on my machine from the FastReader::read_flag optimization. It's possible that it helps other machines, just not mine.

@Shnatsel
Copy link
Contributor

I can confirm this didn't break anything 🎉

No behavioral changes before and after on my corpus of 7,500 images scraped from the web.

@kornelski
Copy link
Contributor

I've made clippy happy. Please rebase.

@Shnatsel
Copy link
Contributor

Shnatsel commented Jan 5, 2025

And I'd like to get this merged before any further merge conflicts arise.

I don't think we can ship 1.80 MSRV just yet. It is very recent, and image tries to be more conservative with MSRV, currently at 1.70.

I see two viable options:

  1. Add bytemuck dependency. We can then use cast_slice to accomplish .as_flattened_mut() on older rustc. Here's how
  2. Just copy-paste the implementation of .as_flattened_mut() from the standard library. It's only 6 lines of code. We can then replace it with the stabilized method once we can bump MSRV to 1.80.

Thoughts?

@kornelski
Copy link
Contributor

I vote for using bytemuck. It's already in the dependencies of the parent image crate.

@fintelia
Copy link
Contributor

fintelia commented Jan 8, 2025

Honestly, maybe we should just bump the MSRV to 1.80. We've got other changes that have been waiting on that same version bump for a while, and it has been nearly 6 months since the 1.80 release.

image-rs does try to be conservative with MSRV bumps, but we've never fully worked out a policy on quite what that means. Mostly, we just try not to frivolously bump the version or pick anything super new. With those criteria, it turns out to be quite easy to go a long time without increasing the MSRV. And then it looks like we have a really conservative policy

@Shnatsel
Copy link
Contributor

In that case that only needs a rebase against latest main, and it's good to go! @SLiV9 can you handle that? I'll push the merge button immediately after so that it doesn't diverge again.

@SLiV9
Copy link
Contributor Author

SLiV9 commented Jan 11, 2025

I just rebased. Let me know if you still want it to be bytemuck and I'll do it later today.

  1. Just copy-paste the implementation of .as_flattened_mut() from the standard library. It's only 6 lines of code. We can then replace it with the stabilized method once we can bump MSRV to 1.80.

I did take a look at inlining, but as_flattened_mut uses unsafe which is forbidden.

@kornelski kornelski merged commit 344ec6f into image-rs:main Jan 11, 2025
7 checks passed
@Shnatsel
Copy link
Contributor

@SLiV9 thanks again for the PR! These are really impressive performance gains, and I don't think we would've been able to optimize this part ourselves.

@SLiV9
Copy link
Contributor Author

SLiV9 commented Jan 11, 2025

Happy to help! Thanks for the challenge and the awesome library.

@Shnatsel
Copy link
Contributor

With this optimization (and other optimizations that went into the 0.2.1 release), https://github.com/Shnatsel/wondermagick backed by image-webp is now faster than imagemagick at decoding and thumbnailing a WebP image!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants