-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[arm64] Performance regression from removal of 4x loop decoding #205
Comments
pinging again @pierrec and @greatroar for thoughts. this is quite bad IMO.
|
Merged
Hi all, I know this is closed but some work over the holidays lead me to dig into this more and I wanted to get y'all thoughts on #215, directly relevant to this sort of unrolling. Thanks! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
1cbdd81 appears to have removed the unrolled 4x copy, meaning the finishing copy after the remainder is less than 8 bytes long may do up to 7x unaligned individual byte copies, rather than the maximum of 3x individual unaligned byte copies from before. Profiling shows that we're seeing 10% of all time spent in the lz4 library being spent on the one line at https://github.com/pierrec/lz4/blob/v4/internal/lz4block/decode_arm64.s#L206
I suspect we need to put that aligned 4x unrolled access back, do you concur, @greatroar?
The text was updated successfully, but these errors were encountered: