Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/lz4block: Speed up noasm decoder #226

Merged
merged 1 commit into from
Jan 12, 2025
Merged

Conversation

greatroar
Copy link
Contributor

When the compiler is told exactly how many bytes a copy call should copy, and that number is at most 16, it will inline the call. Also, the old code only took the short match shortcut when the short literal shortcut was also taken. But long literals with short matches are common.

Benchmark results on older Intel:

goos: linux
goarch: amd64
pkg: github.com/pierrec/lz4/v4
cpu: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
                   │     old      │                 new                  │
                   │     B/s      │     B/s       vs base                │
UncompressPg1661-8   327.9Mi ± 1%   549.7Mi ± 0%  +67.61% (p=0.000 n=10)
UncompressDigits-8   1.111Gi ± 1%   1.499Gi ± 1%  +34.94% (p=0.000 n=10)
UncompressTwain-8    348.3Mi ± 0%   579.4Mi ± 0%  +66.32% (p=0.000 n=10)
UncompressRand-8     3.296Gi ± 0%   3.309Gi ± 1%        ~ (p=0.739 n=10)
geomean              813.8Mi        1.108Gi       +39.40%

On newer AMD:

goos: linux
goarch: amd64
pkg: github.com/pierrec/lz4/v4
cpu: AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics
                    │     old      │                  new                  │
                    │     B/s      │      B/s       vs base                │
UncompressPg1661-16   643.6Mi ± 2%   1076.9Mi ± 1%  +67.33% (p=0.000 n=10)
UncompressDigits-16   2.808Gi ± 1%    3.786Gi ± 0%  +34.82% (p=0.000 n=10)
UncompressTwain-16    702.8Mi ± 1%   1309.5Mi ± 7%  +86.32% (p=0.000 n=10)
UncompressRand-16     6.878Gi ± 0%    6.850Gi ± 1%   -0.42% (p=0.009 n=10)
geomean               1.699Gi         2.430Gi       +43.04%

The assembler version still has quite an edge on amd64, so it might be too early to replace it (which would solve #215):

goos: linux
goarch: amd64
pkg: github.com/pierrec/lz4/v4
cpu: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
                   │     new      │               amd64-v4                │
                   │     B/s      │      B/s       vs base                │
UncompressPg1661-8   549.7Mi ± 0%   1037.5Mi ± 1%  +88.75% (p=0.000 n=10)
UncompressDigits-8   1.499Gi ± 1%    1.954Gi ± 0%  +30.33% (p=0.000 n=10)
UncompressTwain-8    579.4Mi ± 0%   1072.4Mi ± 1%  +85.11% (p=0.000 n=10)
UncompressRand-8     3.309Gi ± 1%    3.306Gi ± 2%        ~ (p=1.000 n=10)
geomean              1.108Gi         1.618Gi       +46.05%

The code passes TestIssue51 without a specific check for it, as well as the unit test from #191, which I haven't included in this PR for copyright reasons.

When the compiler is told exactly how many bytes a copy call should
copy, and that number is at most 16, it will inline the call. Also, the
old code only took the short match shortcut when the short literal
shortcut was also taken. But long literals with short matches are
common.

Benchmark results on older Intel:

    goos: linux
    goarch: amd64
    pkg: github.com/pierrec/lz4/v4
    cpu: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
                       │     old      │                 new                  │
                       │     B/s      │     B/s       vs base                │
    UncompressPg1661-8   327.9Mi ± 1%   549.7Mi ± 0%  +67.61% (p=0.000 n=10)
    UncompressDigits-8   1.111Gi ± 1%   1.499Gi ± 1%  +34.94% (p=0.000 n=10)
    UncompressTwain-8    348.3Mi ± 0%   579.4Mi ± 0%  +66.32% (p=0.000 n=10)
    UncompressRand-8     3.296Gi ± 0%   3.309Gi ± 1%        ~ (p=0.739 n=10)
    geomean              813.8Mi        1.108Gi       +39.40%

On newer AMD:

    goos: linux
    goarch: amd64
    pkg: github.com/pierrec/lz4/v4
    cpu: AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics
                        │     old      │                  new                  │
                        │     B/s      │      B/s       vs base                │
    UncompressPg1661-16   643.6Mi ± 2%   1076.9Mi ± 1%  +67.33% (p=0.000 n=10)
    UncompressDigits-16   2.808Gi ± 1%    3.786Gi ± 0%  +34.82% (p=0.000 n=10)
    UncompressTwain-16    702.8Mi ± 1%   1309.5Mi ± 7%  +86.32% (p=0.000 n=10)
    UncompressRand-16     6.878Gi ± 0%    6.850Gi ± 1%   -0.42% (p=0.009 n=10)
    geomean               1.699Gi         2.430Gi       +43.04%
@greatroar
Copy link
Contributor Author

The test failure is unrelated to this PR.

@pierrec pierrec merged commit 6945807 into pierrec:v4 Jan 12, 2025
0 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants