Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix threadlocal type-inferrability and add reduction functionality #134

Merged
merged 8 commits into from
Feb 26, 2024

Conversation

GianlucaFuwa
Copy link
Contributor

Fixes #112 by passing the threadlocal accumulator as an argument to the batchcall instead of capturing it.

I found that adding a more restrictive, OpenMP-like reduction flag was more elegant than checking whether threadlocal was isbits to eliminate allocations in such cases. Using this looks like:

red1 = 0
red2 = 0
red3 = 0
@batch reduction = ((+,red1), (+,red2), (+,red3)) for i = 0:9
  red1 += 1
  red2 += 1
  red3 -= 1
end
red1, red2, red3

It works by allocating the initial value (neutral element of the operation) within each threads function and loading + reducing them at the end. This PR also includes the tid fix from faab451, fixes in the documentation and additional tests for the reduction functionality.

@Krastanov
Copy link
Contributor

It might be worthwhile to update the benchmarks that involve thredlocal in one of the readme subsections, to showcase things do not allocate anymore.

@GianlucaFuwa
Copy link
Contributor Author

It might be worthwhile to update the benchmarks that involve thredlocal in one of the readme subsections, to showcase things do not allocate anymore.

I might be misunderstanding, but I can't find any benchmarks using threadlocal that show allocations in the README. I could add something like:

julia> function batch_reduction()
           y1 = 0
           y2 = 1
           @batch reduction=((+, y1), (*, y2)) for i in 1:9
               y1 += i
               y2 *= i
           end
           y1, y2
       end
julia> batch_reduction()
(45, 362880)
julia> @allocated batch_reduction()
0

to show that reduction doesn't allocate though.

I also forgot to show this PR affects performance of threadlocal:

using Test

function f()
    @batch threadlocal=rand(10:99)::Int for i in 0:9
        threadlocal += 1
    end
    sum(threadlocal)
end

try
    @inferred f()
catch e
    println("Not inferred")
end
@benchmark f()

Old:

Not inferred
BenchmarkTools.Trial: 874 samples with 170 evaluations.
 Range (min  max):  785.882 ns  188.296 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):      27.131 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):    33.600 μs ±  25.621 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄▃▄▆▇█▇▄▅▂▂▂▂▂  ▁  ▂     ▄   
  ████████████████▇██▇█▅▅▆███▇▇▆█▆▇▆▆▇▆▄▄▅▅▄▄▅▃▃▄▃▃▃▃▁▃▃▂▂▁▃▁▁▃ ▅
  786 ns           Histogram: frequency by time          108 μs <

 Memory estimate: 224 bytes, allocs estimate: 5.

New:

BenchmarkTools.Trial: 10000 samples with 287 evaluations.
 Range (min  max):  282.927 ns  201.847 μs  ┊ GC (min  max): 0.00%  2.28%
 Time  (median):     303.833 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   644.395 ns ±   4.433 μs  ┊ GC (mean ± σ):  0.33% ± 0.05%

  █▂▂▁                                                          ▁
  █████▆▆▇█▆▆▆▅▃▅▁▄▄▄▆▄▆▆▅▆▆▆▅▄▃▄▄▃▅▄▄▆▄▄▄▄▅▄▅▃▁▃▁▄▃▁▃▃▃▄▁▁▄▁▄▄ █
  283 ns        Histogram: log(frequency) by time       7.69 μs <

 Memory estimate: 96 bytes, allocs estimate: 1.

@Krastanov
Copy link
Contributor

My apologies, the readme has been simplified significantly since the last time I looked. Disregard my comment.

Copy link
Member

@chriselrod chriselrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

@chriselrod
Copy link
Member

Performance seems to be worse than I'd have expected.

julia> using LoopVectorization, BenchmarkTools, Polyester, LinearAlgebra

julia> function dot_batch(x,y)
           s = zero(Base.promote_eltype(x,y))
           @batch reduction=((+,s),) for i = eachindex(x,y)
               @inbounds @fastmath s += x[i]*y[i]
           end
           s
       end
dot_batch (generic function with 1 method)

julia> function dot_tturbo(x,y)
           s = zero(Base.promote_eltype(x,y))
           @tturbo for i = eachindex(x,y)
               s += x[i]*y[i]
           end
           s
       end
dot_tturbo (generic function with 1 method)

julia> function dot_turbo(x,y)
           s = zero(Base.promote_eltype(x,y))
           @turbo for i = eachindex(x,y)
               s += x[i]*y[i]
           end
           s
       end
dot_turbo (generic function with 1 method)

julia> x = rand(16384*8);

julia> @btime dot_turbo($x,$x)
  8.710 μs (0 allocations: 0 bytes)
43620.94200326388

julia> @btime dot_tturbo($x,$x)
  3.097 μs (0 allocations: 0 bytes)
43620.94200326386

julia> @btime dot($x,$x)
  3.174 μs (0 allocations: 0 bytes)
43620.942003263866

julia> @btime dot_batch($x,$x)
  32.269 μs (0 allocations: 0 bytes)
43620.94200326387

julia> using LinuxPerf

julia> foreachf(f::F, N::Int, arg1::A, args::Vararg{Any,K}) where {F,A,K} = foreach(_ -> Base.donotdelete(f(arg1, args...)), 1:N)
foreachf (generic function with 1 method)

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(dot_batch, 1_000_000, x, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               5.01e+11  100.0%  #  3.5 cycles per ns
└ task-clock               1.42e+11  100.0%  # 141.6 s
┌ instructions             5.06e+11  100.0%  #  1.0 insns per cycle
│ branch-instructions      1.00e+11  100.0%  # 19.8% of insns
└ branch-misses            8.47e+06  100.0%  #  0.0% of branch insns
┌ L1-dcache-load-misses    1.65e+10  100.0%  #  8.0% of dcache loads
│ L1-dcache-loads          2.06e+11  100.0%
│ cache-misses             5.86e+05  100.0%  #  2.1% of cache refs
└ cache-references         2.86e+07  100.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(dot_tturbo, 1_000_000, x, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               3.92e+10  100.0%  #  3.6 cycles per ns
└ task-clock               1.08e+10  100.0%  # 10.8 s
┌ instructions             4.57e+10  100.0%  #  1.2 insns per cycle
│ branch-instructions      4.19e+09  100.0%  #  9.2% of insns
└ branch-misses            5.73e+06  100.0%  #  0.1% of branch insns
┌ L1-dcache-load-misses    1.65e+10  100.0%  # 50.0% of dcache loads
│ L1-dcache-loads          3.29e+10  100.0%
│ cache-misses             5.42e+04  100.0%  #  0.3% of cache refs
└ cache-references         1.76e+07  100.0%
                  aggregated from 3 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(dot, 1_000_000, x, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               5.04e+10  100.0%  #  3.4 cycles per ns
└ task-clock               1.48e+10  100.0%  # 14.8 s
┌ instructions             6.40e+10  100.0%  #  1.3 insns per cycle
│ branch-instructions      6.45e+09  100.0%  # 10.1% of insns
└ branch-misses            8.16e+06  100.0%  #  0.1% of branch insns
┌ L1-dcache-load-misses    1.69e+10  100.0%  # 46.5% of dcache loads
│ L1-dcache-loads          3.64e+10  100.0%
│ cache-misses             2.80e+05  100.0%  #  0.5% of cache refs
└ cache-references         6.21e+07  100.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(dot_turbo, 1_000_000, x, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               4.05e+10  100.0%  #  4.1 cycles per ns
└ task-clock               9.81e+09  100.0%  #  9.8 s
┌ instructions             3.90e+10  100.0%  #  1.0 insns per cycle
│ branch-instructions      2.06e+09  100.0%  #  5.3% of insns
└ branch-misses            1.18e+06  100.0%  #  0.1% of branch insns
┌ L1-dcache-load-misses    1.64e+10  100.0%  # 50.0% of dcache loads
│ L1-dcache-loads          3.28e+10  100.0%
│ cache-misses             3.07e+05  100.0%  #  0.0% of cache refs
└ cache-references         1.86e+09  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

It requires around 10x as many instructions as @tturbo for this example, resulting in around 10x worse performance.

I've not looked into this/generated code yet, but could there be something interfering with SIMD here?

@chriselrod
Copy link
Member

I don't see anything that strikes out to me as wrong from a quick look.

@chriselrod chriselrod enabled auto-merge February 26, 2024 16:18
Copy link

codecov bot commented Feb 26, 2024

Codecov Report

Attention: Patch coverage is 97.79006% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 93.37%. Comparing base (0952d11) to head (ab511e8).
Report is 2 commits behind head on master.

Files Patch % Lines
src/batch.jl 97.75% 2 Missing ⚠️
src/closure.jl 97.64% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #134      +/-   ##
==========================================
+ Coverage   91.13%   93.37%   +2.24%     
==========================================
  Files           3        3              
  Lines         451      574     +123     
==========================================
+ Hits          411      536     +125     
+ Misses         40       38       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

auto-merge was automatically disabled February 26, 2024 17:40

Head branch was pushed to by a user without write access

@GianlucaFuwa
Copy link
Contributor Author

I've not looked into this/generated code yet, but could there be something interfering with SIMD here?

I have no clue what could be causing all the extra instructions / interference with SIMD. I'm running on a Windows machine so I can't check, but is this a result of the changes I made, or do they appear when using @batchin general?

@GianlucaFuwa
Copy link
Contributor Author

Nightly fails due to excessive allocations. After checking nightly myself, I found that the amount and size of allocations is rather random, so I'm not sure how to handle them.
PProf tells me that the objects being allocated are ThreadStates from ThreadingUtilities.jl

@chriselrod chriselrod merged commit 89425e0 into JuliaSIMD:master Feb 26, 2024
29 of 31 checks passed
@chriselrod
Copy link
Member

Thanks. I don't think we should bother working around flaws in nightly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

threadlocal result is not inferrable
3 participants