-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix threadlocal
type-inferrability and add reduction
functionality
#134
Conversation
It might be worthwhile to update the benchmarks that involve thredlocal in one of the readme subsections, to showcase things do not allocate anymore. |
I might be misunderstanding, but I can't find any benchmarks using julia> function batch_reduction()
y1 = 0
y2 = 1
@batch reduction=((+, y1), (*, y2)) for i in 1:9
y1 += i
y2 *= i
end
y1, y2
end
julia> batch_reduction()
(45, 362880)
julia> @allocated batch_reduction()
0 to show that I also forgot to show this PR affects performance of using Test
function f()
@batch threadlocal=rand(10:99)::Int for i in 0:9
threadlocal += 1
end
sum(threadlocal)
end
try
@inferred f()
catch e
println("Not inferred")
end
@benchmark f() Old: Not inferred
BenchmarkTools.Trial: 874 samples with 170 evaluations.
Range (min … max): 785.882 ns … 188.296 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 27.131 μs ┊ GC (median): 0.00%
Time (mean ± σ): 33.600 μs ± 25.621 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▃▄▆▇█▇▄▅▂▂▂▂▂ ▁ ▂ ▄
████████████████▇██▇█▅▅▆███▇▇▆█▆▇▆▆▇▆▄▄▅▅▄▄▅▃▃▄▃▃▃▃▁▃▃▂▂▁▃▁▁▃ ▅
786 ns Histogram: frequency by time 108 μs <
Memory estimate: 224 bytes, allocs estimate: 5. New: BenchmarkTools.Trial: 10000 samples with 287 evaluations.
Range (min … max): 282.927 ns … 201.847 μs ┊ GC (min … max): 0.00% … 2.28%
Time (median): 303.833 ns ┊ GC (median): 0.00%
Time (mean ± σ): 644.395 ns ± 4.433 μs ┊ GC (mean ± σ): 0.33% ± 0.05%
█▂▂▁ ▁
█████▆▆▇█▆▆▆▅▃▅▁▄▄▄▆▄▆▆▅▆▆▆▅▄▃▄▄▃▅▄▄▆▄▄▄▄▅▄▅▃▁▃▁▄▃▁▃▃▃▄▁▁▄▁▄▄ █
283 ns Histogram: log(frequency) by time 7.69 μs <
Memory estimate: 96 bytes, allocs estimate: 1. |
My apologies, the readme has been simplified significantly since the last time I looked. Disregard my comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good.
Performance seems to be worse than I'd have expected. julia> using LoopVectorization, BenchmarkTools, Polyester, LinearAlgebra
julia> function dot_batch(x,y)
s = zero(Base.promote_eltype(x,y))
@batch reduction=((+,s),) for i = eachindex(x,y)
@inbounds @fastmath s += x[i]*y[i]
end
s
end
dot_batch (generic function with 1 method)
julia> function dot_tturbo(x,y)
s = zero(Base.promote_eltype(x,y))
@tturbo for i = eachindex(x,y)
s += x[i]*y[i]
end
s
end
dot_tturbo (generic function with 1 method)
julia> function dot_turbo(x,y)
s = zero(Base.promote_eltype(x,y))
@turbo for i = eachindex(x,y)
s += x[i]*y[i]
end
s
end
dot_turbo (generic function with 1 method)
julia> x = rand(16384*8);
julia> @btime dot_turbo($x,$x)
8.710 μs (0 allocations: 0 bytes)
43620.94200326388
julia> @btime dot_tturbo($x,$x)
3.097 μs (0 allocations: 0 bytes)
43620.94200326386
julia> @btime dot($x,$x)
3.174 μs (0 allocations: 0 bytes)
43620.942003263866
julia> @btime dot_batch($x,$x)
32.269 μs (0 allocations: 0 bytes)
43620.94200326387
julia> using LinuxPerf
julia> foreachf(f::F, N::Int, arg1::A, args::Vararg{Any,K}) where {F,A,K} = foreach(_ -> Base.donotdelete(f(arg1, args...)), 1:N)
foreachf (generic function with 1 method)
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(dot_batch, 1_000_000, x, x)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 5.01e+11 100.0% # 3.5 cycles per ns
└ task-clock 1.42e+11 100.0% # 141.6 s
┌ instructions 5.06e+11 100.0% # 1.0 insns per cycle
│ branch-instructions 1.00e+11 100.0% # 19.8% of insns
└ branch-misses 8.47e+06 100.0% # 0.0% of branch insns
┌ L1-dcache-load-misses 1.65e+10 100.0% # 8.0% of dcache loads
│ L1-dcache-loads 2.06e+11 100.0%
│ cache-misses 5.86e+05 100.0% # 2.1% of cache refs
└ cache-references 2.86e+07 100.0%
aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(dot_tturbo, 1_000_000, x, x)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 3.92e+10 100.0% # 3.6 cycles per ns
└ task-clock 1.08e+10 100.0% # 10.8 s
┌ instructions 4.57e+10 100.0% # 1.2 insns per cycle
│ branch-instructions 4.19e+09 100.0% # 9.2% of insns
└ branch-misses 5.73e+06 100.0% # 0.1% of branch insns
┌ L1-dcache-load-misses 1.65e+10 100.0% # 50.0% of dcache loads
│ L1-dcache-loads 3.29e+10 100.0%
│ cache-misses 5.42e+04 100.0% # 0.3% of cache refs
└ cache-references 1.76e+07 100.0%
aggregated from 3 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(dot, 1_000_000, x, x)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 5.04e+10 100.0% # 3.4 cycles per ns
└ task-clock 1.48e+10 100.0% # 14.8 s
┌ instructions 6.40e+10 100.0% # 1.3 insns per cycle
│ branch-instructions 6.45e+09 100.0% # 10.1% of insns
└ branch-misses 8.16e+06 100.0% # 0.1% of branch insns
┌ L1-dcache-load-misses 1.69e+10 100.0% # 46.5% of dcache loads
│ L1-dcache-loads 3.64e+10 100.0%
│ cache-misses 2.80e+05 100.0% # 0.5% of cache refs
└ cache-references 6.21e+07 100.0%
aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(dot_turbo, 1_000_000, x, x)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 4.05e+10 100.0% # 4.1 cycles per ns
└ task-clock 9.81e+09 100.0% # 9.8 s
┌ instructions 3.90e+10 100.0% # 1.0 insns per cycle
│ branch-instructions 2.06e+09 100.0% # 5.3% of insns
└ branch-misses 1.18e+06 100.0% # 0.1% of branch insns
┌ L1-dcache-load-misses 1.64e+10 100.0% # 50.0% of dcache loads
│ L1-dcache-loads 3.28e+10 100.0%
│ cache-misses 3.07e+05 100.0% # 0.0% of cache refs
└ cache-references 1.86e+09 100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ It requires around 10x as many instructions as I've not looked into this/generated code yet, but could there be something interfering with SIMD here? |
I don't see anything that strikes out to me as wrong from a quick look. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #134 +/- ##
==========================================
+ Coverage 91.13% 93.37% +2.24%
==========================================
Files 3 3
Lines 451 574 +123
==========================================
+ Hits 411 536 +125
+ Misses 40 38 -2 ☔ View full report in Codecov by Sentry. |
… `threadlocal` flag
Head branch was pushed to by a user without write access
I have no clue what could be causing all the extra instructions / interference with SIMD. I'm running on a Windows machine so I can't check, but is this a result of the changes I made, or do they appear when using |
Nightly fails due to excessive allocations. After checking nightly myself, I found that the amount and size of allocations is rather random, so I'm not sure how to handle them. |
Thanks. I don't think we should bother working around flaws in nightly. |
Fixes #112 by passing the threadlocal accumulator as an argument to the batchcall instead of capturing it.
I found that adding a more restrictive, OpenMP-like
reduction
flag was more elegant than checking whetherthreadlocal
wasisbits
to eliminate allocations in such cases. Using this looks like:It works by allocating the initial value (neutral element of the operation) within each threads function and loading + reducing them at the end. This PR also includes the
tid
fix from faab451, fixes in the documentation and additional tests for thereduction
functionality.