Fix `threadlocal` type-inferrability and add `reduction` functionality #134

GianlucaFuwa · 2024-02-24T18:58:18Z

Fixes #112 by passing the threadlocal accumulator as an argument to the batchcall instead of capturing it.

I found that adding a more restrictive, OpenMP-like reduction flag was more elegant than checking whether threadlocal was isbits to eliminate allocations in such cases. Using this looks like:

red1 = 0
red2 = 0
red3 = 0
@batch reduction = ((+,red1), (+,red2), (+,red3)) for i = 0:9
  red1 += 1
  red2 += 1
  red3 -= 1
end
red1, red2, red3

It works by allocating the initial value (neutral element of the operation) within each threads function and loading + reducing them at the end. This PR also includes the tid fix from faab451, fixes in the documentation and additional tests for the reduction functionality.

Krastanov · 2024-02-24T19:01:02Z

It might be worthwhile to update the benchmarks that involve thredlocal in one of the readme subsections, to showcase things do not allocate anymore.

GianlucaFuwa · 2024-02-24T20:37:22Z

It might be worthwhile to update the benchmarks that involve thredlocal in one of the readme subsections, to showcase things do not allocate anymore.

I might be misunderstanding, but I can't find any benchmarks using threadlocal that show allocations in the README. I could add something like:

julia> function batch_reduction()
           y1 = 0
           y2 = 1
           @batch reduction=((+, y1), (*, y2)) for i in 1:9
               y1 += i
               y2 *= i
           end
           y1, y2
       end
julia> batch_reduction()
(45, 362880)
julia> @allocated batch_reduction()
0

to show that reduction doesn't allocate though.

I also forgot to show this PR affects performance of threadlocal:

using Test

function f()
    @batch threadlocal=rand(10:99)::Int for i in 0:9
        threadlocal += 1
    end
    sum(threadlocal)
end

try
    @inferred f()
catch e
    println("Not inferred")
end
@benchmark f()

Old:

Not inferred
BenchmarkTools.Trial: 874 samples with 170 evaluations.
 Range (min … max):  785.882 ns … 188.296 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      27.131 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):    33.600 μs ±  25.621 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄▃▄▆▇█▇▄▅▂▂▂▂▂  ▁  ▂     ▄   
  ████████████████▇██▇█▅▅▆███▇▇▆█▆▇▆▆▇▆▄▄▅▅▄▄▅▃▃▄▃▃▃▃▁▃▃▂▂▁▃▁▁▃ ▅
  786 ns           Histogram: frequency by time          108 μs <

 Memory estimate: 224 bytes, allocs estimate: 5.

New:

BenchmarkTools.Trial: 10000 samples with 287 evaluations.
 Range (min … max):  282.927 ns … 201.847 μs  ┊ GC (min … max): 0.00% … 2.28%
 Time  (median):     303.833 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   644.395 ns ±   4.433 μs  ┊ GC (mean ± σ):  0.33% ± 0.05%

  █▂▂▁                                                          ▁
  █████▆▆▇█▆▆▆▅▃▅▁▄▄▄▆▄▆▆▅▆▆▆▅▄▃▄▄▃▅▄▄▆▄▄▄▄▅▄▅▃▁▃▁▄▃▁▃▃▃▄▁▁▄▁▄▄ █
  283 ns        Histogram: log(frequency) by time       7.69 μs <

 Memory estimate: 96 bytes, allocs estimate: 1.

Krastanov · 2024-02-24T20:39:11Z

My apologies, the readme has been simplified significantly since the last time I looked. Disregard my comment.

chriselrod

This looks good.

chriselrod · 2024-02-26T16:09:20Z

Performance seems to be worse than I'd have expected.

julia> using LoopVectorization, BenchmarkTools, Polyester, LinearAlgebra

julia> function dot_batch(x,y)
           s = zero(Base.promote_eltype(x,y))
           @batch reduction=((+,s),) for i = eachindex(x,y)
               @inbounds @fastmath s += x[i]*y[i]
           end
           s
       end
dot_batch (generic function with 1 method)

julia> function dot_tturbo(x,y)
           s = zero(Base.promote_eltype(x,y))
           @tturbo for i = eachindex(x,y)
               s += x[i]*y[i]
           end
           s
       end
dot_tturbo (generic function with 1 method)

julia> function dot_turbo(x,y)
           s = zero(Base.promote_eltype(x,y))
           @turbo for i = eachindex(x,y)
               s += x[i]*y[i]
           end
           s
       end
dot_turbo (generic function with 1 method)

julia> x = rand(16384*8);

julia> @btime dot_turbo($x,$x)
  8.710 μs (0 allocations: 0 bytes)
43620.94200326388

julia> @btime dot_tturbo($x,$x)
  3.097 μs (0 allocations: 0 bytes)
43620.94200326386

julia> @btime dot($x,$x)
  3.174 μs (0 allocations: 0 bytes)
43620.942003263866

julia> @btime dot_batch($x,$x)
  32.269 μs (0 allocations: 0 bytes)
43620.94200326387

julia> using LinuxPerf

julia> foreachf(f::F, N::Int, arg1::A, args::Vararg{Any,K}) where {F,A,K} = foreach(_ -> Base.donotdelete(f(arg1, args...)), 1:N)
foreachf (generic function with 1 method)

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(dot_batch, 1_000_000, x, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               5.01e+11  100.0%  #  3.5 cycles per ns
└ task-clock               1.42e+11  100.0%  # 141.6 s
┌ instructions             5.06e+11  100.0%  #  1.0 insns per cycle
│ branch-instructions      1.00e+11  100.0%  # 19.8% of insns
└ branch-misses            8.47e+06  100.0%  #  0.0% of branch insns
┌ L1-dcache-load-misses    1.65e+10  100.0%  #  8.0% of dcache loads
│ L1-dcache-loads          2.06e+11  100.0%
│ cache-misses             5.86e+05  100.0%  #  2.1% of cache refs
└ cache-references         2.86e+07  100.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(dot_tturbo, 1_000_000, x, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               3.92e+10  100.0%  #  3.6 cycles per ns
└ task-clock               1.08e+10  100.0%  # 10.8 s
┌ instructions             4.57e+10  100.0%  #  1.2 insns per cycle
│ branch-instructions      4.19e+09  100.0%  #  9.2% of insns
└ branch-misses            5.73e+06  100.0%  #  0.1% of branch insns
┌ L1-dcache-load-misses    1.65e+10  100.0%  # 50.0% of dcache loads
│ L1-dcache-loads          3.29e+10  100.0%
│ cache-misses             5.42e+04  100.0%  #  0.3% of cache refs
└ cache-references         1.76e+07  100.0%
                  aggregated from 3 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(dot, 1_000_000, x, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               5.04e+10  100.0%  #  3.4 cycles per ns
└ task-clock               1.48e+10  100.0%  # 14.8 s
┌ instructions             6.40e+10  100.0%  #  1.3 insns per cycle
│ branch-instructions      6.45e+09  100.0%  # 10.1% of insns
└ branch-misses            8.16e+06  100.0%  #  0.1% of branch insns
┌ L1-dcache-load-misses    1.69e+10  100.0%  # 46.5% of dcache loads
│ L1-dcache-loads          3.64e+10  100.0%
│ cache-misses             2.80e+05  100.0%  #  0.5% of cache refs
└ cache-references         6.21e+07  100.0%
                  aggregated from 4 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(dot_turbo, 1_000_000, x, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               4.05e+10  100.0%  #  4.1 cycles per ns
└ task-clock               9.81e+09  100.0%  #  9.8 s
┌ instructions             3.90e+10  100.0%  #  1.0 insns per cycle
│ branch-instructions      2.06e+09  100.0%  #  5.3% of insns
└ branch-misses            1.18e+06  100.0%  #  0.1% of branch insns
┌ L1-dcache-load-misses    1.64e+10  100.0%  # 50.0% of dcache loads
│ L1-dcache-loads          3.28e+10  100.0%
│ cache-misses             3.07e+05  100.0%  #  0.0% of cache refs
└ cache-references         1.86e+09  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

It requires around 10x as many instructions as @tturbo for this example, resulting in around 10x worse performance.

I've not looked into this/generated code yet, but could there be something interfering with SIMD here?

chriselrod · 2024-02-26T16:18:15Z

I don't see anything that strikes out to me as wrong from a quick look.

codecov · 2024-02-26T16:20:09Z

Codecov Report

Attention: Patch coverage is 97.79006% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 93.37%. Comparing base (0952d11) to head (ab511e8).
Report is 2 commits behind head on master.

Files	Patch %	Lines
src/batch.jl	97.75%	2 Missing ⚠️
src/closure.jl	97.64%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #134      +/-   ##
==========================================
+ Coverage   91.13%   93.37%   +2.24%     
==========================================
  Files           3        3              
  Lines         451      574     +123     
==========================================
+ Hits          411      536     +125     
+ Misses         40       38       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

… `threadlocal` flag

GianlucaFuwa · 2024-02-26T17:53:10Z

I've not looked into this/generated code yet, but could there be something interfering with SIMD here?

I have no clue what could be causing all the extra instructions / interference with SIMD. I'm running on a Windows machine so I can't check, but is this a result of the changes I made, or do they appear when using @batchin general?

GianlucaFuwa · 2024-02-26T22:42:44Z

Nightly fails due to excessive allocations. After checking nightly myself, I found that the amount and size of allocations is rather random, so I'm not sure how to handle them.
PProf tells me that the objects being allocated are ThreadStates from ThreadingUtilities.jl

chriselrod · 2024-02-26T23:19:12Z

Thanks. I don't think we should bother working around flaws in nightly.

GianlucaFuwa added 6 commits February 20, 2024 22:14

Fix threadlocal inferrence issue and add reduction functionality

d8b857d

Update README and improve docs on reduction

fda544e

Fix small typo in README

c73bbe8

Fix threadlocal_var_gen type assert

0b6f607

Remove StrideArray import

adfb7e4

Fix thread index (@Keluaa) + ignore extras compats in Aqua test

08765c1

chriselrod approved these changes Feb 26, 2024

View reviewed changes

chriselrod enabled auto-merge February 26, 2024 16:18

Fix gensym of variable name threadlocal when single threaded and no…

1e3060a

… `threadlocal` flag

auto-merge was automatically disabled February 26, 2024 17:40
Head branch was pushed to by a user without write access

Improve CodeCov

ab511e8

chriselrod merged commit 89425e0 into JuliaSIMD:master Feb 26, 2024
29 of 31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `threadlocal` type-inferrability and add `reduction` functionality #134

Fix `threadlocal` type-inferrability and add `reduction` functionality #134

GianlucaFuwa commented Feb 24, 2024

Krastanov commented Feb 24, 2024

GianlucaFuwa commented Feb 24, 2024

Krastanov commented Feb 24, 2024

chriselrod left a comment

chriselrod commented Feb 26, 2024

chriselrod commented Feb 26, 2024

codecov bot commented Feb 26, 2024 •

edited

Loading

GianlucaFuwa commented Feb 26, 2024

GianlucaFuwa commented Feb 26, 2024

chriselrod commented Feb 26, 2024

Fix threadlocal type-inferrability and add reduction functionality #134

Fix threadlocal type-inferrability and add reduction functionality #134

Conversation

GianlucaFuwa commented Feb 24, 2024

Krastanov commented Feb 24, 2024

GianlucaFuwa commented Feb 24, 2024

Krastanov commented Feb 24, 2024

chriselrod left a comment

Choose a reason for hiding this comment

chriselrod commented Feb 26, 2024

chriselrod commented Feb 26, 2024

codecov bot commented Feb 26, 2024 • edited Loading

Codecov Report

GianlucaFuwa commented Feb 26, 2024

GianlucaFuwa commented Feb 26, 2024

chriselrod commented Feb 26, 2024

Fix `threadlocal` type-inferrability and add `reduction` functionality #134

Fix `threadlocal` type-inferrability and add `reduction` functionality #134

codecov bot commented Feb 26, 2024 •

edited

Loading