Sometimes in HEP the memory access pattern is inherently bad. For example we have jagg

Very nice <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

because it's used like this later: <div class="highlight highlight-source-julia no

Poor performance when weaving many branches together,about juliahep/unroot.jl

Comments (30)

Moelf commented on May 22, 2024 2

I tried the ultimate solution which is just to add another type to track buffer type and it seems to be working:

now:

julia> @time begin
           for evt in t1
               length(evt.Muon_pt)
           end
       end
  0.154825 seconds (26.07 k allocations: 177.184 MiB, 15.19% gc time, 3.46% compilation time)


julia> @time begin
           for evt in t1
               length(evt.Muon_pt)
           end
       end
  0.026082 seconds (12.45 k allocations: 1.470 MiB)

master:

julia> @time begin
           for evt in t1
               length(evt.Muon_pt)
           end
       end
  3.676705 seconds (26.81 M allocations: 1.196 GiB, 15.58% gc time, 25.18% compilation time)

julia> @time begin
           for evt in t1
               length(evt.Muon_pt)
           end
       end
  0.021609 seconds (5.90 k allocations: 767.688 KiB)

I still can't believe how well current master works with cache....

from unroot.jl.

Moelf commented on May 22, 2024 2

we already do ;) I believe our lazy iteration (when cached) is within 2x concrete Numba loop over allocated arrays, which means we're probably a lot faster than chunked uproot code

from unroot.jl.

oschulz commented on May 22, 2024 1

I don't think this is the issue - VectorOfVectors has to do a little bit more work to check bounds and create the views than getindex for a vector of separate vectors. But it's still very little time, and without bounds checking the difference is actually just about a factor two:

julia> using ArraysOfArrays, BenchmarkTools

julia> V = [rand(Float32, rand(0:4)) for _ = 1:10^5];

julia> Vov = VectorOfVectors(V)

julia> @benchmark begin
           @inbounds for i in eachindex($V)
               length($V[i])
           end
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  34.919 μs … 132.067 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     38.391 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   40.404 μs ±   7.572 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄▁▅▆█▄▃▄▃ ▄  ▂ ▂▂▃                                           ▁
  █████████▄█▄▄█▄███▅▅▆▇▇█▇▆▆▄▅▅▆▅▄▅▄▅▅▄▄▅▅▅▆▄▅▆▄▄▆▅▄▄▆▅▂▄▄▄▃▄ █
  34.9 μs       Histogram: log(frequency) by time      79.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark begin
           @inbounds for i in eachindex($Vov)
               length($Vov[i])
           end
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  77.848 μs … 226.026 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     87.407 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   92.920 μs ±  16.003 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁ ▄▅ ▇█ ▅▄ ▂  ▃▂▄▁ ▁   ▂  ▁   ▁                              ▂
  █▅██▆█████▇█▆▇████████▇██▆█▇▇████▆▇▆▆▆▆▇▇▆▅▆▆▆▆▅▅▅▅▆▅▅▅▆▆▃▅▆ █
  77.8 μs       Histogram: log(frequency) by time       165 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

So that's about 1 ns for each element access (view creation plus length()) for VectorOfVectors (with @inbounds), and there are no memory allocations.

from unroot.jl.

oschulz commented on May 22, 2024 1

But this is just 1 ns vs. 0.5 ns for the access to the element vectors. Compared to an actual operation that does something real-life on those element vectors, that time difference should be negligible.

I don't think ArraysOfArrays causes the allocations, as long as things are type-stable around it.

from unroot.jl.

oschulz commented on May 22, 2024 1

Very nice @Moelf !

I hope I can find some time in the future to contribute more directly. I'm happy with UpROOT.jl as a stopgap, but being able to deal with ROOT files without installing a whole Python stack is very attractive. :-) Also, we should be able to outperform uproot a bit, esp. on ROOT files with small buffer sizes.

from unroot.jl.

oschulz commented on May 22, 2024

Using ArrayOfArrays.jl should definitely help to reduce the number of memory allocations a lot.

from unroot.jl.

oschulz commented on May 22, 2024

Why is v_l_passIso be a vector of boolean vectors, instead of a vector of NamedTuples?

from unroot.jl.

Moelf commented on May 22, 2024

because it's used like this later:

    #isolation cut, require all 4 of them to be true
    if all((
        v_l_passIso[pr1[1]][1],
        v_l_passIso[pr1[2]][1],
        v_l_passIso[W_id[1]][1],
        v_l_passIso[W_id[2]][1],
    ))
    else
        return false, wgt
    end

and

    if (abs(v_l_pid[fifth_l]) == 11)
        !v_l_passIso[fifth_l][4] && return false, wgt
    end
    if (abs(v_l_pid[fifth_l]) == 13)
        !v_l_passIso[fifth_l][3] && return false, wgt
    end

in short, it needs to have the same order as all other "lepton related" branches so you can index all of them with the same number that identifies a unique lepton in the event

from unroot.jl.

oschulz commented on May 22, 2024

I guess one option would be to user a static array or tuple, since the length of the elements of v_l_passIso is fixed. An ArraysOfArrays.VectorOrVectors would work too, of course.

from unroot.jl.

Moelf commented on May 22, 2024

length of the elements of v_l_passIso is

it is not, it only contains stuff for leptons that passed e_mask and m_mask, the inner most vector is of length 16, not worth StaticArray probably.

from unroot.jl.

Moelf commented on May 22, 2024

index 86d8909..e26ff74 100644
--- a/src/root.jl
+++ b/src/root.jl
@@ -215,11 +215,16 @@ function interped_data(rawdata, rawoffsets, ::Type{T}, ::Type{J}) where {T, J<:J
-        @views [
-                ntoh.(reinterpret(
-                                  T, rawdata[ (rawoffsets[i]+jagg_offset+1):rawoffsets[i+1] ]
-                                 )) for i in 1:(length(rawoffsets) - 1)
-               ]
+        _size = sizeof(eltype(T))
+        data = UInt8[]
+        offset = Int64[0] # god damn 0-based index
+        @views @inbounds for i in 1:(length(rawoffsets) - 1)
+            rg = (rawoffsets[i]+jagg_offset+1) : rawoffsets[i+1]
+            append!(data, rawdata[rg])
+            push!(offset, last(offset) + length(rg))
+        end
+        real_data = ntoh.(reinterpret(T, data))
+        VectorOfVectors(real_data, offset .÷ _size .+ 1)
     end

here's the number AFTER applying this diff:

# 1st run
 17.403734 seconds (72.69 M allocations: 6.877 GiB, 9.20% gc time, 3.31% compilation time)
# 2nd run
3.135352 seconds (34.34 M allocations: 2.249 GiB, 12.11% gc time)
# 3rd run
2.990273 seconds (34.34 M allocations: 2.249 GiB, 9.52% gc time)

BEFORE this diff:

# 1st run
32.069656 seconds (213.86 M allocations: 11.560 GiB, 21.16% gc time, 1.26% compilation time)

# 2nd run
3.976044 seconds (27.73 M allocations: 1.779 GiB, 21.24% gc time)

# 3rd run
  3.725007 seconds (27.73 M allocations: 1.779 GiB, 13.36% gc time)

@oschulz definitely helpful at asymptotic limit! A little bit confused by why subsequent runs allocates more, may be because some weird hidden conversion somewhere...

from unroot.jl.

Moelf commented on May 22, 2024

hmmm, I tried a Ref{}() hack and it seems to be working, will make a PR now and we can discuss technical details there

from unroot.jl.

oschulz commented on May 22, 2024

Using view(rawdata, rg) instead of rawdata[rg] could save allocations, since views are stack-allocated now.

from unroot.jl.

oschulz commented on May 22, 2024

Ah, sorry, overlooked the @views

from unroot.jl.

oschulz commented on May 22, 2024

Maybe try this version:

function interped_data(rawdata, rawoffsets, ::Type{T}, ::Type{J}) where {T, J<:JaggType}
    # ...
        _size = Base.SignedMultiplicativeInverse{Int}(sizeof(eltype(T))
        data = UInt8[]
        elem_ptr = sizehint!(Vector{Int}(), length(eachindex(rawoffsets)))
        push!(elem_ptr,  0)
        # Can we sizehint! data too?
        @inbounds for i in eachindex(rawoffsets)[1:end-1] 
            rg = (rawoffsets[i]+jagg_offset+1) : rawoffsets[i+1]
            append!(data, view(rawdata, rg))
            push!(elem_ptr, last(elem_ptr) + div(length(rg), _size))
        end
        real_data = ntoh.(reinterpret(T, data))
        elem_ptr .+= firstindex(real_data)
        VectorOfVectors(real_data, elem_ptr)
    #...
end

from unroot.jl.

Moelf commented on May 22, 2024

Can we sizehint! data too?

maybe something smart like last(offset) \div sizeof(eltype(T)

from unroot.jl.

Moelf commented on May 22, 2024

before:

your version:

no visible difference

from unroot.jl.

oschulz commented on May 22, 2024

Hm, I'm not sure where these allocs are coming from, but it shouldn't be from within interped_data, since it only does a small, fixed number of allocations. What does main_looper do? And can you benchmark just the data reading phase?

from unroot.jl.

Moelf commented on May 22, 2024

see #65 (comment)

for minimal reproduction of the mysterious allocation

from unroot.jl.

oschulz commented on May 22, 2024

I'll try to have go at it ...

from unroot.jl.

Moelf commented on May 22, 2024

I have reduced the minimal reproduction:

julia> using ArraysOfArrays

julia> const V = [rand(Float32, rand(0:4)) for _ = 1:10^5];

julia> const Vov = VectorOfVectors{Float32}()
0-element VectorOfVectors{Float32, Vector{Float32}, Vector{Int64}, Vector{Tuple{}}}

julia> foreach(V) do i
           push!(Vov, i)
       end

julia> @benchmark begin
           for i in eachindex($V)
               length($V[i])
           end
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  50.451 μs … 512.027 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     56.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   59.019 μs ±  14.257 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▂▂ ▇█▂▆▃▂▂▂▁     ▁▁                                         ▂
  ███▁██████████████████▇▇▇▆▆▇█▇▆▆▇▇▆▆▆▆▇▆▅▅▇▅▆▅▅▆▅▆▇▆▆▆▆▆▆▇▇▇ █
  50.5 μs       Histogram: log(frequency) by time       110 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark begin
           for i in eachindex($Vov)
               length($Vov[i])
           end
       end
BenchmarkTools.Trial: 9314 samples with 1 evaluation.
 Range (min … max):  472.014 μs …  1.195 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     514.337 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   533.550 μs ± 70.951 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁▂▆▇▇█▆▅▅▄▃▃▂▁▂                                             ▂
  ███████████████████▇▇▆▆▆▆▅▅▄▆▄▄▄▅▄▅▃▄▄▄▅▆▆▆▆▅▅▆▅▆▇███▇█▇▇▆▄▄ █
  472 μs        Histogram: log(frequency) by time       870 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> Vov == V
true

maybe this can be a ArraysOfArrays issue, not sure.

I tested the newly added Int32-offset of VoV, it's 30% faster than Int64 VoV. I really hope this is not some false sharing shenanigans

from unroot.jl.

Moelf commented on May 22, 2024

some new observation:

julia> const t1 = LazyTree(ROOTFile("/home/akako/Downloads/doublemu.root"), "t", [r"^Muon_(pt|eta|phi|mass)$","MET_pt"]);

julia> length(t1.Muon_pt) ÷ 10^4
247

julia> @time begin
           for idx in 1:10^4:length(t1.Muon_pt)
               t1.Muon_pt[idx]
           end
       end
  0.000530 seconds (3.56 k allocations: 406.859 KiB)

julia> @time begin
           for idx in 1:10^3:length(t1.Muon_pt)
               t1.Muon_pt[idx]
           end
       end
  0.001023 seconds (8.01 k allocations: 546.172 KiB)

julia> @time begin
           for idx in 1:10^2:length(t1.Muon_pt)
               t1.Muon_pt[idx]
           end
       end
  0.005169 seconds (51.95 k allocations: 1.884 MiB)


julia> @time begin
           for idx in 1:10:length(t1.Muon_pt)
               t1.Muon_pt[idx]
           end
       end
  0.014718 seconds (494.29 k allocations: 15.714 MiB)

julia> @time begin
           for idx in 1:1:length(t1.Muon_pt)
               t1.Muon_pt[idx]
           end
       end
  0.194961 seconds (4.89 M allocations: 150.909 MiB, 44.43% gc time)

notice how the allocation is not linear, initially, it seems to me cache is working properly because accessing 10x more elements didn't cause 10x allocation, but later you see it scales linearly

from unroot.jl.

Moelf commented on May 22, 2024

@oschulz I think I found a "real" minimal reproducing example

julia> const Vov = VectorOfVectors{Float32}()
0-element VectorOfVectors{Float32, Vector{Float32}, Vector{Int64}, Vector{Tuple{}}}

julia> const V = [rand(Float32, rand(0:4)) for _ = 1:10^5];

julia> foreach(V) do i
           push!(Vov, i)
       end

julia> struct A
           v::VectorOfVectors{Float32}
       end

julia> struct B
           v::Vector{Vector{Float32}}
       end

julia> const _a = A(Vov);

julia> const _b = B(V);

julia> function f(stru, idx)
           @inbounds stru.v[idx]
       end
f (generic function with 1 method)

# warm run
julia> @time begin
           for i = 1:10^5
               f(_a, i)
           end
       end
  0.005362 seconds (199.49 k allocations: 6.096 MiB)

# warm run
julia> @time begin
           for i = 1:10^5
               f(_b, i)
           end
       end
  0.000216 seconds

in fact for some reason, getting index is only allocating in @benchmark even with @inbounds

julia> @benchmark f($_a, 123)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min … max):  22.168 ns …  1.808 μs  ┊ GC (min … max): 0.00% … 96.08%
 Time  (median):     23.265 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.942 ns ± 48.507 ns  ┊ GC (mean ± σ):  5.52% ±  3.06%

  ▇██▆▅▄▃▁        ▁                ▁▁▁▁                       ▂
  ████████▇▆▆▄▆▆███▇▆▅▅▄▃▃▂▂▂▄▄▇█████████████▇███▇█▇▅▅▆▅▄▆▅▆▆ █
  22.2 ns      Histogram: log(frequency) by time      50.5 ns <

 Memory estimate: 48 bytes, allocs estimate: 1.

julia> @benchmark f($_b, 123)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.344 ns … 15.178 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.514 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.521 ns ±  0.242 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        █▁                                                    
  ▂▃▂▂▁▂███▄▂▂▄▃▂▂▁▁▄▃▂▂▁▁▂▄▂▂▁▁▁▅▇▅▄▂▁▁▂█▄▄▂▂▁▁▃▇▅▄▃▂▁▁▁▂▂▂ ▂
  1.34 ns        Histogram: frequency by time        1.73 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

from unroot.jl.

oschulz commented on May 22, 2024

That one's easy to explain, your struct A doesn't allow for type-stability since VectorOfVectors{Float32} isa UnionAll (because VectorOfVectors has several type parameters). Try this version:

using ArraysOfArrays, BenchmarkTools

const V = [rand(Float32, rand(0:4)) for _ = 1:10^5];

const Vov = VectorOfVectors(V)

struct Foo{VV<:AbstractVector{<:AbstractVector{<:Real}}}
    v::VV
end

const _a = Foo(Vov);

const _b = Foo(V);

function f(stru, idx)
    @inbounds stru.v[idx]
end

julia> @benchmark f($_a, 123)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.977 ns … 42.563 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.307 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.337 ns ±  0.692 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                      ▇    ▅▁                 █▃    █▆     ▁  
  ▆▅▂▂▁▂▁▁▁▃▄▂▂▁▄▇▂▂▁▆█▄▅█▃██▃▃▁▃▇▄▅▃▂▂▂▆▇▂▂▁▃██▃▃▁▃██▃▃▁▁▆█ ▄
  1.98 ns        Histogram: frequency by time        2.54 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f($_b, 123)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.108 ns … 29.334 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.219 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.215 ns ±  0.492 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                         ▅█▅▁▇▄               
  ▂▅▇▆▆▄▅▃▂▂▁▁▁▁▁▁▁▁▁▃▄▃▄▄▄▃▂▂▃▆▅▃▅▅▄▂▂▂▅███████▄▃▂▂▃▂▂▂▂▂▁▁ ▃
  1.11 ns        Histogram: frequency by time        1.27 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

from unroot.jl.

tamasgal commented on May 22, 2024

Well spotted @oschulz. I think JET.jl should reveal such things quite easily.

from unroot.jl.

Moelf commented on May 22, 2024

but it doesn't solve all the problem

julia> struct C
           v::VectorOfVectors{Float32, Vector{Float32}, Vector{Int64}, Vector{Tuple{}}}
       end

julia> @benchmark begin
           for i = 1:10^5
               f(_c, i)
           end
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  89.424 μs … 547.878 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     89.486 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   94.538 μs ±  14.213 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █   ▂ ▅▃       ▁  ▁  ▁                                       ▁
  █▆█▅█▃██▅█▇▆█▇██▇▇█▆▆█▆▆█▆▆▅▅▅▅▅▇▇▆▆▇▇▅▅▆▅▂▃▅▄▄▄▅▃▄▅▅▅▄▃▃▃▃▂ █
  89.4 μs       Histogram: log(frequency) by time       153 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark begin
           for i = 1:10^5
               f(_b, i)
           end
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  39.804 μs … 117.688 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     39.957 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   41.774 μs ±   6.077 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █  ▂ ▄▂          ▁  ▁                                        ▁
  ██▇█▇███▇▅▇▅█▄█▇▅█▅▅█▃▄▅▇▇▇▇▇▆▅▃▆▆▆▆▆▆▅▅▆▆▅▅▅▄▅▅▃▄▄▁▄▃▁▃▅▅▅▇ █
  39.8 μs       Histogram: log(frequency) by time      74.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

tested to have same speed as Foo(), and when I tested this for LazyBranch, it still allocates a lot (although 2x faster than when type is not stable)

from unroot.jl.

tamasgal commented on May 22, 2024

I am wondering if this is somehow related to runtime dispatch deeper in the rabbit hole, but the difference is still quite "small".

from unroot.jl.

tamasgal commented on May 22, 2024

Ha! Indeed 😆 That looks promising.

from unroot.jl.

tamasgal commented on May 22, 2024

@all-contributors please add @oschulz ideas

from unroot.jl.

allcontributors commented on May 22, 2024

@tamasgal

I've put up a pull request to add @oschulz! 🎉

from unroot.jl.

Poor performance when weaving many branches together about unroot.jl HOT 30 CLOSED

Comments (30)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent