Code Monkey home page Code Monkey logo

Comments (30)

Moelf avatar Moelf commented on May 22, 2024 2

I tried the ultimate solution which is just to add another type to track buffer type and it seems to be working:

now:

julia> @time begin
           for evt in t1
               length(evt.Muon_pt)
           end
       end
  0.154825 seconds (26.07 k allocations: 177.184 MiB, 15.19% gc time, 3.46% compilation time)


julia> @time begin
           for evt in t1
               length(evt.Muon_pt)
           end
       end
  0.026082 seconds (12.45 k allocations: 1.470 MiB)

master:

julia> @time begin
           for evt in t1
               length(evt.Muon_pt)
           end
       end
  3.676705 seconds (26.81 M allocations: 1.196 GiB, 15.58% gc time, 25.18% compilation time)

julia> @time begin
           for evt in t1
               length(evt.Muon_pt)
           end
       end
  0.021609 seconds (5.90 k allocations: 767.688 KiB)

I still can't believe how well current master works with cache....

from unroot.jl.

Moelf avatar Moelf commented on May 22, 2024 2

we already do ;) I believe our lazy iteration (when cached) is within 2x concrete Numba loop over allocated arrays, which means we're probably a lot faster than chunked uproot code

from unroot.jl.

oschulz avatar oschulz commented on May 22, 2024 1

I don't think this is the issue - VectorOfVectors has to do a little bit more work to check bounds and create the views than getindex for a vector of separate vectors. But it's still very little time, and without bounds checking the difference is actually just about a factor two:

julia> using ArraysOfArrays, BenchmarkTools

julia> V = [rand(Float32, rand(0:4)) for _ = 1:10^5];

julia> Vov = VectorOfVectors(V)

julia> @benchmark begin
           @inbounds for i in eachindex($V)
               length($V[i])
           end
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min โ€ฆ max):  34.919 ฮผs โ€ฆ 132.067 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     38.391 ฮผs               โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   40.404 ฮผs ยฑ   7.572 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

  โ–„โ–โ–…โ–†โ–ˆโ–„โ–ƒโ–„โ–ƒ โ–„  โ–‚ โ–‚โ–‚โ–ƒ                                           โ–
  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–„โ–ˆโ–„โ–„โ–ˆโ–„โ–ˆโ–ˆโ–ˆโ–…โ–…โ–†โ–‡โ–‡โ–ˆโ–‡โ–†โ–†โ–„โ–…โ–…โ–†โ–…โ–„โ–…โ–„โ–…โ–…โ–„โ–„โ–…โ–…โ–…โ–†โ–„โ–…โ–†โ–„โ–„โ–†โ–…โ–„โ–„โ–†โ–…โ–‚โ–„โ–„โ–„โ–ƒโ–„ โ–ˆ
  34.9 ฮผs       Histogram: log(frequency) by time      79.2 ฮผs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark begin
           @inbounds for i in eachindex($Vov)
               length($Vov[i])
           end
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min โ€ฆ max):  77.848 ฮผs โ€ฆ 226.026 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     87.407 ฮผs               โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   92.920 ฮผs ยฑ  16.003 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

  โ– โ–„โ–… โ–‡โ–ˆ โ–…โ–„ โ–‚  โ–ƒโ–‚โ–„โ– โ–   โ–‚  โ–   โ–                              โ–‚
  โ–ˆโ–…โ–ˆโ–ˆโ–†โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–ˆโ–†โ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–ˆโ–ˆโ–†โ–ˆโ–‡โ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–†โ–‡โ–†โ–†โ–†โ–†โ–‡โ–‡โ–†โ–…โ–†โ–†โ–†โ–†โ–…โ–…โ–…โ–…โ–†โ–…โ–…โ–…โ–†โ–†โ–ƒโ–…โ–† โ–ˆ
  77.8 ฮผs       Histogram: log(frequency) by time       165 ฮผs <

 Memory estimate: 0 bytes, allocs estimate: 0.

So that's about 1 ns for each element access (view creation plus length()) for VectorOfVectors (with @inbounds), and there are no memory allocations.

from unroot.jl.

oschulz avatar oschulz commented on May 22, 2024 1

But this is just 1 ns vs. 0.5 ns for the access to the element vectors. Compared to an actual operation that does something real-life on those element vectors, that time difference should be negligible.

I don't think ArraysOfArrays causes the allocations, as long as things are type-stable around it.

from unroot.jl.

oschulz avatar oschulz commented on May 22, 2024 1

Very nice @Moelf !

I hope I can find some time in the future to contribute more directly. I'm happy with UpROOT.jl as a stopgap, but being able to deal with ROOT files without installing a whole Python stack is very attractive. :-) Also, we should be able to outperform uproot a bit, esp. on ROOT files with small buffer sizes.

from unroot.jl.

oschulz avatar oschulz commented on May 22, 2024

Using ArrayOfArrays.jl should definitely help to reduce the number of memory allocations a lot.

from unroot.jl.

oschulz avatar oschulz commented on May 22, 2024

Why is v_l_passIso be a vector of boolean vectors, instead of a vector of NamedTuples?

from unroot.jl.

Moelf avatar Moelf commented on May 22, 2024

because it's used like this later:

    #isolation cut, require all 4 of them to be true
    if all((
        v_l_passIso[pr1[1]][1],
        v_l_passIso[pr1[2]][1],
        v_l_passIso[W_id[1]][1],
        v_l_passIso[W_id[2]][1],
    ))
    else
        return false, wgt
    end

and

    if (abs(v_l_pid[fifth_l]) == 11)
        !v_l_passIso[fifth_l][4] && return false, wgt
    end
    if (abs(v_l_pid[fifth_l]) == 13)
        !v_l_passIso[fifth_l][3] && return false, wgt
    end

in short, it needs to have the same order as all other "lepton related" branches so you can index all of them with the same number that identifies a unique lepton in the event

from unroot.jl.

oschulz avatar oschulz commented on May 22, 2024

I guess one option would be to user a static array or tuple, since the length of the elements of v_l_passIso is fixed. An ArraysOfArrays.VectorOrVectors would work too, of course.

from unroot.jl.

Moelf avatar Moelf commented on May 22, 2024

length of the elements of v_l_passIso is

it is not, it only contains stuff for leptons that passed e_mask and m_mask, the inner most vector is of length 16, not worth StaticArray probably.

from unroot.jl.

Moelf avatar Moelf commented on May 22, 2024
index 86d8909..e26ff74 100644
--- a/src/root.jl
+++ b/src/root.jl
@@ -215,11 +215,16 @@ function interped_data(rawdata, rawoffsets, ::Type{T}, ::Type{J}) where {T, J<:J
-        @views [
-                ntoh.(reinterpret(
-                                  T, rawdata[ (rawoffsets[i]+jagg_offset+1):rawoffsets[i+1] ]
-                                 )) for i in 1:(length(rawoffsets) - 1)
-               ]
+        _size = sizeof(eltype(T))
+        data = UInt8[]
+        offset = Int64[0] # god damn 0-based index
+        @views @inbounds for i in 1:(length(rawoffsets) - 1)
+            rg = (rawoffsets[i]+jagg_offset+1) : rawoffsets[i+1]
+            append!(data, rawdata[rg])
+            push!(offset, last(offset) + length(rg))
+        end
+        real_data = ntoh.(reinterpret(T, data))
+        VectorOfVectors(real_data, offset .รท _size .+ 1)
     end

here's the number AFTER applying this diff:

# 1st run
 17.403734 seconds (72.69 M allocations: 6.877 GiB, 9.20% gc time, 3.31% compilation time)
# 2nd run
3.135352 seconds (34.34 M allocations: 2.249 GiB, 12.11% gc time)
# 3rd run
2.990273 seconds (34.34 M allocations: 2.249 GiB, 9.52% gc time)

BEFORE this diff:

# 1st run
32.069656 seconds (213.86 M allocations: 11.560 GiB, 21.16% gc time, 1.26% compilation time)

# 2nd run
3.976044 seconds (27.73 M allocations: 1.779 GiB, 21.24% gc time)

# 3rd run
  3.725007 seconds (27.73 M allocations: 1.779 GiB, 13.36% gc time)

@oschulz definitely helpful at asymptotic limit! A little bit confused by why subsequent runs allocates more, may be because some weird hidden conversion somewhere...

from unroot.jl.

Moelf avatar Moelf commented on May 22, 2024

image

hmmm, I tried a Ref{}() hack and it seems to be working, will make a PR now and we can discuss technical details there

from unroot.jl.

oschulz avatar oschulz commented on May 22, 2024

Using view(rawdata, rg) instead of rawdata[rg] could save allocations, since views are stack-allocated now.

from unroot.jl.

oschulz avatar oschulz commented on May 22, 2024

Ah, sorry, overlooked the @views

from unroot.jl.

oschulz avatar oschulz commented on May 22, 2024

Maybe try this version:

function interped_data(rawdata, rawoffsets, ::Type{T}, ::Type{J}) where {T, J<:JaggType}
    # ...
        _size = Base.SignedMultiplicativeInverse{Int}(sizeof(eltype(T))
        data = UInt8[]
        elem_ptr = sizehint!(Vector{Int}(), length(eachindex(rawoffsets)))
        push!(elem_ptr,  0)
        # Can we sizehint! data too?
        @inbounds for i in eachindex(rawoffsets)[1:end-1] 
            rg = (rawoffsets[i]+jagg_offset+1) : rawoffsets[i+1]
            append!(data, view(rawdata, rg))
            push!(elem_ptr, last(elem_ptr) + div(length(rg), _size))
        end
        real_data = ntoh.(reinterpret(T, data))
        elem_ptr .+= firstindex(real_data)
        VectorOfVectors(real_data, elem_ptr)
    #...
end

from unroot.jl.

Moelf avatar Moelf commented on May 22, 2024

Can we sizehint! data too?

maybe something smart like last(offset) \div sizeof(eltype(T)

from unroot.jl.

Moelf avatar Moelf commented on May 22, 2024

before:
image

your version:
image

no visible difference

from unroot.jl.

oschulz avatar oschulz commented on May 22, 2024

Hm, I'm not sure where these allocs are coming from, but it shouldn't be from within interped_data, since it only does a small, fixed number of allocations. What does main_looper do? And can you benchmark just the data reading phase?

from unroot.jl.

Moelf avatar Moelf commented on May 22, 2024

see #65 (comment)

for minimal reproduction of the mysterious allocation

from unroot.jl.

oschulz avatar oschulz commented on May 22, 2024

I'll try to have go at it ...

from unroot.jl.

Moelf avatar Moelf commented on May 22, 2024

I have reduced the minimal reproduction:

julia> using ArraysOfArrays

julia> const V = [rand(Float32, rand(0:4)) for _ = 1:10^5];

julia> const Vov = VectorOfVectors{Float32}()
0-element VectorOfVectors{Float32, Vector{Float32}, Vector{Int64}, Vector{Tuple{}}}

julia> foreach(V) do i
           push!(Vov, i)
       end

julia> @benchmark begin
           for i in eachindex($V)
               length($V[i])
           end
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min โ€ฆ max):  50.451 ฮผs โ€ฆ 512.027 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     56.100 ฮผs               โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   59.019 ฮผs ยฑ  14.257 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

  โ–โ–‚โ–‚ โ–‡โ–ˆโ–‚โ–†โ–ƒโ–‚โ–‚โ–‚โ–     โ–โ–                                         โ–‚
  โ–ˆโ–ˆโ–ˆโ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–‡โ–‡โ–†โ–†โ–‡โ–ˆโ–‡โ–†โ–†โ–‡โ–‡โ–†โ–†โ–†โ–†โ–‡โ–†โ–…โ–…โ–‡โ–…โ–†โ–…โ–…โ–†โ–…โ–†โ–‡โ–†โ–†โ–†โ–†โ–†โ–†โ–‡โ–‡โ–‡ โ–ˆ
  50.5 ฮผs       Histogram: log(frequency) by time       110 ฮผs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark begin
           for i in eachindex($Vov)
               length($Vov[i])
           end
       end
BenchmarkTools.Trial: 9314 samples with 1 evaluation.
 Range (min โ€ฆ max):  472.014 ฮผs โ€ฆ  1.195 ms  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     514.337 ฮผs              โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   533.550 ฮผs ยฑ 70.951 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

   โ–โ–‚โ–†โ–‡โ–‡โ–ˆโ–†โ–…โ–…โ–„โ–ƒโ–ƒโ–‚โ–โ–‚                                             โ–‚
  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–‡โ–†โ–†โ–†โ–†โ–…โ–…โ–„โ–†โ–„โ–„โ–„โ–…โ–„โ–…โ–ƒโ–„โ–„โ–„โ–…โ–†โ–†โ–†โ–†โ–…โ–…โ–†โ–…โ–†โ–‡โ–ˆโ–ˆโ–ˆโ–‡โ–ˆโ–‡โ–‡โ–†โ–„โ–„ โ–ˆ
  472 ฮผs        Histogram: log(frequency) by time       870 ฮผs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> Vov == V
true

maybe this can be a ArraysOfArrays issue, not sure.

I tested the newly added Int32-offset of VoV, it's 30% faster than Int64 VoV. I really hope this is not some false sharing shenanigans

from unroot.jl.

Moelf avatar Moelf commented on May 22, 2024

some new observation:

julia> const t1 = LazyTree(ROOTFile("/home/akako/Downloads/doublemu.root"), "t", [r"^Muon_(pt|eta|phi|mass)$","MET_pt"]);

julia> length(t1.Muon_pt) รท 10^4
247

julia> @time begin
           for idx in 1:10^4:length(t1.Muon_pt)
               t1.Muon_pt[idx]
           end
       end
  0.000530 seconds (3.56 k allocations: 406.859 KiB)

julia> @time begin
           for idx in 1:10^3:length(t1.Muon_pt)
               t1.Muon_pt[idx]
           end
       end
  0.001023 seconds (8.01 k allocations: 546.172 KiB)

julia> @time begin
           for idx in 1:10^2:length(t1.Muon_pt)
               t1.Muon_pt[idx]
           end
       end
  0.005169 seconds (51.95 k allocations: 1.884 MiB)


julia> @time begin
           for idx in 1:10:length(t1.Muon_pt)
               t1.Muon_pt[idx]
           end
       end
  0.014718 seconds (494.29 k allocations: 15.714 MiB)

julia> @time begin
           for idx in 1:1:length(t1.Muon_pt)
               t1.Muon_pt[idx]
           end
       end
  0.194961 seconds (4.89 M allocations: 150.909 MiB, 44.43% gc time)

notice how the allocation is not linear, initially, it seems to me cache is working properly because accessing 10x more elements didn't cause 10x allocation, but later you see it scales linearly

from unroot.jl.

Moelf avatar Moelf commented on May 22, 2024

@oschulz I think I found a "real" minimal reproducing example

julia> const Vov = VectorOfVectors{Float32}()
0-element VectorOfVectors{Float32, Vector{Float32}, Vector{Int64}, Vector{Tuple{}}}

julia> const V = [rand(Float32, rand(0:4)) for _ = 1:10^5];

julia> foreach(V) do i
           push!(Vov, i)
       end

julia> struct A
           v::VectorOfVectors{Float32}
       end

julia> struct B
           v::Vector{Vector{Float32}}
       end

julia> const _a = A(Vov);

julia> const _b = B(V);

julia> function f(stru, idx)
           @inbounds stru.v[idx]
       end
f (generic function with 1 method)

# warm run
julia> @time begin
           for i = 1:10^5
               f(_a, i)
           end
       end
  0.005362 seconds (199.49 k allocations: 6.096 MiB)

# warm run
julia> @time begin
           for i = 1:10^5
               f(_b, i)
           end
       end
  0.000216 seconds

in fact for some reason, getting index is only allocating in @benchmark even with @inbounds

julia> @benchmark f($_a, 123)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min โ€ฆ max):  22.168 ns โ€ฆ  1.808 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 96.08%
 Time  (median):     23.265 ns              โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   26.942 ns ยฑ 48.507 ns  โ”Š GC (mean ยฑ ฯƒ):  5.52% ยฑ  3.06%

  โ–‡โ–ˆโ–ˆโ–†โ–…โ–„โ–ƒโ–        โ–                โ–โ–โ–โ–                       โ–‚
  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–†โ–†โ–„โ–†โ–†โ–ˆโ–ˆโ–ˆโ–‡โ–†โ–…โ–…โ–„โ–ƒโ–ƒโ–‚โ–‚โ–‚โ–„โ–„โ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–ˆโ–ˆโ–ˆโ–‡โ–ˆโ–‡โ–…โ–…โ–†โ–…โ–„โ–†โ–…โ–†โ–† โ–ˆ
  22.2 ns      Histogram: log(frequency) by time      50.5 ns <

 Memory estimate: 48 bytes, allocs estimate: 1.

julia> @benchmark f($_b, 123)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min โ€ฆ max):  1.344 ns โ€ฆ 15.178 ns  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     1.514 ns              โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   1.521 ns ยฑ  0.242 ns  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

        โ–ˆโ–                                                    
  โ–‚โ–ƒโ–‚โ–‚โ–โ–‚โ–ˆโ–ˆโ–ˆโ–„โ–‚โ–‚โ–„โ–ƒโ–‚โ–‚โ–โ–โ–„โ–ƒโ–‚โ–‚โ–โ–โ–‚โ–„โ–‚โ–‚โ–โ–โ–โ–…โ–‡โ–…โ–„โ–‚โ–โ–โ–‚โ–ˆโ–„โ–„โ–‚โ–‚โ–โ–โ–ƒโ–‡โ–…โ–„โ–ƒโ–‚โ–โ–โ–โ–‚โ–‚โ–‚ โ–‚
  1.34 ns        Histogram: frequency by time        1.73 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

from unroot.jl.

oschulz avatar oschulz commented on May 22, 2024

That one's easy to explain, your struct A doesn't allow for type-stability since VectorOfVectors{Float32} isa UnionAll (because VectorOfVectors has several type parameters). Try this version:

using ArraysOfArrays, BenchmarkTools

const V = [rand(Float32, rand(0:4)) for _ = 1:10^5];

const Vov = VectorOfVectors(V)

struct Foo{VV<:AbstractVector{<:AbstractVector{<:Real}}}
    v::VV
end

const _a = Foo(Vov);

const _b = Foo(V);

function f(stru, idx)
    @inbounds stru.v[idx]
end
julia> @benchmark f($_a, 123)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min โ€ฆ max):  1.977 ns โ€ฆ 42.563 ns  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     2.307 ns              โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   2.337 ns ยฑ  0.692 ns  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

                      โ–‡    โ–…โ–                 โ–ˆโ–ƒ    โ–ˆโ–†     โ–  
  โ–†โ–…โ–‚โ–‚โ–โ–‚โ–โ–โ–โ–ƒโ–„โ–‚โ–‚โ–โ–„โ–‡โ–‚โ–‚โ–โ–†โ–ˆโ–„โ–…โ–ˆโ–ƒโ–ˆโ–ˆโ–ƒโ–ƒโ–โ–ƒโ–‡โ–„โ–…โ–ƒโ–‚โ–‚โ–‚โ–†โ–‡โ–‚โ–‚โ–โ–ƒโ–ˆโ–ˆโ–ƒโ–ƒโ–โ–ƒโ–ˆโ–ˆโ–ƒโ–ƒโ–โ–โ–†โ–ˆ โ–„
  1.98 ns        Histogram: frequency by time        2.54 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f($_b, 123)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min โ€ฆ max):  1.108 ns โ€ฆ 29.334 ns  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     1.219 ns              โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   1.215 ns ยฑ  0.492 ns  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

                                         โ–…โ–ˆโ–…โ–โ–‡โ–„               
  โ–‚โ–…โ–‡โ–†โ–†โ–„โ–…โ–ƒโ–‚โ–‚โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ƒโ–„โ–ƒโ–„โ–„โ–„โ–ƒโ–‚โ–‚โ–ƒโ–†โ–…โ–ƒโ–…โ–…โ–„โ–‚โ–‚โ–‚โ–…โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–„โ–ƒโ–‚โ–‚โ–ƒโ–‚โ–‚โ–‚โ–‚โ–‚โ–โ– โ–ƒ
  1.11 ns        Histogram: frequency by time        1.27 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

from unroot.jl.

tamasgal avatar tamasgal commented on May 22, 2024

Well spotted @oschulz. I think JET.jl should reveal such things quite easily.

from unroot.jl.

Moelf avatar Moelf commented on May 22, 2024

but it doesn't solve all the problem

julia> struct C
           v::VectorOfVectors{Float32, Vector{Float32}, Vector{Int64}, Vector{Tuple{}}}
       end

julia> @benchmark begin
           for i = 1:10^5
               f(_c, i)
           end
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min โ€ฆ max):  89.424 ฮผs โ€ฆ 547.878 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     89.486 ฮผs               โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   94.538 ฮผs ยฑ  14.213 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

  โ–ˆ   โ–‚ โ–…โ–ƒ       โ–  โ–  โ–                                       โ–
  โ–ˆโ–†โ–ˆโ–…โ–ˆโ–ƒโ–ˆโ–ˆโ–…โ–ˆโ–‡โ–†โ–ˆโ–‡โ–ˆโ–ˆโ–‡โ–‡โ–ˆโ–†โ–†โ–ˆโ–†โ–†โ–ˆโ–†โ–†โ–…โ–…โ–…โ–…โ–…โ–‡โ–‡โ–†โ–†โ–‡โ–‡โ–…โ–…โ–†โ–…โ–‚โ–ƒโ–…โ–„โ–„โ–„โ–…โ–ƒโ–„โ–…โ–…โ–…โ–„โ–ƒโ–ƒโ–ƒโ–ƒโ–‚ โ–ˆ
  89.4 ฮผs       Histogram: log(frequency) by time       153 ฮผs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark begin
           for i = 1:10^5
               f(_b, i)
           end
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min โ€ฆ max):  39.804 ฮผs โ€ฆ 117.688 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     39.957 ฮผs               โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   41.774 ฮผs ยฑ   6.077 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

  โ–ˆ  โ–‚ โ–„โ–‚          โ–  โ–                                        โ–
  โ–ˆโ–ˆโ–‡โ–ˆโ–‡โ–ˆโ–ˆโ–ˆโ–‡โ–…โ–‡โ–…โ–ˆโ–„โ–ˆโ–‡โ–…โ–ˆโ–…โ–…โ–ˆโ–ƒโ–„โ–…โ–‡โ–‡โ–‡โ–‡โ–‡โ–†โ–…โ–ƒโ–†โ–†โ–†โ–†โ–†โ–†โ–…โ–…โ–†โ–†โ–…โ–…โ–…โ–„โ–…โ–…โ–ƒโ–„โ–„โ–โ–„โ–ƒโ–โ–ƒโ–…โ–…โ–…โ–‡ โ–ˆ
  39.8 ฮผs       Histogram: log(frequency) by time      74.4 ฮผs <

 Memory estimate: 0 bytes, allocs estimate: 0.

tested to have same speed as Foo(), and when I tested this for LazyBranch, it still allocates a lot (although 2x faster than when type is not stable)

from unroot.jl.

tamasgal avatar tamasgal commented on May 22, 2024

I am wondering if this is somehow related to runtime dispatch deeper in the rabbit hole, but the difference is still quite "small".

from unroot.jl.

tamasgal avatar tamasgal commented on May 22, 2024

Ha! Indeed ๐Ÿ˜† That looks promising.

from unroot.jl.

tamasgal avatar tamasgal commented on May 22, 2024

@all-contributors please add @oschulz ideas

from unroot.jl.

allcontributors avatar allcontributors commented on May 22, 2024

@tamasgal

I've put up a pull request to add @oschulz! ๐ŸŽ‰

from unroot.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.