Comments (30)
I tried the ultimate solution which is just to add another type to track buffer type and it seems to be working:
now:
julia> @time begin
for evt in t1
length(evt.Muon_pt)
end
end
0.154825 seconds (26.07 k allocations: 177.184 MiB, 15.19% gc time, 3.46% compilation time)
julia> @time begin
for evt in t1
length(evt.Muon_pt)
end
end
0.026082 seconds (12.45 k allocations: 1.470 MiB)
master:
julia> @time begin
for evt in t1
length(evt.Muon_pt)
end
end
3.676705 seconds (26.81 M allocations: 1.196 GiB, 15.58% gc time, 25.18% compilation time)
julia> @time begin
for evt in t1
length(evt.Muon_pt)
end
end
0.021609 seconds (5.90 k allocations: 767.688 KiB)
I still can't believe how well current master
works with cache....
from unroot.jl.
we already do ;) I believe our lazy iteration (when cached) is within 2x concrete Numba loop over allocated arrays, which means we're probably a lot faster than chunked uproot code
from unroot.jl.
I don't think this is the issue - VectorOfVectors has to do a little bit more work to check bounds and create the views than getindex
for a vector of separate vectors. But it's still very little time, and without bounds checking the difference is actually just about a factor two:
julia> using ArraysOfArrays, BenchmarkTools
julia> V = [rand(Float32, rand(0:4)) for _ = 1:10^5];
julia> Vov = VectorOfVectors(V)
julia> @benchmark begin
@inbounds for i in eachindex($V)
length($V[i])
end
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min โฆ max): 34.919 ฮผs โฆ 132.067 ฮผs โ GC (min โฆ max): 0.00% โฆ 0.00%
Time (median): 38.391 ฮผs โ GC (median): 0.00%
Time (mean ยฑ ฯ): 40.404 ฮผs ยฑ 7.572 ฮผs โ GC (mean ยฑ ฯ): 0.00% ยฑ 0.00%
โโโ
โโโโโโ โ โ โโโ โ
โโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโ
โ
โโ
โโ
โโ
โ
โโโ
โ
โ
โโโ
โโโโโ
โโโโ
โโโโโโ โ
34.9 ฮผs Histogram: log(frequency) by time 79.2 ฮผs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark begin
@inbounds for i in eachindex($Vov)
length($Vov[i])
end
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min โฆ max): 77.848 ฮผs โฆ 226.026 ฮผs โ GC (min โฆ max): 0.00% โฆ 0.00%
Time (median): 87.407 ฮผs โ GC (median): 0.00%
Time (mean ยฑ ฯ): 92.920 ฮผs ยฑ 16.003 ฮผs โ GC (mean ยฑ ฯ): 0.00% ยฑ 0.00%
โ โโ
โโ โ
โ โ โโโโ โ โ โ โ โ
โโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโ
โ
โ
โ
โโ
โ
โ
โโโโ
โ โ
77.8 ฮผs Histogram: log(frequency) by time 165 ฮผs <
Memory estimate: 0 bytes, allocs estimate: 0.
So that's about 1 ns for each element access (view creation plus length()
) for VectorOfVectors
(with @inbounds
), and there are no memory allocations.
from unroot.jl.
But this is just 1 ns vs. 0.5 ns for the access to the element vectors. Compared to an actual operation that does something real-life on those element vectors, that time difference should be negligible.
I don't think ArraysOfArrays causes the allocations, as long as things are type-stable around it.
from unroot.jl.
Very nice @Moelf !
I hope I can find some time in the future to contribute more directly. I'm happy with UpROOT.jl as a stopgap, but being able to deal with ROOT files without installing a whole Python stack is very attractive. :-) Also, we should be able to outperform uproot a bit, esp. on ROOT files with small buffer sizes.
from unroot.jl.
Using ArrayOfArrays.jl should definitely help to reduce the number of memory allocations a lot.
from unroot.jl.
Why is v_l_passIso
be a vector of boolean vectors, instead of a vector of NamedTuple
s?
from unroot.jl.
because it's used like this later:
#isolation cut, require all 4 of them to be true
if all((
v_l_passIso[pr1[1]][1],
v_l_passIso[pr1[2]][1],
v_l_passIso[W_id[1]][1],
v_l_passIso[W_id[2]][1],
))
else
return false, wgt
end
and
if (abs(v_l_pid[fifth_l]) == 11)
!v_l_passIso[fifth_l][4] && return false, wgt
end
if (abs(v_l_pid[fifth_l]) == 13)
!v_l_passIso[fifth_l][3] && return false, wgt
end
in short, it needs to have the same order as all other "lepton related" branches so you can index all of them with the same number that identifies a unique lepton in the event
from unroot.jl.
I guess one option would be to user a static array or tuple, since the length of the elements of v_l_passIso
is fixed. An ArraysOfArrays.VectorOrVectors
would work too, of course.
from unroot.jl.
length of the elements of v_l_passIso is
it is not, it only contains stuff for leptons that passed e_mask
and m_mask
, the inner most vector is of length 16, not worth StaticArray probably.
from unroot.jl.
index 86d8909..e26ff74 100644
--- a/src/root.jl
+++ b/src/root.jl
@@ -215,11 +215,16 @@ function interped_data(rawdata, rawoffsets, ::Type{T}, ::Type{J}) where {T, J<:J
- @views [
- ntoh.(reinterpret(
- T, rawdata[ (rawoffsets[i]+jagg_offset+1):rawoffsets[i+1] ]
- )) for i in 1:(length(rawoffsets) - 1)
- ]
+ _size = sizeof(eltype(T))
+ data = UInt8[]
+ offset = Int64[0] # god damn 0-based index
+ @views @inbounds for i in 1:(length(rawoffsets) - 1)
+ rg = (rawoffsets[i]+jagg_offset+1) : rawoffsets[i+1]
+ append!(data, rawdata[rg])
+ push!(offset, last(offset) + length(rg))
+ end
+ real_data = ntoh.(reinterpret(T, data))
+ VectorOfVectors(real_data, offset .รท _size .+ 1)
end
here's the number AFTER applying this diff:
# 1st run
17.403734 seconds (72.69 M allocations: 6.877 GiB, 9.20% gc time, 3.31% compilation time)
# 2nd run
3.135352 seconds (34.34 M allocations: 2.249 GiB, 12.11% gc time)
# 3rd run
2.990273 seconds (34.34 M allocations: 2.249 GiB, 9.52% gc time)
BEFORE this diff:
# 1st run
32.069656 seconds (213.86 M allocations: 11.560 GiB, 21.16% gc time, 1.26% compilation time)
# 2nd run
3.976044 seconds (27.73 M allocations: 1.779 GiB, 21.24% gc time)
# 3rd run
3.725007 seconds (27.73 M allocations: 1.779 GiB, 13.36% gc time)
@oschulz definitely helpful at asymptotic limit! A little bit confused by why subsequent runs allocates more, may be because some weird hidden conversion somewhere...
from unroot.jl.
hmmm, I tried a Ref{}()
hack and it seems to be working, will make a PR now and we can discuss technical details there
from unroot.jl.
Using view(rawdata, rg)
instead of rawdata[rg]
could save allocations, since views are stack-allocated now.
from unroot.jl.
Ah, sorry, overlooked the @views
from unroot.jl.
Maybe try this version:
function interped_data(rawdata, rawoffsets, ::Type{T}, ::Type{J}) where {T, J<:JaggType}
# ...
_size = Base.SignedMultiplicativeInverse{Int}(sizeof(eltype(T))
data = UInt8[]
elem_ptr = sizehint!(Vector{Int}(), length(eachindex(rawoffsets)))
push!(elem_ptr, 0)
# Can we sizehint! data too?
@inbounds for i in eachindex(rawoffsets)[1:end-1]
rg = (rawoffsets[i]+jagg_offset+1) : rawoffsets[i+1]
append!(data, view(rawdata, rg))
push!(elem_ptr, last(elem_ptr) + div(length(rg), _size))
end
real_data = ntoh.(reinterpret(T, data))
elem_ptr .+= firstindex(real_data)
VectorOfVectors(real_data, elem_ptr)
#...
end
from unroot.jl.
Can we sizehint! data too?
maybe something smart like last(offset) \div sizeof(eltype(T)
from unroot.jl.
no visible difference
from unroot.jl.
Hm, I'm not sure where these allocs are coming from, but it shouldn't be from within interped_data
, since it only does a small, fixed number of allocations. What does main_looper
do? And can you benchmark just the data reading phase?
from unroot.jl.
see #65 (comment)
for minimal reproduction of the mysterious allocation
from unroot.jl.
I'll try to have go at it ...
from unroot.jl.
I have reduced the minimal reproduction:
julia> using ArraysOfArrays
julia> const V = [rand(Float32, rand(0:4)) for _ = 1:10^5];
julia> const Vov = VectorOfVectors{Float32}()
0-element VectorOfVectors{Float32, Vector{Float32}, Vector{Int64}, Vector{Tuple{}}}
julia> foreach(V) do i
push!(Vov, i)
end
julia> @benchmark begin
for i in eachindex($V)
length($V[i])
end
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min โฆ max): 50.451 ฮผs โฆ 512.027 ฮผs โ GC (min โฆ max): 0.00% โฆ 0.00%
Time (median): 56.100 ฮผs โ GC (median): 0.00%
Time (mean ยฑ ฯ): 59.019 ฮผs ยฑ 14.257 ฮผs โ GC (mean ยฑ ฯ): 0.00% ยฑ 0.00%
โโโ โโโโโโโโโ โโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโ
โโ
โ
โโ
โโโโโโโโโโโ โ
50.5 ฮผs Histogram: log(frequency) by time 110 ฮผs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark begin
for i in eachindex($Vov)
length($Vov[i])
end
end
BenchmarkTools.Trial: 9314 samples with 1 evaluation.
Range (min โฆ max): 472.014 ฮผs โฆ 1.195 ms โ GC (min โฆ max): 0.00% โฆ 0.00%
Time (median): 514.337 ฮผs โ GC (median): 0.00%
Time (mean ยฑ ฯ): 533.550 ฮผs ยฑ 70.951 ฮผs โ GC (mean ยฑ ฯ): 0.00% ยฑ 0.00%
โโโโโโโโ
โ
โโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโ
โโ
โโโโโ
โโโโโ
โ
โโ
โโโโโโโโโโโโ โ
472 ฮผs Histogram: log(frequency) by time 870 ฮผs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> Vov == V
true
maybe this can be a ArraysOfArrays issue, not sure.
I tested the newly added Int32
-offset of VoV, it's 30% faster than Int64
VoV. I really hope this is not some false sharing shenanigans
from unroot.jl.
some new observation:
julia> const t1 = LazyTree(ROOTFile("/home/akako/Downloads/doublemu.root"), "t", [r"^Muon_(pt|eta|phi|mass)$","MET_pt"]);
julia> length(t1.Muon_pt) รท 10^4
247
julia> @time begin
for idx in 1:10^4:length(t1.Muon_pt)
t1.Muon_pt[idx]
end
end
0.000530 seconds (3.56 k allocations: 406.859 KiB)
julia> @time begin
for idx in 1:10^3:length(t1.Muon_pt)
t1.Muon_pt[idx]
end
end
0.001023 seconds (8.01 k allocations: 546.172 KiB)
julia> @time begin
for idx in 1:10^2:length(t1.Muon_pt)
t1.Muon_pt[idx]
end
end
0.005169 seconds (51.95 k allocations: 1.884 MiB)
julia> @time begin
for idx in 1:10:length(t1.Muon_pt)
t1.Muon_pt[idx]
end
end
0.014718 seconds (494.29 k allocations: 15.714 MiB)
julia> @time begin
for idx in 1:1:length(t1.Muon_pt)
t1.Muon_pt[idx]
end
end
0.194961 seconds (4.89 M allocations: 150.909 MiB, 44.43% gc time)
notice how the allocation is not linear, initially, it seems to me cache is working properly because accessing 10x more elements didn't cause 10x allocation, but later you see it scales linearly
from unroot.jl.
@oschulz I think I found a "real" minimal reproducing example
julia> const Vov = VectorOfVectors{Float32}()
0-element VectorOfVectors{Float32, Vector{Float32}, Vector{Int64}, Vector{Tuple{}}}
julia> const V = [rand(Float32, rand(0:4)) for _ = 1:10^5];
julia> foreach(V) do i
push!(Vov, i)
end
julia> struct A
v::VectorOfVectors{Float32}
end
julia> struct B
v::Vector{Vector{Float32}}
end
julia> const _a = A(Vov);
julia> const _b = B(V);
julia> function f(stru, idx)
@inbounds stru.v[idx]
end
f (generic function with 1 method)
# warm run
julia> @time begin
for i = 1:10^5
f(_a, i)
end
end
0.005362 seconds (199.49 k allocations: 6.096 MiB)
# warm run
julia> @time begin
for i = 1:10^5
f(_b, i)
end
end
0.000216 seconds
in fact for some reason, getting index is only allocating in @benchmark
even with @inbounds
julia> @benchmark f($_a, 123)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
Range (min โฆ max): 22.168 ns โฆ 1.808 ฮผs โ GC (min โฆ max): 0.00% โฆ 96.08%
Time (median): 23.265 ns โ GC (median): 0.00%
Time (mean ยฑ ฯ): 26.942 ns ยฑ 48.507 ns โ GC (mean ยฑ ฯ): 5.52% ยฑ 3.06%
โโโโโ
โโโ โ โโโโ โ
โโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโ
โโโ
โโ โ
22.2 ns Histogram: log(frequency) by time 50.5 ns <
Memory estimate: 48 bytes, allocs estimate: 1.
julia> @benchmark f($_b, 123)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min โฆ max): 1.344 ns โฆ 15.178 ns โ GC (min โฆ max): 0.00% โฆ 0.00%
Time (median): 1.514 ns โ GC (median): 0.00%
Time (mean ยฑ ฯ): 1.521 ns ยฑ 0.242 ns โ GC (mean ยฑ ฯ): 0.00% ยฑ 0.00%
โโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโ
โโโโโโโโโโโโโโโ
โโโโโโโโโ โ
1.34 ns Histogram: frequency by time 1.73 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
from unroot.jl.
That one's easy to explain, your struct A
doesn't allow for type-stability since VectorOfVectors{Float32}
isa UnionAll
(because VectorOfVectors has several type parameters). Try this version:
using ArraysOfArrays, BenchmarkTools
const V = [rand(Float32, rand(0:4)) for _ = 1:10^5];
const Vov = VectorOfVectors(V)
struct Foo{VV<:AbstractVector{<:AbstractVector{<:Real}}}
v::VV
end
const _a = Foo(Vov);
const _b = Foo(V);
function f(stru, idx)
@inbounds stru.v[idx]
end
julia> @benchmark f($_a, 123)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min โฆ max): 1.977 ns โฆ 42.563 ns โ GC (min โฆ max): 0.00% โฆ 0.00%
Time (median): 2.307 ns โ GC (median): 0.00%
Time (mean ยฑ ฯ): 2.337 ns ยฑ 0.692 ns โ GC (mean ยฑ ฯ): 0.00% ยฑ 0.00%
โ โ
โ โโ โโ โ
โโ
โโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโ โ
1.98 ns Histogram: frequency by time 2.54 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark f($_b, 123)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min โฆ max): 1.108 ns โฆ 29.334 ns โ GC (min โฆ max): 0.00% โฆ 0.00%
Time (median): 1.219 ns โ GC (median): 0.00%
Time (mean ยฑ ฯ): 1.215 ns ยฑ 0.492 ns โ GC (mean ยฑ ฯ): 0.00% ยฑ 0.00%
โ
โโ
โโโ
โโ
โโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโ
โโ
โ
โโโโโ
โโโโโโโโโโโโโโโโโโโ โ
1.11 ns Histogram: frequency by time 1.27 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
from unroot.jl.
Well spotted @oschulz. I think JET.jl
should reveal such things quite easily.
from unroot.jl.
but it doesn't solve all the problem
julia> struct C
v::VectorOfVectors{Float32, Vector{Float32}, Vector{Int64}, Vector{Tuple{}}}
end
julia> @benchmark begin
for i = 1:10^5
f(_c, i)
end
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min โฆ max): 89.424 ฮผs โฆ 547.878 ฮผs โ GC (min โฆ max): 0.00% โฆ 0.00%
Time (median): 89.486 ฮผs โ GC (median): 0.00%
Time (mean ยฑ ฯ): 94.538 ฮผs ยฑ 14.213 ฮผs โ GC (mean ยฑ ฯ): 0.00% ยฑ 0.00%
โ โ โ
โ โ โ โ โ
โโโโ
โโโโโ
โโโโโโโโโโโโโโโโโโโ
โ
โ
โ
โ
โโโโโโโ
โ
โโ
โโโ
โโโโ
โโโ
โ
โ
โโโโโโ โ
89.4 ฮผs Histogram: log(frequency) by time 153 ฮผs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark begin
for i = 1:10^5
f(_b, i)
end
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min โฆ max): 39.804 ฮผs โฆ 117.688 ฮผs โ GC (min โฆ max): 0.00% โฆ 0.00%
Time (median): 39.957 ฮผs โ GC (median): 0.00%
Time (mean ยฑ ฯ): 41.774 ฮผs ยฑ 6.077 ฮผs โ GC (mean ยฑ ฯ): 0.00% ยฑ 0.00%
โ โ โโ โ โ โ
โโโโโโโโโโ
โโ
โโโโโ
โโ
โ
โโโโ
โโโโโโโ
โโโโโโโโ
โ
โโโ
โ
โ
โโ
โ
โโโโโโโโโ
โ
โ
โ โ
39.8 ฮผs Histogram: log(frequency) by time 74.4 ฮผs <
Memory estimate: 0 bytes, allocs estimate: 0.
tested to have same speed as Foo()
, and when I tested this for LazyBranch, it still allocates a lot (although 2x faster than when type is not stable)
from unroot.jl.
I am wondering if this is somehow related to runtime dispatch deeper in the rabbit hole, but the difference is still quite "small".
from unroot.jl.
Ha! Indeed ๐ That looks promising.
from unroot.jl.
@all-contributors please add @oschulz ideas
from unroot.jl.
I've put up a pull request to add @oschulz! ๐
from unroot.jl.
Related Issues (20)
- `LazyTree()` hang regression in 0.10.16
- Pre-compilation failure after upgrading to v1.9.3 HOT 6
- Performance for trees with a large number of branches HOT 13
- Fix Documentation due to their 1.0 release
- `RNTuple` reading extremely slow
- `nanoAOD_ttbar` latency HOT 26
- CI broken on nighly due to MD5.jl using SHA.jl internals
- RNTuple RC2 compatibility
- Do not manage to read a TTree with a structure of arrays of basic types HOT 17
- Cannot read empty collections from a RNTuple file HOT 1
- ConcurrencyViolationError when reading with XRootD HOT 2
- [RNTuple] Wrong offset `Index32/Index64` array when read from multiple pages HOT 7
- [RNTuple] accessing nested structs is not lazy enough HOT 1
- [WIP] 0.11.0 breaking changes items
- Re-write resources with `Base.Lockable`
- [RNTuple] OutOfMemoryError in show() HOT 2
- Reading a file with Branches and Leafs HOT 11
- [RNTuple] miss-aligned column in DAOD_TLA with RNTuple RC2 HOT 2
- xrootd doesn't handle XCache
- [RNTuple] Roadmap of writing RNTuple to disk
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from unroot.jl.