juliahep / unroot.jl Goto Github PK
View Code? Open in Web Editor NEWNative Julia I/O package to work with CERN ROOT files objects (TTree and RNTuple)
License: MIT License
Native Julia I/O package to work with CERN ROOT files objects (TTree and RNTuple)
License: MIT License
https://root.cern/doc/master/md_tree_ntuple_v7_doc_README.html
Just putting down this so we can gather information here since we will have to deal with it one day.
The file contains two trees that look the same
f["Scaled"] # works
f["DecayTree"] # ERROR: zlib error: incorrect header check (code: -3)
Here is a hear of the stacktrace:
changemode!(::TranscodingStreams.TranscodingStream{CodecZlib.ZlibDecompressor,IOStream}, ::Symbol) at .julia\packages\TranscodingStreams\MsN8d\src\stream.jl:717
callprocess(::TranscodingStreams.TranscodingStream{CodecZlib.ZlibDecompressor,IOStream}, ::TranscodingStreams.Buffer, ::TranscodingStreams.Buffer) at .julia\packages\TranscodingStreams\MsN8d\src\stream.jl:649
fillbuffer(::TranscodingStreams.TranscodingStream{CodecZlib.ZlibDecompressor,IOStream}; eager::Bool) at .julia\packages\TranscodingStreams\MsN8d\src\stream.jl:577
fillbuffer at .julia\packages\TranscodingStreams\MsN8d\src\stream.jl:564
eof(::TranscodingStreams.TranscodingStream{CodecZlib.ZlibDecompressor,IOStream}) at .julia\packages\TranscodingStreams\MsN8d\src\stream.jl:188
readbytes!(::TranscodingStreams.TranscodingStream{CodecZlib.ZlibDecompressor,IOStream}, ::Array{UInt8,1}, ::Int32) at .julia\packages\TranscodingStreams\MsN8d\src\stream.jl:371
read(::TranscodingStreams.TranscodingStream{CodecZlib.ZlibDecompressor,IOStream}, ::Int32) at .\io.jl:941
datastream(::IOStream, ::UnROOT.TKey32) at .julia\packages\UnROOT\T4A6o\src\types.jl:108
UnROOT.TTree(::IOStream, ::UnROOT.TKey32, ::Dict{Int32,Any}) at .julia\packages\UnROOT\T4A6o\src\bootstrap.jl:679
getindex(::ROOTFile, ::SubString{String}) at .julia\packages\UnROOT\T4A6o\src\root.jl:98
getindex(::ROOTFile, ::String) at .julia\packages\UnROOT\T4A6o\src\root.jl:93
array(::ROOTFile, ::String; raw::Bool) at .julia\packages\UnROOT\T4A6o\src\root.jl:142
array at .julia\packages\UnROOT\T4A6o\src\root.jl:139
binlineshape(::String, ::String, ::StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}) at c:\Users\mikha.julia\dev\OmegacDecay\script\feeddown\lineshape_from_saras_files.jl:14
I can read them with uproot.py
what can it be?
This allows smaller memory footprint by avoiding materialize the entire array into memory. And we can use something like OnlineStats
I tried to implement one for a few days now and find fBasketSeek
is very annoying and want some suggestion. I'm not sure what would be the canonical way to do this but:
struct BRANCH_ITR
io
seeks
thetype
count::Int
end
function Base.iterate(S::BRANCH_ITR, idx=1)
basket_seek = S.seeks[idx]
if idx > S.count
return nothing
elseif basket_seek==0
return (nothing, idx+1)
else
s = datastream(io, basketkey)
return (readtype(s, S.thetype), idx)
end
end
this doesn't work because readtype
advance the cursor and we lost track of it in the next iteration, where would you suggest to keep track of:
https://quinnj.home.blog/2020/11/13/partition-all-the-datas/ Leverage this as ROOT provides natural partition too.
I came across an AWS recommendation to use a fork of zlib that cloudfare maintains because it's faster.
Basically,
git clone https://github.com/cloudflare/zlib.git
cd zlib
./configure --prefix=`pwd`
make
make install
and then to hackily swap between the two
# copy original zlib back
cp ~/julia/julia-1.7.0-beta4/lib/julia/bck_libz.so.1.2.11 ~/julia/julia-1.7.0-beta4/lib/julia/libz.so.1.2.11
# copy cloudflare over original
cp ~/julia/cloudflare/zlib/libz.so.1.2.8 ~/julia/julia-1.7.0-beta4/lib/julia/libz.so.1.2.11
Trying out @btime transcode(ZlibDecompressor, a)
between the two on raw compressed basket bytes, I saw between a 25 and 40% speedup. We can still see a similar speedup with
julia> using UnROOT; const t = LazyTree(ROOTFile("Run2012BC_DoubleMuParked_Muons.root"), "Events");
# regular zlib
julia> @btime sum(t.nMuon) # 770.774 ms (11366 allocations: 544.33 MiB)
julia> @btime sum(length,t.Muon_pt) # 5.322 s (18391 allocations: 2.93 GiB)
# cloudflare fork
julia> @btime sum(t.nMuon) # 542.433 ms (11366 allocations: 544.33 MiB)
julia> @btime sum(length,t.Muon_pt) # 3.975 s (18391 allocations: 2.93 GiB)
This was on a x86_64 AMD EPYC 7702P
, but hopefully one can reproduce it on other systems. From another thread/PR, I think we wrote down that zlib-ng was giving a 10-15% improvement over the usual zlib library, but this cloudflare fork is even faster!
I bet since it's a fork, not some next-gen rewrite, that it'll be easier to make into binaries (@Moelf ? ;) ). And I think this could benefit many workflows due to ROOT using zlib by default.
One day someone will ask about this :) (eg. each event has N tracks each of which has M subdetectors which have P hits). The doubly jagged interpretation is ~15 lines of julia, so I don't think this will be so complicated to add.
>>> import uproot4
>>> t = uproot4.open("tree_with_triply_jagged.root")["t1"]
>>> t["bi"].array()
<Array [[[[1, 2], [2]], ... []]], [[[100]]]] type='7 * var * var * var * int64'>
>>> t["bi"].array().tolist()
[[[[1, 2], [2]], [[3, 5]]], [[[7, 9, 11], [13, 15, 16]]], [[[17, 18], [19]], []], [], [[]], [[[]]], [[[100]]]]
file generated with
#!/usr/bin/env python
import ROOT as r
f = r.TFile("tree_with_triply_jagged.root", "recreate")
t = r.TTree("t1","t1")
vvvi = r.vector("vector<vector<int>>")()
vvvf = r.vector("vector<vector<float>>")()
t.Branch("bi", vvvi)
t.Branch("bf", vvvf)
data = [
[[[1,2],[2]], [[3,5]]],
[[[7,9,11], [13,15,16]]],
[[[17,18], [19]], []],
[],
[[]],
[[[]]],
[[[100]]],
]
for row in data:
vvvi.clear()
vvvf.clear()
for subrow in row:
vvi = r.vector("vector<int>")()
vvf = r.vector("vector<float>")()
for subsubrow in subrow:
vi = r.vector("int")()
vf = r.vector("float")()
for e in subsubrow:
vi.push_back(e)
vf.push_back(e + 0.5)
vvi.push_back(vi)
vvf.push_back(vf)
vvvi.push_back(vvi)
vvvf.push_back(vvf)
t.Fill()
t.Write()
f.Close()
For the file linked here, I see weird behavior. When we iterate sequentially over one branch, we should be reading each basket only once. But what I see with an additional printout
diff --git a/src/iteration.jl b/src/iteration.jl
index b3579d5..2ac6b93 100644
--- a/src/iteration.jl
+++ b/src/iteration.jl
@@ -143,6 +143,7 @@ function Base.getindex(ba::LazyBranch{T,J,B}, idx::Integer) where {T,J,B}
br = ba.buffer_range
if idx ∉ br
seek_idx = findfirst(x -> x > (idx - 1), ba.fEntry) - 1 #support 1.0 syntax
+ @show idx, br, seek_idx
bb = basketarray(ba.f, ba.b, seek_idx)
if typeof(bb) !== B
error("Expected type of interpreted data: $(B), got: $(typeof(bb))")
is
julia> using UnROOT
julia> const t = LazyTree(ROOTFile("Run2012BC_DoubleMuParked_Muons.root"), "Events", ["Muon_pt"]);
julia> for i in 1:6000; t.Muon_pt[i]; end
(idx, br, seek_idx) = (1, 5477:7314, 1)
(idx, br, seek_idx) = (1810, 1:1809, 1)
(idx, br, seek_idx) = (1811, 1:1809, 2)
(idx, br, seek_idx) = (3647, 1811:3646, 2)
(idx, br, seek_idx) = (3648, 1811:3646, 3)
(idx, br, seek_idx) = (5476, 3648:5475, 3)
(idx, br, seek_idx) = (5477, 3648:5475, 4)
Each branch object has its own cache of the last-read basket and the associated index range br
. If we getindex()
with an index, we check if it's in br
otherwise we have to read the whole basket.
seek_idx
is the basket number. So we end up calling basketarray()
(which does IO + decompression + interpretation) twice for each basket.
Note
julia> t.Muon_pt.b
...
fBasketEntry: Array{Int64}((524,)) [0, 1810, 3647, 5476, 7315, 9126, 10964, 12804, 14634, 16465 … 54231870, 55053565, 55875260, 56696955, 57518650, 58340345, 59162040, 59983735, 60805430, 61540413]
...
My bet is there's some zero-based and one-based indexing issue somewhere.
Just updating some dependencies (and fighting with the Julia registrator bot).
@JuliaRegistrator register
In the past, we used to be able to do
function Tables.partitions(t::LazyTree)
# ...
return (t[r] for r in ranges)
end
Arrow.write("out.arrow", t)
but now we need to use innertable
to avoid a crash:
-Tables.partitions(t::LazyTree) = (t[r] for r in _clusterranges(t))
+Tables.partitions(t::LazyTree) = (innertable(t[r]) for r in _clusterranges(t))
which modifies the API if users want to have a LazyTree
chunk from partitions()
(Maybe it's OK?). I wonder if this is related to the DataFrames decoupling.
Hi! First of all thank you for providing the package, hope to see it grow and improve in the future!
I just started trying it out and I've found (I think) a bug. I cannot read the TTree
stored in the attached file test.root.txt (I had to add .txt
otherwise GitHub does not let me upload it). In particular:
julia> using UnROOT
julia> f = ROOTFile("test.root")
ROOTFile with 3 entries and 24 streamers.
mage-output.root
├─ InputMacroCommands (TObjString)
└─ NumberOfPrimaries (TParameter<long>)
julia> keys(f)
3-element Vector{String}:
"InputMacroCommands"
"NumberOfPrimaries"
"fTree"
julia> t = LazyTree(f, "fTree")
ERROR: MethodError: no method matching keys(::Nothing)
Closest candidates are:
keys(::Union{Tables.AbstractColumns, Tables.AbstractRow}) at /home/gipert/.julia/packages/Tables/i6z2B/src/Tables.jl:184
keys(::Dictionaries.AbstractIndices) at /home/gipert/.julia/packages/Dictionaries/xfzRj/src/AbstractIndices.jl:38
keys(::OrderedCollections.OrderedSet) at /home/gipert/.julia/packages/OrderedCollections/PRayh/src/ordered_set.jl:95
...
Stacktrace:
[1] LazyTree(f::ROOTFile, s::String)
@ UnROOT ~/.julia/packages/UnROOT/EoYKp/src/iteration.jl:258
[2] top-level scope
@ REPL[18]:1
Any idea?
I was trying to do some basic IO to Julia of my ROOT-Files, however my Branches could not be read.
It turned out that the basic problem is that I am storing my variables in the Root-Branches as 64-bit floats (doubles) which is not supported by UnROOT yet.
Therefore I get the following Error-Message:
ERROR: UndefVarError: TLeafD not defined
I added a small ROOT-File (test.root.zip)
with 2 TTrees (TreeF and TreeD) with 32-bit floats and 64-bit floats. TreeF can be read into Julia without a problem but TreeD throws the above mentioned error.
If I got this correctly, a simple fix would be just adding the struct in TLeafD
in bootstrap.jl but I haven't looked thoroughly through the entire Source-Code to verify this.
I would be really grateful if this could be fixed, so I can use Julia for processing my ROOT-Files. Thanks in advance.
The current design of the streamer type hierarchy (including versioning) works in a way which I do not find nice at all. At some point I thought it would be a nice solution but now I see a lot of code repetitions.
A lot of streamer logic is currently implemented in bootstrap.jl
and the overall implementation of a streamer and all of its versions is done like this:
abstract type TAttLine <: ROOTStreamedObject end
struct TAttLine_1 <: TAttLine end
function readfields!(io, fields, T::Type{TAttLine_1})
fields[:fLineColor] = readtype(io, Int16)
fields[:fLineStyle] = readtype(io, Int16)
fields[:fLineWidth] = readtype(io, Int16)
end
struct TAttLine_2 <: TAttLine end
function readfields!(io, fields, T::Type{TAttLine_2})
fields[:fLineColor] = readtype(io, Int16)
fields[:fLineStyle] = readtype(io, Int16)
fields[:fLineWidth] = readtype(io, Int16)
end
As seen above, the ROOTStreamedObject
is the supertype (which is just a simple abstract type) and there is (always) one single subtype (another abstract type) representing a specific streamer.
Different versions are defined by appending _VERSION
to the type name and creating an empty struct. The readfields!
method then define how to read its fields.
Both versions of TAttLine
(v1 and v2) have the same fields with the same types, so in this case it's just code repetition.
Note that this definitions should be created dynamically in future, but the overall layout is of course the same.
During the read-in of a ROOT file, UnROOT checks if there is a streamer with the given name and version (appended), like TAttLine_2
and if not, it will complain. (later it will define it from the streamer info).
A simplification of this system which allows a nice bootstrapping (no code repetition) and a better version management would be something like the following:
import Base: getproperty
abstract type ROOTStreamedObject end
function Base.getproperty(obj::ROOTStreamedObject, sym::Symbol)
if !haskey(Core.getproperty(obj, :fields), sym)
error("Type $(typeof(obj)) has no field $(String(sym))")
end
Core.getproperty(obj, :fields)[sym]
end
struct TAttLine{V} <: ROOTStreamedObject # this should be a macro
fields::Dict{Symbol, Any}
end
where the actual struct definition could of course be simplified via a macro like @rootstreamer TAttLine
or so.
The readfields!()
methods can then be defined using value type dispatch:
function readfields!(io, fields, ::Type{TAttLine{1})
fields[:fLineColor] = readtype(io, Int16)
fields[:fLineStyle] = readtype(io, Int16)
fields[:fLineWidth] = readtype(io, Int16)
end
readfields!(io, fields, ::Type{TAttLine{2}) = readfields!(io, fields, ::Type{TAttLine{1})
Here is a short demo of the instantiation:
julia> tattline = TAttLine{1}(Dict(:fLineColor => 42, :fLineStyle => 5, :fLineWidth => 23))
TAttLine{1}(Dict{Symbol,Any}(:fLineWidth => 23,:fLineColor => 42,:fLineStyle => 5))
julia> tattline.fLineWidth
23
julia> tattline.fFoo
ERROR: Type TAttLine{1} has no field fFoo
This is just an idea, any feedback is highly appreciated ;)
I just discovered a serious bug in the data
and offset
readout while I was working with larger ROOT files. I think it's mostly affecting large datasets, but I am not sure.
With small sample file, we have no issues, everything is read out correctly and I can assemble the hits.
Just as a side note: a UnROOT.KM3NETDAQHit
consists of 10 bytes and per event we have a 10 bytes header.
julia> f = ROOTFile("/home/tgal/Dev/UnROOT.jl/test/samples/km3net_online.root")ROOTFile with 10 entries and 54 streamers.
/home/tgal/Dev/UnROOT.jl/test/samples/km3net_online.root
├─ E
│ ├─ "Evt"
│ └─ ""
├─ KM3NET_TIMESLICE
│ ├─ "KM3NET_TIMESLICE"
│ └─ ""
├─ KM3NET_TIMESLICE_L0
│ ├─ "km3net_timeslice_L0"
│ └─ ""
├─ KM3NET_TIMESLICE_L1
│ ├─ "km3net_timeslice_L1"
│ └─ ""
├─ KM3NET_TIMESLICE_L2
│ ├─ "km3net_timeslice_L2"
│ └─ ""
├─ KM3NET_TIMESLICE_SN
│ ├─ "km3net_timeslice_SN"
│ └─ ""
├─ KM3NET_EVENT
│ ├─ "KM3NET_EVENT"
│ └─ ""
└─ KM3NET_SUMMARYSLICE
├─ "KM3NET_SUMMARYSLICE"
└─ ""
julia> data, offsets = UnROOT.array(f, "KM3NET_EVENT/KM3NET_EVENT/snapshotHits"; raw=true)
(UInt8[0x40, 0x00, 0x03, 0xc6, 0x00, 0x09, 0x00, 0x00, 0x00, 0x60 … 0x30, 0x40, 0xa9, 0x7d, 0x0a, 0x3c, 0x21, 0xc9, 0x03, 0x17], Int32[0, 970, 2220, 3010])
julia> length(data)
3010
julia> length(offsets)
4
julia> offsets
4-element Vector{Int32}:
0
970
2220
3010
julia> sum(diff(offsets))
3010
julia> minimum(diff(offsets))
790
julia> UnROOT.splitup(data, offsets, UnROOT.KM3NETDAQHit, skipbytes=10)
3-element Vector{Vector{UnROOT.KM3NETDAQHit}}:
[UnROOT.KM3NETDAQHit(806451572, 0x0a....
Now comes the fun part. Here is a large file with 10410
events. In UnROOT
, the length of the offsets
indicates 10411
events, one more than expected!
julia> f = ROOTFile("/data/irods/mc/atm_muon/KM3NeT_00000044/v6.1/trigger/mcv6.1.mupage_10G.sirene.jterbr00007209.2548.root")
ROOTFile with 17 entries and 76 streamers.
/data/irods/mc/atm_muon/KM3NeT_00000044/v6.1/trigger/mcv6.1.mupage_10G.sirene.jterbr00007209.2548.root
├─ KM3NET_SUMMARYSLICE
│ ├─ "KM3NET_SUMMARYSLICE"
│ └─ ""
├─ KM3NET_SUMMARYSLICE
│ ├─ "KM3NET_SUMMARYSLICE"
│ └─ ""
├─ E
│ ├─ "Evt"
│ └─ ""
├─ E
│ ├─ "Evt"
│ └─ ""
├─ KM3NET_TIMESLICE_SN
│ ├─ "km3net_timeslice_SN"
│ └─ ""
├─ KM3NET_TIMESLICE_SN
│ ├─ "km3net_timeslice_SN"
│ └─ ""
├─ KM3NET_TIMESLICE
│ ├─ "KM3NET_TIMESLICE"
│ └─ ""
├─ KM3NET_TIMESLICE_L0
│ ├─ "km3net_timeslice_L0"
│ └─ ""
├─ KM3NET_TIMESLICE_L1
│ ├─ "km3net_timeslice_L1"
│ └─ ""
├─ KM3NET_TIMESLICE_L2
│ ├─ "km3net_timeslice_L2"
│ └─ ""
└─ KM3NET_EVENT
├─ "KM3NET_EVENT"
└─ ""
julia> data, offsets = UnROOT.array(f, "KM3NET_EVENT/KM3NET_EVENT/snapshotHits"; raw=true)
(UInt8[0x40, 0x00, 0x05, 0xec, 0x00, 0x09, 0x00, 0x00, 0x00, 0x97 … 0x30, 0x37, 0xb3, 0xec, 0x10, 0x29, 0x2f, 0xc9, 0x04, 0x06], Int32[0, 1520, 2000, 2460, 2920, 3740, 4840, 5610, 6120, 6960 … 2982630, 2983100, 2983870, 2984680, 2986230, 2986700, 2987130, 2987610, 2988370, 2989030])
julia> length(data)
7936390
julia> length(offsets)
10412
julia> sum(diff(offsets)) # this does not give 7936390 due to the jump, see below
2989030
julia> minimum(diff(offsets))
-4947360
julia> argmin(diff(offsets))
6553
julia> offsets[6550:6560]
11-element Vector{Int32}:
4944770
4945250
4945910
4947360
0
1080
1720
2330
2810
4270
5370
I first compared the length of events (number of hits) at the beginning of the data and the end of the data and they all match! It's just this weird jump in the middle of the offsets
.
Luckily the number of bytes in data
matches the expected number of bytes when cross-checking with uproot (it's n_events * 10bytes + n_hits * 10bytes
== 7936390`)
Now I have to understand why the offset starts to count from 0 and how to solve it. For now, UnROOT
simply iterates over the diff(offsets)
and it worked, but apparently it's not the full story ;)
Right now the most complicated logic apart from streamer parsing is:
https://github.com/tamasgal/UnROOT.jl/blob/36fa5eb0a1e19c58f1ade0e40d9b4f5e43a35244/src/iteration.jl#L109-L122
and it's auto_T_JaggT
dependency. Every time we try to make some change, we need to guess what the resultant type is at runtime and pre-process it.
I think a better strategy is to actually run the runtime functions and fetch one event just to see the type, and complete the constructor accordingly. This involves type computing, so maybe
https://github.com/vtjnash/ComputedFieldTypes.jl
can simplify our lives
Right now this ends up calling:
https://github.com/tamasgal/UnROOT.jl/blob/db8938009f62f12d994433b81efe5d81eb91708f/src/iteration.jl#L307-L309
which is not very optimized (it's really not bad btw, see for yourself). There are mainly two optimizations:
basketarray
to read and combine them.data
and offsets
of VectorOfVectors
together such that the entire column is a big VectorOfVectors
This is obviously one of the more important features but I have to admit, I do not work with ROOT histograms so I first need to understand how they are stored.
With reference to #2 (comment) where @Moelf provided a sample histogram file (thanks!).
I just had a look with uproot and it seems that the usual histogram features are there (bins
, edges
and values
) plus I found also the variances
. In addition to that, there is a version for each of them with an all
prefix.
A few other parameters are also there: low
, high
, underflows
, ...
I'll look into the implementation of the histogram parsing in existing libraries.
ROOTFile with 2 entries and 25 streamers.
./test/samples/km3net_offline.root
ERROR: unlock count must match lock count
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] unlock(rl::ReentrantLock)
...
caused by: UndefVarError: Head not defined
Stacktrace:
[1] macro expansion
@ ~/.julia/dev/UnROOT/src/root.jl:111 [inlined]
[2] (::UnROOT.var"##getter#258#116"{ROOTFile, String})()
@ UnROOT ~/.julia/packages/Memoization/f2NeI/src/Memoization.jl:96
[3] get!(default::UnROOT.var"##getter#258#116"{ROOTFile, String}, lru::LRUCache.LRU{Any, Any}, key::Tuple{Tuple{ROOTFile, String}, NamedTuple{(), Tuple{}}})
I ran across another (small) issue.
Most of the times my data in root-files is stored in TTrees
since these permit to store different datatypes into one collection. However for storing some settings (like Random-Number-Generator Seeds) I prefer to store the values in TNtuples
which is similar to TTrees
but only allows to store 32-bit floats.
Since the TKey
of a TNtuple
differs from a TTree
the array()
is not able to process the entries.
I added again a small file (test.root.zip) with the same data stored once in a TTree
and once in a TNtuple
.
It would be helpful to add support for TNtuples. In principle this should not differ much from a TTree
with a TBranch
containing TLeafF
objects.
thanks @grabanal for reporting this
I stumped upon https://github.com/andyferris/Dictionaries.jl and saw this:
julia> dict = Dictionary(["a", "b", "c"], [1, 2, 3])
3-element Dictionary{String,Int64}
"a" │ 1
"b" │ 2
"c" │ 3
julia> dict .+ 1
3-element Dictionary{String,Int64}
"a" │ 2
"b" │ 3
"c" │ 4
and performance is an aim for the package so I thought this can be a solid base for jagged array.
Some recent discussions revealed that the memory footprint is a bit unpredictable and most probably dominated by the caching mechanism in readbasketseek()
:
https://github.com/tamasgal/UnROOT.jl/blob/ac38c7505655ed92cae15d9f332223a6a0ac049f/src/root.jl#L417
uproot
does) and alsoLRU
as it might be not accurate for nested objects.Both points are low hanging fruits, so if you have any memory problems, you are welcome to help :)
Could someone look into this?
input is here: http://uaf-10.t2.ucsd.edu/~phchang//julia/trackntuple.root
Thanks,
Philip
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.6.2 (2021-07-14)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using UnROOT
julia> f = ROOTFile("trackntuple.root")
ROOTFile with 1 entry and 28 streamers.
/home/users/phchang/public_html/julia/trackntuple.root
└─ trackingNtuple (TDirectory)
└─ tree (TTree)
├─ "event"
├─ "lumi"
├─ "run"
├─ "⋮"
├─ "simvtx_sourceSimIdx"
├─ "simvtx_daughterSimIdx"
└─ "simpv_idx"
julia> lt = LazyTree(f, "trackingNtuple/tree", [r"pix_.*"])
┌ Warning: Can't automatically create LazyBranch for branch trackingNtuple/tree/pix_ladder. Returning a branch object
└ @ UnROOT ~/.julia/packages/UnROOT/vROjM/src/root.jl:133
ERROR: MethodError: Cannot `convert` an object of type
UnROOT.TBranchElement_10 to an object of type
LazyBranch
Closest candidates are:
convert(::Type{T}, ::LinearAlgebra.Factorization) where T<:AbstractArray at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/factorization.jl:58
convert(::Type{T}, ::T) where T<:AbstractArray at abstractarray.jl:14
convert(::Type{T}, ::T) where T at essentials.jl:205
Stacktrace:
[1] setindex!(h::Dict{Symbol, LazyBranch}, v0::UnROOT.TBranchElement_10, key::Symbol)
@ Base ./dict.jl:382
[2] LazyTree(f::ROOTFile, s::String, branches::Vector{Regex})
@ UnROOT ~/.julia/packages/UnROOT/vROjM/src/iteration.jl:252
[3] top-level scope
@ REPL[3]:1
(@v1.6) pkg> status UnROOT
Status `~/.julia/environments/v1.6/Project.toml`
[3cd96dde] UnROOT v0.6.1 `https://github.com/tamasgal/UnROOT.jl#master`
(@v1.6) pkg>
Currently, we cannot read doubly jagged branches (eg vector<vector<float> >
). I didn't get very far trying to parse these more generally, but I'll dump my progress to save time for whoever ends up tackling it in the future :)
Example file:
import ROOT as r
import random
f = r.TFile("doublejagged.root", "recreate")
t = r.TTree("t1","t1")
vvi = r.vector("vector<int>")()
vvf = r.vector("vector<float>")()
t.Branch("bi", vvi)
t.Branch("bf", vvf)
data = [
[[2], [3,5]],
[[7,9,11], [13]],
[[17], [19], []],
[],
[[]],
]
for row in data:
vvi.clear()
vvf.clear()
for subrow in row:
vi = r.vector("int")()
vf = r.vector("float")()
for e in subrow:
vi.push_back(e)
vf.push_back(e + 0.5)
vvi.push_back(vi)
vvf.push_back(vf)
t.Fill()
t.Write()
f.Close()
using UnROOT
f = ROOTFile("doublejagged.root")
function g()
f = ROOTFile("doublejagged.root")
data, offsets = UnROOT.array(f, "t1/bf"; raw=true)
T = Float32
jagg_offset = 10
out = Vector{Vector{Vector{T}}}()
for i in 1:(length(offsets)-1)
flat = data[(offsets[i]+1+jagg_offset:offsets[i+1])]
row = Vector{Vector{T}}()
cursor = 1
while cursor < length(flat)
n = ntoh(only(reinterpret(Int32, flat[cursor:cursor+sizeof(Int32)-1])))
cursor += sizeof(Int32)
b = ntoh.(reinterpret(T, flat[cursor:cursor+n*sizeof(T)-1]))
cursor += n*sizeof(T)
push!(row, b)
end
push!(out, row)
end
out
end
gives
5-element Vector{Vector{Vector{Float32}}}:
[[2.5], [3.5, 5.5]]
[[7.5, 9.5, 11.5], [13.5]]
[[17.5], [19.5], []]
[]
[[]]
I parsed it as if it were singly jagged first, to end up with "flattened" arrays per event. Each flattened event has an Int32 corresponding to the number of elements in a sub-array, followed by the sub-array (then it repeats).
The implementation is reasonably fast I think. For 10k events with (on average) 5 vectors of 5 elements each, the interpretation rate is ~1MHz.
I'm trying to bruteforce/pattern-match my way through reading a TH1D and am a bit stuck. My wip branch is https://github.com/tamasgal/UnROOT.jl/compare/master...aminnj:reading-th1d?expand=1 and that includes a small CMS analysis ntuple with the histogram.
julia> using Revise
julia> using UnROOT
julia> f = ROOTFile("/Users/namin/.julia/dev/UnROOT/test/samples/cms_ntuple_wjet.root");
julia> tkey = f.directory.keys[1]
UnROOT.TKey32(702, 4, 1491, 0x69cf5640, 55, 1, 266, 100, "TH1D", "AK4CategPresel_cutflow", "")
julia> UnROOT.@initparse;
julia> ds = UnROOT.datastream(f.fobj, tkey);
julia> preamble = UnROOT.Preamble(ds, Missing);
julia> UnROOT.stream!(ds, fields, UnROOT.TH1)
fields = Dict{Symbol, Any}(:fLineWidth => 1, :fMarkerColor => 1, :fLineColor => 602, :fFillColor => 0, :fMarkerSize => 1.0f0, :fNcells => 18, :fLabelOffset => 0.005f0, :fTitleSize => 0.035f0, :fXmax => 16.0, :fLabelSize => 0.035f0, :fTitleColor => 1, :fBits2 => 0x0003, :fNdivisions => 510, :fTitle => "", :fXmin => 0.0, :fLabelColor => 1, :fXbins => Any[], :fLast => 0, :fLabelFont => 42, :fMarkerStyle => 1, :fTitleFont => 42, :fName => "xaxis", :fTimeDisplay => false, :fFirst => 0, :fNbins => 16, :fFillStyle => 1001, :fTitleOffset => 1.0f0, :fTickLength => 0.03f0, :fTimeFormat => "", :fAxisColor => 1, :fLineStyle => 1)
(name, size) = ("\x7f\0", 83886336)
ERROR: Object 'UnROOT.TAxis' has 132 bytes; expected 770
...
I'm basically stuck at the THashList
(evidently just a TList
) part of TAxis
:
julia> [(x.fName, x.fSize, x.fTypeName) for x in UnROOT.streamerfor(f, "TH1").streamer.fElements.elements]
26-element Vector{Tuple{String, Signed, String}}:
("TNamed", 0, "BASE")
("TAttLine", 0, "BASE")
("TAttFill", 0, "BASE")
("TAttMarker", 0, "BASE")
("fNcells", 4, "int")
("fXaxis", 216, "TAxis") # <----------------
("fYaxis", 216, "TAxis")
("fZaxis", 216, "TAxis")
("fBarOffset", 2, "short")
("fBarWidth", 2, "short")
("fEntries", 8, "double")
("fTsumw", 8, "double")
("fTsumw2", 8, "double")
("fTsumwx", 8, "double")
("fTsumwx2", 8, "double")
("fMaximum", 8, "double")
("fMinimum", 8, "double")
("fNormFactor", 8, "double")
("fContour", 24, "TArrayD")
("fSumw2", 24, "TArrayD")
("fOption", 24, "TString")
("fFunctions", 8, "TList*")
("fBufferSize", 4, "int")
("fBuffer", 8, "double*")
("fBinStatErrOpt", 4, "TH1::EBinErrorOpt")
("fStatOverflows", 4, "TH1::EStatOverflows")
julia> [(x.fName, x.fSize, x.fTypeName) for x in UnROOT.streamerfor(f, "TAxis").streamer.fElements.elements]
13-element Vector{Tuple{String, Signed, String}}:
("TNamed", 0, "BASE")
("TAttAxis", 0, "BASE")
("fNbins", 4, "int")
("fXmin", 8, "double")
("fXmax", 8, "double")
("fXbins", 24, "TArrayD")
("fFirst", 4, "int")
("fLast", 4, "int")
("fBits2", 2, "unsigned short")
("fTimeDisplay", 1, "bool")
("fTimeFormat", 24, "TString")
("fLabels", 8, "THashList*") # <----------------
("fModLabs", 8, "TList*")
With
preamble = Preamble(io, TList)
skiptobj(io)
name = readtype(io, String)
size = readtype(io, Int32)
@show name, size
I get
(name, size) = ("\x7f\0", 83886336)
...looks like nonsense to me. Do you see where I've gone wrong?
For reference, the histogram looks like this:
which means that I expect the fXaxis.fLabels
TList
to not be empty, as verified with uproot:
>>> import uproot3
>>> uproot3.open("/Users/namin/.julia/dev/UnROOT/test/samples/cms_ntuple_wjet.root")["AK4CategPresel_cutflow"]._fXaxis._fLabels
[b'Root', b'Weight', b'Preselection', b'SelectGenPart', b'GoodRunsList', b'EventFilters', b'SelectLeptons', b'SelectJets', b'Trigger', b'ObjectsSelection', b'SSPreselection', b'NjetGeq4', b'AK4CategTagHiggsJets', b'AK4CategTagVBSJets', b'AK4CategChannels', b'AK4CategPresel']
Thanks a lot in advance, and apologies if it's clear I'm not a julian ;)
what the heck is signed char...
there's a super free optimization that is especially effectively for network I/O:
First:
diff --git a/src/iteration.jl b/src/iteration.jl
index 04e5d7d..7e30835 100644
--- a/src/iteration.jl
+++ b/src/iteration.jl
@@ -333,7 +333,7 @@ function Base.getindex(ba::LazyBranch{T,J,B}, range::UnitRange) where {T,J,B}
ib2 = findfirst(x -> x > (last(range) - 1), ba.fEntry) - 1
offset = ba.fEntry[ib1]
range = (first(range)-offset):(last(range)-offset)
- return vcat([basketarray(ba, i) for i in ib1:ib2]...)[range]
+ return Vcat(asyncmap(i->basketarray(ba, i), ib1:ib2)...)[range]
end
but to make this useful, we also need:
diff --git a/src/root.jl b/src/root.jl
index e4eb852..abb1c09 100644
--- a/src/root.jl
+++ b/src/root.jl
@@ -472,17 +472,17 @@ function readbasket(f::ROOTFile, branch, ith)
end
function readbasketseek(f::ROOTFile, branch::Union{TBranch, TBranchElement}, seek_pos::Int, nb)
- lock(f)
- local basketkey, compressedbytes
+ # lock(f)
+ local rawbuffer
try
seek(f.fobj, seek_pos)
rawbuffer = OffsetBuffer(IOBuffer(read(f.fobj, nb)), seek_pos)
- basketkey = unpack(rawbuffer, TBasketKey)
- compressedbytes = compressed_datastream(rawbuffer, basketkey)
catch
finally
- unlock(f)
+ # unlock(f)
end
+ basketkey = unpack(rawbuffer, TBasketKey)
+ compressedbytes = compressed_datastream(rawbuffer, basketkey)
Now, the problem of course is that removing the lock
will fuck up our multi-threaded reading, because the current cursor location is stored in IOStream. Technically, network based ones don't need a position()
since each read is bytes range based, but our on-disk IOStream will surely complain.
MMap
based for on-disk filesThis issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
I've read more source code now and I understand that currently what "struct" is supported depends on manual type-up in the types.jl
.
So, no rush to fix this particular "issue", but just a reminder we should cover this eventually.
Probably for 2.0 release or something but I ran into some very light / nice implementation for example https://github.com/marius311/Memoization.jl .
I realized uproot4
has dropped the way-too-many kinds of cache so we probably should learn from Jim in making one here. I think a good start would be to at leach cache whole branch
/ tree
if they fit, in case users are doing some small exploration work.
For the larger-than-RAM TChain like event loop, cache is less useful:
@8me since you're looking at relevant code maybe I can borrow your wisdom! I'm trying to figure out the padding for TLorentz vector: https://github.com/scikit-hep/uproot3-methods/blob/ba9a97b3dc71c7030a9ec15a9d97397b5ff8aa0d/uproot3_methods/classes/TLorentzVector.py
import ROOT as r
from array import array
f = r.TFile("LorentzVector.root", "recreate")
t = r.TTree("t1","t1")
#x y z e
p4 = r.TLorentzVector(1,1,0,5)
t.Branch("LV.","TLorentzVector",p4)
for i in range(10):
p4.SetPxPyPzE(1,1,0,i)
t.Fill()
t.Write()
f.Close()
I just ran into another error message:
julia> ROOTFile("/hosts/nashome/riederer_bernd/Data/Output/raw-su3-ym-4d-10-2.096680-0.251411-1.987610-adj-200819-085736.root")
ERROR: Unsupported compression type 'L4'
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] UnROOT.Streamers(::IOStream) at /hosts/nashome/riederer_bernd/.julia/packages/UnROOT/KN5MU/src/streamers.jl:85
[3] ROOTFile(::String) at /hosts/nashome/riederer_bernd/.julia/packages/UnROOT/KN5MU/src/root.jl:39
[4] top-level scope at none:1
The ROOT-File I am trying to read was created with ROOT v6.14/00
and the settings for File creation were
The following lines were obtained from this file using the ROOT-CLI
root [0]
Attaching file raw-su3-ym-4d-4-9.927979-0.780315-0.171814-adj-200821-082121.root as _file0...
(TFile *) 0x562dada6d350
root [1] _file0->GetCompressionAlgorithm()
(int) 0
root [2] _file0->GetCompressionLevel()
(int) 9
root [3] _file0->GetCompressionSettings()
(int) 9
I guess the program is either catching my case even though it should not be a problem for ZlibDecompressorStream()
to read my files. In my opinion this should be the case since all my files with lower Compression-Levels were read without a problem.
If this is not the case however it could maybe be fixed by adding another decompreesion-package like CodecLz4.jl.
julia> using UnROOT
julia> f = ROOTFile(joinpath(dirname(pathof(UnROOT)), "../test/samples/histograms1d2d.root"))
ROOTFile with 5 entries and 14 streamers.
/Users/namin/.julia/dev/UnROOT/src/../test/samples/histograms1d2d.root
# nothing printed out here
julia> keys(f)
5-element Vector{String}:
"myTH1F"
"myTH1D"
"myTH2F"
"myTH2D"
"myTH1D_nonuniform"
julia> f["myTH1D"] |> UnROOT.parseTH
([40.0, 2.0], (-2.0:2.0:2.0,), [800.0, 2.0])
The only hint that something is there is the 5 entries and 14 streamers
. Trees are printed out, so it would be nice to see histograms there as well.
I recently came across https://github.com/SciML/RecursiveArrayTools.jl/blob/196ecd7f68bff6b2cb196241dbae3b0dbf68f1dd/src/utils.jl#L126 which reminded me that it would be useful to have something like ROOT's TChain. Either with this chain
function or something more tailored to UnROOT. I don't know the pitfalls of using this chain
but it seems to more or less work out of the box.
julia> using UnROOT
julia> import RecursiveArrayTools: chain
julia> fnames = files[1:4]
4-element Vector{String}:
"files/output_NANOAOD_0FZl3J8yuv.root"
"files/output_NANOAOD_0L5MLORY1C.root"
"files/output_NANOAOD_0W9htYy8qc.root"
"files/output_NANOAOD_0b7P4Oq0hh.root"
julia> trees = LazyTree.(ROOTFile.(fnames), "Events", Ref(["Jet_pt", "MET_pt"]));
julia> ch = chain(trees...);
julia> println(length(trees[1]))
250
julia> println(length(ch))
1000
julia> sum(evt.MET_pt for evt in ch)
74405.19f0
So I came across a super weird issue. I couldn't open up an analysis ntuple with a pretty basic type (vector<float>
). Apparently this depends on the environment 😕. Test file:
import ROOT as r
f = r.TFile("output_good.root", "recreate")
t = r.TTree("Events", "")
obj = r.vector("float")()
t.Branch("Jet_pt", obj)
rows = [[], [27.324586868286133, 24.88954734802246, 20.853023529052734], [], [20.330659866333008], [], []]
for row in rows:
obj.clear()
for x in row:
obj.push_back(x)
t.Fill()
t.Write()
f.Close()
gives me
julia> ROOTFile("output_good.root")["Events/Jet_pt"]
6-element LazyBranch{Vector{Float32}, UnROOT.Offsetjagg}:
[]
[27.324587, 24.889547, 20.853024]
[]
[20.33066]
[]
[]
if I make the ntuple on my laptop (MacOS, Python3, ROOT 6.16/00). But it fails with
julia> ROOTFile("output_bad.root")["Events/Jet_pt"]
6-element LazyBranch{Vector{Float32}, UnROOT.Nojagg}:
Error showing value of type LazyBranch{Vector{Float32}, UnROOT.Nojagg}:
ERROR: MethodError: Cannot `convert` an object of type Float32 to an object of type Vector{Float32}
if I make the ntuple on a remote machine in a CMS environment (SLC7/CMSSW_10_2_5
, Python2.7 😬, ROOT 6.12/07).
Raw bytes are the same according to array(...; raw=true)
. The only difference I found was that fLeaves
on that branch has
fLenType: Int32 4 # "bad"/remote ntuple
fLenType: Int32 0 # "good"/local ntuple
which means that
function JaggType(leaf)
# https://github.com/scikit-hep/uproot3/blob/54f5151fb7c686c3a161fbe44b9f299e482f346b/uproot3/interp/auto.py#L144
(match(r"\[.*\]", leaf.fTitle) !== nothing) && return Nooffsetjagg
leaf isa TLeafElement && leaf.fLenType==0 && return Offsetjagg
return Nojagg
end
mis-classifies it as a Nojagg
branch because of the leaf.fLenType==0
. Given that uproot4 can read both ntuples just fine, is there a way around the fLenType
check, or an alternative? Note that the fLenType
was introduced to deal with this case: #55 . So we'll have to revisit that...
A prototype implementation has been shown at https://gist.github.com/aminnj/79ceeb7ab9026f80e3cc1978124f8370 by @aminnj
The use case when the analysis is highly columnar (but not single column), users can potentially gain by having SIMD, so imagine computing invariant mass using two columns:
for table in Tables.partitions(t::LazyTree)
mass.(table.lep1, table.lep2)
# or
@simd for row in table
compute(row.blah1, row.blah2)
end
end
in this case, the columns of inner table
has to be contiguous in RAM (then we hope compiler is smart). Actually, @oschulz can you confirm TypedTable
of VectorOfVectors
is SIMD-able at all?
I'm trying to read a tree (variable
) that has a "ROOT::Math::LorentzVector<ROOT::Math::PtEtaPhiM4D<float> >"
branch (met_p4
) from a test file.
I can do
julia> f = ROOTFile("/Users/namin/.julia/dev/UnROOT/test/samples/cms_ntuple_wjet.root",
customstructs = Dict("ROOT::Math::PtEtaPhiM4D<float>" => Vector{Float32}));
julia> UnROOT.array(f, "variable/met_p4/fCoordinates/fCoordinates.fPt"; raw=true)
(UInt8[0x42, 0x8b, 0xf0, 0x6d, 0x41, 0xc9, 0x33, 0x05, 0x43, 0x03 … 0x45, 0xe5, 0x40, 0x88, 0xf6, 0x89, 0x41, 0x8d, 0x24, 0xeb], Int32[0])
but then raw=false
crashes, unless I modify
https://github.com/tamasgal/UnROOT.jl/blob/31770c20502779195ecde324f5255b3ffeecd6a4/src/utils.jl#L58-L63
by commenting out
leaf isa TLeafElement && return Offsetjagg
Then I get
julia> print(UnROOT.array(f, "variable/met_p4/fCoordinates/fCoordinates.fPt"; raw=false))
Float32[69.96958, 25.149912, 131.66693, 150.56802, 175.48634, 7.744735, 125.21542, 118.75257, 172.247, 39.266693, 43.818253, 4.641392, 64.815704, 57.03564, 75.78774, 56.980965, 40.431374, 52.25016, 59.041767, 66.45392, 71.02385, 95.63651, 4.2800946, 17.643026]
This branch is not jagged, so does this classification need to be changed a bit?
ROOT offers different ways to store arrays or even structs in TBranches
. I added a small ROOT-File (test_array.root.zip) which contains a TTree
called 'arrays' which contains a scalar, a 6d-vector and a 2x3-matrix and another TTree
called 'structs' which contains again the same scalar as 'arrays' and a struct which resembles again a 2x3-matrix but stored as a 6d-vec and is defined in the following way
struct m23 {double el[6];};
So the file was created in the following way (may be useful when trying to retrieve data in Julia):
struct m23 {double el[6];};
m23 mat_23 = {1.0,2.0,3.0,4.0,5.0,6.0};
int n = 1;
double vec[6] = {1.0,2.0,3.0,4.0,5.0,6.0};
double mat[2][3] = {{1.0,2.0,3.0},{4.0,5.0,6.0}};
TFile *file = new TFile("test_array.root","Recreate");
TTree *tree_arr = new TTree("arrays","tree filled with arrays")
TTree *tree_str = new TTree("structs","tree filled with struct");
tree_arr -> Branch("nInt",&n);
tree_arr -> Branch("6dVec",&vec,"[6]/D");
tree_arr -> Branch("2x3Mat",&mat,"[2][3]/D");
tree_arr -> Fill();
tree_str -> Branch("nInt",&n);
tree_str -> Branch("2x3mat",&mat,"el[6]/D");
tree_str -> Fill();
file -> Write("",TObject::kOverwrite);
In the ROOT command-line the TTrees
look like this in the end
root [1] arrays -> Scan();
***********************************************************
* Row * Instance * nInt * 6dVec.6dV * 2x3Mat.2x *
***********************************************************
* 0 * 0 * 1 * 1 * 1 *
* 0 * 1 * 1 * 2 * 2 *
* 0 * 2 * 1 * 3 * 3 *
* 0 * 3 * 1 * 4 * 4 *
* 0 * 4 * 1 * 5 * 5 *
* 0 * 5 * 1 * 6 * 6 *
***********************************************************
root [2] structs -> Scan();
***********************************************
* Row * Instance * nInt * 2x3mat.2x *
***********************************************
* 0 * 0 * 1 * 1 *
* 0 * 1 * 1 * 2 *
* 0 * 2 * 1 * 3 *
* 0 * 3 * 1 * 4 *
* 0 * 4 * 1 * 5 *
* 0 * 5 * 1 * 6 *
***********************************************
When I try to read this in Julia using the UnROOT package I get the following
julia> file=ROOTFile("/hosts/nashome/riederer_bernd/test_array.root")
ROOTFile("/hosts/nashome/riederer_bernd/test_array.root") with 2 entries and 17 streamers.
julia> keys(file)
2-element Array{String,1}:
"arrays"
"structs"
julia> array(file,"arrays")
ERROR: Branches with multiple leaves are not supported yet.
Stacktrace:
[1] #array#120(::Bool, ::typeof(array), ::ROOTFile, ::String) at ~/.juliapro/JuliaPro_v1.2.0-1/packages/UnROOT/KN5MU/src/root.jl:153
[2] array(::ROOTFile, ::String) at ~/.juliapro/JuliaPro_v1.2.0-1/packages/UnROOT/KN5MU/src/root.jl:139
[3] top-level scope at none:0
julia> array(file,"structs")
ERROR: Branches with multiple leaves are not supported yet.
Stacktrace:
[1] #array#120(::Bool, ::typeof(array), ::ROOTFile, ::String) at ~/.juliapro/JuliaPro_v1.2.0-1/packages/UnROOT/KN5MU/src/root.jl:153
[2] array(::ROOTFile, ::String) at ~/.juliapro/JuliaPro_v1.2.0-1/packages/UnROOT/KN5MU/src/root.jl:139
[3] top-level scope at none:0
julia> DataFrame(file,"arrays")
1×3 DataFrame
│ Row │ nInt │ 6dVec │ 2x3Mat │
│ │ Int32 │ Float64 │ Float64 │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ 1.0 │ 1.0 │
julia> DataFrame(file,"structs")
1×2 DataFrame
│ Row │ nInt │ 2x3mat │
│ │ Int32 │ Float64 │
├─────┼───────┼─────────┤
│ 1 │ 1 │ 1.0 │
It seems that, while array
is refusing to handle multidimensional objects in a TBranch
, the function DataFrame
is able to get the scalar and the first entry of each multidimensional object.
I tested this with another file and DataFrame
returns correctly all Rows of scalar-entries and next to it in each case the first element of the higher-dimensional object. So it looks like this:
julia> DataFrame(file,"structs")
1×5 DataFrame
│ Row │ nInt │ 2x3mat │
│ │ Int32 │ Float64 │
├─────┼───────┼───────────┤
│ 1 │ 1 │ mat.el[0] │
│ 2 │ 2 │ mat.el[0] │
│ 3 │ 3 │ mat.el[0] │
│ 4 │ 4 │ mat.el[0] │
I would like to request to extend the support for multidimensional objects at least for the array
function.
Let me start a dedicated topic on threaded performance.
Based on PR #76, we can parallelize the event loop using a thread-local last-read-basket cache carried by each branch. With 4 threads, I got a 2.5x speedup which indicates something is inhibiting parallelization. This feels weird to me--we should be dominated by branch decompression and interpretation, both of which are easily threaded.
Using the dimuon CMS outreach file here, I made a Gantt chart to visualize how time is spent across the different threads.
With a few additional printouts:
# iteration.jl
rawdata, rawoffsets = readbasket(f, branch, ithbasket)
+ start = time()
T, J = auto_T_JaggT(f, branch; customstructs=f.customstructs)
- return interped_data(rawdata, rawoffsets, T, J)
+ idata = interped_data(rawdata, rawoffsets, T, J)
+ stop = time()
+ lock(stdout); println("interpretation, THREAD $(Threads.threadid()), BRANCH $(branch.fName), basket $(ithbasket), START $(start), STOP $(stop), NEVENTS $(length(idata))"); unlock(stdout)
+ return idata
end
# root.jl
function readbasketseek(
f::ROOTFile, branch::Union{TBranch, TBranchElement}, seek_pos::Int
)::Tuple{Vector{UInt8},Vector{Int32}} # just being extra careful
lock(f)
+ start = time()
seek(f.fobj, seek_pos)
basketkey = unpack(f.fobj, TBasketKey)
compressedbytes = compressed_datastream(f.fobj, basketkey)
+ stop = time()
unlock(f)
+ ithbasket = findfirst(==(seek_pos), branch.fBasketSeek)
+ lock(stdout); println("locked IO, THREAD $(Threads.threadid()), BRANCH $(branch.fName), basket $(ithbasket), START $(start), STOP $(stop), NEVENTS 0"); unlock(stdout)
+
+ start = time()
basketrawbytes = decompress_datastreambytes(compressedbytes, basketkey)
+ stop = time()
+
+ lock(stdout); println("decompression, THREAD $(Threads.threadid()), BRANCH $(branch.fName), basket $(ithbasket), START $(start), STOP $(stop), NEVENTS 0"); unlock(stdout)
and a script like
using UnROOT
const t = LazyTree(ROOTFile("Run2012BC_DoubleMuParked_Muons.root"), "Events", ["Muon_pt"])
Threads.@threads for i in 1:length(t);
length(t.Muon_pt[i])
end
one gets this kind of dump
...
locked IO, THREAD 3, BRANCH Muon_pt, basket 505, START 1.630720271814293e9, STOP 1.63072027181704e9, NEVENTS 0
decompression, THREAD 3, BRANCH Muon_pt, basket 505, START 1.630720271817063e9, STOP 1.630720271897435e9, NEVENTS 0
offset calc, THREAD 3, BRANCH Muon_pt, basket 505, START 1.630720271922676e9, STOP 1.630720271924612e9, NEVENTS 0
interpretation, THREAD 3, BRANCH Muon_pt, basket 505, START 1.6307202719262e9, STOP 1.630720271939091e9, NEVENTS 821695
...
There are some clear gaps. I'm not sure where they come from. Profiling tells me that interpretation/decompression/IO are the dominant methods.
You can zoom into the plot here: https://nbviewer.jupyter.org/gist/aminnj/982c0dd63cf32eaff20564223c81f7b6
The (Python) notebook also should be self contained in terms of data, so feel free to extract any extra insights.
Sometimes in HEP the memory access pattern is inherently bad. For example we have jagged branches for electron and muon related quantities: Lorentz Vectors, isolation flags etc.
Let's say the e_mask
is quality cut, and people want to extract isolation cuts for electrons that passed the e_mask
:
function get_Isos(e_mask, m_mask, evt)
v_l_passIso = Vector{Bool}[]
v_e_passIso_HighPtCaloOnly = evt.v_e_passIso_HighPtCaloOnly
v_e_passIso_TightTrackOnly_VarRad = evt.v_e_passIso_TightTrackOnly_VarRad
v_e_passIso_TightTrackOnly_FixedRad = evt.v_e_passIso_TightTrackOnly_FixedRad
v_e_passIso_Tight_VarRad = evt.v_e_passIso_Tight_VarRad
#.... 12 more
@inbounds for (i,f) in enumerate(e_mask)
f || continue
push!(
v_l_passIso,
Bool[
v_e_passIso_HighPtCaloOnly[i],
v_e_passIso_TightTrackOnly_VarRad[i],
v_e_passIso_TightTrackOnly_FixedRad[i],
v_e_passIso_Tight_VarRad[i],
v_e_passIso_Loose_VarRad[i],
],
)
end
# one more for loop for muons
return v_l_passIso
Each of those piles is actually just the v_e_passIso_HighPtCaloOnly = evt.v_e_passIso_HighPtCaloOnly
line.
This is probably horrible no matter what language you use, but I'm wondering if we should explore https://github.com/JuliaArrays/ArraysOfArrays.jl/ to see if we can cut down allocations of making concrete vectors (or vector of vectors) since in many cases they just get teared apart manually in the looper.
Suppose you have a few events to keep by index
julia> keep = findall(>(45),tf.nMuon)
3-element Vector{Int64}:
35499643
48770485
53288840
julia> @time tf[keep]
0.000007 seconds (1 allocation: 256 bytes)
3-element Vector{UnROOT.LazyEvent{TypedTables.Table{NamedTuple{(:Muon_phi, :nMuon, :Muon_pt, :Muon_eta, :Muon_charge, :Muon_mass), Tuple{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, UInt32, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, SubArray{Int32, 1, Vector{Int32}, Tuple{UnitRange{Int64}}, true}, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}}, 1, NamedTuple{(:Muon_phi, :nMuon, :Muon_pt, :Muon_eta, :Muon_charge, :Muon_mass), Tuple{LazyBranch{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, UnROOT.Nooffsetjagg, ArraysOfArrays.VectorOfVectors{Float32, Vector{Float32}, Vector{Int32}, Vector{Tuple{}}}}, LazyBranch{UInt32, UnROOT.Nojagg, Vector{UInt32}}, LazyBranch{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, UnROOT.Nooffsetjagg, ArraysOfArrays.VectorOfVectors{Float32, Vector{Float32}, Vector{Int32}, Vector{Tuple{}}}}, LazyBranch{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, UnROOT.Nooffsetjagg, ArraysOfArrays.VectorOfVectors{Float32, Vector{Float32}, Vector{Int32}, Vector{Tuple{}}}}, LazyBranch{SubArray{Int32, 1, Vector{Int32}, Tuple{UnitRange{Int64}}, true}, UnROOT.Nooffsetjagg, ArraysOfArrays.VectorOfVectors{Int32, Vector{Int32}, Vector{Int32}, Vector{Tuple{}}}}, LazyBranch{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, UnROOT.Nooffsetjagg, ArraysOfArrays.VectorOfVectors{Float32, Vector{Float32}, Vector{Int32}, Vector{Tuple{}}}}}}}}}:
^P UnROOT.LazyEvent at index 35499643 with 6 columns:
(Muon_phi = Float32[1.8356316, 2.7067316, 0.87054247, 1.5321478, -1.8103704, -1.8802298, -2.1558394, -1.1770533, -2.0695007, -2.2399049 … 1.2497426, 2.2827926, 1.216624, 2.6214137, -0.76639545, -1.3989606, 0.6140395, 1.0926172, 0.97163, 3.1090136], nMuon = 0x0000002e, Muon_pt = Float32[3.9049673, 3.2476199, 4.5254884, 206.89212, 108.89227, 9.696999, 5.193112, 37.201935, 4.0803714, 3.9650037 … 7.354625, 4.20561, 6.915594, 55.53567, 7.694474, 47.25801, 3.351904, 5.535518, 4.2438073, 4.9639397], Muon_eta = Float32[-0.75485444, -0.5610426, -0.42048398, 0.19798185, -1.6615134, -1.0364083, -0.91027415, -0.7516932, 0.20546485, 0.17692244 … -0.8958062, -0.93142104, -1.025144, -1.3047857, -0.45616388, -0.51817393, 0.2727939, 0.8475114, 0.7797574, 0.938193], Muon_charge = Int32[-1, -1, -1, 1, -1, -1, -1, 1, -1, -1 … -1, 1, -1, 1, 1, 1, 1, -1, -1, 1], Muon_mass = Float32[0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837 … 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837])
UnROOT.LazyEvent at index 48770485 with 6 columns:
(Muon_phi = Float32[2.7186887, 1.66489, -0.4609219, -0.30460685, -1.4622912, -0.29155937, -0.19675593, 1.3228732, -2.5117185, 1.1946799 … 0.12079977, -1.8443158, 0.38688487, 1.5110592, 1.0058981, 0.48077318, 2.650082, 1.0766889, 0.76323473, 2.855547], nMuon = 0x00000032, Muon_pt = Float32[4.3042383, 3.3196697, 4.02104, 3.91249, 5.2271028, 3.7576833, 12.493878, 15.357484, 42.470627, 359.8618 … 4.594613, 52.41086, 5.586868, 8.586369, 92.26022, 5.863132, 4.7038293, 42.10251, 239.57248, 15.74454], Muon_eta = Float32[-1.1377369, -0.05508444, 0.43966427, 0.33119693, 1.9196174, 0.36755624, -0.90255135, 0.8946229, 0.83761317, -0.6391948 … 0.7502582, 0.5413515, -0.95098275, 0.6888705, 0.82908434, 0.6890304, 1.2712107, 0.8093618, 0.73721737, -0.6364821], Muon_charge = Int32[-1, -1, 1, 1, 1, 1, -1, 1, 1, -1 … -1, -1, -1, 1, 1, -1, 1, -1, 1, 1], Muon_mass = Float32[0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837 … 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837])
UnROOT.LazyEvent at index 53288840 with 6 columns:
(Muon_phi = Float32[1.0888511, -2.9195318, -2.5780199, 0.17057382, 0.7872117, 0.63450843, 0.75276434, 2.9759731, -2.5312407, -2.6104288 … -1.4269199, -2.2816708, 1.0712936, -0.12250012, 0.5654365, 0.43211797, -2.8930926, 0.3762097, 2.2137518, 1.8053775], nMuon = 0x0000002e, Muon_pt = Float32[3.4638908, 5.742142, 6.0685053, 3.1070886, 3.0751114, 4.3533607, 146.31197, 6.3071194, 11.217872, 5.1325 … 18.919275, 5.4368644, 48.185585, 7.5798483, 3.6502025, 3.9469419, 5.0577216, 3.1323035, 4.6181774, 8.671757], Muon_eta = Float32[0.9788583, 1.4845941, 1.0897319, 1.6785655, 2.5901272, 1.568596, 0.46518221, 0.40752897, -0.8510304, -0.7986666 … 1.1021087, 1.0143414, 0.8790746, 0.7488476, 0.02210111, -0.18400188, -0.99575096, 1.007069, -0.44294864, 1.0667362], Muon_charge = Int32[-1, 1, 1, 1, 1, 1, -1, -1, -1, -1 … 1, -1, 1, 1, -1, -1, -1, -1, 1, 1], Muon_mass = Float32[0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837 … 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837, 0.10565837])
The output looks gnarly. However, for a UnitRange, we have
julia> tf[1:3]
Row │ Muon_phi nMuon Muon_pt Muon_eta Muon_charge Muon_mass
│ Vector{Float32} UInt32 Vector{Float32} Vector{Float32} Vector{Int32} Vector{Float32}
─────┼───────────────────────────────────────────────────────────────────────────────────────────
1 │ [-0.0343, 2.54] 2 [10.8, 15.7] [1.07, -0.564] [-1, -1] [0.106, 0.106]
2 │ [-0.275, 2.54] 2 [10.5, 16.3] [-0.428, 0.349] [1, -1] [0.106, 0.106]
3 │ [-1.22] 1 [3.28] [2.21] [1] [0.106]
For consistency, I think we should make getindex on tf
with a vector be aliased to
julia> LazyTree(UnROOT.innertable(tf)[keep])
Row │ Muon_phi nMuon Muon_pt Muon_eta Muon_charge Muon_mass ⋯
│ SubArray{Float3 UInt32 SubArray{Float3 SubArray{Float3 SubArray{Int32, SubArray{Float3 ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ [1.84, 2.71, 0.871, 1.53, -1 46 [3.9, 3.25, 4.53, 207.0, 109 [-0.755, -0.561, -0.42, 0.19 [-1, -1, -1, 1, -1, -1, -1, [0.106, 0.106, 0.106, 0 ⋯
2 │ [2.72, 1.66, -0.461, -0.305, 50 [4.3, 3.32, 4.02, 3.91, 5.23 [-1.14, -0.0551, 0.44, 0.331 [-1, -1, 1, 1, 1, 1, -1, 1, [0.106, 0.106, 0.106, 0 ⋯
3 │ [1.09, -2.92, -2.58, 0.171, 46 [3.46, 5.74, 6.07, 3.11, 3.0 [0.979, 1.48, 1.09, 1.68, 2. [-1, 1, 1, 1, 1, 1, -1, -1, [0.106, 0.106, 0.106, 0 ⋯
sample file (only 0.5KB in size): https://jiling.web.cern.ch/jiling/public/Run2012BC_DoubleMuParked_Muons_uncompress_1Mevts.root
https://github.com/tamasgal/UnROOT.jl/blob/ea2a6ba3244915fe336a13ec3d9ab8b0adb23c38/src/streamers.jl#L109
(on current master branch)
this line throws an ERROR: Invalid class reference.
when reading a ROOT file that is not compressed at all (even the Streamers are not compressed).
If I print out all the tag
> 0 from:
https://github.com/tamasgal/UnROOT.jl/blob/ea2a6ba3244915fe336a13ec3d9ab8b0adb23c38/src/streamers.jl#L198-L207
I see that, for a compressed file:
/home/akako/Downloads/Run2012BC_DoubleMuParked_Muons.root
tag = 0xffffffff
tag = 0xffffffff
tag = 0x8000011e
tag = 0xffffffff
tag = 0x8000011e
tag = 0x80002875
tag = 0x8000011e
tag = 0x80002875
tag = 0x8000011e
tag = 0x80002875
tag = 0x8000011e
tag = 0x800001dd
tag = 0x000001d9
tag = 0x00002871
tag = 0x00005ace
tag = 0x00008d26
tag = 0x0000bf80
tag = 0x0000f1e0
├─ Events
│ ├─ "nMuon"
│ ├─ "Muon_pt"
for a bugged uncomressed file:
┌ Debug: Unompressed stream at 58521775
└ @ UnROOT ~/.julia/dev/UnROOT/src/streamers.jl:98
┌ Debug: Found 19 streamers, continue with parsing.
└ @ UnROOT ~/.julia/dev/UnROOT/src/streamers.jl:107
tag = 0xffffffff
tag = 0xffffffff
tag = 0xffffffff
tag = 0xffffffff
tag = 0x8000013b
ERROR: Invalid class reference.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
any idea what 0x8000013b
is? It doesn't look like we have a offset bug because when I print out the start of stream
, they look the same at least for the first few recursion.
I try to read a TBranchElement
which contains subbranches from a data file which has the file format like this example file. In order to get more specific, I want to read the track information at E/Evt/trks
. The track information contains the particles properties (for simulation) or the reconstructions (for measurement) per event, so this leads to a nested array structure. Taking a look at it using uproot
(loading the given example file to fobj
) I get:
In [14]: fobj["E/Evt/trks"].show()
name | typename | interpretation
---------------------+--------------------------+-------------------------------
trks | vector<Trk> | AsGroup(<TBranchElement 'trks'
trks.fUniqueID | uint32_t[] | AsJagged(AsDtype('>u4'))
trks.fBits | uint32_t[] | AsJagged(AsDtype('>u4'))
trks.usr_data | std::vector<AAny>* | AsObjects(AsArray(True, Fal...
trks.usr_names | std::vector<std::stri... | AsObjects(AsArray(True, Fal...
trks.id | int32_t[] | AsJagged(AsDtype('>i4'))
trks.pos.x | double[] | AsJagged(AsDtype('>f8'))
trks.pos.y | double[] | AsJagged(AsDtype('>f8'))
trks.pos.z | double[] | AsJagged(AsDtype('>f8'))
trks.dir.x | double[] | AsJagged(AsDtype('>f8'))
trks.dir.y | double[] | AsJagged(AsDtype('>f8'))
trks.dir.z | double[] | AsJagged(AsDtype('>f8'))
trks.t | double[] | AsJagged(AsDtype('>f8'))
trks.E | double[] | AsJagged(AsDtype('>f8'))
trks.len | double[] | AsJagged(AsDtype('>f8'))
trks.lik | double[] | AsJagged(AsDtype('>f8'))
trks.type | int32_t[] | AsJagged(AsDtype('>i4'))
trks.rec_type | int32_t[] | AsJagged(AsDtype('>i4'))
trks.rec_stages | std::vector<int32_t>* | AsObjects(AsArray(True, Fal...
trks.status | int32_t[] | AsJagged(AsDtype('>i4'))
trks.mother_id | int32_t[] | AsJagged(AsDtype('>i4'))
trks.fitinf | std::vector<double>* | AsObjects(AsArray(True, Fal...
trks.hit_ids | std::vector<int32_t>* | AsObjects(AsArray(True, Fal...
trks.error_matrix | std::vector<double>* | AsObjects(AsArray(True, Fal...
trks.comment | std::string* | AsObjects(AsArray(True, Fal...
I'm able to read the single fields with UnROOT.jl
julia> fobj_online = UnROOT.ROOTFile(fpath_example)
ROOTFile(...) with 10 entries and 55 streamers.
julia> data, offsets = UnROOT.array(fobj, "E/Evt/trks/trks.t"; raw=true)
(UInt8[0x41, 0x90, 0xc3, 0x78, 0x59, 0xdb, 0x26, 0xbe, 0x41, 0x90 … 0x04, 0x2c, 0x41, 0x8a, 0x34, 0x8c, 0x55, 0xe6, 0xb6, 0x6d], Int32[70, 518, 958, 1406, 1854, 2302, 2750, 3198, 3646, 4078])
julia> reinterpret(Float64, reverse(data[1:8]))
1-element reinterpret(Float64, ::Vector{UInt8}):
7.031144646401498e7
and with the offsets
I can get the nested array structure preserved. I get stuck when I want to read this is an more efficient way and converting it to a trk struct preserving the nested array structure?!
Symbol("STDMTruthTausAuxDyn.classifierParticleOrigin")
is crazy branch name and it includes a dot...
f = ROOTFile(joinpath(SAMPLES_DIR, "histograms.root"))
with the histogram file added in #1
We should make a benchmark suite to make sure we don't regress...
I don't think CI would be very happy to pull in a 2GB file for benchmarking ;) To start off with, it could even be just a local script to run.
The title is a bit generic ;)
Let me rewind and show how it looked the early days of UnROOT
(v0.1.7), where the getindex
was different:
julia> using UnROOT
julia> f = UnROOT.ROOTFile("test/samples/km3net_online.root")
ROOTFile("test/samples/km3net_online.root") with 10 entries and 54 streamers.
julia> f["KM3NET_EVENT/KM3NET_EVENT"].fClassName
"KM3NETDAQ::JDAQEvent"
julia> keys(f["KM3NET_EVENT/KM3NET_EVENT"])
4-element Vector{String}:
"KM3NETDAQ::JDAQPreamble"
"KM3NETDAQ::JDAQEventHeader"
"triggeredHits"
"snapshotHits"
julia> f["KM3NET_EVENT/KM3NET_EVENT/snapshotHits"]
UnROOT.TBranchElement_10
cursor: UnROOT.Cursor
fName: String "snapshotHits"
fTitle: String "snapshotHits"
fFillColor: Int16 0
fFillStyle: Int16 1001
...
...
...
Here you can see that keys(f["KM3NET_EVENT/KM3NET_EVENT"])
lists the "branches", which are fields of the KM3NETDAQ::JDAQEvent
class entry.
Of course, there is some complicated way to read this class by deserialising all the four sub-branches but it's a tough thing and I haven't even managed to do so in uproot
. However, It's possible to do it branch-by-branch.
With the new API we currently work on, there is currently no easy way to access subranches of such a custom branch like f["KM3NET_EVENT/KM3NET_EVENT"]
:
julia> keys(f["KM3NET_EVENT/KM3NET_EVENT"])
┌ Warning: Can't automatically create LazyBranch for branch KM3NET_EVENT/KM3NET_EVENT. Returning a branch object
└ @ UnROOT ~/Dev/UnROOT.jl/src/root.jl:93
ERROR: MethodError: no method matching keys(::UnROOT.TBranchElement_10)
Closest candidates are:
keys(::Union{Tables.AbstractColumns, Tables.AbstractRow}) at /Users/tamasgal/.julia/packages/Tables/gg6Id/src/Tables.jl:181
keys(::Missings.EachFailMissing) at /Users/tamasgal/.julia/packages/Missings/sx5js/src/Missings.jl:154
keys(::Dictionaries.Dictionary) at /Users/tamasgal/.julia/packages/Dictionaries/YpAxR/src/Dictionary.jl:270
...
Stacktrace:
[1] top-level scope
@ REPL[3]:1
That's quite an easy fix since we have the abstract type TBranchElement
which we can use to implement keys(branch::TBranchElement)
.
However, the API to introduce the custom parsing currently only looks at the base class and is not able to distinguish between subbranches. This means that it only looks at fClassName
and if there is a match, it will use the type in customstructs
:
struct KM3NETDAQHit
dom_id::Int32
channel_id::UInt8
tdc::Int32
tot::UInt8
end
function UnROOT.interped_data(rawdata, rawoffsets, ::Type{Vector{KM3NETDAQHit}}, ::Type{J}) where J<:UnROOT.JaggType
@views map(1:length(rawoffsets)-1) do idx
idxrange = rawoffsets[idx]+10+1 : rawoffsets[idx+1]
UnROOT.interped_data(rawdata[idxrange], rawoffsets[idx], KM3NETDAQHit, UnROOT.Nojagg)
end
en
function UnROOT.interped_data(rawdata, rawoffsets, T::Type{KM3NETDAQHit}, ::Type{J}) where J <: UnROOT.JaggType
splitdata = Base.Iterators.partition(rawdata, 10)
out = KM3NETDAQHit[]
sizehint!(out, length(splitdata))
for single in splitdata
io = IOBuffer(single)
push!(out, T(UnROOT.readtype(io, Int32), read(io, UInt8), read(io, Int32), read(io, UInt8)))
end
out
end
customstructs = Dict("KM3NETDAQ::JDAQEvent" => Vector{KM3NETDAQHit})
f = UnROOT.ROOTFile("../test/samples/km3net_online.root", customstructs=customstructs)
With this code, f["KM3NET_EVENT/KM3NET_EVENT/snapshotHits"]
magically becomes a jagged Vector{Vector{KM3NETDAQHit}}
which is nice, but it uses the same type also for f["KM3NET_EVENT/KM3NET_EVENT/triggeredHits"]
and fails.
Furthermore, the printout of f["KM3NET_EVENT/KM3NET_EVENT"]
is broken when using customstructs
:
MethodError: no method matching -(::Nothing, ::Int64)
Closest candidates are:
-(::T, ::T) where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8} at int.jl:86
-(::T, ::Integer) where T<:AbstractChar at char.jl:222
-(::LinearAlgebra.UniformScaling, ::Number) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/uniformscaling.jl:147
...
Stacktrace:
[1] getindex(ba::LazyBranch{Vector{KM3NETDAQHit}, UnROOT.Offsetjagg}, idx::Int64)
@ UnROOT ~/Dev/UnROOT.jl/src/iteration.jl:111
[2] _getindex
@ ./abstractarray.jl:1209 [inlined]
[3] getindex
@ ./abstractarray.jl:1170 [inlined]
[4] isassigned(::LazyBranch{Vector{KM3NETDAQHit}, UnROOT.Offsetjagg}, ::Int64, ::Int64)
@ Base ./abstractarray.jl:513
[5] alignment(io::IOContext{IOBuffer}, X::LazyBranch{Vector{KM3NETDAQHit}, UnROOT.Offsetjagg}, rows::UnitRange{Int64}, cols::UnitRange{Int64}, cols_if_complete::Int64, cols_otherwise::Int64, sep::Int64)
@ Base ./arrayshow.jl:67
[6] _print_matrix(io::IOContext{IOBuffer}, X::AbstractVecOrMat{T} where T, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64, rowsA::UnitRange{Int64}, colsA::UnitRange{Int64})
@ Base ./arrayshow.jl:201
[7] #invokelatest#2
@ ./essentials.jl:708 [inlined]
[8] invokelatest
@ ./essentials.jl:706 [inlined]
[9] print_matrix (repeats 2 times)
@ ./arrayshow.jl:170 [inlined]
[10] print_array
@ ./arrayshow.jl:327 [inlined]
[11] show(io::IOContext{IOBuffer}, #unused#::MIME{Symbol("text/plain")}, X::LazyBranch{Vector{KM3NETDAQHit}, UnROOT.Offsetjagg})
@ Base ./arrayshow.jl:368
[12] limitstringmime(mime::MIME{Symbol("text/plain")}, x::LazyBranch{Vector{KM3NETDAQHit}, UnROOT.Offsetjagg})
@ IJulia ~/.julia/packages/IJulia/e8kqU/src/inline.jl:43
[13] display_mimestring
@ ~/.julia/packages/IJulia/e8kqU/src/display.jl:71 [inlined]
[14] display_dict(x::LazyBranch{Vector{KM3NETDAQHit}, UnROOT.Offsetjagg})
@ IJulia ~/.julia/packages/IJulia/e8kqU/src/display.jl:102
[15] #invokelatest#2
@ ./essentials.jl:708 [inlined]
[16] invokelatest
@ ./essentials.jl:706 [inlined]
[17] execute_request(socket::ZMQ.Socket, msg::IJulia.Msg)
@ IJulia ~/.julia/packages/IJulia/e8kqU/src/execute_request.jl:112
[18] #invokelatest#2
@ ./essentials.jl:708 [inlined]
[19] invokelatest
@ ./essentials.jl:706 [inlined]
[20] eventloop(socket::ZMQ.Socket)
@ IJulia ~/.julia/packages/IJulia/e8kqU/src/eventloop.jl:8
[21] (::IJulia.var"#15#18")()
@ IJulia ./task.jl:411
EDIT:
So all in all, what we need is the possibility to provide the types for subbranches. Given the example above, it would be something like this:
customstructs = Dict(
"KM3NETDAQ::JDAQEvent.snapshotHits" => Vector{KM3NETDAQHit},
"KM3NETDAQ::JDAQEvent.triggeredHits" => Vector{KM3NETDAQTriggeredHit},
"KM3NETDAQ::JDAQEvent.KM3NETDAQ::JDAQPreamble" => KM3NETDAQPreamble,
"KM3NETDAQ::JDAQEvent.KM3NETDAQ::JDAQEventHeader" => KM3NETDAQEventHeader
)
I'm new to Julia wanting to see its potential for HEP computing. For this, I am playing with a TTree with UnROOT to see how this compares with equivalent job with ROOT/RDataFrame.
I understand that aLazyTree
is not a DataFrame
, since in general you cannot not fit in memory the full TTree
, however it provides the basic functionality of an AbstractDataFrame
. Naively I thought that I could use initially a LazyTree
and then once it is heavily reduced to be converted to a DataFrame
. But this seems not to be working. Perhaps you can give me some hints.
The code I am trying to execute is as follows:
using UnROOT
f = ROOTFile("/eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root")
events = LazyTree(f, "Events");
df = filter([:nMuon,:Muon_charge] => (n,c) -> n == 2 && c[1] != c[2], events[1:1000000])
and the error I get is:
MethodError: no method matching getindex(::LazyTree{TypedTables.Table{NamedTuple{(:Muon_phi, :nMuon, :Muon_pt, :Muon_eta, :Muon_charge, :Muon_mass), Tuple{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, UInt32, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, SubArray{Int32, 1, Vector{Int32}, Tuple{UnitRange{Int64}}, true}, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}}, 1, NamedTuple{(:Muon_phi, :nMuon, :Muon_pt, :Muon_eta, :Muon_charge, :Muon_mass), Tuple{Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{UInt32}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Int32, 1, Vector{Int32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}}}}}, ::BitVector, ::Colon)
Closest candidates are:
getindex(::AbstractDataFrame, ::Integer, ::Colon) at /eos/user/m/mato/.julia/packages/DataFrames/vuMM8/src/dataframerow/dataframerow.jl:210
getindex(::AbstractDataFrame, ::Integer, ::Union{Colon, Regex, AbstractVector{T} where T, All, Between, Cols, InvertedIndex}) at /eos/user/m/mato/.julia/packages/DataFrames/vuMM8/src/dataframerow/dataframerow.jl:208
getindex(::AbstractDataFrame, ::CartesianIndex{2}) at /eos/user/m/mato/.julia/packages/DataFrames/vuMM8/src/other/broadcasting.jl:3
...
Stacktrace:
[1] #filter#88
@ ~.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:1040 [inlined]
[2] filter(::Pair{Vector{Symbol}, var"#5#6"}, df::LazyTree{TypedTables.Table{NamedTuple{(:Muon_phi, :nMuon, :Muon_pt, :Muon_eta, :Muon_charge, :Muon_mass), Tuple{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, UInt32, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}, SubArray{Int32, 1, Vector{Int32}, Tuple{UnitRange{Int64}}, true}, SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}}, 1, NamedTuple{(:Muon_phi, :nMuon, :Muon_pt, :Muon_eta, :Muon_charge, :Muon_mass), Tuple{Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{UInt32}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Int32, 1, Vector{Int32}, Tuple{UnitRange{Int64}}, true}}, Vector{SubArray{Float32, 1, Vector{Float32}, Tuple{UnitRange{Int64}}, true}}}}}})
@ DataFrames ~.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:1035
[3] macro expansion
@ In[5]:3 [inlined]
[4] top-level scope
@ timing.jl:210
[5] eval
@ ./boot.jl:360 [inlined]
[6] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
@ Base ./loading.jl:1094
If instead of passing a LazyTree
I construct a DataFrame
, then it works nicely. However it means, I think, that I have copied in memory the full TTree.
df = filter([:nMuon,:Muon_charge] => (n,c) -> n == 2 && c[1] != c[2], DataFrame(events[1:1000000]))
(nondeterministic) MWE:
using UnROOT
using Base.Threads
using Polyester # make sure it is >=0.5.3
function foo_batch(t)
@batch for evt in t
# @threads for evt in t
evt.nMuon == 4 || continue
end
end
foo_batch(LazyTree(ROOTFile("lz4_Run2012BC_DoubleMuParked_Muons.root"),"Events", ["nMuon"]))
but note that it only fails 30% of the time :( with something like this:
Task (failed) @0x00007f17e9f1d150
UndefRefError: access to undefined reference
Stacktrace:
[1] load
@ ~/.julia/packages/ManualMemory/J0yTE/src/ManualMemory.jl:60 [inlined]
[2] macro expansion
@ ~/.julia/packages/ManualMemory/J0yTE/src/ManualMemory.jl:160 [inlined]
[3] load
@ ~/.julia/packages/ManualMemory/J0yTE/src/ManualMemory.jl:160 [inlined]
[4] (::Polyester.BatchClosure{var"#1#2", ManualMemory.Reference{Tuple{Int64, Static.StaticInt{1}, Polyester.NoLoop, LazyTree{TypedTables.Table{NamedTuple{(:nMuon,), Tuple{UInt32}}, 1, NamedTuple{(:nMuon,), Tuple{LazyBranch{UInt32, UnROOT.Nojagg, Vector{UInt32}}}}}}}}, false})(p::Ptr{UInt64})
@ Polyester ~/.julia/packages/Polyester/G022G/src/batch.jl:5
[5] _call
@ ~/.julia/packages/ThreadingUtilities/IkkvN/src/threadtasks.jl:11 [inlined]
[6] (::ThreadingUtilities.ThreadTask)()
@ ThreadingUtilities ~/.julia/packages/ThreadingUtilities/IkkvN/src/threadtasks.jl:29Task
next: Nothing nothing
queue: Nothing nothing
storage: Nothing nothing
donenotify: Base.GenericCondition{SpinLock}
waitq: Base.InvasiveLinkedList{Task}
head: Nothing nothing
tail: Nothing nothing
lock: SpinLock
owned: Int64 0
result: UndefRefError UndefRefError()
logstate: Nothing nothing
code: ThreadingUtilities.ThreadTask
p: Ptr{UInt64} @0x00000000025cf000
rngState0: UInt64 0xa0ac74b4dbd9be5c
rngState1: UInt64 0xca657b522fb38e15
rngState2: UInt64 0x80f16291609a20c5
rngState3: UInt64 0x00380ae50f218885
_state: UInt8 0x02
sticky: Bool true
_isexception: Bool true
@Moelf suggests it's the customization in UnROOT/src/polyester.jl
. Commenting that out seems to eliminate these failures and @batch
still works because of the recent PR that made LazyTree <: AbstractVector{LazyEvent{T}}
.
So, since @threads
/@batch
have pretty much the same overhead when considering our problem/batch size [1], if we can do all of the following,
@threads for i in eachindex()
@threads for evt in t
@threads for (i,evt) in enumerate(t)
then we can drop all the polyester stuff.
[1]
Double muon file H->ZZ->4mu benchmark with polyester.jl
commented out
Sometimes I accidentally type print(t)
in a script and then have to kill my terminal tab as millions of rows flood into stdout.
We get a nice compact representation with the "default repr" in the REPL
julia> p
Row │ nMuon Muon_pt
│ UInt32 SubArray{Float3
─────┼─────────────────────────
1 │ 0 []
2 │ 2 [19.9, 15.3]
3 │ 0 []
4 │ 0 []
5 │ 0 []
6 │ 0 []
7 │ 2 [22.2, 4.43]
8 │ 0 []
9 │ 0 []
10 │ 0 []
⋮ │ ⋮ ⋮
90 rows omitted
which you can reproduce with display(p)
or println(repr(p, context=:limit=>true))
, the former of which is unintuitive for a newcomer and the latter unintuitive unless you spend 10 mins tracing through show/display/repr
in Base.
I think pandas has a sane approach here:
>>> import pandas as pd
>>> df = pd.DataFrame(range(1000))
>>> df
0
0 0
1 1
2 2
3 3
4 4
.. ...
995 995
996 996
997 997
998 998
999 999
[1000 rows x 1 columns]
>>> print(df)
0
0 0
1 1
2 2
3 3
4 4
.. ...
995 995
996 996
997 997
998 998
999 999
[1000 rows x 1 columns]
so, I propose that we match this behavior (always give compact output).
I'll act as a human CI for an M1 mac ;)
This multi threading unit test fails for me when I use the julia default of 1 thread. However everything succeeds if I use several threads (eg JULIA_NUM_THREADS=4 julia runtests.jl
)
$ julia runtests.jl
Test Summary: | Pass Total
Parallel and enumerate interface | 9 9
Multi threading: Test Failed at /Users/namin/.julia/dev/UnROOT/test/runtests.jl:658
Expression: count((>)(0), nmus) > 1
Evaluated: 1 > 1
Stacktrace:
[1] macro expansion
@ /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/Test/src/Test.jl:445 [inlined]
[2] macro expansion
@ ~/.julia/dev/UnROOT/test/runtests.jl:658 [inlined]
[3] macro expansion
@ /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/Test/src/Test.jl:1283 [inlined]
[4] top-level scope
@ ~/.julia/dev/UnROOT/test/runtests.jl:648
Multi threading: Test Failed at /Users/namin/.julia/dev/UnROOT/test/runtests.jl:666
Expression: count((>)(0), nmus) > 1
Evaluated: 1 > 1
Stacktrace:
[1] macro expansion
@ /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/Test/src/Test.jl:445 [inlined]
[2] macro expansion
@ ~/.julia/dev/UnROOT/test/runtests.jl:666 [inlined]
[3] macro expansion
@ /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/stdlib/v1.7/Test/src/Test.jl:1283 [inlined]
[4] top-level scope
@ ~/.julia/dev/UnROOT/test/runtests.jl:648
Test Summary: | Pass Fail Total
Multi threading | 7 2 9
ERROR: LoadError: Some tests did not pass: 7 passed, 2 failed, 0 errored, 0 broken.
in expression starting at /Users/namin/.julia/dev/UnROOT/test/runtests.jl:645
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.