juliahep / jetreconstruction.jl Goto Github PK

View Code? Open in Web Editor NEW

13.0 13.0 3.0 13.78 MB

Jet reconstruction (reclustering) with Julia

License: MIT License

Julia 100.00%

hep jet-finding jet-reclustering jet-reconstruction julia physics

jetreconstruction.jl's People

Contributors

Stargazers

Watchers

Forkers

moelf graeme-a-stewart

jetreconstruction.jl's Issues

HepMC3 Reader Status

We have as ACSII HepMC3 reader that @grasph wrote. It's great for testing and benchmarking.

However, is there a better defined HepMC3 reader in Julia? I see that there is a HepMC3_jll, but I don't find any Julia interface package for it. Am I missing something? Does anyone know of plans to write an interface or support a native Julia reader?

Defaut strategy

We should have a default strategy (like Best) where the number of input particles is used to pick the more optimal algorithm (FastJet does this). #26 would help us to empirically determine the breakpoints as a function of the density of initial particles.

Improve tests

There is a nice benchmark of 100 Pythia 8 events that @grasph generated that we use for correctness and benchmarking ATM.

I would propose that we extend this to two additional use cases:

low density FCCee type events (where N2Plain might outperform N2Tiled)
very high density HI type events (which will just be challenging)

In both cases FastJet should be used as a reference to ensure correctness.

We should also test other clustering strategies, at least Cambridge/Achen ($p=0$) and Inclusive $k_T$ ($p=1$).

Return consistent sequence types

At the moment N2Plain and N2Tiled return different representations of the merging sequence (tiled returns a FastJet inspired one, plain returns a simpler nested structure).

We should really make these the same (honestly, I don't understand so well the use cases for this, beyond debugging).

Use Makie for 3D barplot

here's a dummy example:

using WGLMakie
begin
	N = 100
	δ = 0.02
	cs = repeat(1:5, N÷5)
	heights = rand(N) * 200
	meshscatter(
                                # this is needed because Makie would flatten to 2D if all Zs are the same
		Point3f.(rand(N), rand(N), 0.00001rand(N)); 
	  	color = cs, 
		markersize = Vec3f.(δ, δ, heights),
		colormap = :Spectral_7,
		marker = Rect3f(Vec3f(0), Vec3f(1)),
	 	axis = (type = Axis3, perspectiveness = 0.5, azimuth = 2.4,
            xlabel = "ϕ", ylabel = "η", zlabel = "kt",
		    limits = (nothing, nothing, nothing, nothing, 0, 200)
		),
	    shading=false
	)
end

It should also work with GLMakie.jl, but 3D might be a bit broken with CairoMakie...

Loop optimisation

I've been trying to optimise the code for sequential recombination algorithm in Julia and spotted an interesting fact.

Below you can see a very simple piece of code that is basically a translation of the C++ FastJet code. The profiler shows that lines that only have nndist = Δ2 in them take up as much time as really computationally heavy lines like Δ2 = _dist(i, j, _eta, _phi). I checked the stack that the profiler throws out and it's actually calling the == operation inside of the iterate() function that causes that. So this probably comes not from the assignment itself but from the end of the loop iteration. Simple loops typically get optimised but here it's something strange.

I used Profile.jl and profiled the entire algorithm in a loop. This function is called from there frequently and only exists to be compiled to avoid dynamic dispatch in the main algorithm.

# src/Algo.jl @ line 45
function _upd_nn_nocross!(i::Int, from::Int, to::Int, _eta, _phi, _R2, _nndist, _nn)
    nndist::Float64 = _R2
    nn = i
    Δ2::Float64 = 0.0
    for j in from:(i-1)
        Δ2 = _dist(i, j, _eta, _phi)
        if Δ2 <= nndist
            nn = j
            nndist = Δ2 # too long for an assignment
        end
    end
    for j in (i+1):to
        Δ2 = _dist(i, j, _eta, _phi)
        if Δ2 <= nndist
            nn = j
            nndist = Δ2 # too long for an assignment
        end
    end
    _nndist[i] = nndist
    _nn[i] = nn
    nothing
end

Optimisation needed

The current Julia implementation is significantly slower than C++ FastJet and I assume we can do better. Here you can see a plot that compares benchmarking results from an older version of this repo, which is already heavily optimised, ("Faster Julia" from about a week ago, showed in dark purple) with the currently used version of the algorithm ("New Faster Julia", in magenta). C++ FastJet is still way below. I did not include compilation time and used BenchmarkTools.jl for benchmarking. For the Julia approaches I've plotted the entire range of runtimes as well as the median (it's very close to the lower bound).

I cannot think of any other methods to optimise the code for now. The latest updates have already been inspired mostly by @Moelf 's suggestions (i.e. using @inbounds @simd for, Base.@propagate_inbounds, etc). In case anyone is able to spot other issues or knows ways to optimise the code with more or less native Julia methods, please let me know.

P.S. Profiling shows that most of the time we spend in the two following blocks of code:

...
## src/Algo.jl @ line 105 
# initialize _nn
@simd for i in 1:N
    _upd_nn_crosscheck!(i, 1, i-1, _eta, _phi, _R2, _nndist, _nn)
end
...

...
## src/Algo.jl @ line 165 
# update nearest neighbours step
@inbounds @simd for k in 1:N
    _upd_nn_step!(i, j, k, N, Nn, _kt2, _eta, _phi, _R2, _nndist, _nn, _nndij)
end
...

Move project to JuliaHEP organization?

use existing Lorentz Vector package

https://github.com/JuliaHEP/LorentzVectorHEP.jl instead of https://github.com/gojakuch/JetReconstruction.jl/blob/main/src/Particle.jl ?

Define public interface

We should define what the public interfaces are to the package. At the moment we have:

Input

N2Plain

Defaults to a collection of arbitrary objects, where px, py, pz, energy are defined for each object (in JetReconstruction, i.e., JetReconstruction.px(T)) to extract the relevant 4-vector parameters.

I like that, but the default used in Particle.jl is a little bit hokey for my taste as the index numbers are mapped to each quantity in a magic number kind of way.

N2Tiled

A collection of PseudoJet structs.

The thing I dislike currently is that the struct has an implementation specific parameter, viz. _cluster_hist_index. It's also rather close to LorentzVectorHEP, albeit that circular parameters are cached (not sure if we need that).

Proposal

Standardise to accept input particles as a collection of LorentzVectors, used via the LorentzVectorHEP package, which provides the appropriate px, py, pz, energy definitions already. This also achieves #5.

I still like the idea of documenting for type T the idea of defining JetReconstruction.px{T}, etc. that provides a clear way for users to use their own types with the package.

Internal Use

It is probably optimal that each algorithm uses an internal EDM when users pass arbitrary types, as these can be tuned for internal efficiency (e.g., the N2Tiled method can keep using its current PseudoJet).

Output

N2Plain returns the same type of jet objects that it was given as input. N2Tiled returns its own PseudoJet objects.

Also the sequences are different, with N2Tiled returning a FastJet-like clusterisation history and N2Plain a vector of ints, giving the merging index sequence for each input particle (and intermediates).

N2Tiled also filters on $p_{Tmin}$ and N2Plain doesn't.

I am really not sure what's best here - uniformly return something like LorentzVectorCyl, which gives $(p_T, \eta, \phi)$ coordinates?

Nicer strategy switching

ATM, the algorithm strategy is manual - --strategy=N2Plain in chep.jl. The string is mapped to an enum in Julia, but there is an annoying manual mapping from the option string to the enum. I could not find a way to extract a nice string from the enum names, did I miss something?

However, reading about this issue, it seems people often use types in these cases, not enums. Should see if that is a nicer way. If a concrete type was used to map to each algorithm it could contain a user friendly string for the CLI.

License Matters

We have a bunch of code that came from AntiKt,jl that is licensed under GPL. The rest of the package is under the more permissive MIT license (more popular in the Julia community IMO).

Current CERN and HSF advice is to not use viral licenses and, specifically, ATLAS and CMS disfavour them (they use Apache2).

@grasph - would you be willing to relicense the code under an MIT license?

Use `FastJet.jl` for comparison later

it looks like eventually we'd compare to FastJet's output, @jstrube has kindly wrapped the C++ library in Julia: https://github.com/jstrube/FastJet.jl . I also assume Jan is happy to help you with any missing features there.

Due to lack of docs might want to look at C++ FastJet documentation and compare to test:
https://github.com/jstrube/FastJet.jl/blob/ba5a1521430fbfecf3cfec977d450ae169063ee5/test/runtests.jl#L23

but we can imagine extending FastJet.jl to accept stuff from LorentzVectorHep.jl

Remove internal data from PseudoJet

The PseudoJet struct is extremely useful and works very well in this package. However, it has one defect when exposed publicly, viz. a data member which is completely for the internal use of the N2Tiled algorithm, _cluster_hist_index.

This should be removed to make the struct clean for export and an internal index added to be used during tiled reconstruction.