moelf / fhist.jl Goto Github PK

View Code? Open in Web Editor NEW

35.0 4.0 10.0 28.69 MB

A pure Julia 1/2/3D histogram package that focus on speed and is thread-safe.

License: MIT License

Julia 100.00%

julia julialang histogram hep

fhist.jl's Introduction

FHist.jl

Fast, error-aware, and thread-safe 1D/2D/3D histograms that are also compatible with StatsBase.Histogram

Changelog

0.11
- Breaking change to the main constructor API, now the API is sanely divided into "constructor for empty histogram" and "make histogram given data". See documentation.
0.10
- due to huge latency issue, UnicodePlots.jl (for text based terminal display) has been dropped.

Quick Start

julia> using FHist

julia> a = randn(1000);

julia> h1 = Hist1D(a; binedges = -4:1.0:4)
edges: -4.0:1.0:4.0
bin counts: [0, 23, 123, 377, 320, 134, 21, 2]
total count: 1000

julia> h2 = Hist1D(; counttype = Int, binedges = -4.0:1.0:4.0);

julia> push!.(h2, a);

julia> Threads.@threads for v in a
           atomic_push!(h2, v) # thread-safe
       end

julia> h1+h1 == h2
true

Additionally, one can specify overflow=true when creating a histogram to clamp out-of-bounds values into the edge bins.

julia> Hist1D(rand(1000); binedges = 0:0.2:0.9, overflow=true)
edges: 0.0:0.2:0.8
bin counts: [218, 183, 198, 401]
total count: 1000

Speed

Single-threaded filling happens at 400 MHz

julia> a = randn(10^6);

julia> @benchmark Hist1D(a; binedges = -3:0.01:3)
BenchmarkTools.Trial: 2578 samples with 1 evaluation.
 Range (min … max):  1.871 ms …  3.360 ms  ┊ GC (min … max): 0.00% … 0.00%

Features

1D

julia> using FHist, Statistics

julia> h1 = Hist1D(randn(10^4).+2; binedges = -5:0.5:5);

julia> h1 = (h1 + h1*2)
edges: -5.0:0.5:5.0
bin counts: [0, 0, 0, 0, 0, 3, 3, 39, 153, 540, 1356, 2640, 4551, 5757, 5727, 4494, 2652, 1326, 546, 168]
total count: 29955

julia> bincenters(h1)
-4.75:0.5:4.75

julia> println(bincounts(h1))
[0, 0, 0, 0, 0, 3, 3, 39, 153, 540, 1356, 2640, 4551, 5757, 5727, 4494, 2652, 1326, 546, 168]

julia> println(h1.sumw2);
[0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 65.0, 255.0, 900.0, 2260.0, 4400.0, 7585.0, 9595.0, 9545.0, 7490.0, 4420.0, 2210.0, 910.0, 280.0]

julia> nbins(h1), integral(h1)
(20, 29955)

julia> mean(h1), std(h1)
(1.993865798698047, 1.0126978885814524)

julia> median(h1), quantile(h1, 0.5)
(1.7445284002084418, 1.7445284002084418)

julia> lookup.(h1, [-1.5,0,2.5]) # find bin counts for given x-axis values
3-element Vector{Int64}:
   39
 1356
 4494

julia> Hist1D(sample(h1; n=100))
edges: -1.0:1.0:5.0
bin counts: [4, 15, 41, 28, 10, 2]
total count: 100

julia> h1 |> normalize |> integral
1.0

2D

julia> h2 = Hist2D((randn(10^4),randn(10^4)); binedges = (-5:5,-5:5))

edges: (-5:5, -5:5)
bin counts: [0 0 … 0 0; 0 0 … 0 0; … ; 0 0 … 0 0; 0 0 … 0 0]
total count: 10000

julia> nbins(h2)
(10, 10)

julia> project(h2, :x)
edges: -5:5
bin counts: [0, 14, 228, 1345, 3399, 3462, 1300, 237, 15, 0]
total count: 10000

julia> h2 |> profile(:x) # or pipe

edges: -5:5
bin counts: [0.0, -0.21428571428571427, 0.0, 0.03382899628252788, 0.0025007355104442485, 0.012709416522241479, 0.018461538461538463, 0.035864978902953586, 0.1, 0.0]
total count: -0.010920048606008592

julia> h2 |> rebin(10,5)

edges: (-5.0:10.0:5.0, -5.0:5.0:5.0)
bin counts: [4953 5047]
total count: 10000

3D

julia> h3 = Hist3D((randn(10^4),randn(10^4),randn(10^4)); binedges = (-5:5,-5:5,-5:5))
Hist3D{Int64}, edges=(-5:5, -5:5, -5:5), integral=10000

julia> h3 |> project(:x) |> project(:x) |> std
1.051590599996025

fhist.jl's People

Contributors

Stargazers

Watchers

Forkers

aminnj sgnoohc tamasgal rafaeljacobsen mivanovska007 grabanal bojohnson5 jjgomezcadenas sfranchel vaiton

fhist.jl's Issues

cumulative reverse order

Was the function meant to be cumsum(f( not f(cumsum( ?

FHist.jl/src/hist1d.jl

Lines 238 to 239 in 0c3dfdf

    
           h.hist.weights .= f(cumsum(h.hist.weights)) 
        
           h.sumw2 .= f(cumsum(h.sumw2))

Statbox not working on a normalized histogram

preserve number of entries and overflow bin

[Makie] `stephist` missing edges when some bins are empty in the middle of the histogram

julia> using FHist, CairoMakie

julia> h = Hist1D(;binedges = 0:3, bincounts = [2, 0, 10])

julia> begin
           stephist(h; axis=(; yticks=[1,10,100], yscale=log10))
           ylims!(1, 100)
           current_figure()
       end

[RFC] `v0.11` breaking change to main API

There are mainly three ways a user may want to construct a histogram:

Directly construct a histogram from known binedges, bincounts, sumw2s, nentries (e.g. loading histogram from ROOT)
Starting with a known histogram bin count type and binedges, then push!() into this histogram\
Starting with a bunch of data already in array (possibly with weights as well, maybe already know desired binedges)

Our current APIs are quite confusing, I want to re-design it, but there are ~~maybe binary-choice~~ some design questions we have to answer, I will say what they are and write down what I prefer for posterity:

A) `fit(Hist1D, ...)` vs. B) `Hist1D(...)`

The first obvious question is do we want to use Hist1D(...) for everything or reserve it for constructor-like use case only. I prefer option B) because even if we use fit() for the 3nd kind of use case, use case 1. and 2. will still share Hist1D(...) api,

Positional args vs. Keyword args

I would call use cases 1. and 2. "constructor-like" where 3. is "fit-like". Here we have one obvious choice (or the opposite), I have been thinking about something like this:

# constructor-like
Hist1D(; eltye::Type{T}, binedges, bincounts, sumw2s, nentries, overflow) where T

# fit-like
Hist1D(ary, weights, binedges; eltype::Type{T}, overflow) where T

we can as easily exchange these two, make fit-like use keyword argument. But somehow I feel like make constructir-like use case 100% explicit might be more natural

Pushing `-Inf` and `Inf`

I just bumped into an error when trying to push -Inf or Inf (also NaN but that's another topic) to a histogram. I wanted to create a PR but then found that there is already a commented-out codeline which fixes the problem and according to the comment it's even 20% faster.

Is there any reason not to use that line instead?

FHist.jl/src/hist1d.jl

Line 17 in b1a9b1c

    
           # return Base.unsafe_trunc(Int, round((x - first(r)) * inv(step(r)), RoundDown)) + 1

The error btw. is:

julia> using FHist

julia> h = Hist1D(;bins=1:10)
edges: 1:10
bin counts: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
total count: 0.0

julia> push!(h, -Inf)
ERROR: InexactError: trunc(Int64, -Inf)
Stacktrace:
 [1] trunc
   @ ./float.jl:893 [inlined]
 [2] floor
   @ ./float.jl:382 [inlined]
 [3] _edge_binindex
   @ ~/.julia/packages/FHist/Gj41b/src/hist1d.jl:15 [inlined]
 [4] push!
   @ ~/.julia/packages/FHist/Gj41b/src/hist1d.jl:121 [inlined]
 [5] push!(h::Hist1D{Float64, Tuple{UnitRange{Int64}}}, val::Float64)
   @ FHist ~/.julia/packages/FHist/Gj41b/src/hist1d.jl:119
 [6] top-level scope
   @ REPL[11]:1

Problem with recipes

Hello,
Wen including FHist in my Pluto notebook,

begin
	import PlutoUI
	using FHist
	using Printf
	using Plots
end

it returns an error:

LoadError: MethodError: no method matching var"@recipe"(::LineNumberNode, ::Module, ::Expr)

Closest candidates are:

var"@recipe"(::LineNumberNode, ::Module, ::Any, !Matched::Symbol, !Matched::Symbol...)

@ MakieCore ~/.julia/packages/MakieCore/zgrOY/src/recipes.jl:167

in expression starting at /Users/jjgomezcadenas/.julia/packages/FHist/N3dfh/src/plots.jl:3

in expression starting at /Users/jjgomezcadenas/.julia/packages/FHist/N3dfh/src/plots.jl:3

include(::Module, ::String)@Base.jl:457
include(::String)@FHist.jl:1
top-level [email protected]:40
[email protected]:370[inlined]
[email protected]:1[inlined]
(::FHist.var"#101#110")()@require.jl:101
macro [email protected]:393[inlined]
err(::Any, ::Module, ::String, ::String, ::Any)@require.jl:47
(::FHist.var"#100#109")()@require.jl:100
withpath(::Any, ::String)@require.jl:37
(::FHist.var"#99#108")()@require.jl:99
#invokelatest#[email protected]:819[inlined]
[email protected]:816[inlined]
foreach(::typeof(invokelatest), ::Vector{Function})@abstractarray.jl:3075
loadpkg(::Base.PkgId)@require.jl:27
#invokelatest#[email protected]:819[inlined]
[email protected]:816[inlined]
run_package_callbacks(::Base.PkgId)@loading.jl:1129
_require_prelocked(::Base.PkgId, ::String)@loading.jl:1667
macro [email protected]:1648[inlined]
macro [email protected]:267[inlined]
require(::Module, ::Symbol)@loading.jl:1611
top-level scope@[Other: 5](http://localhost:1235/edit?id=a4eb16cc-53e4-11ee-2791-87b16b34de44#)

Then I try a simple example:

a = randn(10000);
h1 = Hist1D(a, -4:0.1:4)
plot(h1)

which fails:

Cannot convert FHist.Hist1D{Int64, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}} to series data for plotting

error(::String)@error.jl:35
_prepare_series_data(::FHist.Hist1D{Int64, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}})@series.jl:8
_series_data_vector(::FHist.Hist1D{Int64, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}}, ::Dict{Symbol, Any})@series.jl:36
macro [email protected]:129[inlined]
apply_recipe(::AbstractDict{Symbol, Any}, ::Type{RecipesPipeline.SliceIt}, ::Any, ::Any, ::Any)@RecipesBase.jl:300
_process_userrecipes!(::Any, ::Any, ::Any)@user_recipe.jl:38
recipe_pipeline!(::Any, ::Any, ::Any)@RecipesPipeline.jl:72
_plot!(::Plots.Plot, ::Any, ::Any)@plot.jl:223
#plot#[email protected]:102[inlined]
[email protected]:93[inlined]
top-level scope@[Local: 1](http://localhost:1235/edit?id=a4eb16cc-53e4-11ee-2791-87b16b34de44#)[inlined]

I can still plot this histogram (sending to plot the weights and the centers) but the direct recipe seems to fail.

Thanks for an excellent package!

integral should take into account binwidths

this also affects normalize
so if one plots the normalized histogram one doesn't get an area under the curve of 1 as desired

Can this be Real?

FHist.jl/src/hist1d.jl

Line 129 in 2ea9ea2

h = Hist1D(Int; bins=r)

I am having issues where I try to do something like

h.hist = Histogram(edges, bc)

and the ones being replaced is Real instead of Int and Histogram complains.

ERROR: MethodError: Cannot convert an object of type
Histogram{Real,1,Tuple{StepRange{Int64{},Int64{}}}} to an object of type
Histogram{Int64,1,Tuple{StepRange{Int64{},Int64{}}}}

I wonder if assuming "Real" is better than Int.

repr in jupyter notebooks

Right now we suppress the UnicodePlots.jl visualization for histograms if there are too many bins (takes up too much vertical space):

h = Hist1D(randn(1000), -3:0.1:3)

edges: -3.0:0.1:3.0
bin counts: [1, 2, 0, 2, 1, 4, 4, 3, 4, 6  …  4, 6, 4, 3, 0, 0, 1, 0, 1, 0]
total count: 998

but in a notebook we have more freedom. Translating some code from yahist, I got

paddingy = 0.10
paddingx = 0.05
framewidth = 250
frameheight = 200
fill = "#1f77b4" # matplotlib C0
strokecolor = "black"
strokewidth = 0

c = bincounts(h)
e = binedges(h)
ys = frameheight * ((2*paddingy-1)/maximum(c) .* c .+ (1-paddingy))
xs = framewidth * (
    (1 - 2 * paddingx) / (maximum(e) - minimum(e))
    .* (e .- minimum(e))
    .+ paddingx
)
points = [(paddingx*framewidth, (1-paddingy)*frameheight)] # bottom left
for i in 1:(length(xs)-1)
    push!(points, (xs[i],ys[i])) # left bin edge
    push!(points, (xs[i+1],ys[i])) # right bin edge
end
push!(points, ((1-paddingx)*framewidth,(1-paddingy)*frameheight))
push!(points, points[1]) # close path
pathstr = join(["$(x),$(y)" for (x,y) in points],",")

display("image/svg+xml", """
    
<svg width="$(framewidth)" height="$(frameheight)" version="1.1" xmlns="http://www.w3.org/2000/svg">
    <polyline points="$(pathstr)" stroke="$(strokecolor)" fill="$(fill)" fill-opacity="1.0" stroke-width="$(strokewidth)"/>
    <polyline points="$(framewidth*paddingx),$(frameheight*(1-paddingy)),$(framewidth*(1-paddingx)),$(frameheight*(1-paddingy))" stroke="black" stroke-width="1"/>
    <text x="$(framewidth*paddingx)" y="$(frameheight*(1-0.5*paddingy))" dominant-baseline="middle" text-anchor="start" fill="black">$(minimum(e))</text>
    <text x="$(framewidth*(1-paddingx))" y="$(frameheight*(1-0.5*paddingy))" dominant-baseline="middle" text-anchor="end" fill="black">$(maximum(e))</text>
</svg>
    # black frame around drawable svg area
    # <rect width="$(framewidth)" height="$(frameheight)" fill="none" stroke="#000" stroke-width="2" />
    # red box around plottable area
    # <rect x="$(framewidth*paddingx)" y="$(frameheight*paddingy)" width="$(framewidth*(1-2*paddingx))" height="$(frameheight*(1-2*paddingy))" fill="none" stroke="#f00" stroke-width="2" />
    
""")
print("""
edges: $(binedges(h))
total count: $(integral(h))
""")

Worth it? Left some space at the bottom to label the leftmost and rightmost edges. Also, I have to admit that between display, show, the various MIME types, IO, ... I couldn't figure out which one is responsible for just jupyter printing :(

or do you like this one more

normalize, width = true

true should be the default
that would be consistent with everyone else's normalize
thanks

Make plot attributes histograms - direction

I need to plot a 1D histogram vertical with Makie. This is usually accomplished by specifying direction=:y to a plot call like hist or density, however this seems to not be propagated for a Hist1D type.

There are currently workarounds or better ways to achieve this?

Hist1D division

Division of two Hist1Ds currently crashes. It works with a small modification

--- a/src/arithmatics.jl
+++ b/src/arithmatics.jl
@@ -36,8 +36,8 @@ for T in (:Hist1D,)
         _f(counts) = any(x -> x<0, counts)
         (_f(h1.hist.weights) || _f(h2.hist.weights)) && error("Can't do / when bin count is negative")
         _hist = /(h1.hist, h2.hist)
-        _sumw2 =  h1.sumw2 / (h2.hist.weights .^ 2) +
-                (sqrt.(h2.sumw2) * h1.hist.weights / (h2.hist.weights .^ 2)) .^ 2
+        _sumw2 =  @. h1.sumw2 / (h2.hist.weights ^ 2) +
+                (sqrt(h2.sumw2) * h1.hist.weights / (h2.hist.weights ^ 2)) ^ 2

         ($T)(_hist, _sumw2)
     end

Though the output _sumw2 is a bunch of zeros if h1 and h2 are filled without weights. Maybe there needs to be a check inside here which "materializes" each of the sumw2s (gaussian/sqrt by default) before doing the error propagation?

Extract ATLAS styling and utility function in another package

It could be useful to create Makie graphs for the ATLAS collaboration. In particular:

ATLASTHEME
collabtext!

Useful functionality

Let me list some functionality from yahist that I found useful (used >=1 times) in the past year. Feel free to make a concrete todo list of whichever ones you think make sense in Julia.

+-/* operators
h.counts
h.errors
h.edges
h.bin_centers
h.nbins
h.integral
h.mean()/h.std()/h.median()
h.normalize()
h.rebin(3) to combine 3 adjacent bins into one
h.restrict(self, low=None, high=None) to subset the hist along the x-axis
h.cumulative()
h.lookup() which takes x-value(s) and returns the bin content(s). E.g.,

lookup(h::Hist1D, value) = h.hist.weights[FHist._edge_binindex(h.hist.edges[1], value)]
h1 = Hist1D(Int; bins=0:3)
push!.(h1, [0.5, 1.5])
@test lookup.(h1, [0.5,1.5,2.5]) == [1, 1, 0]

h.quantile() (h.quantile(0.5) is used for h.median() above)
h.sample()
h.to_json(), h.from_json() for de/serialization

Number of entries not properly 'merged'

If you merge! two histograms the number of entries are not accumulated.

julia> h1 = Hist1D([1.,2.,3.])
edges: 1.0:1.0:4.0
bin counts: [1, 1, 1]
total count: 3

julia> nentries(h1)
3

julia> h2 = Hist1D([1.,2.,3.])
edges: 1.0:1.0:4.0
bin counts: [1, 1, 1]
total count: 3

julia> nentries(h2)
3

julia> merge!(h1,h2)
edges: 1.0:1.0:4.0
bin counts: [2, 2, 2]
total count: 6

julia> nentries(h1)
3

Weird behavior

How should I understand this?

julia> println(list_nsim)
[233, 213, 192, 270, 185, 277, 243, 171, 202, 166, 159, 191, 165, 204, 171, 231, 174, 223, 217, 170, 217, 197, 254, 200, 251, 191, 219, 239, 263, 222, 192, 210, 218, 223, 229, 266, 244, 220, 169, 223, 210, 225, 210, 179, 188, 154, 153, 167, 202, 220]

julia> h = Hist1D(list_nsim, 0:20:300)
                  ┌                              ┐
   [  0.0,  20.0) ┤ 0
   [ 20.0,  40.0) ┤ 0
   [ 40.0,  60.0) ┤ 0
   [ 60.0,  80.0) ┤ 0
   [ 80.0, 100.0) ┤ 0
   [100.0, 120.0) ┤ 0
   [120.0, 140.0) ┤ 0
   [140.0, 160.0) ┤ 0
   [160.0, 180.0) ┤ 0
   [180.0, 200.0) ┤ 0
   [200.0, 220.0) ┤ 0
   [220.0, 240.0) ┤ 0
   [240.0, 260.0) ┤ 0
   [260.0, 280.0) ┤ 0
   [280.0, 300.0) ┤ 0
                  └                              ┘
edges: 0:20:300
bin counts: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
total count: 0

julia> g = Hist1D(list_nsim, 233:1:253)
                  ┌                              ┐
   [233.0, 234.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [234.0, 235.0) ┤ 0
   [235.0, 236.0) ┤ 0
   [236.0, 237.0) ┤ 0
   [237.0, 238.0) ┤ 0
   [238.0, 239.0) ┤ 0
   [239.0, 240.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [240.0, 241.0) ┤ 0
   [241.0, 242.0) ┤ 0
   [242.0, 243.0) ┤ 0
   [243.0, 244.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [244.0, 245.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [245.0, 246.0) ┤ 0
   [246.0, 247.0) ┤ 0
   [247.0, 248.0) ┤ 0
   [248.0, 249.0) ┤ 0
   [249.0, 250.0) ┤ 0
   [250.0, 251.0) ┤ 0
   [251.0, 252.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [252.0, 253.0) ┤ 0
                  └                              ┘
edges: 233:1:253
bin counts: [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0]
total count: 5

Naming of push/unsafe_push

Should we rename
unsafe_push -> push
push -> atomic_push
?
(and then update docs to emphasize the difference). "atomic" also makes it clear what the safe/unsafe distinction refers to--thread safety.

That way people get good performance with the default push!()ing into a histogram. And it will still work in most use cases where threads operate on different chunks of data and fill thread-local histograms which are summed/merged later.

Issue with integer bins

Ran into an issue while playing around. Also, would be nice to re-introduce the unicode printing of the histogram by copying/converting the edges to float to bypass the UnicodePlots issue (also should suppress the drawing if there's more than, say, 50 bins?).

julia> Hist1D(rand(10), 1:1:5)
ERROR: StackOverflowError:
Stacktrace:
 [1] diff(r::StepRange{Int64, Int64}; dims::Int64)
   @ Base ./multidimensional.jl:964
 [2] diff
   @ ./multidimensional.jl:965 [inlined]
 [3] _is_uniform_bins
   @ ~/.julia/dev/FHist/src/FHist.jl:20 [inlined]
 [4] Hist1D(A::Vector{Float64}, edges::StepRange{Int64, Int64})
   @ FHist ~/.julia/dev/FHist/src/hist1d.jl:114
 [5] Hist1D(A::Vector{Float64}, edges::StepRange{Int64, Int64}) (repeats 20125 times)
   @ FHist ~/.julia/dev/FHist/src/hist1d.jl:117
 [6] top-level scope
   @ REPL[2]:1

but

julia> Hist1D(rand(10), 1.0:1:5)

works.

FHist type-parameter (of eltype) is also used for the weights

Not sure if this is a bug or feature, but I was confused when doing

using FHist

h = Hist1D(Float32, bins=0:0.5:1)
push!.(h, rand(100_000_000))

which produces wrong counts:

The counts are simply h.hist.weights but that should not depend on the element type.

I can of course fix it if it's a bug, but maybe I miss some design decision ;)

Nonuniform weights?

Hi,

Thanks for sharing this package! I have a question of the option of nonuniform weights provided by the APIs. It seems like currently it is always assumed to have uniform weights 1. Is it possible to support a separate weight array input?

If so, I would really love to see the performance difference compared to, say, the 2D histogram plotting function such as

h = histogram2d(v1, v2; bins=(r1, r2), weights)

from Matplotlib.

Always compute sumw2?

Based on #7, would it be worth always computing sumw2 (either by the whole-array-at-a-time or push!() API? That is, being less lazy about sumw2 so that there's less issues with histogram arithmetic later.

I tested the difference between these two methods:

FHist.jl/src/hist1d.jl

Lines 45 to 59 in 20e6695

    
           function Base.push!(h::Hist1D{T,E}, val::Real, wgt::Real) where {T,E} 
        
               @inbounds binidx = searchsortedlast(h.hist.edges[1], val) 
        
               lock(h) 
        
               @inbounds h.hist.weights[binidx] += wgt 
        
               @inbounds h.sumw2[binidx] += wgt^2 
        
               unlock(h) 
        
               return h 
        
           end 
        
           function Base.push!(h::Hist1D{T,E}, val::Real) where {T,E} 
        
               @inbounds binidx = searchsortedlast(h.hist.edges[1], val) 
        
               lock(h) 
        
               @inbounds h.hist.weights[binidx] += one(T) 
        
               unlock(h) 
        
               return h 
        
           end

julia> h = Hist1D(Int; bins=-5:1.0:5);

julia> a = randn(10^6);

julia> f(a::Vector{Float64}) = for i in a push!(h, i) end

julia> g(a::Vector{Float64}) = for i in a push!(h, i, 1) end

julia> @benchmark f($a)
BechmarkTools.Trial: 52 samples with 1 evaluations.
 Range (min … max):  95.785 ms … 102.866 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     96.994 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   97.304 ms ±   1.294 ms  ┊ GC (mean ± σ):  0.28% ± 0.39%

julia> @benchmark g($a)
BechmarkTools.Trial: 51 samples with 1 evaluations.
 Range (min … max):  97.108 ms … 102.408 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     98.253 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   98.466 ms ± 909.434 μs  ┊ GC (mean ± σ):  0.28% ± 0.41%

and they're basically the same speed. So they could be combined into one function with a default weight of 1 and always increment sumw2. Unless I mis-benchmarked somehow.

Constructor using Dict{Symbol, Key} from UnROOT's hist

Below is an example of how one might turn a UnROOT's hist object into a Hist1D.

function make_fhist1d(th1::Dict{Symbol, Any})
    xmin = th1[:fXaxis_fXmin]
    xmax = th1[:fXaxis_fXmax]
    nbins = th1[:fXaxis_fNbins]
    bins = range(xmin, xmax, length=nbins+1)
    h = Hist1D(Float64; bins=bins)
    h.hist.weights .= th1[:fN][2:end-1]
    h.sumw2 .= th1[:fSumw2][2:end-1]
    return h
end

I am not sure what would be the best way or best place to incorporate stuff like this....

Overflow behavior

Collecting discussion here on overflow handling. In my mind there are three ways of doing this:

the ROOT way: have extra underflow and overflow bins. If we don't want to deal with hackiness of changing the StatsBase histogram edges, then this would mean underflow and overflow counters in the Hist1D struct.
how it's done in yahist: Hist1D(..., overflow=True) by default (or False). This puts the overflow and underflow into the last or first bins rather than keep them separate like ROOT does.
let the user handle it: By default, anything outside the bin range is neglected. If the user wants to include overflows, they either extend the range or they clamp() values before filling the histogram.

I think (3) is too annoying for HEP usecases.
(1) has the complication of extra infrastructure and when computing various statistics, do we include or exclude overflow?
I'm biased ;) but (2) seems like a sensible compromise which wouldn't be hard to implement (basically just clamp() inside the filling methods). And it makes the choice of including/excluding overflow at histogram construction time rather than arithmetic/statistic computation time. But it means that overflow/underflow can't be separated out after histogram construction (you'd need to reconstruct the histogram).

What are your thoughts?

unsafe_push when only 1 thread

Since julia keeps the same number of threads throughout a session, shouldn't we be able to jit the right push!() based on the number of threads (i.e., don't use lock/unlock if nthreads() == 1)?

FHist.jl/src/hist1d.jl

Lines 101 to 106 in 23ace07

    
           @inline function Base.push!(h::Hist1D{T,E}, val::Real, wgt::Real=1) where {T,E} 
        
               lock(h) 
        
               unsafe_push!(h, val, wgt) 
        
               unlock(h) 
        
               return nothing 
        
           end

High precision bin edges make overfull display

FHistCairoMakieExt not found

Error: Error during loading of extension FHistCairoMakieExt of FHist, use Base.retry_load_extensions() to retry.
│ exception =
│ 1-element ExceptionStack:
│ ArgumentError: Package FHistCairoMakieExt [563811e1-e1a2-5d73-ba9f-d6e07a357f0b] is required but does not seem to be installed:
│ - Run Pkg.instantiate() to install all recorded dependencies.

Pkg.instantiate() does not fix the issue
0 dependencies successfully precompiled in 1 seconds. 299 already precompiled.

More informations:
Status ~/.julia/environments/v1.10/Project.toml
[13f3f980] CairoMakie v0.11.4
[68837c9b] FHist v0.10.7

Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 8 × Apple M3
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 2 on 4 virtual cores

Is normal that memory allocations happens when filling?

The package is great, but while chasing memory allocations in my application I noticed that it generates them.
See the following test case:

using FHist
data = randn(100);
h = Hist1D(; bins=-10:1:10)

@time for d in data
              push!(h,d)
              end
  0.000027 seconds (200 allocations: 4.688 KiB)

2 allocations per fill.

Benchmarks

Is it expected that Hist1D(a) is 10x-20x faster than the push!() interface?

julia> a = rand(10^6);
julia> h = Hist1D(Int; bins=-3:0.5:3);

julia> @benchmark Hist1D(a)
BechmarkTools.Trial: 724 samples with 1 evaluations.
 Range (min … max):  6.335 ms …   9.398 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.804 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.895 ms ± 335.737 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
 Memory estimate: 608 bytes, allocs estimate: 5.

julia> @benchmark for i in a push!(h2, i) end
BechmarkTools.Trial: 32 samples with 1 evaluations.
 Range (min … max):  150.007 ms … 183.342 ms  ┊ GC (min … max): 2.01% … 1.94%
 Time  (median):     155.427 ms               ┊ GC (median):    2.01%
 Time  (mean ± σ):   156.255 ms ±   5.758 ms  ┊ GC (mean ± σ):  2.08% ± 0.24%
 Memory estimate: 91.54 MiB, allocs estimate: 3999490.

julia> @benchmark push!.(Ref(h2), a)
BechmarkTools.Trial: 71 samples with 1 evaluations.
 Range (min … max):  62.397 ms … 140.778 ms  ┊ GC (min … max): 0.00% … 50.19%
 Time  (median):     68.369 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   70.491 ms ±  11.774 ms  ┊ GC (mean ± σ):  5.69% ±  8.42%
 Memory estimate: 22.89 MiB, allocs estimate: 5.

Plotting example using FHist and Plotly

Here is an example of how one might use FHist to make some real physics plots.
I created a wrapper of PlotlyJS to format things in LHC physics plot style.
(Because finding all the options in plotly takes some time.)

using PlotlyJSWrapper
using FHist

# Creating example histograms
h1 = Hist1D(randn(3000),-5:0.5:5)
h2 = Hist1D(randn(1000).+1,-5:0.5:5)
h3 = Hist1D((randn(1000).+2)./2,-5:0.5:5)
h4 = Hist1D((randn(2000).-2).*2,-5:0.5:5)
h5 = Hist1D(randn(2000).*5,-5:0.5:5)
h6 = Hist1D(randn(2000).-0.5,-5:0.5:5)
data = Hist1D(randn(9000),-5:0.5:5)
signal = Hist1D((randn(1000).+10)./3,-5:0.5:5)

# Plotting
plot_stack(
           backgrounds=[h1, h2, h3, h4, h5, h6],
           data=[data],
           signals=[signal], # TODO Not supported yet
           options=Dict{Symbol, Any}(
            :xaxistitle => "Δϕ<sub>jj</sub> [GeV]",
            :outputname => "plot.pdf",
            :backgroundlabels => ["tt̄", "Higgs", "Drell-Yan", "tt̄Z", "ZZ", "VBS WW"],
            :signallabels => ["VVH"],
           )
          )

This would produce https://github.com/sgnoohc/PlotlyJSWrapper.jl/blob/main/examples/example1/plot.png

implement Hist3D

Hist1D through Hist3D should cover almost all use cases. Beyond that, people can use StatsBase.Histogram.

Since we don't plan to go beyond 3D, we don't need to have a generalized HistND, so implementing Hist3D should be mainly copy-pasting Hist2D...

Beyond the basic constructors and getters, the important methods to implement would be lookup (e.g., for scale factors in pT/eta/phi bins), and project/profile to reduce to a Hist2D which is actually displayable.

Should there be a function return sumw2?

Just a thought.
Feel free to ignore.

Just in case in future the data structure changes, having an accessor to "sumw2" information may be better than direct access of h.sumw2.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Normalize broken ?

julia>   LinearAlgebra.normalize( Hist1D( randn(1000)); width=true )
ERROR: MethodError: no method matching normalize(::Hist1D{Int64, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}}; width=true)
Closest candidates are:
  normalize(::Hist1D) at ~/.julia/packages/FHist/tzC40/src/hist1d.jl:244 got unsupported keyword argument "width"
  normalize(::StatsBase.Histogram{T, N}; mode) where {T, N} at ~/.julia/packages/StatsBase/XgjIN/src/hist.jl:534 got unsupported keyword argument "width"
  normalize(::StatsBase.Histogram{T, N}, ::Array{T, N}...; mode) where {T, N} at ~/.julia/packages/StatsBase/XgjIN/src/hist.jl:547 got unsupported keyword argument "width"
  ...
Stacktrace:
 [1] top-level scope
   @ REPL[42]:1

	h.hist.weights .= f(cumsum(h.hist.weights))
	h.sumw2 .= f(cumsum(h.sumw2))

	function Base.push!(h::Hist1D{T,E}, val::Real, wgt::Real) where {T,E}
	@inbounds binidx = searchsortedlast(h.hist.edges[1], val)
	lock(h)
	@inbounds h.hist.weights[binidx] += wgt
	@inbounds h.sumw2[binidx] += wgt^2
	unlock(h)
	return h
	end
	function Base.push!(h::Hist1D{T,E}, val::Real) where {T,E}
	@inbounds binidx = searchsortedlast(h.hist.edges[1], val)
	lock(h)
	@inbounds h.hist.weights[binidx] += one(T)
	unlock(h)
	return h
	end

	@inline function Base.push!(h::Hist1D{T,E}, val::Real, wgt::Real=1) where {T,E}
	lock(h)
	unsafe_push!(h, val, wgt)
	unlock(h)
	return nothing
	end

moelf / fhist.jl Goto Github PK

fhist.jl's Introduction

FHist.jl

Changelog

Quick Start

Speed

Features

1D

2D

3D

fhist.jl's People

Contributors

Stargazers

Watchers

Forkers

fhist.jl's Issues

A) fit(Hist1D, ...) vs. B) Hist1D(...)

Positional args vs. Keyword args

Recommend Projects

Recommend Topics

Recommend Org

A) `fit(Hist1D, ...)` vs. B) `Hist1D(...)`