Code Monkey home page Code Monkey logo

fhist.jl's Introduction

FHist.jl

Dev CICodecov

Fast, error-aware, and thread-safe 1D/2D/3D histograms that are also compatible with StatsBase.Histogram

Changelog

  • 0.11
    • Breaking change to the main constructor API, now the API is sanely divided into "constructor for empty histogram" and "make histogram given data". See documentation.
  • 0.10
    • due to huge latency issue, UnicodePlots.jl (for text based terminal display) has been dropped.

Quick Start

julia> using FHist

julia> a = randn(1000);

julia> h1 = Hist1D(a; binedges = -4:1.0:4)
edges: -4.0:1.0:4.0
bin counts: [0, 23, 123, 377, 320, 134, 21, 2]
total count: 1000

julia> h2 = Hist1D(; counttype = Int, binedges = -4.0:1.0:4.0);

julia> push!.(h2, a);

julia> Threads.@threads for v in a
           atomic_push!(h2, v) # thread-safe
       end

julia> h1+h1 == h2
true

Additionally, one can specify overflow=true when creating a histogram to clamp out-of-bounds values into the edge bins.

julia> Hist1D(rand(1000); binedges = 0:0.2:0.9, overflow=true)
edges: 0.0:0.2:0.8
bin counts: [218, 183, 198, 401]
total count: 1000

Speed

Single-threaded filling happens at 400 MHz

julia> a = randn(10^6);

julia> @benchmark Hist1D(a; binedges = -3:0.01:3)
BenchmarkTools.Trial: 2578 samples with 1 evaluation.
 Range (min  max):  1.871 ms   3.360 ms  ┊ GC (min  max): 0.00%  0.00%

Features

1D

julia> using FHist, Statistics

julia> h1 = Hist1D(randn(10^4).+2; binedges = -5:0.5:5);

julia> h1 = (h1 + h1*2)
edges: -5.0:0.5:5.0
bin counts: [0, 0, 0, 0, 0, 3, 3, 39, 153, 540, 1356, 2640, 4551, 5757, 5727, 4494, 2652, 1326, 546, 168]
total count: 29955

julia> bincenters(h1)
-4.75:0.5:4.75

julia> println(bincounts(h1))
[0, 0, 0, 0, 0, 3, 3, 39, 153, 540, 1356, 2640, 4551, 5757, 5727, 4494, 2652, 1326, 546, 168]

julia> println(h1.sumw2);
[0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 65.0, 255.0, 900.0, 2260.0, 4400.0, 7585.0, 9595.0, 9545.0, 7490.0, 4420.0, 2210.0, 910.0, 280.0]

julia> nbins(h1), integral(h1)
(20, 29955)

julia> mean(h1), std(h1)
(1.993865798698047, 1.0126978885814524)

julia> median(h1), quantile(h1, 0.5)
(1.7445284002084418, 1.7445284002084418)

julia> lookup.(h1, [-1.5,0,2.5]) # find bin counts for given x-axis values
3-element Vector{Int64}:
   39
 1356
 4494

julia> Hist1D(sample(h1; n=100))
edges: -1.0:1.0:5.0
bin counts: [4, 15, 41, 28, 10, 2]
total count: 100

julia> h1 |> normalize |> integral
1.0

2D

julia> h2 = Hist2D((randn(10^4),randn(10^4)); binedges = (-5:5,-5:5))

edges: (-5:5, -5:5)
bin counts: [0 0  0 0; 0 0  0 0;  ; 0 0  0 0; 0 0  0 0]
total count: 10000

julia> nbins(h2)
(10, 10)

julia> project(h2, :x)
edges: -5:5
bin counts: [0, 14, 228, 1345, 3399, 3462, 1300, 237, 15, 0]
total count: 10000

julia> h2 |> profile(:x) # or pipe

edges: -5:5
bin counts: [0.0, -0.21428571428571427, 0.0, 0.03382899628252788, 0.0025007355104442485, 0.012709416522241479, 0.018461538461538463, 0.035864978902953586, 0.1, 0.0]
total count: -0.010920048606008592

julia> h2 |> rebin(10,5)

edges: (-5.0:10.0:5.0, -5.0:5.0:5.0)
bin counts: [4953 5047]
total count: 10000

3D

julia> h3 = Hist3D((randn(10^4),randn(10^4),randn(10^4)); binedges = (-5:5,-5:5,-5:5))
Hist3D{Int64}, edges=(-5:5, -5:5, -5:5), integral=10000

julia> h3 |> project(:x) |> project(:x) |> std
1.051590599996025

fhist.jl's People

Contributors

aminnj avatar bojohnson5 avatar francescoalemanno avatar github-actions[bot] avatar jjgomezcadenas avatar mivanovska007 avatar moelf avatar rafaeljacobsen avatar sfranchel avatar tamasgal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

fhist.jl's Issues

[RFC] `v0.11` breaking change to main API

There are mainly three ways a user may want to construct a histogram:

  1. Directly construct a histogram from known binedges, bincounts, sumw2s, nentries (e.g. loading histogram from ROOT)
  2. Starting with a known histogram bin count type and binedges, then push!() into this histogram\
  3. Starting with a bunch of data already in array (possibly with weights as well, maybe already know desired binedges)

Our current APIs are quite confusing, I want to re-design it, but there are maybe binary-choice some design questions we have to answer, I will say what they are and write down what I prefer for posterity:

A) fit(Hist1D, ...) vs. B) Hist1D(...)

The first obvious question is do we want to use Hist1D(...) for everything or reserve it for constructor-like use case only. I prefer option B) because even if we use fit() for the 3nd kind of use case, use case 1. and 2. will still share Hist1D(...) api,

Positional args vs. Keyword args

I would call use cases 1. and 2. "constructor-like" where 3. is "fit-like". Here we have one obvious choice (or the opposite), I have been thinking about something like this:

# constructor-like
Hist1D(; eltye::Type{T}, binedges, bincounts, sumw2s, nentries, overflow) where T

# fit-like
Hist1D(ary, weights, binedges; eltype::Type{T}, overflow) where T

we can as easily exchange these two, make fit-like use keyword argument. But somehow I feel like make constructir-like use case 100% explicit might be more natural

Pushing `-Inf` and `Inf`

I just bumped into an error when trying to push -Inf or Inf (also NaN but that's another topic) to a histogram. I wanted to create a PR but then found that there is already a commented-out codeline which fixes the problem and according to the comment it's even 20% faster.

Is there any reason not to use that line instead?

# return Base.unsafe_trunc(Int, round((x - first(r)) * inv(step(r)), RoundDown)) + 1

The error btw. is:

julia> using FHist

julia> h = Hist1D(;bins=1:10)
edges: 1:10
bin counts: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
total count: 0.0

julia> push!(h, -Inf)
ERROR: InexactError: trunc(Int64, -Inf)
Stacktrace:
 [1] trunc
   @ ./float.jl:893 [inlined]
 [2] floor
   @ ./float.jl:382 [inlined]
 [3] _edge_binindex
   @ ~/.julia/packages/FHist/Gj41b/src/hist1d.jl:15 [inlined]
 [4] push!
   @ ~/.julia/packages/FHist/Gj41b/src/hist1d.jl:121 [inlined]
 [5] push!(h::Hist1D{Float64, Tuple{UnitRange{Int64}}}, val::Float64)
   @ FHist ~/.julia/packages/FHist/Gj41b/src/hist1d.jl:119
 [6] top-level scope
   @ REPL[11]:1

Problem with recipes

Hello,
Wen including FHist in my Pluto notebook,

begin
	import PlutoUI
	using FHist
	using Printf
	using Plots
end

it returns an error:

LoadError: MethodError: no method matching var"@recipe"(::LineNumberNode, ::Module, ::Expr)

Closest candidates are:

var"@recipe"(::LineNumberNode, ::Module, ::Any, !Matched::Symbol, !Matched::Symbol...)

@ MakieCore ~/.julia/packages/MakieCore/zgrOY/src/recipes.jl:167

in expression starting at /Users/jjgomezcadenas/.julia/packages/FHist/N3dfh/src/plots.jl:3

in expression starting at /Users/jjgomezcadenas/.julia/packages/FHist/N3dfh/src/plots.jl:3

include(::Module, ::String)@Base.jl:457
include(::String)@FHist.jl:1
top-level [email protected]:40
[email protected]:370[inlined]
[email protected]:1[inlined]
(::FHist.var"#101#110")()@require.jl:101
macro [email protected]:393[inlined]
err(::Any, ::Module, ::String, ::String, ::Any)@require.jl:47
(::FHist.var"#100#109")()@require.jl:100
withpath(::Any, ::String)@require.jl:37
(::FHist.var"#99#108")()@require.jl:99
#invokelatest#[email protected]:819[inlined]
[email protected]:816[inlined]
foreach(::typeof(invokelatest), ::Vector{Function})@abstractarray.jl:3075
loadpkg(::Base.PkgId)@require.jl:27
#invokelatest#[email protected]:819[inlined]
[email protected]:816[inlined]
run_package_callbacks(::Base.PkgId)@loading.jl:1129
_require_prelocked(::Base.PkgId, ::String)@loading.jl:1667
macro [email protected]:1648[inlined]
macro [email protected]:267[inlined]
require(::Module, ::Symbol)@loading.jl:1611
top-level scope@[Other: 5](http://localhost:1235/edit?id=a4eb16cc-53e4-11ee-2791-87b16b34de44#)

Then I try a simple example:

a = randn(10000);
h1 = Hist1D(a, -4:0.1:4)
plot(h1)

which fails:

Cannot convert FHist.Hist1D{Int64, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}} to series data for plotting

error(::String)@error.jl:35
_prepare_series_data(::FHist.Hist1D{Int64, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}})@series.jl:8
_series_data_vector(::FHist.Hist1D{Int64, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}}, ::Dict{Symbol, Any})@series.jl:36
macro [email protected]:129[inlined]
apply_recipe(::AbstractDict{Symbol, Any}, ::Type{RecipesPipeline.SliceIt}, ::Any, ::Any, ::Any)@RecipesBase.jl:300
_process_userrecipes!(::Any, ::Any, ::Any)@user_recipe.jl:38
recipe_pipeline!(::Any, ::Any, ::Any)@RecipesPipeline.jl:72
_plot!(::Plots.Plot, ::Any, ::Any)@plot.jl:223
#plot#[email protected]:102[inlined]
[email protected]:93[inlined]
top-level scope@[Local: 1](http://localhost:1235/edit?id=a4eb16cc-53e4-11ee-2791-87b16b34de44#)[inlined]

I can still plot this histogram (sending to plot the weights and the centers) but the direct recipe seems to fail.

Thanks for an excellent package!

Can this be Real?

h = Hist1D(Int; bins=r)

I am having issues where I try to do something like

h.hist = Histogram(edges, bc)

and the ones being replaced is Real instead of Int and Histogram complains.

ERROR: MethodError: Cannot convert an object of type
Histogram{Real,1,Tuple{StepRange{Int64{},Int64{}}}} to an object of type
Histogram{Int64,1,Tuple{StepRange{Int64{},Int64{}}}}

I wonder if assuming "Real" is better than Int.

repr in jupyter notebooks

Right now we suppress the UnicodePlots.jl visualization for histograms if there are too many bins (takes up too much vertical space):

h = Hist1D(randn(1000), -3:0.1:3) 
edges: -3.0:0.1:3.0
bin counts: [1, 2, 0, 2, 1, 4, 4, 3, 4, 6  …  4, 6, 4, 3, 0, 0, 1, 0, 1, 0]
total count: 998

but in a notebook we have more freedom. Translating some code from yahist, I got

paddingy = 0.10
paddingx = 0.05
framewidth = 250
frameheight = 200
fill = "#1f77b4" # matplotlib C0
strokecolor = "black"
strokewidth = 0

c = bincounts(h)
e = binedges(h)
ys = frameheight * ((2*paddingy-1)/maximum(c) .* c .+ (1-paddingy))
xs = framewidth * (
    (1 - 2 * paddingx) / (maximum(e) - minimum(e))
    .* (e .- minimum(e))
    .+ paddingx
)
points = [(paddingx*framewidth, (1-paddingy)*frameheight)] # bottom left
for i in 1:(length(xs)-1)
    push!(points, (xs[i],ys[i])) # left bin edge
    push!(points, (xs[i+1],ys[i])) # right bin edge
end
push!(points, ((1-paddingx)*framewidth,(1-paddingy)*frameheight))
push!(points, points[1]) # close path
pathstr = join(["$(x),$(y)" for (x,y) in points],",")

display("image/svg+xml", """
    
<svg width="$(framewidth)" height="$(frameheight)" version="1.1" xmlns="http://www.w3.org/2000/svg">
    <polyline points="$(pathstr)" stroke="$(strokecolor)" fill="$(fill)" fill-opacity="1.0" stroke-width="$(strokewidth)"/>
    <polyline points="$(framewidth*paddingx),$(frameheight*(1-paddingy)),$(framewidth*(1-paddingx)),$(frameheight*(1-paddingy))" stroke="black" stroke-width="1"/>
    <text x="$(framewidth*paddingx)" y="$(frameheight*(1-0.5*paddingy))" dominant-baseline="middle" text-anchor="start" fill="black">$(minimum(e))</text>
    <text x="$(framewidth*(1-paddingx))" y="$(frameheight*(1-0.5*paddingy))" dominant-baseline="middle" text-anchor="end" fill="black">$(maximum(e))</text>
</svg>
    # black frame around drawable svg area
    # <rect width="$(framewidth)" height="$(frameheight)" fill="none" stroke="#000" stroke-width="2" />
    # red box around plottable area
    # <rect x="$(framewidth*paddingx)" y="$(frameheight*paddingy)" width="$(framewidth*(1-2*paddingx))" height="$(frameheight*(1-2*paddingy))" fill="none" stroke="#f00" stroke-width="2" />
    
""")
print("""
edges: $(binedges(h))
total count: $(integral(h))
""")

image

Worth it? Left some space at the bottom to label the leftmost and rightmost edges. Also, I have to admit that between display, show, the various MIME types, IO, ... I couldn't figure out which one is responsible for just jupyter printing :(

or do you like this one more
image

normalize, width = true

true should be the default
that would be consistent with everyone else's normalize
thanks

Make plot attributes histograms - direction

I need to plot a 1D histogram vertical with Makie. This is usually accomplished by specifying direction=:y to a plot call like hist or density, however this seems to not be propagated for a Hist1D type.

There are currently workarounds or better ways to achieve this?

Hist1D division

Division of two Hist1Ds currently crashes. It works with a small modification

--- a/src/arithmatics.jl
+++ b/src/arithmatics.jl
@@ -36,8 +36,8 @@ for T in (:Hist1D,)
         _f(counts) = any(x -> x<0, counts)
         (_f(h1.hist.weights) || _f(h2.hist.weights)) && error("Can't do / when bin count is negative")
         _hist = /(h1.hist, h2.hist)
-        _sumw2 =  h1.sumw2 / (h2.hist.weights .^ 2) +
-                (sqrt.(h2.sumw2) * h1.hist.weights / (h2.hist.weights .^ 2)) .^ 2
+        _sumw2 =  @. h1.sumw2 / (h2.hist.weights ^ 2) +
+                (sqrt(h2.sumw2) * h1.hist.weights / (h2.hist.weights ^ 2)) ^ 2

         ($T)(_hist, _sumw2)
     end

Though the output _sumw2 is a bunch of zeros if h1 and h2 are filled without weights. Maybe there needs to be a check inside here which "materializes" each of the sumw2s (gaussian/sqrt by default) before doing the error propagation?

Useful functionality

Let me list some functionality from yahist that I found useful (used >=1 times) in the past year. Feel free to make a concrete todo list of whichever ones you think make sense in Julia.

  • +-/* operators
  • h.counts
  • h.errors
  • h.edges
  • h.bin_centers
  • h.nbins
  • h.integral
  • h.mean()/h.std()/h.median()
  • h.normalize()
  • h.rebin(3) to combine 3 adjacent bins into one
  • h.restrict(self, low=None, high=None) to subset the hist along the x-axis
  • h.cumulative()
  • h.lookup() which takes x-value(s) and returns the bin content(s). E.g.,
lookup(h::Hist1D, value) = h.hist.weights[FHist._edge_binindex(h.hist.edges[1], value)]
h1 = Hist1D(Int; bins=0:3)
push!.(h1, [0.5, 1.5])
@test lookup.(h1, [0.5,1.5,2.5]) == [1, 1, 0]
  • h.quantile() (h.quantile(0.5) is used for h.median() above)
  • h.sample()
  • h.to_json(), h.from_json() for de/serialization

Number of entries not properly 'merged'

If you merge! two histograms the number of entries are not accumulated.

julia> h1 = Hist1D([1.,2.,3.])
edges: 1.0:1.0:4.0
bin counts: [1, 1, 1]
total count: 3

julia> nentries(h1)
3

julia> h2 = Hist1D([1.,2.,3.])
edges: 1.0:1.0:4.0
bin counts: [1, 1, 1]
total count: 3

julia> nentries(h2)
3

julia> merge!(h1,h2)
edges: 1.0:1.0:4.0
bin counts: [2, 2, 2]
total count: 6

julia> nentries(h1)
3

Weird behavior

How should I understand this?

julia> println(list_nsim)
[233, 213, 192, 270, 185, 277, 243, 171, 202, 166, 159, 191, 165, 204, 171, 231, 174, 223, 217, 170, 217, 197, 254, 200, 251, 191, 219, 239, 263, 222, 192, 210, 218, 223, 229, 266, 244, 220, 169, 223, 210, 225, 210, 179, 188, 154, 153, 167, 202, 220]

julia> h = Hist1D(list_nsim, 0:20:300)
                  ┌                              ┐
   [  0.0,  20.0) ┤ 0
   [ 20.0,  40.0) ┤ 0
   [ 40.0,  60.0) ┤ 0
   [ 60.0,  80.0) ┤ 0
   [ 80.0, 100.0) ┤ 0
   [100.0, 120.0) ┤ 0
   [120.0, 140.0) ┤ 0
   [140.0, 160.0) ┤ 0
   [160.0, 180.0) ┤ 0
   [180.0, 200.0) ┤ 0
   [200.0, 220.0) ┤ 0
   [220.0, 240.0) ┤ 0
   [240.0, 260.0) ┤ 0
   [260.0, 280.0) ┤ 0
   [280.0, 300.0) ┤ 0
                  └                              ┘
edges: 0:20:300
bin counts: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
total count: 0

julia> g = Hist1D(list_nsim, 233:1:253)
                  ┌                              ┐
   [233.0, 234.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [234.0, 235.0) ┤ 0
   [235.0, 236.0) ┤ 0
   [236.0, 237.0) ┤ 0
   [237.0, 238.0) ┤ 0
   [238.0, 239.0) ┤ 0
   [239.0, 240.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [240.0, 241.0) ┤ 0
   [241.0, 242.0) ┤ 0
   [242.0, 243.0) ┤ 0
   [243.0, 244.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [244.0, 245.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [245.0, 246.0) ┤ 0
   [246.0, 247.0) ┤ 0
   [247.0, 248.0) ┤ 0
   [248.0, 249.0) ┤ 0
   [249.0, 250.0) ┤ 0
   [250.0, 251.0) ┤ 0
   [251.0, 252.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1
   [252.0, 253.0) ┤ 0
                  └                              ┘
edges: 233:1:253
bin counts: [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0]
total count: 5

Naming of push/unsafe_push

Should we rename
unsafe_push -> push
push -> atomic_push
?
(and then update docs to emphasize the difference). "atomic" also makes it clear what the safe/unsafe distinction refers to--thread safety.

That way people get good performance with the default push!()ing into a histogram. And it will still work in most use cases where threads operate on different chunks of data and fill thread-local histograms which are summed/merged later.

Issue with integer bins

Ran into an issue while playing around. Also, would be nice to re-introduce the unicode printing of the histogram by copying/converting the edges to float to bypass the UnicodePlots issue (also should suppress the drawing if there's more than, say, 50 bins?).

julia> Hist1D(rand(10), 1:1:5)
ERROR: StackOverflowError:
Stacktrace:
 [1] diff(r::StepRange{Int64, Int64}; dims::Int64)
   @ Base ./multidimensional.jl:964
 [2] diff
   @ ./multidimensional.jl:965 [inlined]
 [3] _is_uniform_bins
   @ ~/.julia/dev/FHist/src/FHist.jl:20 [inlined]
 [4] Hist1D(A::Vector{Float64}, edges::StepRange{Int64, Int64})
   @ FHist ~/.julia/dev/FHist/src/hist1d.jl:114
 [5] Hist1D(A::Vector{Float64}, edges::StepRange{Int64, Int64}) (repeats 20125 times)
   @ FHist ~/.julia/dev/FHist/src/hist1d.jl:117
 [6] top-level scope
   @ REPL[2]:1

but

julia> Hist1D(rand(10), 1.0:1:5)

works.

FHist type-parameter (of eltype) is also used for the weights

Not sure if this is a bug or feature, but I was confused when doing

using FHist

h = Hist1D(Float32, bins=0:0.5:1)
push!.(h, rand(100_000_000))

which produces wrong counts:

Screenshot 2022-04-12 at 12 13 56

The counts are simply h.hist.weights but that should not depend on the element type.

I can of course fix it if it's a bug, but maybe I miss some design decision ;)

Nonuniform weights?

Hi,

Thanks for sharing this package! I have a question of the option of nonuniform weights provided by the APIs. It seems like currently it is always assumed to have uniform weights 1. Is it possible to support a separate weight array input?

If so, I would really love to see the performance difference compared to, say, the 2D histogram plotting function such as

h = histogram2d(v1, v2; bins=(r1, r2), weights)

from Matplotlib.

Always compute sumw2?

Based on #7, would it be worth always computing sumw2 (either by the whole-array-at-a-time or push!() API? That is, being less lazy about sumw2 so that there's less issues with histogram arithmetic later.

I tested the difference between these two methods:

FHist.jl/src/hist1d.jl

Lines 45 to 59 in 20e6695

function Base.push!(h::Hist1D{T,E}, val::Real, wgt::Real) where {T,E}
@inbounds binidx = searchsortedlast(h.hist.edges[1], val)
lock(h)
@inbounds h.hist.weights[binidx] += wgt
@inbounds h.sumw2[binidx] += wgt^2
unlock(h)
return h
end
function Base.push!(h::Hist1D{T,E}, val::Real) where {T,E}
@inbounds binidx = searchsortedlast(h.hist.edges[1], val)
lock(h)
@inbounds h.hist.weights[binidx] += one(T)
unlock(h)
return h
end

julia> h = Hist1D(Int; bins=-5:1.0:5);

julia> a = randn(10^6);

julia> f(a::Vector{Float64}) = for i in a push!(h, i) end

julia> g(a::Vector{Float64}) = for i in a push!(h, i, 1) end

julia> @benchmark f($a)
BechmarkTools.Trial: 52 samples with 1 evaluations.
 Range (min  max):  95.785 ms  102.866 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     96.994 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   97.304 ms ±   1.294 ms  ┊ GC (mean ± σ):  0.28% ± 0.39%

julia> @benchmark g($a)
BechmarkTools.Trial: 51 samples with 1 evaluations.
 Range (min  max):  97.108 ms  102.408 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     98.253 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   98.466 ms ± 909.434 μs  ┊ GC (mean ± σ):  0.28% ± 0.41%

and they're basically the same speed. So they could be combined into one function with a default weight of 1 and always increment sumw2. Unless I mis-benchmarked somehow.

Constructor using Dict{Symbol, Key} from UnROOT's hist

Below is an example of how one might turn a UnROOT's hist object into a Hist1D.

function make_fhist1d(th1::Dict{Symbol, Any})
    xmin = th1[:fXaxis_fXmin]
    xmax = th1[:fXaxis_fXmax]
    nbins = th1[:fXaxis_fNbins]
    bins = range(xmin, xmax, length=nbins+1)
    h = Hist1D(Float64; bins=bins)
    h.hist.weights .= th1[:fN][2:end-1]
    h.sumw2 .= th1[:fSumw2][2:end-1]
    return h
end

I am not sure what would be the best way or best place to incorporate stuff like this....

Overflow behavior

Collecting discussion here on overflow handling. In my mind there are three ways of doing this:

  1. the ROOT way: have extra underflow and overflow bins. If we don't want to deal with hackiness of changing the StatsBase histogram edges, then this would mean underflow and overflow counters in the Hist1D struct.

  2. how it's done in yahist: Hist1D(..., overflow=True) by default (or False). This puts the overflow and underflow into the last or first bins rather than keep them separate like ROOT does.

  3. let the user handle it: By default, anything outside the bin range is neglected. If the user wants to include overflows, they either extend the range or they clamp() values before filling the histogram.

  • I think (3) is too annoying for HEP usecases.
  • (1) has the complication of extra infrastructure and when computing various statistics, do we include or exclude overflow?
  • I'm biased ;) but (2) seems like a sensible compromise which wouldn't be hard to implement (basically just clamp() inside the filling methods). And it makes the choice of including/excluding overflow at histogram construction time rather than arithmetic/statistic computation time. But it means that overflow/underflow can't be separated out after histogram construction (you'd need to reconstruct the histogram).

What are your thoughts?

unsafe_push when only 1 thread

Since julia keeps the same number of threads throughout a session, shouldn't we be able to jit the right push!() based on the number of threads (i.e., don't use lock/unlock if nthreads() == 1)?

FHist.jl/src/hist1d.jl

Lines 101 to 106 in 23ace07

@inline function Base.push!(h::Hist1D{T,E}, val::Real, wgt::Real=1) where {T,E}
lock(h)
unsafe_push!(h, val, wgt)
unlock(h)
return nothing
end

FHistCairoMakieExt not found

Error: Error during loading of extension FHistCairoMakieExt of FHist, use Base.retry_load_extensions() to retry.
│ exception =
│ 1-element ExceptionStack:
│ ArgumentError: Package FHistCairoMakieExt [563811e1-e1a2-5d73-ba9f-d6e07a357f0b] is required but does not seem to be installed:
│ - Run Pkg.instantiate() to install all recorded dependencies.

Pkg.instantiate() does not fix the issue
0 dependencies successfully precompiled in 1 seconds. 299 already precompiled.

More informations:
Status ~/.julia/environments/v1.10/Project.toml
[13f3f980] CairoMakie v0.11.4
[68837c9b] FHist v0.10.7

Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 8 × Apple M3
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 2 on 4 virtual cores

Is normal that memory allocations happens when filling?

The package is great, but while chasing memory allocations in my application I noticed that it generates them.
See the following test case:

using FHist
data = randn(100);
h = Hist1D(; bins=-10:1:10)

@time for d in data
              push!(h,d)
              end
  0.000027 seconds (200 allocations: 4.688 KiB)

2 allocations per fill.

Benchmarks

Is it expected that Hist1D(a) is 10x-20x faster than the push!() interface?

julia> a = rand(10^6);
julia> h = Hist1D(Int; bins=-3:0.5:3);

julia> @benchmark Hist1D(a)
BechmarkTools.Trial: 724 samples with 1 evaluations.
 Range (min  max):  6.335 ms    9.398 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     6.804 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.895 ms ± 335.737 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
 Memory estimate: 608 bytes, allocs estimate: 5.

julia> @benchmark for i in a push!(h2, i) end
BechmarkTools.Trial: 32 samples with 1 evaluations.
 Range (min  max):  150.007 ms  183.342 ms  ┊ GC (min  max): 2.01%  1.94%
 Time  (median):     155.427 ms               ┊ GC (median):    2.01%
 Time  (mean ± σ):   156.255 ms ±   5.758 ms  ┊ GC (mean ± σ):  2.08% ± 0.24%
 Memory estimate: 91.54 MiB, allocs estimate: 3999490.

julia> @benchmark push!.(Ref(h2), a)
BechmarkTools.Trial: 71 samples with 1 evaluations.
 Range (min  max):  62.397 ms  140.778 ms  ┊ GC (min  max): 0.00%  50.19%
 Time  (median):     68.369 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   70.491 ms ±  11.774 ms  ┊ GC (mean ± σ):  5.69% ±  8.42%
 Memory estimate: 22.89 MiB, allocs estimate: 5.

Plotting example using FHist and Plotly

Here is an example of how one might use FHist to make some real physics plots.
I created a wrapper of PlotlyJS to format things in LHC physics plot style.
(Because finding all the options in plotly takes some time.)

using PlotlyJSWrapper
using FHist

# Creating example histograms
h1 = Hist1D(randn(3000),-5:0.5:5)
h2 = Hist1D(randn(1000).+1,-5:0.5:5)
h3 = Hist1D((randn(1000).+2)./2,-5:0.5:5)
h4 = Hist1D((randn(2000).-2).*2,-5:0.5:5)
h5 = Hist1D(randn(2000).*5,-5:0.5:5)
h6 = Hist1D(randn(2000).-0.5,-5:0.5:5)
data = Hist1D(randn(9000),-5:0.5:5)
signal = Hist1D((randn(1000).+10)./3,-5:0.5:5)

# Plotting
plot_stack(
           backgrounds=[h1, h2, h3, h4, h5, h6],
           data=[data],
           signals=[signal], # TODO Not supported yet
           options=Dict{Symbol, Any}(
            :xaxistitle => "Δϕ<sub>jj</sub> [GeV]",
            :outputname => "plot.pdf",
            :backgroundlabels => ["tt̄", "Higgs", "Drell-Yan", "tt̄Z", "ZZ", "VBS WW"],
            :signallabels => ["VVH"],
           )
          )

This would produce https://github.com/sgnoohc/PlotlyJSWrapper.jl/blob/main/examples/example1/plot.png

implement Hist3D

Hist1D through Hist3D should cover almost all use cases. Beyond that, people can use StatsBase.Histogram.

Since we don't plan to go beyond 3D, we don't need to have a generalized HistND, so implementing Hist3D should be mainly copy-pasting Hist2D...

Beyond the basic constructors and getters, the important methods to implement would be lookup (e.g., for scale factors in pT/eta/phi bins), and project/profile to reduce to a Hist2D which is actually displayable.

Should there be a function return sumw2?

Just a thought.
Feel free to ignore.

Just in case in future the data structure changes, having an accessor to "sumw2" information may be better than direct access of h.sumw2.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Normalize broken ?

julia>   LinearAlgebra.normalize( Hist1D( randn(1000)); width=true )
ERROR: MethodError: no method matching normalize(::Hist1D{Int64, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}}; width=true)
Closest candidates are:
  normalize(::Hist1D) at ~/.julia/packages/FHist/tzC40/src/hist1d.jl:244 got unsupported keyword argument "width"
  normalize(::StatsBase.Histogram{T, N}; mode) where {T, N} at ~/.julia/packages/StatsBase/XgjIN/src/hist.jl:534 got unsupported keyword argument "width"
  normalize(::StatsBase.Histogram{T, N}, ::Array{T, N}...; mode) where {T, N} at ~/.julia/packages/StatsBase/XgjIN/src/hist.jl:547 got unsupported keyword argument "width"
  ...
Stacktrace:
 [1] top-level scope
   @ REPL[42]:1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.