Warning: wall of text Context and literature review <p dir="au

Variance/bias-corrected entropy estimators about complexitymeasures.jl HOT 17 CLOSED

kahaaga commented on September 26, 2024 1

Variance/bias-corrected entropy estimators

from complexitymeasures.jl.

Comments (17)

Datseris commented on September 26, 2024 1

(I've removed the 3.0 milestone, because this can be done as a non breaking change since we have separated the definitions of entropies from discrete estimators)

from complexitymeasures.jl.

Datseris commented on September 26, 2024 1

I wouldn't name the field frequencies. I would name it counts. I always found frequency an odd word to use for integer numbers.

from complexitymeasures.jl.

Datseris commented on September 26, 2024 1

DiscreteInfoEstimator does not depend on the probabilities estimation in any way so there is no reason to make it its field. So I vote no changes to be done.

from complexitymeasures.jl.

kahaaga commented on September 26, 2024

I was looking into implementing these now, but came across something we need to resolve:

Many of these estimators are functions of the raw counts of the outcomes, not the normalised counts (i.e. probabilities). To accommodate this, I propose that we simply add another field frequencies to Probabilities, where frequencies[i] corresponds to probs[i]. This way, both counts and probabilities are readily available.

What do you think, @Datseris?

from complexitymeasures.jl.

Datseris commented on September 26, 2024

This sounds like a big change because many internal functions already compute probabilities, mainly the fasthist so I can't even count from the top of my head how many files you would need to alter to correctly track this change "everywhere" :(

But I can't think of a different way. If you really need the true counts.

So let's go with this. This means that the internal normed argument of the Probabilities constructor must be dropped. Actually, this is a good way for you to find out which methods will fail. Just remove the normed as now we must never give pre-normed probabilities. The test suite will tell you how many fail.

Let's do this PR before anything else to keep things clean. I am reallhy worried about this change I fear it may break many things.

from complexitymeasures.jl.

kahaaga commented on September 26, 2024

But I can't think of a different way. If you really need the true counts.

Yes, several of the estimators do require the raw counts. I can attempt a PR and see how involved it will be.

So let's go with this. This means that the internal normed argument of the Probabilities constructor must be dropped. Actually, this is a good way for you to find out which methods will fail. Just remove the normed as now we must never give pre-normed probabilities. The test suite will tell you how many fail.

Agreed.

from complexitymeasures.jl.

Datseris commented on September 26, 2024

Given the current API, I am not sure how to estimate frequencies given arbitrary input x to Probabilities. I guess one method should dispatch to Array{<:AbstractFloat} and one method dispatches to Array{<:Int}. The first dispatch tries to somehow magically estimate raw counts by extracting count quantum = 1/minimum(p) and then muliplying everything with the quantum and rounding to integer. Second method assmes array contains counts already?

from complexitymeasures.jl.

kahaaga commented on September 26, 2024

Hm. Do we actually have to estimate frequencies all the time? I think we can do

struct Probabilities
    probs::AbstractVector
    counts::Union{Nothing, AbstractVector}
end

This way, if isnothing(counts), then only MLEntropy can be used (or other estimators that operate directly on probabilities). On the other hand, if !isnothing(counts), then entropy estimators that demand raw counts can also be used.

The user doesn't even need to know that Probabilities stored counts. We just make sure to also include counts wherever possible. Trying to call entropy(WhateverFancyEstimator(Shannon()), est, x) will then error generically, because counts are not defined.

from complexitymeasures.jl.

Datseris commented on September 26, 2024

Well there is nothing stopping us from estimating the counts with the method I described so why not do it alltogether. The user anyways will never know of the existence of the counts field.

from complexitymeasures.jl.

kahaaga commented on September 26, 2024

Well there is nothing stopping us from estimating the counts with the method I described so why not do it alltogether. The user anyways will never know of the existence of the counts field.

For the integer version, it is straight-forward

# If given integer vector, then it is assumed that elements are counts of the different outcomes
Probabilities(x::AbstractVector{Int})

But for the float-version, I don't see how that would work. If I input say x = [0.1, 0.3, 0.4], and I want to convert it to a probability vector by normalizing, I have no idea how many counts underlie those initial fractions. To get actual counts, I'd need to also specify n (the total number of outcomes)?

from complexitymeasures.jl.

kahaaga commented on September 26, 2024

To me, it seems like there should be three constructors.

Probabilities(::AbstractVector{<:Float}) (the current behaviour, leaves counts as nothing)
Probabilities(::AbstractVector{<:Int}). Treats the inputs as counts
Probabilities(::AbstractVector{<:Float}, ::Int). Like the first, but also defines counts, because the total number of observations is known

from complexitymeasures.jl.

kahaaga commented on September 26, 2024

We could of course just create imaginary counts whose ratio respects the initial input data, but I feel uneasy doing so, because we're pretending to know information we don't have.

from complexitymeasures.jl.

kahaaga commented on September 26, 2024

Ah, but the input data gives n automatically, so scaling like you proposed should work (up to rounding errors).

from complexitymeasures.jl.

Datseris commented on September 26, 2024

We could of course just create imaginary counts whose ratio respects the initial input data, but I feel uneasy doing so, because we're pretending to know information we don't have.

That's what we should do. It's fine. Besides, we don't expect users to directly initialize Probabilities. Instead, they should give input data to the probabilities function.

from complexitymeasures.jl.

kahaaga commented on September 26, 2024

Hey @Datseris,

Since we're moving to 3.0 due to the new infoestimator-stores-the-definition API, would it make sense to do something similar for the discrete info estimators? We have the old-style syntax

struct FancyDiscreteEst{I} <: DiscreteInfoestimator
    measure::I # the info measure, e.g. `Shannon()`
end

function information(est::FancyDiscreteEst{<:Shannon}, est::ProbabilitiesEstimator, x)
    probs = probabilities(est, x)
    # ...
end

Or, we could let the DiscreteInfoEstimator store the ProbabilitiesEstimator too, so that we get

struct FancyDiscreteEst{I, P <: ProbabilitiesEstimator} <: DiscreteInfoestimator
    measure::I # the info measure, e.g. `Shannon()`
    probest::P  # e.g. `CountOccurences()`
end

function information(est::FancyDiscreteEst{<:Shannon}, x)
     probs = probabilities(est.probest, x)
     # ....
end

Any preference?

from complexitymeasures.jl.

kahaaga commented on September 26, 2024

I don't think it's a huge problem to have two different signatures for information - we already do, so I am slightly leaning towards the first alternative. The reason is that it is more pedagogic - one needs to pick both an entropy estimator AND a probabilities/frequencies estimator to estimate an entropy from data. That gets hidden a bit in the second alternative.

from complexitymeasures.jl.

kahaaga commented on September 26, 2024

DiscreteInfoEstimator does not depend on the probabilities estimation in any way so there is no reason to make it its field. So I vote no changes to be done.

Ok, then I just stick with information(est::DiscreteInfoEstimator, pest::ProbabilitiesEstimator, x).

from complexitymeasures.jl.

Variance/bias-corrected entropy estimators about complexitymeasures.jl HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent