Code Monkey home page Code Monkey logo

Comments (17)

Datseris avatar Datseris commented on September 26, 2024 1

(I've removed the 3.0 milestone, because this can be done as a non breaking change since we have separated the definitions of entropies from discrete estimators)

from complexitymeasures.jl.

Datseris avatar Datseris commented on September 26, 2024 1

I wouldn't name the field frequencies. I would name it counts. I always found frequency an odd word to use for integer numbers.

from complexitymeasures.jl.

Datseris avatar Datseris commented on September 26, 2024 1

DiscreteInfoEstimator does not depend on the probabilities estimation in any way so there is no reason to make it its field. So I vote no changes to be done.

from complexitymeasures.jl.

kahaaga avatar kahaaga commented on September 26, 2024

I was looking into implementing these now, but came across something we need to resolve:

Many of these estimators are functions of the raw counts of the outcomes, not the normalised counts (i.e. probabilities). To accommodate this, I propose that we simply add another field frequencies to Probabilities, where frequencies[i] corresponds to probs[i]. This way, both counts and probabilities are readily available.

What do you think, @Datseris?

from complexitymeasures.jl.

Datseris avatar Datseris commented on September 26, 2024

This sounds like a big change because many internal functions already compute probabilities, mainly the fasthist so I can't even count from the top of my head how many files you would need to alter to correctly track this change "everywhere" :(

But I can't think of a different way. If you really need the true counts.

So let's go with this. This means that the internal normed argument of the Probabilities constructor must be dropped. Actually, this is a good way for you to find out which methods will fail. Just remove the normed as now we must never give pre-normed probabilities. The test suite will tell you how many fail.

Let's do this PR before anything else to keep things clean. I am reallhy worried about this change I fear it may break many things.

from complexitymeasures.jl.

kahaaga avatar kahaaga commented on September 26, 2024

But I can't think of a different way. If you really need the true counts.

Yes, several of the estimators do require the raw counts. I can attempt a PR and see how involved it will be.

So let's go with this. This means that the internal normed argument of the Probabilities constructor must be dropped. Actually, this is a good way for you to find out which methods will fail. Just remove the normed as now we must never give pre-normed probabilities. The test suite will tell you how many fail.

Agreed.

from complexitymeasures.jl.

Datseris avatar Datseris commented on September 26, 2024

Given the current API, I am not sure how to estimate frequencies given arbitrary input x to Probabilities. I guess one method should dispatch to Array{<:AbstractFloat} and one method dispatches to Array{<:Int}. The first dispatch tries to somehow magically estimate raw counts by extracting count quantum = 1/minimum(p) and then muliplying everything with the quantum and rounding to integer. Second method assmes array contains counts already?

from complexitymeasures.jl.

kahaaga avatar kahaaga commented on September 26, 2024

Hm. Do we actually have to estimate frequencies all the time? I think we can do

struct Probabilities
    probs::AbstractVector
    counts::Union{Nothing, AbstractVector}
end

This way, if isnothing(counts), then only MLEntropy can be used (or other estimators that operate directly on probabilities). On the other hand, if !isnothing(counts), then entropy estimators that demand raw counts can also be used.

The user doesn't even need to know that Probabilities stored counts. We just make sure to also include counts wherever possible. Trying to call entropy(WhateverFancyEstimator(Shannon()), est, x) will then error generically, because counts are not defined.

from complexitymeasures.jl.

Datseris avatar Datseris commented on September 26, 2024

Well there is nothing stopping us from estimating the counts with the method I described so why not do it alltogether. The user anyways will never know of the existence of the counts field.

from complexitymeasures.jl.

kahaaga avatar kahaaga commented on September 26, 2024

Well there is nothing stopping us from estimating the counts with the method I described so why not do it alltogether. The user anyways will never know of the existence of the counts field.

For the integer version, it is straight-forward

# If given integer vector, then it is assumed that elements are counts of the different outcomes
Probabilities(x::AbstractVector{Int})

But for the float-version, I don't see how that would work. If I input say x = [0.1, 0.3, 0.4], and I want to convert it to a probability vector by normalizing, I have no idea how many counts underlie those initial fractions. To get actual counts, I'd need to also specify n (the total number of outcomes)?

from complexitymeasures.jl.

kahaaga avatar kahaaga commented on September 26, 2024

To me, it seems like there should be three constructors.

  • Probabilities(::AbstractVector{<:Float}) (the current behaviour, leaves counts as nothing)
  • Probabilities(::AbstractVector{<:Int}). Treats the inputs as counts
  • Probabilities(::AbstractVector{<:Float}, ::Int). Like the first, but also defines counts, because the total number of observations is known

from complexitymeasures.jl.

kahaaga avatar kahaaga commented on September 26, 2024

We could of course just create imaginary counts whose ratio respects the initial input data, but I feel uneasy doing so, because we're pretending to know information we don't have.

from complexitymeasures.jl.

kahaaga avatar kahaaga commented on September 26, 2024

Ah, but the input data gives n automatically, so scaling like you proposed should work (up to rounding errors).

from complexitymeasures.jl.

Datseris avatar Datseris commented on September 26, 2024

We could of course just create imaginary counts whose ratio respects the initial input data, but I feel uneasy doing so, because we're pretending to know information we don't have.

That's what we should do. It's fine. Besides, we don't expect users to directly initialize Probabilities. Instead, they should give input data to the probabilities function.

from complexitymeasures.jl.

kahaaga avatar kahaaga commented on September 26, 2024

Hey @Datseris,

Since we're moving to 3.0 due to the new infoestimator-stores-the-definition API, would it make sense to do something similar for the discrete info estimators? We have the old-style syntax

struct FancyDiscreteEst{I} <: DiscreteInfoestimator
    measure::I # the info measure, e.g. `Shannon()`
end

function information(est::FancyDiscreteEst{<:Shannon}, est::ProbabilitiesEstimator, x)
    probs = probabilities(est, x)
    # ...
end

Or, we could let the DiscreteInfoEstimator store the ProbabilitiesEstimator too, so that we get

struct FancyDiscreteEst{I, P <: ProbabilitiesEstimator} <: DiscreteInfoestimator
    measure::I # the info measure, e.g. `Shannon()`
    probest::P  # e.g. `CountOccurences()`
end

function information(est::FancyDiscreteEst{<:Shannon}, x)
     probs = probabilities(est.probest, x)
     # ....
end

Any preference?

from complexitymeasures.jl.

kahaaga avatar kahaaga commented on September 26, 2024

I don't think it's a huge problem to have two different signatures for information - we already do, so I am slightly leaning towards the first alternative. The reason is that it is more pedagogic - one needs to pick both an entropy estimator AND a probabilities/frequencies estimator to estimate an entropy from data. That gets hidden a bit in the second alternative.

from complexitymeasures.jl.

kahaaga avatar kahaaga commented on September 26, 2024

DiscreteInfoEstimator does not depend on the probabilities estimation in any way so there is no reason to make it its field. So I vote no changes to be done.

Ok, then I just stick with information(est::DiscreteInfoEstimator, pest::ProbabilitiesEstimator, x).

from complexitymeasures.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.