Code Monkey home page Code Monkey logo

Comments (20)

devmotion avatar devmotion commented on June 1, 2024 3

Nice!

IMO both approaches try to address a different problem (which we should both solve in some way). The approach with return (or some other similar model-based approach) tries to avoid saving auxiliary variables completely. This structure would allow samplers to marginalize out variables, which is otherwise not possible if always all variables have to be returned. Your implementation addresses the problem of rerunning the model with different (or the same) variables, which is interesting in itself and provides the tracked expressions if they are returned in the model definition (so you need the same/similar model definition).

Additionally, I think we should just have some convenience function for running models with NamedTuple instead of VarInfo, instead of letting users rewrite the same hacks over and over again. IMO it would be good to use NamedTuples in generated_quantities as well since otherwise reordering of the columns will break the function, which would not be clear to a user.

from dynamicppl.jl.

torfjelde avatar torfjelde commented on June 1, 2024 3

I 100% agree that it would be nice with a more integrated approach where we keep track of the quantities during sampling, but I also 100% disagree with this meaning that generated_quantities is not useful.

  1. Memory-constraints: the "tracked" quantities might require a lot of memory to the point where it's infeasible to keep track of these during sampling. The user can also add a kwarg full::Bool = false to the model-definition if they wanted to exclude certain computation until they run the post-processing step.
  2. Different models under posterior: say you've obtained a chain from model1, thus giving you an estimate of the posterior. You might then want to compute the generated quantities (i.e. the stuff you've put in the return statement) for some other model using the chain you just obtained. This other model, model2, could
    1. have the same model-def but different inputs,
    2. be a completely different model but with the same latent parameters, or
    3. be a new definition of the same model, e.g. user realized they wanted to also look at some other values under the posterior, so they re-define the model-def of model1 and want to the execute the new instantiation on the chain.
  3. Speed: executing the model is likely not the most expensive part (in the case where one uses AD-based samplers), so all the cases listed in (2) is going be muuuch faster to do from an existing chain rather than running the entire sampling process again.
  4. Development cost: generated_quantities exists now and this issue is something that we keep getting asked about. And, because of the points just listed above, it's still going to have a functionality once we get a "better" solution for tracking variables during sampling.

Now, if anyone still disagrees; fight me! I'll be near the fountain behind the gray building at 6pm. No brass knuckles allowed.

(I'm kidding; I've just always wanted to say that. Please continue the discussion in a civil way.)

EDIT: Also, in my original comment I should have emphasized that I did not mean that my proposal was a replacement for tracking during sampling! My bad. But rather an addition + it would provide a temporary "solution" to the problem in most basic use-cases, which I think is important too, given the number of times people have brought this up in our Slack-channel.

EDIT 2: The above is mainly a response to the:

I wonder if we can just do away entirely with generated_quantities.

from dynamicppl.jl.

devmotion avatar devmotion commented on June 1, 2024 2

Even if that might be a way to solve the problem, I'm not sure that's how we should actually do it (sorry 😛). Adding a deterministic distribution and an identity bijector for a transformation seems like a bit of a hack to me (and the AD part seems to be focused on HMC?).

It seems the main purpose of @track would be to ensure that m2 would be part of the chain? Or is there anything else that we have to take care of?

If it's (mainly) about the interesting variables ending up in the chain, maybe

@model function test(x)
    m1 ~ Normal()
    m2 = f(m1) # some differentiable transformation
    x ~ Normal(m2, 1.0)
	return m1, m2
end

and respecting the return statement would be the more natural approach.

from dynamicppl.jl.

trappmartin avatar trappmartin commented on June 1, 2024 2

What do you guys think about using the return statement (as described by @devmotion) in general to define the set of variables that should be tracked? This idea is kinda growing on me. :)

@model function gmm(x, K)
    m ~ filldist(Normal(), K) 
    v ~ filldist(InverseGamma(2, 3), K)
    s = sqrt.(v)
    w ~ Dirichlet(K, 1.0)
    z ~ filldist(Categorical(w), length(x))
    x .~ Normal.(m[z] s[z])
    return m, s, w
end

Then the resulting chain has only states for m, s and w but not for z or v and we can safely marginalise out z in the inference.

from dynamicppl.jl.

trappmartin avatar trappmartin commented on June 1, 2024 2

I think using the return statement is indeed the best idea. I'm not sure we need to store transformed variables in VarInfo at all, tbh. I'd rather have it only in the resulting trace and keep VarInfo clean.

from dynamicppl.jl.

trappmartin avatar trappmartin commented on June 1, 2024 2

Well, I think we all agree that it’s very useful to have, and I think we should provide it as functionally. We should simply not have this as the only solution. I think best would be both, as pointed out earlier.

from dynamicppl.jl.

devmotion avatar devmotion commented on June 1, 2024 1

You're absolutely right - I assumed _setval! just operates on vectors, I guess due to the new indexing behaviour in MCMCChains#dev (IIRC chains[i] just returns an array (with labelled axes)).

from dynamicppl.jl.

cpfiffer avatar cpfiffer commented on June 1, 2024 1

I wonder if we can just do away entirely with generated_quantities. I'd prefer to handle everything at sampling time rather than ex-post, which seems really clumsy to me. For example, couldn't the model compiler convert something like

@model function f()
    m ~ Normal(0, 1)
    z = m + 1
    return z
end

to something where the return value is a NamedTuple as in

@model function f()
    m ~ Normal(0, 1)
    z = m + 1
    return (z=z,)
end

Then, when we run the model through Turing, we just bundle up the NamedTuple we got with the Transition. We already have the machinery to do this here. All we need is a reasonable way to return the generated quantities with their symbols to the model caller in a type-stable way that doesn't distort the model internals that much.

from dynamicppl.jl.

trappmartin avatar trappmartin commented on June 1, 2024

Cool idea to have the return statement determine what is stored "additionally" in the chain.

from dynamicppl.jl.

trappmartin avatar trappmartin commented on June 1, 2024

That said, I find the use of an @track macro more "user-friendly". But that's rather subjective. Also, @track could be used to inform a sampler that performs marginalisation automatically, that a certain RV should not be marginalised out as we want to track it.

from dynamicppl.jl.

torfjelde avatar torfjelde commented on June 1, 2024

So just a heads up: I've already implemented a generated_quantities for "extracting" return-values from a chain by re-running the model on the values in the chain: https://github.com/cambridge-mlg/Covid19/blob/4d967d58eeada3f3399116fc815cb17fd4163c87/src/utils.jl#L79-L89.

In this case you could then just do:

@model function gmm(x, K)
    m ~ filldist(Normal(), K) 
    v ~ filldist(InverseGamma(2, 3), K)
    s = sqrt.(v)
    w ~ Dirichlet(K, 1.0)
    z ~ filldist(Categorical(w), length(x))
    x .~ Normal.(m[z] s[z])
    return m, s, w
end

m = gmm(x, K)
chain = sample(m, sampler, n)

return_values = generated_quantities(m, chain)

This is a fairly good temporary-fix for tackling this issue + it has additional use-cases:

  1. Predicting. model_train is instantiated on "train-data" but now you want to generate the return-values for model_test which is instantiated on "test-data", then you can execute model_test using the chain (posterior) obtained from model_train.
  2. If you for some reason wanted to compute the "generated values" for model1 using a chain obtained from model2 you could do that, assuming they have the same latent parameters.

Not suggesting this is as a solution to the problem, but it can in many cases be a good fix + it provides some additional functionality.

For completeness:

function generated_quantities(m::Turing.Model, c::MCMCChains.Chains)
    # if `c` is multiple chains we pool them into a single chain
    chain = length(chains(c)) == 1 ? c : MCMCChains.pool_chain(c)

    varinfo = Turing.DynamicPPL.VarInfo(m)

    return map(1:length(chain)) do i
        Turing.DynamicPPL._setval!(varinfo, chain[i])
        m(varinfo)
    end
end

from dynamicppl.jl.

torfjelde avatar torfjelde commented on June 1, 2024

Additionally, I think we should just have some convenience function for running models with NamedTuple instead of VarInfo, instead of letting users rewrite the same hacks over and over again.

👍

IMO it would be good to use NamedTuples in generated_quantities as well since otherwise reordering of the columns will break the function, which would not be clear to a user.

Hooold up; what do you mean when you say "reordering of the columns"? Do you mean that generated_quantities will break if the ordering of the resulting vector in the model is wrong? If that's what you mean, _setval internally makes use of the symbols so everything should work just fine even then, right?
With that being said, I 100% support allowing use of NamedTuple instead of having to deal with VarInfo 👍

from dynamicppl.jl.

ElOceanografo avatar ElOceanografo commented on June 1, 2024

+1 for using the return statement to define which variables to track. I think it's clean and intuitive, and the ability to marginalize out variables without storing them in a chain first would be really great.

from dynamicppl.jl.

cpfiffer avatar cpfiffer commented on June 1, 2024

Something I've always wondered about this is why we don't just track these generated quantities in VarInfo? It seems like it would be reasonably easy to build a VarName from the returned quantities and just stick them in some field somewhere.

from dynamicppl.jl.

torfjelde avatar torfjelde commented on June 1, 2024

Ref: TuringLang/Turing.jl#1335

Thanks to the most recent updates to DPPL (in particular @devmotion's PR with the new impl of setval!) we have a good approach to extracting "generated quantities" in a post-processing step. This is using (an updated version of) generated_quantities method I mentioned above. One issue: at the moment it relies on MCMCChains.pool_chains to handle multiple chains. If we remove that, we could just add generated_quantities to DPPL. I honestly think it's super-useful, and would partially solve a lot of the use-cases for "deterministic tracking" that people are requesting. In most cases executing the model is cheap → re-running the model on a chain as a post-processing step won't be an issue. It also comes with a couple of nice advantages:

  1. No extra memory-usage when sampling.
  2. Sample from model1 and then execute some other model (with the only restriction being that it contains the same parameters in it's parameter-space), say model2, to get the generated quantities of model2 under the posterior of model1.

Thoughts?

from dynamicppl.jl.

trappmartin avatar trappmartin commented on June 1, 2024

This is essentially a post-processing step, right? I think this is a nice approach, but a more integrated solution that uses the return statement during inference would be great too, as it could help in figuring out possible ways to collapse out parameters.

Maybe we can have your solution implemented for now and add the functionality that I proposed later on. This way one can make use of both 1) possible collapsed sampling, 2) obtaining transformed variables after sampling.

What do you think?

from dynamicppl.jl.

yebai avatar yebai commented on June 1, 2024

In most cases executing the model is cheap → re-running the model on a chain as a post-processing step won't be an issue.

Re-running the model on a big dataset could be expensive.

I think what @torfjelde proposes can be considered a separate feature. It is helpful to have a return statement that can track arbitrary user-provided variables during inference. On top of that, we can provide a generated_quantities function, that can take a chain and a model (can be a different model from which chain is obtained) and produce some post inference analysis.

from dynamicppl.jl.

trappmartin avatar trappmartin commented on June 1, 2024

I thought about Tor’s approach also as an additional functionality. It would also be nice to have a similar function to obtain the log prob values of each datum at each iteration after inference.

from dynamicppl.jl.

cpfiffer avatar cpfiffer commented on June 1, 2024

I think those are all great points, particularly the idea about using different models.

The memory constraints thing is not that big an issue though, since I was also thinking of a keyword argument to exclude/include generated quantities during sampling. I think also there's not really that much data that gets thrown around in most common models, so we probably shouldn't be running into memory issues. We could by default just not store any return values but give that option for people who want it.

Maybe the short-term solution here is to give a keyword generate_quantities=true to sample which automatically does a pass to generate quantities and stick them in the returned chain at the end. This way the user doesn't have to explicitly call generated_quantities for the most basic use case. It'd add additional sampling time but I think would be a very nice feature to have.

from dynamicppl.jl.

yebai avatar yebai commented on June 1, 2024

I like @devmotion's proposal above: a clean syntax for specifying tracked variables would be via the return statement:

@model test(x) = begin
    m1 ~ Normal()
    m2 = f(m1) # some differentiable transformation
    x ~ Normal(m2, 1.0)
    return m2 # store returned values in the `AbstractMCMC.bundle_samples` function 
end

from dynamicppl.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.