Comments (20)
Nice!
IMO both approaches try to address a different problem (which we should both solve in some way). The approach with return
(or some other similar model-based approach) tries to avoid saving auxiliary variables completely. This structure would allow samplers to marginalize out variables, which is otherwise not possible if always all variables have to be returned. Your implementation addresses the problem of rerunning the model with different (or the same) variables, which is interesting in itself and provides the tracked expressions if they are returned in the model definition (so you need the same/similar model definition).
Additionally, I think we should just have some convenience function for running models with NamedTuple instead of VarInfo, instead of letting users rewrite the same hacks over and over again. IMO it would be good to use NamedTuples in generated_quantities
as well since otherwise reordering of the columns will break the function, which would not be clear to a user.
from dynamicppl.jl.
I 100% agree that it would be nice with a more integrated approach where we keep track of the quantities during sampling, but I also 100% disagree with this meaning that generated_quantities
is not useful.
- Memory-constraints: the "tracked" quantities might require a lot of memory to the point where it's infeasible to keep track of these during sampling. The user can also add a kwarg
full::Bool = false
to the model-definition if they wanted to exclude certain computation until they run the post-processing step. - Different models under posterior: say you've obtained a chain from
model1
, thus giving you an estimate of the posterior. You might then want to compute the generated quantities (i.e. the stuff you've put in thereturn
statement) for some other model using the chain you just obtained. This other model,model2
, could- have the same model-def but different inputs,
- be a completely different model but with the same latent parameters, or
- be a new definition of the same model, e.g. user realized they wanted to also look at some other values under the posterior, so they re-define the model-def of
model1
and want to the execute the new instantiation on the chain.
- Speed: executing the model is likely not the most expensive part (in the case where one uses AD-based samplers), so all the cases listed in (2) is going be muuuch faster to do from an existing chain rather than running the entire sampling process again.
- Development cost:
generated_quantities
exists now and this issue is something that we keep getting asked about. And, because of the points just listed above, it's still going to have a functionality once we get a "better" solution for tracking variables during sampling.
Now, if anyone still disagrees; fight me! I'll be near the fountain behind the gray building at 6pm. No brass knuckles allowed.
(I'm kidding; I've just always wanted to say that. Please continue the discussion in a civil way.)
EDIT: Also, in my original comment I should have emphasized that I did not mean that my proposal was a replacement for tracking during sampling! My bad. But rather an addition + it would provide a temporary "solution" to the problem in most basic use-cases, which I think is important too, given the number of times people have brought this up in our Slack-channel.
EDIT 2: The above is mainly a response to the:
I wonder if we can just do away entirely with generated_quantities.
from dynamicppl.jl.
Even if that might be a way to solve the problem, I'm not sure that's how we should actually do it (sorry 😛). Adding a deterministic distribution and an identity bijector for a transformation seems like a bit of a hack to me (and the AD part seems to be focused on HMC?).
It seems the main purpose of @track
would be to ensure that m2
would be part of the chain? Or is there anything else that we have to take care of?
If it's (mainly) about the interesting variables ending up in the chain, maybe
@model function test(x)
m1 ~ Normal()
m2 = f(m1) # some differentiable transformation
x ~ Normal(m2, 1.0)
return m1, m2
end
and respecting the return
statement would be the more natural approach.
from dynamicppl.jl.
What do you guys think about using the return statement (as described by @devmotion) in general to define the set of variables that should be tracked? This idea is kinda growing on me. :)
@model function gmm(x, K)
m ~ filldist(Normal(), K)
v ~ filldist(InverseGamma(2, 3), K)
s = sqrt.(v)
w ~ Dirichlet(K, 1.0)
z ~ filldist(Categorical(w), length(x))
x .~ Normal.(m[z] s[z])
return m, s, w
end
Then the resulting chain has only states for m, s
and w
but not for z
or v
and we can safely marginalise out z
in the inference.
from dynamicppl.jl.
I think using the return
statement is indeed the best idea. I'm not sure we need to store transformed variables in VarInfo at all, tbh. I'd rather have it only in the resulting trace and keep VarInfo clean.
from dynamicppl.jl.
Well, I think we all agree that it’s very useful to have, and I think we should provide it as functionally. We should simply not have this as the only solution. I think best would be both, as pointed out earlier.
from dynamicppl.jl.
You're absolutely right - I assumed _setval!
just operates on vectors, I guess due to the new indexing behaviour in MCMCChains#dev (IIRC chains[i]
just returns an array (with labelled axes)).
from dynamicppl.jl.
I wonder if we can just do away entirely with generated_quantities
. I'd prefer to handle everything at sampling time rather than ex-post, which seems really clumsy to me. For example, couldn't the model compiler convert something like
@model function f()
m ~ Normal(0, 1)
z = m + 1
return z
end
to something where the return value is a NamedTuple
as in
@model function f()
m ~ Normal(0, 1)
z = m + 1
return (z=z,)
end
Then, when we run the model through Turing, we just bundle up the NamedTuple
we got with the Transition
. We already have the machinery to do this here. All we need is a reasonable way to return the generated quantities with their symbols to the model caller in a type-stable way that doesn't distort the model internals that much.
from dynamicppl.jl.
Cool idea to have the return statement determine what is stored "additionally" in the chain.
from dynamicppl.jl.
That said, I find the use of an @track
macro more "user-friendly". But that's rather subjective. Also, @track
could be used to inform a sampler that performs marginalisation automatically, that a certain RV should not be marginalised out as we want to track it.
from dynamicppl.jl.
So just a heads up: I've already implemented a generated_quantities
for "extracting" return-values from a chain by re-running the model on the values in the chain: https://github.com/cambridge-mlg/Covid19/blob/4d967d58eeada3f3399116fc815cb17fd4163c87/src/utils.jl#L79-L89.
In this case you could then just do:
@model function gmm(x, K)
m ~ filldist(Normal(), K)
v ~ filldist(InverseGamma(2, 3), K)
s = sqrt.(v)
w ~ Dirichlet(K, 1.0)
z ~ filldist(Categorical(w), length(x))
x .~ Normal.(m[z] s[z])
return m, s, w
end
m = gmm(x, K)
chain = sample(m, sampler, n)
return_values = generated_quantities(m, chain)
This is a fairly good temporary-fix for tackling this issue + it has additional use-cases:
- Predicting.
model_train
is instantiated on "train-data" but now you want to generate the return-values formodel_test
which is instantiated on "test-data", then you can executemodel_test
using the chain (posterior) obtained frommodel_train
. - If you for some reason wanted to compute the "generated values" for
model1
using a chain obtained frommodel2
you could do that, assuming they have the same latent parameters.
Not suggesting this is as a solution to the problem, but it can in many cases be a good fix + it provides some additional functionality.
For completeness:
function generated_quantities(m::Turing.Model, c::MCMCChains.Chains)
# if `c` is multiple chains we pool them into a single chain
chain = length(chains(c)) == 1 ? c : MCMCChains.pool_chain(c)
varinfo = Turing.DynamicPPL.VarInfo(m)
return map(1:length(chain)) do i
Turing.DynamicPPL._setval!(varinfo, chain[i])
m(varinfo)
end
end
from dynamicppl.jl.
Additionally, I think we should just have some convenience function for running models with NamedTuple instead of VarInfo, instead of letting users rewrite the same hacks over and over again.
👍
IMO it would be good to use NamedTuples in generated_quantities as well since otherwise reordering of the columns will break the function, which would not be clear to a user.
Hooold up; what do you mean when you say "reordering of the columns"? Do you mean that generated_quantities
will break if the ordering of the resulting vector in the model is wrong? If that's what you mean, _setval
internally makes use of the symbols so everything should work just fine even then, right?
With that being said, I 100% support allowing use of NamedTuple
instead of having to deal with VarInfo
👍
from dynamicppl.jl.
+1 for using the return
statement to define which variables to track. I think it's clean and intuitive, and the ability to marginalize out variables without storing them in a chain first would be really great.
from dynamicppl.jl.
Something I've always wondered about this is why we don't just track these generated quantities in VarInfo
? It seems like it would be reasonably easy to build a VarName
from the returned quantities and just stick them in some field somewhere.
from dynamicppl.jl.
Ref: TuringLang/Turing.jl#1335
Thanks to the most recent updates to DPPL (in particular @devmotion's PR with the new impl of setval!
) we have a good approach to extracting "generated quantities" in a post-processing step. This is using (an updated version of) generated_quantities
method I mentioned above. One issue: at the moment it relies on MCMCChains.pool_chains
to handle multiple chains. If we remove that, we could just add generated_quantities
to DPPL. I honestly think it's super-useful, and would partially solve a lot of the use-cases for "deterministic tracking" that people are requesting. In most cases executing the model is cheap → re-running the model on a chain
as a post-processing step won't be an issue. It also comes with a couple of nice advantages:
- No extra memory-usage when sampling.
- Sample from
model1
and then execute some other model (with the only restriction being that it contains the same parameters in it's parameter-space), saymodel2
, to get the generated quantities ofmodel2
under the posterior ofmodel1
.
Thoughts?
from dynamicppl.jl.
This is essentially a post-processing step, right? I think this is a nice approach, but a more integrated solution that uses the return statement during inference would be great too, as it could help in figuring out possible ways to collapse out parameters.
Maybe we can have your solution implemented for now and add the functionality that I proposed later on. This way one can make use of both 1) possible collapsed sampling, 2) obtaining transformed variables after sampling.
What do you think?
from dynamicppl.jl.
In most cases executing the model is cheap → re-running the model on a chain as a post-processing step won't be an issue.
Re-running the model on a big dataset could be expensive.
I think what @torfjelde proposes can be considered a separate feature. It is helpful to have a return
statement that can track arbitrary user-provided variables during inference. On top of that, we can provide a generated_quantities
function, that can take a chain
and a model
(can be a different model from which chain
is obtained) and produce some post inference analysis.
from dynamicppl.jl.
I thought about Tor’s approach also as an additional functionality. It would also be nice to have a similar function to obtain the log prob values of each datum at each iteration after inference.
from dynamicppl.jl.
I think those are all great points, particularly the idea about using different models.
The memory constraints thing is not that big an issue though, since I was also thinking of a keyword argument to exclude/include generated quantities during sampling. I think also there's not really that much data that gets thrown around in most common models, so we probably shouldn't be running into memory issues. We could by default just not store any return
values but give that option for people who want it.
Maybe the short-term solution here is to give a keyword generate_quantities=true
to sample
which automatically does a pass to generate quantities and stick them in the returned chain at the end. This way the user doesn't have to explicitly call generated_quantities
for the most basic use case. It'd add additional sampling time but I think would be a very nice feature to have.
from dynamicppl.jl.
I like @devmotion's proposal above: a clean syntax for specifying tracked variables would be via the return statement:
@model test(x) = begin
m1 ~ Normal()
m2 = f(m1) # some differentiable transformation
x ~ Normal(m2, 1.0)
return m2 # store returned values in the `AbstractMCMC.bundle_samples` function
end
from dynamicppl.jl.
Related Issues (20)
- @model macro breaks on conditional branches with return values HOT 2
- MCMCChains for Julia < 1.9
- `logprior`, `loglikelihood`, `logjoint` are not working with `MCMCChains` in Julia 1.9 HOT 1
- Derived variables from data on the LHS of tilde HOT 10
- Decouple from Distributions.jl HOT 8
- ThreadSafeVarInfo(::SimpleVarInfo) bug with ForwardDiff HOT 1
- Need alternative to `NamedTuple` for `SimpleVarInfo` HOT 12
- Storing returned values in `VarInfo`.
- `BangBang.setindex!!(vi::SimpleVarInfo{<:AbstractDict}, val, vn::VarName)` returns eltype Any arrays when `vn` is multidimensional HOT 1
- Remove `tonamedtuple`
- Can't use `@info` in models HOT 1
- Move TestUtils to an extension
- `SimpleVarInfo` sometimes returns a `SimpleVarInfo` with `Int`-type `logp` and creates down-stream errors HOT 1
- Roadmap for removal of `Selector` shenanigans HOT 2
- change defaults: @addlogprob! not included in PriorContext() HOT 5
- Type-instability when using type-parameters HOT 1
- Type instability in unflatten for medium / large tuples HOT 6
- DynamicPPL..jl tests 23 Broken. HOT 3
- Introduce a convenience function, `AutoType`, to hide `TypeWrap`.
- Issue with ReverseDiff and undef on Julia v1.7 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dynamicppl.jl.