-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Functionality necessary to nest Stan models in other PPLs #169
Comments
I think all of these would require upstream changes in the Stan library and stanc compiler. Are all of those quantities well defined for a generic Stan model? The gradient w.r.t constrained parameters seems particularly problematic for constraints like the simplex which have a different number of dimensions on the unconstrained scale |
The ones with checks are available right now as functions in the Stan math library or in the generated code for a model.
As @WardBrian notes, doing the other ones would require modifying how we do code generation in the transpiler. We can pull out log density w.r.t. constrained parameters and gradient pretty easily. During code generation, we separately generate (a) code for the constraining transform and Jacobian determinants, followed by (b) code for the constrained log density and gradients. I'm not sure what exactly you want with jvp (Jacobian-vector product?). We can calculate a Jacobian and do vector-jacobean products explicitly, but we don't have the constraining transforms coded that way, so we can't do this very efficiently compared to an optimal implementation. Instead, they do the transform and add the log Jacobian determinant terms to the target density then we do autodiff. Making this efficient would require recoding the transforms. |
The magic of AD is that at any given point in the program, you are computing either the partial derivative of a real scalar output with respect to a real scalar intermediate (reverse-mode) or computing the partial derivative of a real scalar intermediate with respect to a real scalar input (forward-mode). It doesn't matter whether those intermediates have constraints or not; with application of the chain rule everything just works. Simplex is an easy one because it has a linear constraint, but let's take the harder case of a unit vector, which has a nonlinear constraint. The Jacobian of l2-normalization
The reason this is necessary is because if a constrained model parameter defined in a Stan model is ever used elsewhere in the model defined in the other PPL, then we need a way to "backpropagate" the gradient of that parameter back to the unconstrained parameter, which is also handled by the other PPL. The only way to do this is if such a jacobian-vector-product primitive is available for the constraining transform; without such a primitive, this use of a Stan model is not possible (except maybe by computing full Jacobians - 🤢 ) Do you think the necessary changes to the transforms would be a major project? |
To be clear, I'm just gathering information at this point. I don't have a use case for this, but the people I was talking with do. |
We could use autodiff to calculate the Jacobian of the transform and then explicitly multiply it by a vector. It wouldn't be efficient, but we'd get the right answer. Right now, we code the (scalar) log determinant of the absolute Jacobian and its derivative. We don't need Jacobian-vector products for transforms anywhere in Stan, so we've never coded it. It probably wouldn't be too hard to code all these as Jacobian-vector products as there aren't that many transforms and they're all well described mathematically in our reference manual. |
No problem, vector-Jacobian-product is generally more useful anyways for these applications. And these are implicitly available, right, in that Stan can reverse-mode AD through the transforms? When Stan computes e.g. So for VJPs at least, would it be possible to simply provide a function like (pseucodode)
Here |
We don't code the transforms with either jvp or vjp to use Seth's lingo. When stan executes
In this example, Yes, it'd be possible to implement the Out of curiosity, why does someone want all this? |
Yes, this is precisely what I mean. The transform is computed on a number, and the derivative is backpropagated by the AD. The part of the program that backpropagates is a vjp automatically constructed by the AD. I don't really understand why one would need to rewrite the transforms from scratch to allow this to be done on the transform by itself instead of the transform composed with the downstream operations that compute the logpdf. Is it the case that Stan's reverse-mode AD API only supports computing gradients? If so, one can still implement a VJP even if the only available primitive is a gradient by computing the gradient of the function
I haven't seen a detailed use case yet, but I think it's to support submodels, e.g. as supported by Turing and PyMC: https://discourse.pymc.io/t/nested-submodels/11725. The idea is that one's model might be composed of modular components that each have been well-characterized, are considered reliable, and have efficient and numerically stable implementations. I suspect the goal might be to support cases where such a Stan model already exists for a subset of the parameters, and one wants to augment the model with additional parameters or combine it with a separate model; that other model might require language features unavailable in Stan. |
Stan's really not compatible with that use case because of the way blocks are defined. Maria Gorinova wrote a cool thesis with an alternative design that's more like Turing.jl that would accommodate the generality of being able to define submodules modularly. I haven't seen this functionality in Turing or PyMC---is there a pointer to how they do it somewhere? I'm having trouble imaging where that'd be the right thing to do from a software perspective (mixing Stan and something else), because Stan code only gets so complicated and is thus not too big of a bottleneck just to reimplement. Yes, it's the case that Stan's reverse-mode AD only computes gradients. We didn't template out the types so that we could autodiff through reverse mode. We have forward mode implemented for higher-order gradients, but not for our implicit functions like ODE solvers. We have the transforms and inverse transforms implemented with templated functions. So we can do all of this by brute force with autodiff by explicitly constructing the Jacobian. To evaluate a Jacobian-adjoint product more efficiently without building the Jacobian explicitly, we'd have to rewrite the transform code. |
@bob-carpenter Here is an example of using a Stan model inside a Turing model, and it is very helpful in two ways:
|
Not that I'm aware of. I think that would get too much into the internals.
I don't think that example is really a great case for this feature though. Once can benefit from Stan's math libraries and use alternative samplers from HMC already without needing the additional features described here. e.g. the example in the readme of StanLogDensityProblems shows how to sample a Stan model with DynamicHMC.jl, and one could easily swap in Pathfinder.jl to build a variational model instead. I think only needs these features if one wants to combine a Stan model with additional parameters whose log-density depends on the constrained parameters in the Stan model, and it would be nice to see a compelling use case for that. |
Fair point. Models involving discrete variables and non-parametric models where the model dimensionality changes during inference. These are hard to handle in Stan but are unavoidable in some applied areas. |
@yebai: For algorithm development, we've been using BridgeStan, but that's still limited to Stan models. @roualdes originally developed it for use in Julia.
Those problems are too hard for us. We like to stick to examples where we can get simulation-based calibration to work. Even in simpler cases like K-means clustering where we can marginalize all the discrete parameters, the posterior is too multimodal to sample. |
Someone asked me if it would be possible to nest an existing Stan model within a model defined in another PPL using bridgestan. Currently the major limitation of doing this is that we have no way to autodiff through the constraining transformation. We also would in general need to be able to separately compute the following:
As far as I can tell, only the two transforms are currently part of the API. The available log-density and gradient are only wrt unconstrained parameters, the Jacobian adjustment is only available as part of the density calculation, and no AD primitives are available for the transforms.
This is purely exploratory at this stage, but I wonder if it would be feasible and of interest to include the missing functionality in the API.
The text was updated successfully, but these errors were encountered: