-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ScoreELBO objective #72
Conversation
Thank you so much for the PR. It looks pretty good at a glance, but I am currently on a trip to ICML, so it will take some time for me to do a more detailed review. A few first comments:
As for mixing the parameterization gradient with the score gradient, I think I have an idea about how to do it: we can split the variables into those that need the RP gradient and those that need the score gradient and compute the RP objective and the score objective separately. But it will require handling |
May I ask what is "STL"?
Yep, DynamicPPL doesn't track dependencies, there were some efforts, but we just con't do it reliably. Although, |
Thanks @Red-Portal for the first comments, I agree with the name change and to avoid the use of This is a somewhat unrelated question but do you know what are the key differences between DynamicPPL.jl and GraphPPL.jl?. The reason I ask is because in GraphPPL.jl there is a very nice syntax to specify metadata for a specific sampling (see here): @model function some_model(a, b)
x ~ Beta(a, b) where { meta = "Hello, world!" }
end this would be very convenient to specify the estimation method. |
Sorry for the delay! @arnauqb I don't have too much insight into the PPL part so that's something that perhaps @sunxd3 could comment on. @sunxd3 STL stands for sticking-the-landing, which is a very clever control variate method by Roeder, Wu, and Duvenaud. |
DynamicPPL aims to enable users to write generic Julia code for model definition. While Graph-oriented DSLs like GraphPPL (and JuliaBUGS) are bound to have some restrictions to the syntax, although the restrictions can be a good thing. These syntax restrictions usually manifested as limited mutability, limitations to loop bounds etc. Other than syntax differences, Turing/DynamicPPL is also a "general purpose" PPL(probabilistic programming language). A somewhat reductionist understanding of "general purpose" language v.s. graphical model PPL can be: "general purpose" PPL can allow stochastic number of random variables in the model. A very basic circumstance for this to happen is when the size of some array representing random variables to be stochastic, e.g., drawn from a Categorical distribution. The ability to specify metadata in GraphPPL is a good design, but maybe difficult for DynamicPPL to implement well (it would take work to get the design right). Now, I think it's very interesting how Pyro implemented TraceELBO, but I would need further investigation to determine how it works and if it applies to DynamicPPL. All these being said, we should probably think about how to use better dependency information to make VI better here. Both in terms of algorithm implementations and interface design. We can use these designs to inform DynamicPPL design in the future and also build interface with DSLs like GraphPPL. |
So I'm not sure how to avoid using |
Sorry for the delay! I just got back from ICML and was recovering from a cold I caught on the way back. Yes, it seems like we'll have to call |
Apologies for the long delay on updating this. I got sidetracked with other projects after the summer break... I have now implemented the score estimator without needing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @arnauqb, thanks for the amazing work! It is nice to see that everything seems to work out without the need to gradient stopping operations. I have some rather minor questions and requested changes, which I think should be handled pretty quickly. I do have another comment: could you also try to add the test with Bijectors
as done in here?
function compute_control_variate_baseline(history, window_size) | ||
if length(history) == 0 | ||
return 1.0 | ||
end | ||
min_index = max(1, length(history) - window_size) | ||
return mean(history[min_index:end]) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think that we should make a whole new set of interface for control variates so that people could easily mix and match various control variates. But for now, I think this could be included in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed!
Co-authored-by: Kyurae Kim <[email protected]>
Thank you for the comments @Red-Portal ! I'm in the process of merging your comments and adding the additional tests. I'm currently experiencing a strange bug when computing the closed form of the entropy. In particular, n_dims = 10
μ_x = 5.0
σ_x = 0.3
μ_y = Fill(5.0, n_dims)
σ_y = Fill(0.3, n_dims)
model = NormalLogNormal(μ_x, σ_x, μ_y, Diagonal(σ_y.^2))
d = LogDensityProblems.dimension(model)
q = variational_standard_mvnormal(Float64, d, :meanfield)
entropy(q) # works
b = Bijectors.bijector(model)
binv = inverse(b)
q_transformed = Bijectors.TransformedDistribution(q, binv)
entropy(q_transformed) # does not work the call This should also make the ReprGradElbo objective fail when computing the closed entropy, but somehow it doesn't. Any help? |
@arnauqb The entropy under nonlinear transformations is generally intractable, so it's not implemented. For the |
@Red-Portal Ok, so I had to rework this a bit more, let me know what you think. For the score estimator, we need to sample Locally, I get segmentation faults for the Enzyme tests, even for the ReprGradELBO. Not sure what's wrong there... |
@arnauqb Can't you just shove |
But then the entropy term would not carry a gradient, so it wouldn't optimize for that? |
Ah you're right. Okay this is indeed a little bit tricky, and I am not too happy with dispatching based on the objective. I think the best course of action would be to split |
Ok, I have reverted the change. Currently using Bijectors with the closed form entropy does not work for the ScoreGrad, but I guess that is OK for now. Currently, Zygote and Tapir tests also fail on the score grad, not sure why it is a hard to debug Enzyme error. |
Hi @arnauqb , did you push all your changes? If so, I'll take a deeper look so that we can move this PR forward. |
Yes, should be ready for a deep review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for the one comment, I think most things look good now. Could you also run the formatter (refer to here) so that it doesn't complain? Once you're done, re-request review
so that I can immediately know that you're ready.
src/objectives/elbo/scoregradelbo.jl
Outdated
return fv_mean + (score_grad - score_grad_stop) | ||
end | ||
|
||
function compute_elbo(q, q_stop, samples_stop, entropy, problem, baseline) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for this function to exist separately? The only caller is estimate_scoregradelbo_ad_forward
so I think it could be blended into it.
Okay now that the tests pass, I am happy. I'll take it from here. Thank you so much for the amazing work! |
Thank you for your reviews and guidance! :) |
Very good work @arnauqb @Red-Portal, thanks for pushing this through |
This PR implements the
ScoreELBO
objective which allows to perform VI for non-differentiable densities.I have followed the idea exposed here where we can substitute the tradiational ELBO objective by a surrogate one
where the bar indicates that the term is constant (i.e., no gradient flows). The gradient of this surrogate objective is the same as the reparameterization one.
I have done the following modifications
stop_gradient
function defined inAdvancedVI.jl
for theForwardDiff
andZygote
backends. I am not sure how to implement them forEnzyme
orReverseDiff
so it throws a not implemented error for now.value_and_gradient
method for Zygote to properly test that the gradient does not flow. Instead of returningnothing
it returns 0. Not sure if this is necessary or adequate.objectives/elbo/scoreelbo.jl
that implements the new objective. I made sure only to modify the backward pass, and keep the ELBO loss the same value.RepGradELBO
objective. I had to relax some of the tolerances a bit given that the score estimator is typically much worse.Questions / things to consider:
log_prob(samples)
inside the rule and I am not sure how to get them easily.