Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ScoreELBO objective #72

Merged
merged 15 commits into from
Sep 30, 2024
Merged

Add ScoreELBO objective #72

merged 15 commits into from
Sep 30, 2024

Conversation

arnauqb
Copy link
Contributor

@arnauqb arnauqb commented Jul 19, 2024

This PR implements the ScoreELBO objective which allows to perform VI for non-differentiable densities.

I have followed the idea exposed here where we can substitute the tradiational ELBO objective by a surrogate one

$$ f_\phi(z) \rightarrow \log q_\phi (z) \overline{f_\phi(z)} $$

where the bar indicates that the term is constant (i.e., no gradient flows). The gradient of this surrogate objective is the same as the reparameterization one.

I have done the following modifications

  • Extended the stop_gradient function defined in AdvancedVI.jl for the ForwardDiff and Zygote backends. I am not sure how to implement them for Enzyme or ReverseDiff so it throws a not implemented error for now.
  • I also modified the value_and_gradient method for Zygote to properly test that the gradient does not flow. Instead of returning nothing it returns 0. Not sure if this is necessary or adequate.
  • I have added a new file objectives/elbo/scoreelbo.jl that implements the new objective. I made sure only to modify the backward pass, and keep the ELBO loss the same value.
  • I added tests copying the structure for the RepGradELBO objective. I had to relax some of the tolerances a bit given that the score estimator is typically much worse.
  • I have added a score example in the benchmarks folder. This is the comparison with the pathwise version:

scorevsrepgrad

Questions / things to consider:

  • Right now, we take the score gradient for the whole model, but ideally we would only use it for the individual densities where we can't differentiate. I thought that one way to do this would be to create custom differentiation rules for the non-differentiation densities and do the score estimator inside. However, for this we need to access log_prob(samples) inside the rule and I am not sure how to get them easily.
  • The score estimator can be greatly improved with proper variance reduction techniques. This does not implement any and so the estimator is relatively bad.

@sunxd3 sunxd3 requested a review from Red-Portal July 19, 2024 13:45
@Red-Portal
Copy link
Member

Red-Portal commented Jul 20, 2024

Thank you so much for the PR. It looks pretty good at a glance, but I am currently on a trip to ICML, so it will take some time for me to do a more detailed review. A few first comments:

  • I think it would be better to change the name of ScoreELBO to ScoreGradELBO to be consistent with RepGradELBO.
  • I don't think applying STL would help in the case of the score gradient. (I also have not seen people do this.) But maybe it does? Have you run some experiments on this?
  • You can avoid relying on stop_gradient by passing the log target evaluation as an auxiliary input. This is, in fact, how STL is implemented (see q_stop here). It's not ideal, but I think that's the best we can do. In this case, it's acceptable since it wouldn't add any costs compared to using stop_gradient.
  • Additional variance reduction methods like control variates should be easy to add, but Rao-Blackwellization is probably going to be very hard to implement since DynamicPPL doesn't have a mechanism similar to Pyro plates. (I am all ears if anybody has better ideas for this.)
  • If we were to support the score gradient, we should probably add the VarGrad estimator too. (I guess this will be work for me in the future unless somebody would like to take a shot first.)

As for mixing the parameterization gradient with the score gradient, I think I have an idea about how to do it: we can split the variables into those that need the RP gradient and those that need the score gradient and compute the RP objective and the score objective separately. But it will require handling NamedTuple-variate random variables, which will require changes on DynamicPPL side. (Without NamedTuple-variate RVs, it's not going to be pretty.) Also, unfortunately, we probably can't handle things automatically as in Pyro since DynamicPPL doesn't track RV dependencies as far as I understand. (Is that correct @sunxd3 ?)

@sunxd3
Copy link
Member

sunxd3 commented Jul 20, 2024

I don't think applying STL would help in the case of the score gradient.

May I ask what is "STL"?

Also, unfortunately, we probably can't handle things automatically as in Pyro since DynamicPPL doesn't track RV dependencies as far as I understand.

Yep, DynamicPPL doesn't track dependencies, there were some efforts, but we just con't do it reliably. Although, JuliaBUGS.jl does, extracting dependencies is part of the motivation of writing JuliaBUGS -- It actually constructs a directed PGM. Maybe this is a good time to explore the design and implementation that exploits the graph structure.

@arnauqb
Copy link
Contributor Author

arnauqb commented Jul 20, 2024

Thanks @Red-Portal for the first comments, I agree with the name change and to avoid the use of stop_gradient will push changes soon.

This is a somewhat unrelated question but do you know what are the key differences between DynamicPPL.jl and GraphPPL.jl?. The reason I ask is because in GraphPPL.jl there is a very nice syntax to specify metadata for a specific sampling (see here):

@model function some_model(a, b)
    x ~ Beta(a, b) where { meta = "Hello, world!" }
end

this would be very convenient to specify the estimation method.

@Red-Portal
Copy link
Member

Red-Portal commented Jul 21, 2024

Sorry for the delay!

@arnauqb I don't have too much insight into the PPL part so that's something that perhaps @sunxd3 could comment on.

@sunxd3 STL stands for sticking-the-landing, which is a very clever control variate method by Roeder, Wu, and Duvenaud.

@sunxd3
Copy link
Member

sunxd3 commented Jul 22, 2024

DynamicPPL aims to enable users to write generic Julia code for model definition. While Graph-oriented DSLs like GraphPPL (and JuliaBUGS) are bound to have some restrictions to the syntax, although the restrictions can be a good thing. These syntax restrictions usually manifested as limited mutability, limitations to loop bounds etc.

Other than syntax differences, Turing/DynamicPPL is also a "general purpose" PPL(probabilistic programming language). A somewhat reductionist understanding of "general purpose" language v.s. graphical model PPL can be: "general purpose" PPL can allow stochastic number of random variables in the model. A very basic circumstance for this to happen is when the size of some array representing random variables to be stochastic, e.g., drawn from a Categorical distribution.

The ability to specify metadata in GraphPPL is a good design, but maybe difficult for DynamicPPL to implement well (it would take work to get the design right).

Now, I think it's very interesting how Pyro implemented TraceELBO, but I would need further investigation to determine how it works and if it applies to DynamicPPL.

All these being said, we should probably think about how to use better dependency information to make VI better here. Both in terms of algorithm implementations and interface design. We can use these designs to inform DynamicPPL design in the future and also build interface with DSLs like GraphPPL.

@arnauqb
Copy link
Contributor Author

arnauqb commented Jul 23, 2024

So I'm not sure how to avoid using stop_gradient since I need to do the trick of (score_grad - ignore_gradient(score_grad)) so that the ELBO stays the same value as the repgrad case. The only way I can think to do it with the auxiliary variable would be compute logpdf twice, once outside the gradient calculation and passed through aux, and one inside. But that would add a bit of computation overhead?

@Red-Portal
Copy link
Member

Red-Portal commented Jul 29, 2024

Sorry for the delay! I just got back from ICML and was recovering from a cold I caught on the way back. Yes, it seems like we'll have to call logpdf(q) twice, both in and out of the AD-path. I think it should be acceptable in general. I'll do a proper review once your updates are commited.

@arnauqb
Copy link
Contributor Author

arnauqb commented Sep 16, 2024

Apologies for the long delay on updating this. I got sidetracked with other projects after the summer break...

I have now implemented the score estimator without needing the stop_gradient trick by using q_stop as you suggested. I have also imlemented a baseline control variate, which computes the running mean of the last n iterations and uses it as baseline for variance reduction. It does have a very noticeable effect:

plot_19

Copy link
Member

@Red-Portal Red-Portal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arnauqb, thanks for the amazing work! It is nice to see that everything seems to work out without the need to gradient stopping operations. I have some rather minor questions and requested changes, which I think should be handled pretty quickly. I do have another comment: could you also try to add the test with Bijectors as done in here?

src/objectives/elbo/scoregradelbo.jl Outdated Show resolved Hide resolved
Comment on lines +67 to +73
function compute_control_variate_baseline(history, window_size)
if length(history) == 0
return 1.0
end
min_index = max(1, length(history) - window_size)
return mean(history[min_index:end])
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think that we should make a whole new set of interface for control variates so that people could easily mix and match various control variates. But for now, I think this could be included in the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed!

src/objectives/elbo/scoregradelbo.jl Outdated Show resolved Hide resolved
ext/AdvancedVIZygoteExt.jl Outdated Show resolved Hide resolved
src/objectives/elbo/scoregradelbo.jl Outdated Show resolved Hide resolved
src/objectives/elbo/entropy.jl Outdated Show resolved Hide resolved
test/Project.toml Outdated Show resolved Hide resolved
test/Project.toml Outdated Show resolved Hide resolved
@arnauqb
Copy link
Contributor Author

arnauqb commented Sep 19, 2024

Thank you for the comments @Red-Portal ! I'm in the process of merging your comments and adding the additional tests.

I'm currently experiencing a strange bug when computing the closed form of the entropy. In particular,

n_dims = 10
μ_x    = 5.0
σ_x    = 0.3
μ_y    = Fill(5.0, n_dims)
σ_y    = Fill(0.3, n_dims)
model  = NormalLogNormal(μ_x, σ_x, μ_y, Diagonal(σ_y.^2))

d = LogDensityProblems.dimension(model)
q = variational_standard_mvnormal(Float64, d, :meanfield) 
entropy(q) # works

b             = Bijectors.bijector(model)
binv          = inverse(b)
q_transformed = Bijectors.TransformedDistribution(q, binv)
entropy(q_transformed) # does not work

the call entropy(q_transformed) errors with the message MethodError: no method matching iterate(::MultivariateTransformed{MvLocationScale{…}, Stacked{…}})

This should also make the ReprGradElbo objective fail when computing the closed entropy, but somehow it doesn't. Any help?

@Red-Portal
Copy link
Member

Red-Portal commented Sep 20, 2024

@arnauqb The entropy under nonlinear transformations is generally intractable, so it's not implemented. For the repgradelbo case, for example, the jacobian adjustment part of the entropy is inside the Monte Carlo estimator. Now that I think of it, to make things as clean as possible, it would be best to call reparam_with_entropy since it deals with the Jacobian adjustment when computing the entropy (see the AdvancedVIBijectorsExt extension).

@arnauqb
Copy link
Contributor Author

arnauqb commented Sep 20, 2024

@Red-Portal Ok, so I had to rework this a bit more, let me know what you think.

For the score estimator, we need to sample x ~ p and then evaluate logp(x). The problem is that the sampling needs to be decoupled from the computation graph. In the current implementation where the sampling and the entropy are done together this was not possible, so I introduced a new function sample_from_q that dispatches according to the objective. The Repr samples from q and the ScoreGrad samples from q_stop.

Locally, I get segmentation faults for the Enzyme tests, even for the ReprGradELBO. Not sure what's wrong there...

@Red-Portal
Copy link
Member

@arnauqb Can't you just shove q_stop into reparam_with_entropy?

@arnauqb
Copy link
Contributor Author

arnauqb commented Sep 21, 2024

But then the entropy term would not carry a gradient, so it wouldn't optimize for that?

@Red-Portal
Copy link
Member

Red-Portal commented Sep 21, 2024

Ah you're right. Okay this is indeed a little bit tricky, and I am not too happy with dispatching based on the objective. I think the best course of action would be to split reparam_with_entropy into sampling and estimating the entropy so that one can mix and match differentiable elements. But that would require more work including updating the docs, which I think is a bit too much for this PR. Could you go back to the version without sample_from_q and forget about testing with Bijectors? I'll handle the rest from there.

@arnauqb
Copy link
Contributor Author

arnauqb commented Sep 23, 2024

Ok, I have reverted the change. Currently using Bijectors with the closed form entropy does not work for the ScoreGrad, but I guess that is OK for now. Currently, Zygote and Tapir tests also fail on the score grad, not sure why it is a hard to debug Enzyme error.

@Red-Portal
Copy link
Member

Red-Portal commented Sep 27, 2024

Hi @arnauqb , did you push all your changes? If so, I'll take a deeper look so that we can move this PR forward.

@arnauqb
Copy link
Contributor Author

arnauqb commented Sep 27, 2024

Yes, should be ready for a deep review.

@Red-Portal Red-Portal self-requested a review September 27, 2024 19:20
Copy link
Member

@Red-Portal Red-Portal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for the one comment, I think most things look good now. Could you also run the formatter (refer to here) so that it doesn't complain? Once you're done, re-request review so that I can immediately know that you're ready.

return fv_mean + (score_grad - score_grad_stop)
end

function compute_elbo(q, q_stop, samples_stop, entropy, problem, baseline)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this function to exist separately? The only caller is estimate_scoregradelbo_ad_forward so I think it could be blended into it.

@arnauqb arnauqb requested a review from Red-Portal September 29, 2024 12:06
@Red-Portal
Copy link
Member

Okay now that the tests pass, I am happy. I'll take it from here. Thank you so much for the amazing work!

@Red-Portal Red-Portal merged commit 4eab1ac into TuringLang:master Sep 30, 2024
9 of 10 checks passed
@arnauqb
Copy link
Contributor Author

arnauqb commented Sep 30, 2024

Thank you for your reviews and guidance! :)

@sunxd3
Copy link
Member

sunxd3 commented Oct 2, 2024

Very good work @arnauqb @Red-Portal, thanks for pushing this through

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants