Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a work-in-progress implementation of Sampled Muzero that I've been working on. I figured I'd store the implementation here in case anyone else is interested in developing it. The agent learns slowly (if at all) and performs significantly worse than the vanilla MuZero agent. As I'm relatively new to these libraries, I'm out of ideas for how to debug it. One interesting discrepancy I noticed between the regular agent and the sampled agent is that the policy loss for the sampled agent initially spikes, then returns to zero, then proceeds in a logarithmic curve, whereas the regular agent's policy loss has no such initial spike. But, I don't know how to interpret this difference. It also seems that, since the policy loss does converge, that automatic differentiation is configured correctly. In that case, the question would be why the policy does not improve more than it does, which suggests that something in the tree search is misconfigured.
I'm very much interested in feedback! Thanks.