[WIP] Sampled Muzero #216

JosephDenman · 2022-12-18T19:30:40Z

This is a work-in-progress implementation of Sampled Muzero that I've been working on. I figured I'd store the implementation here in case anyone else is interested in developing it. The agent learns slowly (if at all) and performs significantly worse than the vanilla MuZero agent. As I'm relatively new to these libraries, I'm out of ideas for how to debug it. One interesting discrepancy I noticed between the regular agent and the sampled agent is that the policy loss for the sampled agent initially spikes, then returns to zero, then proceeds in a logarithmic curve, whereas the regular agent's policy loss has no such initial spike. But, I don't know how to interpret this difference. It also seems that, since the policy loss does converge, that automatic differentiation is configured correctly. In that case, the question would be why the policy does not improve more than it does, which suggests that something in the tree search is misconfigured.

I'm very much interested in feedback! Thanks.

JosephDenman · 2023-04-07T18:23:19Z

Got it working. My problem was that the action passed to step in the environment was a tensor [i], rather than an integer i. I just had to get the integer from the tensor. Strange that the environment doesn't throw an error if the actions are incorrectly formatted.

First working cartpole

2f52d72

JosephDenman force-pushed the sampled-muzero branch from c6ea136 to 2f52d72 Compare April 7, 2023 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Sampled Muzero #216

[WIP] Sampled Muzero #216

JosephDenman commented Dec 18, 2022 •

edited

Loading

JosephDenman commented Apr 7, 2023

[WIP] Sampled Muzero #216

Are you sure you want to change the base?

[WIP] Sampled Muzero #216

Conversation

JosephDenman commented Dec 18, 2022 • edited Loading

JosephDenman commented Apr 7, 2023

JosephDenman commented Dec 18, 2022 •

edited

Loading