Skip to content

Releases: pierrot-lc/eternity-rl

1.0.0: Never-ending rollouts

23 Sep 09:21
Compare
Choose a tag to compare

This release marks a significant milestone, capturing the project in a state I deem satisfactory. There's been considerable iteration since the project's inception, leading to several key features in its current state:

  • Training using my own PPO implementation.
  • Implementation of never-ending episodes training, where puzzles are played until termination or random re-initialization.
  • Incorporation of an Actor/Critic model to compensate for any missing MCTS estimations in a never-ending rollout.
  • Fully transformer model, complete with a pointer network utilizing nn.MultiHeadAttention across tiles.
  • Implementation of a fully batched environment on the GPU, with all functions defined over the batch of observations.
  • A model capable of handling multiple puzzle sizes seamlessly, courtesy of the pointer network.
  • Definition of the reward as the simple difference between the new number of matches and the previous count.
  • Noteworthy, any remaining bugs are of a minor nature, or so I hope.

Sadly, at the current state the model isn't able to solve the 4x4 puzzle... It is blocked at some point can't recover from a local optimum. After careful consideration, it appears this may be attributed to the fact that, in order to gain the last bit of rewards, the model must manipulate a substantial number of tiles. This means that it has to go through a lot of negative rewards to reach a new favorable state.

To circumvent this, the most prudent course of action appears to be reverting to a prior idea I had, wherein the model doesn't search indefinitely over a puzzle but only for a fixed number of steps. At the end of such episode, the model gets a reward proportional to the number of rightly placed tiles. This approach may help sidestep local minima, as the model is rewarded once, at the episode's end, for all its actions. The cost of moving numerous tiles is only apparent at the episode's conclusion. This is a bit like if the model was given a fixed amount of compute to find a solution. I find this less beautiful since we have to define this amount of compute and it has to be proportional to the puzzle complexity.

The next milestone will be reached once the model will be able to easily solve the 4x4 puzzle. It will signify that it has been able to escape those local minima. Let's hope it will be for v2.0.0!