BEAR-QL. Bootstrapping Error Accumulation Reduction Q-Learning (BEAR) [1] is an actor-critic algorithm which builds on the core idea of BCQ, but instead of using a perturbation model, samples actions from a learned actor. As in BCQ, BEAR trains a generative model of the data distribution in the batch. Using the generative model , the actor is trained using the deterministic policy gradient, while minimizing the variance over an ensemble of Q-networks, and constraining the maximum mean discrepancy (MMD) between and through dual gradient descent:
where and the MMD is computed over some choice of kernel. The update rule for the ensemble of Q-networks matches BCQ, except the actions can be sampled from the single actor network rather than sampling from a generative model and perturbing:
The policy used during evaluation is defined similarly to BCQ, but again samples actions directly from the actor:
python bear-train.py --dataset=walker2d-random-v2 --seed=0 --gpu=0
- Kumar A, Fu J, Soh M, et al. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction[J]. Advances in Neural Information Processing Systems, 2019, 32: 11784-11794.