April 2023
: This repository contains experiments of different reinforcement learning algorithms applied to 3 MuJoCo environments - Walker2d, Hopper and Halfcheetah
. Essentially, there are 2 models in comparison: Adaptive Behavior Cloning Regularization [1] (in short, redq_bc
) and Supported Policy Optimization for Offline Reinforcement Learning [2] (in short, spot
).
July-September 2023 update
: There are also additional implementations of:
- Cal-QL [9]: Logs (wip)
- ReBRAC[11]: Logs
- EDAC[12]: Logs: EDAC itself, SAC-N[12] (with
eta = 0
), LB-SAC[16] (witheta = 0
andbatch_size = 10_000
) - AWAC[13]: Logs
- Decision Transformer[14]: Logs
- IQL[15]: Logs
- Robust IQL [39]
- MSG[17]: Logs (This method is realised upon offline SAC-N algorithm. However, my realization lacks appropriate hyperparameters for best results.)
- PRDC[19]: Logs
- DOGE[20]: Logs
- BEAR[21]: Logs
- SAC-RND[10]: Logs & Implementation
- SAC-DRND[32]: Logs & Implementation
- RORL: Logs & Implementation (lacks appropriate hyperparameters)
- CNF[18]: Logs & Implementation
- offline O3F[22]: Logs (realised for offline learning, not as stated in the paper)
- XQL[23]: Logs
- TD7[24]: Logs
- offline TQC[25]: Logs (failed on
walker2d-medium-v2
) - InAC[26]: Logs
- FisherBRC[27]: Logs
- Diffusion Q-Learning[28]: Logs
- Sparse Q-Learning[29]: Logs
- Exponential Q-Learning[29]: Logs (differs from SQL mentioned above by a bit different update of value function and actor)
- ATAC [31]: Logs (bad. Gotta tune hyperparams.)
- TD3 BC++ [33]: Logs
- STR [34]: Logs (
nan
failed) - BPPO [35]
- AsymQ [36]
- MCQ [37]
- DroQ [38]
- SR DICE [40]
- ODICE [41]
- CrossQ [42] (actually idk how to really accelerate it on
torch
)
At the moment offline training is realised for this models. Logs (of only training actually, unfortunately, without evaluation as it was forbidden on the machine to install mujoco stuff, so I trained the models with preloaded pickle and json datasets) are available up below.
You can also check out my Logs and seem_interface
for SEEM[30] paper.
I've chosen these datasets from gym as they are from MuJoCo, i.e. require learning of complex underlying structufe of the given task with trade-off in short-term and long-term strategies and Google Colab doesn't die from them ;). I have also used d4rl
[3] library at https://github.com/tinkoff-ai/d4rl as a module to get offline dataset. Datasets used from d4rl
for environments mentioned above: medium
and medium-replay
. Both models have the same base structure in architecture and training - actor-critic model [6] combined with Double Q-learning ([7], [8]).
Models (both redq_bc and spot) were trained on this offline dataset first using Adam
optimizer with lr = 3e-4
. The same with online training. Scripts can be found in appropriate folders (adaptive_bc
and spot
)
All available models can be tested in colab opening inference.ipynb
. Examples of evaluation can be found in video
folder.
redq_bc is implemented to adaptively weigh the L2 loss associated with offline dataset distribution during online fine-tuning on order to stabilise the training. This loss is constructed into the architecture to prevent sudden distribution shift from offline to online data with such simple regularisation that requires minimum code changes (the method is located in the adaptive_bc
folder, there is also paper
folder with key moments from the following paper to realise the model). Logs are available at: https://wandb.ai/zzmtsvv/adaptive_bc
example_redq_bc_walker2d.mp4
spot is also implemented to mitigate the problem of the distribution shift by adding a density-based constraint to the main objective. The offline behavior density is realised with Conditional VAE ([4], [5]) that reconstructs action joint with condition (state in this case). VAE is trained as usual and then its loss is used as a régularisation term in offline and online training (there is also additional cooling component in online fine-tuning for more stable handling of distribution shift). The method is located in the spot
folder, there is also paper
folder with key moments from the following paper to realise the model, Tensorboard plots can be seen in graphs
folder.
example_spot_halfcheetah.mp4
As can be seen from plots and concrete examples on videos, spot
performs much better than redq_bc
. Intuitively, it can be connected with the fact both works brings additional regularization term during training, in fact, density-constraint support defined in spot can handle offline distribution support more succesfully than L2 term in redq_bc due to its bigger complexity. Furthermore, additional research on latent space of VAE can potentially bring impact in offline2online field.
[1] - Yi Zhao et al. (2022). Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning.
[2] - Jialong Wu et al. (2022). Supported Policy Optimization for Offline Reinforcement Learning.
[3] - Justin Fu et al. (2021). D4RL: Datasets for Deep Data-driven Reinforcement Learning.
[4] - Kingma, Welling et al. (2014). Auto-Encoding Variational Bayes.
[5] - Sohn, Lee, Yan et al. (2015). Learning Structured Output Representation using Deep Conditional Generative Models.
[6] - Lillicrap, Hunt et al. (2015). Continuous Control With Deep Reinforcement Learning.
[7] - Mnih et al. (2013). Playing Atari with Deep Reinforcement Learning.
[8] - Fujimoto et al. (2018). Addressing Function Approximation Error in Actor-Critic Methods.
[9] - Nakamoto, Zhai et al. (2023). Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning.
[10] - Nikulin, Kurenkov et al. (2023). Anti-Exploration by Random Network Distillation.
[11] - Tarasov, Kurenkov et al. (2023). Revisiting the Minimalist Approach to Offline Reinforcement Learning.
[12] - An, Moon et al. (2021). Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble.
[13] - Nair, Gupta et al. (2021). AWAC: Accelerating Online Reinforcement Learning with Offline Datasets.
[14] - Chen, Lu et al. (2021). Decision Transformer: Reinforcement Learning via Sequence Modeling.
[15] - Kostrikov, Nair et al. (2021). Offline Reinforcement Learning with Implicit Q-Learning.
[16] - Nikulin, Kurenkov et al. (2022). Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size.
[17] - Kamyar, Ghasemipour et al. (2022). Why So Pessimistic? Estimating Uncertainties for Offline RL through Ensembles, and Why Their Independence Matters.
[18] Akimov, Kurenkov et al. (2023). Let Offline RL Flow: Training Conservative Agents in the Latent Space of Normalizing Flows.
[19] Ran, Li et al. (2023). Policy Regularization with Dataset Constraint for Offline Reinforcement Learning.
[20] Li, Zhan et al. (2023). When Data Geometry Meets Deep Function: Generalizing Offline Reinforcement Learning.
[21] Kumar, Fu et al. (2019). Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction.
[22] Mark, Ghardizadeh et al. (2023). Fine-Tuning Offline Policies With Optimistic Action Selection.
[23] Garg, Hejna et al. (2023). Extreme Q-Learning: MaxEnt RL without Entropy
[24] Fujimoto, Chang et al. (2023). For SALE: State-Action Representation Learning for Deep Reinforcement Learning
[25] Kuznetsov, Shvechikov et al. (2020). Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics
[26] Xiao, Wang et al. (2023). The In-Sample Softmax for Offline Reinforcement Learning
[27] Kostrikov, Tompson, et al. (2021). Offline Reinforcement Learning with Fisher Divergence Critic Regularization
[28] Wang, Hunt, Zhou (2023). Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
[29] Xu, Jiang et al. (2023). Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization
[30] Yue, Lu et al. (2023). Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL
[31] Cheng, Xie et al. (2022). Adversarially Trained Actor Critic for Offline Reinforcement Learning
[32] Yang, Tao et al. (2024). Exploration and Anti-Exploration with Distributional Random Network Distillation
[33] Gao, Xu et al. (2022). Robust Offline Reinforcement Learning with Gradient Penalty and Constraint Relaxation
[34] Mao, Zhang et al. (2023) Supported Trust Region Optimization for Offline Reinforcement Learning
[35] Zhuang, Lei et al. (2023). Behavior Proximal Policy Optimization
[36] Zhang, Krishna et al. (2023). AsymQ: Assymetric Q-Loss to Mitigate Over-Estimation Bias in Off-Policy Reinforcement Learning
[37] Lyu, Ma et al. (2022). Mildly Conservative Q-Learning for Offline Reinforcement Learning
[38] Hiraoka et al. (2022). Dropout Q-Functions for Doubly Efficient Reinforcement Learning
[39] Yang et al. (2023). Towards Robust Offline Reinforcement Learning under Diverse Data Corruption
[40] Fujimoto et al. (2023). A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation
[41] Mao, Xu et al. (2024). ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update
[42] Bhatt, Palenicek et al. (2023). CrosQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity