Offline & Offline2Online Reinforcement Learning

April 2023: This repository contains experiments of different reinforcement learning algorithms applied to 3 MuJoCo environments - Walker2d, Hopper and Halfcheetah. Essentially, there are 2 models in comparison: Adaptive Behavior Cloning Regularization [1] (in short, redq_bc) and Supported Policy Optimization for Offline Reinforcement Learning [2] (in short, spot).

July-September 2023 update: There are also additional implementations of:

Cal-QL [9]: Logs (wip)
ReBRAC[11]: Logs
EDAC[12]: Logs: EDAC itself, SAC-N[12] (with eta = 0), LB-SAC[16] (with eta = 0 and batch_size = 10_000)
AWAC[13]: Logs
Decision Transformer[14]: Logs
IQL[15]: Logs
Robust IQL [39]
MSG[17]: Logs (This method is realised upon offline SAC-N algorithm. However, my realization lacks appropriate hyperparameters for best results.)
PRDC[19]: Logs
DOGE[20]: Logs
BEAR[21]: Logs
SAC-RND[10]: Logs & Implementation
SAC-DRND[32]: Logs & Implementation
RORL: Logs & Implementation (lacks appropriate hyperparameters)
CNF[18]: Logs & Implementation
offline O3F[22]: Logs (realised for offline learning, not as stated in the paper)
XQL[23]: Logs
TD7[24]: Logs
offline TQC[25]: Logs (failed on walker2d-medium-v2)
InAC[26]: Logs
FisherBRC[27]: Logs
Diffusion Q-Learning[28]: Logs
Sparse Q-Learning[29]: Logs
Exponential Q-Learning[29]: Logs (differs from SQL mentioned above by a bit different update of value function and actor)
ATAC [31]: Logs (bad. Gotta tune hyperparams.)
TD3 BC++ [33]: Logs
STR [34]: Logs (nan failed)
BPPO [35]
AsymQ [36]
MCQ [37]
DroQ [38]
SR DICE [40]
ODICE [41]
CrossQ [42] (actually idk how to really accelerate it on torch)

At the moment offline training is realised for this models. Logs (of only training actually, unfortunately, without evaluation as it was forbidden on the machine to install mujoco stuff, so I trained the models with preloaded pickle and json datasets) are available up below.

You can also check out my Logs and seem_interface for SEEM[30] paper.

General setup (April 2023)

I've chosen these datasets from gym as they are from MuJoCo, i.e. require learning of complex underlying structufe of the given task with trade-off in short-term and long-term strategies and Google Colab doesn't die from them ;). I have also used d4rl [3] library at https://github.com/tinkoff-ai/d4rl as a module to get offline dataset. Datasets used from d4rl for environments mentioned above: medium and medium-replay. Both models have the same base structure in architecture and training - actor-critic model [6] combined with Double Q-learning ([7], [8]).

Models (both redq_bc and spot) were trained on this offline dataset first using Adam optimizer with lr = 3e-4. The same with online training. Scripts can be found in appropriate folders (adaptive_bc and spot)

Models (April 2023)

All available models can be tested in colab opening inference.ipynb. Examples of evaluation can be found in video folder.

Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning

redq_bc is implemented to adaptively weigh the L2 loss associated with offline dataset distribution during online fine-tuning on order to stabilise the training. This loss is constructed into the architecture to prevent sudden distribution shift from offline to online data with such simple regularisation that requires minimum code changes (the method is located in the adaptive_bc folder, there is also paper folder with key moments from the following paper to realise the model). Logs are available at: https://wandb.ai/zzmtsvv/adaptive_bc

example_redq_bc_walker2d.mp4

Supported Policy Optimization for Offline Reinforcement Learning

spot is also implemented to mitigate the problem of the distribution shift by adding a density-based constraint to the main objective. The offline behavior density is realised with Conditional VAE ([4], [5]) that reconstructs action joint with condition (state in this case). VAE is trained as usual and then its loss is used as a régularisation term in offline and online training (there is also additional cooling component in online fine-tuning for more stable handling of distribution shift). The method is located in the spot folder, there is also paper folder with key moments from the following paper to realise the model, Tensorboard plots can be seen in graphs folder.

example_spot_halfcheetah.mp4

Results (April 2023)

As can be seen from plots and concrete examples on videos, spot performs much better than redq_bc. Intuitively, it can be connected with the fact both works brings additional regularization term during training, in fact, density-constraint support defined in spot can handle offline distribution support more succesfully than L2 term in redq_bc due to its bigger complexity. Furthermore, additional research on latent space of VAE can potentially bring impact in offline2online field.

References

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
adaptive_bc		adaptive_bc
asymq		asymq
atac		atac
awac		awac
bear		bear
bppo		bppo
cal_ql		cal_ql
cross_q		cross_q
decision_transformer		decision_transformer
diffusion_ql		diffusion_ql
doge		doge
droq		droq
edac		edac
eql		eql
fisher_brc		fisher_brc
inac		inac
iql		iql
mcq		mcq
msg		msg
odice		odice
offline_o3f		offline_o3f
offline_tqc		offline_tqc
online_weights		online_weights
prdc		prdc
rebrac		rebrac
riql		riql
seem_interface		seem_interface
spot		spot
sql		sql
sr_dice		sr_dice
str		str
td3_bc++		td3_bc++
td7		td7
video		video
xql		xql
.gitignore		.gitignore
README.md		README.md
cql_utils.py		cql_utils.py
inference.ipynb		inference.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Offline & Offline2Online Reinforcement Learning

General setup (April 2023)

Models (April 2023)

Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning

Supported Policy Optimization for Offline Reinforcement Learning

Results (April 2023)

References

About

Releases

Packages

Languages

zzmtsvv/ORL

Folders and files

Latest commit

History

Repository files navigation

Offline & Offline2Online Reinforcement Learning

General setup (April 2023)

Models (April 2023)

Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning

Supported Policy Optimization for Offline Reinforcement Learning

Results (April 2023)

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages