The codes for the proposed Reinforcement Learning with an Assistant Reward Agent (ReLara) algorithm.
The framework of the ReLara algorithm is shown below:
ReLara involves two agents: a policy agent (PA) and a reward agent (RA). The PA learns the policy to maximize the augmented rewards
- This code has been tested on:
pytorch==2.0.1+cu117
- Install all dependent packages:
pip3 install -r requirements.txt
Run the following command to train ReLara on the task specified by <Task ID>
:
python run-ReLara.py --env-id <Task ID>
All available environments with sparse rewards evaluated in our paper are listed below:
- Mujoco-Sparse:
MyMujoco/Ant-Height-Sparse
: the AntStand task.MyMujoco/Ant-Speed-Sparse
: the AntSpeed task.MyMujoco/Ant-Far-Sparse
: the AntFar task.MyMujoco/Ant-Very-Far-Sparse
: the AntVeryFar task.MyMujoco/Walker2d-Keep-Sparse
: the WalkerKeep task.MyMujoco/Humanoid-Keep-Sparse
: the HumanKeep task.MyMujoco/HumanoidStandup-Sparse
: the HumanStand task.
- Robotics-Sparse:
MyFetchRobot/Reach-Jnt-Sparse-v0
: the RobotReach task.MyFetchRobot/Push-Jnt-Sparse-v0
: the RobotPush task.
- Classic control:
MountainCarContinuous-v0
: the MountainCar task.
All hyper-parameters are set as default values in the code. You can change them by adding arguments to the command line. All available arguments are listed below:
--exp-name: the name of the experiment, to record the tensorboard and save the model.
--env-id: the task id
--seed: the random seed.
--cuda: the cuda device, default is 0, the code will automatically choose "cpu" if cuda is not available.
--gamma: the discount factor.
--total-timesteps: the total timesteps to train the agent.
--pa-learning-starts: the burn-in steps of the policy agent.
--ra-learning-starts: the burn-in steps of the reward agent.
--proposed-reward-scale: the scale of the proposed reward, default is 1.
--beta: the weight of the proposed reward, default is 0.2.
--pa-buffer-size: the buffer size of the policy agent.
--pa-rb-optimize-memory: whether to optimize the memory of the policy agent
--pa-batch-size: the batch size of the policy agent
--ra-buffer-size: the buffer size of the reward agent.
--ra-rb-optimize-memory: whether to optimize the memory of the reward agent
--ra-batch-size: the batch size of the reward agent
--pa-actor-lr: the learning rate of the actor of the policy agent
--pa-critic-lr: the learning rate of the critic of the policy agent
--pa-alpha-lr: the learning rate of the alpha of the policy agent
--ra-actor-lr: the learning rate of the actor of the reward agent
--ra-critic-lr: the learning rate of the critic of the reward agent
--ra-alpha-lr: the learning rate of the alpha of the reward agent
--pa-policy-frequency: the policy frequency of the policy agent
--pa-target-frequency: the target frequency of the policy agent
--pa-tau: the tau of the policy agent
--ra-policy-frequency: the policy frequency of the reward agent
--ra-target-frequency: the target frequency of the reward agent
--ra-tau: the tau of the reward agent
--pa-alpha: the alpha of the policy agent
--pa-alpha-autotune: whether to autotune the alpha of the policy agent
--ra-alpha: the alpha of the reward agent
--ra-alpha-autotune: whether to autotune the alpha of the reward agent
--write-frequency: the frequency to write the tensorboard
--save-frequency: the frequency to save the checkpoint
--save-folder: the folder to save the model
The comparison of ReLara with several baselines, including ROSA (Mguni et al., 2023), ExploRS (Devidze et al., 2022) SAC (Haarnoja et al., 2018), TD3 (Fujimoto et al., 2018), RND (Burda et al., 2018) and PPO (Schulman et al., 2017), is shown below: