Deadlock when running model.learn on a SubprocVecEnv #1814
Labels
check the checklist
You have checked the required items in the checklist but you didn't do what is written...
custom gym env
Issue related to Custom Gym Env
🐛 Bug
When running model.learn on a SubprocVecEnv as follows:
env = make_vec_env(ENV_ID, n_envs=cpus, vec_env_cls=SubprocVecEnv, vec_env_kwargs=dict(start_method="spawn")) model = SAC(**kwargs, env=env) model.learn(N_TIMESTEPS, callback=eval_callback)
the program ends up in a deadlock. This is likely because I am using a custom environment which is running julia code by using juliacall. The solution is to change CloudpickleWrapper to a DillWrapper as follows:
in stable_baselines3.common.vec_env.base_vec_env.py
`
import dill
class DillWrapper:
def init(self, var: Any):
self.var = var
`
And use DillWrapper instead of CloudpickleWrapper in stable_baselines3.common.vec_env.subproc_vec_env.py
By doing this, the problem seems to be less often, but not completely solved.
I am running the code with the following command:
apptainer exec
--env PYTHON_JULIACALL_SYSIMAGE=/cluster/home/bartva/BOAT/Simulation/Kite/trainer/.JlEnv.so
--bind /cluster/work/bartva
--nv
.kite-app.sif python -u BOAT/Simulation/Kite/trainer/hyperparam-tuning.py
--trials 500
--startup_trials 5
--evaluations 5
--steps 200000
--episodes 10
--cpus 0
--verbose 2 \
Code example
KiteEnv.py:
hyperparam-tuning.py:
Environment.jl (Has to be built as a custom package!)
Relevant log output / Error message
After 1-2 trials, there is a deadlock. When using the DillWrapper, the deadlock is after around 7 trials. When using the "spawn" method and the DillWrapper, the problem seems to be solved.
System Info
I am running the code in an Ubuntu Apptainer on a High Performance Cluster:
https://www.hpc.ntnu.no/idun/
Checklist
The text was updated successfully, but these errors were encountered: