Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel collection and evaluation #143

Closed
gliese876b opened this issue Nov 18, 2024 · 29 comments · Fixed by #152
Closed

Parallel collection and evaluation #143

gliese876b opened this issue Nov 18, 2024 · 29 comments · Fixed by #152

Comments

@gliese876b
Copy link
Contributor

Can _evaluation_loop use SyncDataCollector for non vectorized envs so that the evaluation is also parallel?

While running on Melting Pot envs, increasing n_envs_per_worker definitely improves execution time but the evaluation steps take almost 3 times longer (I have evaluation_episodes: 10) than a regular iteration since the evaluation is sequential.

Making test_env SerialEnv could solve the issue.

@matteobettini
Copy link
Collaborator

matteobettini commented Nov 24, 2024

Hello!

Thanks for opening this and sorry for the delay in answering.
It gives me the chance to speak about this which I wanted to do.

In vectorized environments, both collection and evaluation are done using a batch of vectorized environments.

In other environments, right now, both collection and evaluation are sequentially in the number of environments.

Collection:

SerialEnv(self.config.n_envs_per_worker(self.on_policy), env_func),

Evaluation:
for eval_episode in range(self.config.evaluation_episodes):

Allowing to change both of these to Parallel has long been on the TODO list: #94

This could be as simple as changing SerialEnv to ParallelEnv but also have certain implications which have to be checked.

This is top of the todo list so I think I will get to it when i have time.

RE your specific case, in meltingpot changing the n_envs_per_worker should not change much as it will collect sequentially anyway. Maybe the reason evaluation is so much longer could be rendering? try to test it with rendering disabled (it is in the experiment config)

@matteobettini matteobettini pinned this issue Nov 24, 2024
@matteobettini matteobettini changed the title Parallel Evaluation Parallel collection and evaluation Nov 24, 2024
@gliese876b
Copy link
Contributor Author

Hello!

Thanks for the response. It is good to know that the issue is at the top of the todo list.

You are right that the collection and evaluation is done sequentially.

Just to follow your suggestion, I changed SerialEnv to ParallelEnv yet it led to many errors so I stopped.

Also, I definitely see execution time improvements when I set n_envs_per_worker from 2 to 20. But I guess it has something to do with the reset method of meltingpot envs.

Here is an example run of IQN on Harvest env with 10 agents with off_policy_n_envs_per_worker: 20, evaluation_interval: 50_000, evaluation_episodes: 10, off_policy_collected_frames_per_batch: 2000.

0%|          | 0/2500 [00:00<?, ?it/s].../logger.py:100: UserWarning: No episode terminated this iteration and thus the episode rewards will be NaN, this is normal if your horizon is longer then one iteration. Learning is proceeding fine.The episodes will probably terminate in a future iteration.
  warnings.warn(

mean return = nan:   0%|          | 1/2500 [05:48<241:38:32, 348.10s/it].../logger.py:100: UserWarning: No episode terminated this iteration and thus the episode rewards will be NaN, this is normal if your horizon is longer then one iteration. Learning is proceeding fine.The episodes will probably terminate in a future iteration.
  warnings.warn(

mean return = nan:   0%|          | 2/2500 [06:38<120:12:07, 173.23s/it]
mean return = nan:   0%|          | 3/2500 [07:29<81:18:35, 117.23s/it] 
mean return = nan:   0%|          | 4/2500 [08:20<63:04:10, 90.97s/it] 
mean return = nan:   0%|          | 5/2500 [09:11<53:00:30, 76.49s/it]
mean return = nan:   0%|          | 6/2500 [10:02<46:59:21, 67.83s/it]
mean return = nan:   0%|          | 7/2500 [10:52<43:04:41, 62.21s/it]
mean return = nan:   0%|          | 8/2500 [11:43<40:39:06, 58.73s/it]
mean return = nan:   0%|          | 9/2500 [12:36<39:21:00, 56.87s/it]
mean return = -88.84220123291016:   0%|          | 10/2500 [13:58<44:37:06, 64.51s/it]
mean return = nan:   0%|          | 11/2500 [14:49<41:44:54, 60.38s/it]               
mean return = nan:   0%|          | 12/2500 [15:41<39:56:15, 57.79s/it]
mean return = nan:   1%|          | 13/2500 [16:32<38:38:37, 55.94s/it]
mean return = nan:   1%|          | 14/2500 [17:24<37:44:07, 54.65s/it]
mean return = nan:   1%|          | 15/2500 [18:16<37:08:39, 53.81s/it]
mean return = nan:   1%|          | 16/2500 [19:08<36:42:25, 53.20s/it]
mean return = nan:   1%|          | 17/2500 [20:00<36:24:12, 52.78s/it]
mean return = nan:   1%|          | 18/2500 [20:52<36:17:42, 52.64s/it]
mean return = nan:   1%|          | 19/2500 [21:45<36:22:00, 52.77s/it]
mean return = -108.3295669555664:   1%|          | 20/2500 [23:08<42:33:57, 61.79s/it]
mean return = nan:   1%|          | 21/2500 [23:59<40:23:30, 58.66s/it]               
mean return = nan:   1%|          | 22/2500 [24:51<39:01:50, 56.70s/it]
mean return = nan:   1%|          | 23/2500 [25:43<38:02:08, 55.28s/it]
mean return = nan:   1%|          | 24/2500 [26:35<37:23:44, 54.37s/it]
mean return = nan:   1%|          | 25/2500 [32:32<99:44:40, 145.08s/it]  ------------------> evaluation
mean return = nan:   1%|          | 26/2500 [33:23<80:13:32, 116.74s/it]
mean return = nan:   1%|          | 27/2500 [34:13<66:28:06, 96.76s/it] 
mean return = nan:   1%|          | 28/2500 [35:04<57:02:45, 83.08s/it]
mean return = nan:   1%|          | 29/2500 [35:55<50:19:59, 73.33s/it]
mean return = -130.4561309814453:   1%|          | 30/2500 [37:15<51:42:54, 75.37s/it]

There is also an increase in time execution when episodes end. I guess, at the end, it cancels out the improvement on regular iterations.

@matteobettini
Copy link
Collaborator

Just to follow your suggestion, I changed SerialEnv to ParallelEnv yet it led to many errors so I stopped.

Ok that is what I was afraid of. In theory they should be interchangable but in practice they are my first cause of migranes (hence why we only have serial for now). but when I gather some courage I'll look into it.

Rgarding the other part of the message: anything out of what you expected/something I can help with?

@gliese876b
Copy link
Contributor Author

Nope. Thanks for the quick responses.
Good luck!

@matteobettini
Copy link
Collaborator

I ll just keep this open until the feature lands

@matteobettini matteobettini reopened this Nov 25, 2024
@gliese876b
Copy link
Contributor Author

I revisited the issue and you were right that switching from SerialEnv to ParallelEnv works!

Apparently, the problem was about how I pass some env config params to env creator function. I guess ParallelEnv does not copy task config as SerialEnv do. I changed the way I pass the args and removed hydra option and now it works.

@matteobettini
Copy link
Collaborator

Nice! Would you be able to share your solution in a PR? Also maybe if you can open an issue in torchrl outlining where the serial and parallel differ that you did not expect

@gliese876b
Copy link
Contributor Author

gliese876b commented Dec 1, 2024

Well, in terms of collection time, ParallelEnv improves a lot.

However, after checking the results, I can see that there is a big change in terms of learning performance. I ran some more tests with the config below (on IQL), only changing SerialEnv - ParallelEnv, and somehow the learning is very poor when I use ParallelEnv.

off_policy_collected_frames_per_batch: 2000
off_policy_n_envs_per_worker: 20
off_policy_n_optimizer_steps: 20
off_policy_train_batch_size: 128
off_policy_memory_size: 20000
off_policy_init_random_frames: 0

I thought the only difference is that SerialEnv is just stepping 20 envs in sequence whereas ParallelEnv steps them in seperate processes. Note that an episode ends only if 1000 steps are taken.

I am not sure if this originates from MeltingPot and it is due to async collection from envs.

@matteobettini
Copy link
Collaborator

matteobettini commented Dec 1, 2024

Oh no, that does not sound good. I feared something like this. I'll need to take a look

We need to identify where this deviation first occurs.

Maybe the first apporoach would be to test with a non-learned deterministic policy and see if the result is different betwen the 2 envs

@gliese876b
Copy link
Contributor Author

I think it is not about learning but logging.

When I run with 2 envs, I realize that episode_reward metric for an agent gets either integer values or values with 0.5 with SerialEnv whereas it has values like 23.2175006866455 with ParallelEnv.

I checked the batches in either case, and saw a strange difference. For episodes of 1000 steps,

  • In SerialEnv case, the ('next', 'done') key returns a tensor of 999 False and 1 True at the end
  • In ParallelEnv case, the same key returns a tensor of 1000 True values.
    I guess the batch is split by these done values, so episode reward is averaged over 1000 values in ParallelEnv while it is averaged over 2 in SerialEnv.

I pinpointed the problem here: If I remove TransformedEnv, the done keys return as expected.

Any ideas on the reason?

@matteobettini
Copy link
Collaborator

Do you know what specific transform causes this to happen? ccing @vmoens

@gliese876b
Copy link
Contributor Author

I tried passing an empty list of Transforms but it led to other errors.

MeltingpotTask has additional transforms but I am not sure they are the reason.

Maybe it is about reset keys? Or done_specs?

@matteobettini
Copy link
Collaborator

Mhh i would need a bit more systematic report. If you could pinpoint with more granularity what component causes the deviation I can take a look. A reprod script would also help although I know it is time consuming. Maybe you can tell me what are the minimal changes you did to benchmarl main to reach your state

@gliese876b
Copy link
Contributor Author

I tried a fresh clone in a fresh conda env with the following script;

python3 benchmarl/run.py algorithm=iql task=meltingpot/coins model=layers/mp_cnn model@critic_model=layers/mp_cnn

where mp_cnn contains

`name: cnn

mlp_num_cells: [128]
mlp_layer_class: torch.nn.Linear
mlp_activation_class: torch.nn.Tanh
mlp_activation_kwargs: null
mlp_norm_class: null
mlp_norm_kwargs: null

cnn_num_cells: [32, 128]
cnn_kernel_sizes: [6, 11]
cnn_strides: [8, 1]
cnn_paddings: [1, 0]
cnn_activation_class: torch.nn.ReLU
cnn_activation_kwargs: null
cnn_norm_class: null
cnn_norm_kwargs: null`

However now this gives the following error:

Traceback (most recent call last): File "/home/singh/Desktop/tmp/BenchMARL/benchmarl/run.py", line 37, in hydra_experiment experiment = load_experiment_from_hydra(cfg, task_name=task_name) File "/home/singh/Desktop/tmp/BenchMARL/benchmarl/hydra_config.py", line 43, in load_experiment_from_hydra return Experiment( File "/home/singh/Desktop/tmp/BenchMARL/benchmarl/experiment/experiment.py", line 339, in __init__ self._setup() File "/home/singh/Desktop/tmp/BenchMARL/benchmarl/experiment/experiment.py", line 359, in _setup self._setup_task() File "/home/singh/Desktop/tmp/BenchMARL/benchmarl/experiment/experiment.py", line 416, in _setup_task test_env = self.task.get_env_fun( File "/home/singh/Desktop/tmp/BenchMARL/benchmarl/environments/meltingpot/common.py", line 87, in <lambda> return lambda: MeltingpotEnv( File "/home/singh/.local/lib/python3.10/site-packages/torchrl/envs/common.py", line 171, in __call__ instance: EnvBase = super().__call__(*args, **kwargs) File "/home/singh/.local/lib/python3.10/site-packages/torchrl/envs/libs/meltingpot.py", line 574, in __init__ super().__init__( File "/home/singh/.local/lib/python3.10/site-packages/torchrl/envs/libs/meltingpot.py", line 197, in __init__ super().__init__(**kwargs) File "/home/singh/.local/lib/python3.10/site-packages/torchrl/envs/common.py", line 3126, in __init__ self._env = self._build_env(**kwargs) # writes the self._env attribute File "/home/singh/.local/lib/python3.10/site-packages/torchrl/envs/libs/meltingpot.py", line 590, in _build_env from meltingpot import substrate as mp_substrate File "/home/singh/.local/lib/python3.10/site-packages/meltingpot/__init__.py", line 18, in <module> from meltingpot import bot File "/home/singh/.local/lib/python3.10/site-packages/meltingpot/bot.py", line 18, in <module> from meltingpot import substrate File "/home/singh/.local/lib/python3.10/site-packages/meltingpot/substrate.py", line 21, in <module> from meltingpot.utils.substrates import substrate File "/home/singh/.local/lib/python3.10/site-packages/meltingpot/utils/substrates/substrate.py", line 19, in <module> import chex File "/home/singh/.local/lib/python3.10/site-packages/chex/__init__.py", line 17, in <module> from chex._src.asserts import assert_axis_dimension File "/home/singh/.local/lib/python3.10/site-packages/chex/_src/asserts.py", line 26, in <module> from chex._src import asserts_internal as _ai File "/home/singh/.local/lib/python3.10/site-packages/chex/_src/asserts_internal.py", line 34, in <module> from chex._src import pytypes File "/home/singh/.local/lib/python3.10/site-packages/chex/_src/pytypes.py", line 53, in <module> Shape = jax.core.Shape File "/home/singh/.local/lib/python3.10/site-packages/jax/_src/deprecations.py", line 52, in getattr raise AttributeError(message) AttributeError: jax.core.Shape is deprecated. Use Shape = Sequence[int | Any].

I'll continue debugging on my own branch to make ParallelEnv work for now.

@gliese876b
Copy link
Contributor Author

Here is an odd observation: when I change the sampling_device from 'cpu' to 'cuda', the done keys are correct as in the SerialEnv setting.

In fact, when I set sampling_device='cpu' and collect_with_grad=True and add below prints at the end of _setup_collector

print("------1--------")
print(self.env_func().to(self.config.sampling_device).rollout(max_steps=1000).get(('next', 'done')))
print("------2--------")
print(self.env_func().rollout(max_steps=1000).get(('next', 'done')))

It prints

------1--------

tensor([[[True],
         [True],
         [True],
         ...,
         [True],
         [True],
         [True]],

        [[True],
         [True],
         [True],
         ...,
         [True],
         [True],
         [True]]])

------2--------

tensor([[[False],
         [False],
         [False],
         ...,
         [False],
         [False],
         [ True]],

        [[False],
         [False],
         [False],
         ...,
         [False],
         [False],
         [ True]]])

I guess casting the rollout to the sampling device applies a transformation that messes up the done keys.

@gliese876b
Copy link
Contributor Author

gliese876b commented Dec 13, 2024

Ok. I got the problem solved.

I was using a custom wrapper for MeltingPot envs and I realize that I did not pass device parameter to the constructor. As I did, this problem vanishes. I guess the mismatch was the reason.

At the end, just switching from SerialEnv to ParallelEnv works.

Sorry for the trouble. Thanks for the help!

@matteobettini
Copy link
Collaborator

Really nice to hear! Feel free to make a PR to enable users to choose parallel or serial if this is easy

@gliese876b
Copy link
Contributor Author

Sorry that I couldn't respond before.

Thanks for the PR!

I wanted parallel collection since I had a model with an LSTM layer. Now with ParallelEnv, it gives an error;

  File "/home/adem/miniconda3/envs/bench_env/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/adem/miniconda3/envs/bench_env/lib/python3.11/site-packages/torchrl/_utils.py", line 669, in run
    return mp.Process.run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adem/miniconda3/envs/bench_env/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/adem/miniconda3/envs/bench_env/lib/python3.11/site-packages/torchrl/envs/batched_envs.py", line 2215, in _run_worker_pipe_shared_mem
    "next", next_shared_tensordict.select(*next_td_passthrough_keys)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adem/miniconda3/envs/bench_env/lib/python3.11/site-packages/tensordict/base.py", line 9275, in select
    result = self._select(*keys, inplace=inplace, strict=strict)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adem/miniconda3/envs/bench_env/lib/python3.11/site-packages/tensordict/_td.py", line 3089, in _select
    source[key] = source[key]._select(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/home/adem/miniconda3/envs/bench_env/lib/python3.11/site-packages/tensordict/_td.py", line 3077, in _select
    val = self._get_str(key, default=None if not strict else NO_DEFAULT)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adem/miniconda3/envs/bench_env/lib/python3.11/site-packages/tensordict/_td.py", line 2489, in _get_str
    return self._default_get(first_key, default)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adem/miniconda3/envs/bench_env/lib/python3.11/site-packages/tensordict/base.py", line 4973, in _default_get
    raise KeyError(
KeyError: 'key "_hidden_lstm_h_1" not found in TensorDict with keys [\'observation\', \'reward\']'

I checked the arguments of ParallelEnv but couldn't find a solution yet.

@matteobettini
Copy link
Collaborator

Got it! I'll try to investigate

@gliese876b
Copy link
Contributor Author

gliese876b commented Dec 16, 2024

Here is further info;

BatchedEnv from TorchRL collects the keys from the next field of the tensordict as the keys to be passed between the processes.

If I make the data variable empty in the below line;

https://github.com/pytorch/rl/blob/91064bc27d018fc8c489b5d91fa5d39cdcc18000/torchrl/envs/batched_envs.py#L1747

the error disappears and learning is fine with a model of LSTM.

Keys like "_hidden_lstm_h_1" are expected in the tensordict from envs since LSTM model puts them there but somehow the values are missing when the actual transfer is to occur. If I comment the below lines;

if not training:

the error again disappears and the learning is fine.

Can you check if the implementation of the LSTM layer really requires these?

@matteobettini
Copy link
Collaborator

What exactly did you comment out? Just 501?

The LSTM needs to write those values during execution as they have to be read in the next step, during training they are not needed as we process a time batch and not a single step

@gliese876b
Copy link
Contributor Author

gliese876b commented Dec 17, 2024

I understand that LSTM needs the values to work properly.
If I commented out the if clause with its body in 501 in lstm.py or make data empty in 1747 in batched_envs.py, the error disappears. I did these just to identify the origin of the problem.

So, these values are needed to be passed between processes (copies of env) but they don't exist in the tensordict that is passed to _step() method in BatchedEnv.

The update_() here;
https://github.com/pytorch/rl/blob/91064bc27d018fc8c489b5d91fa5d39cdcc18000/torchrl/envs/batched_envs.py#L1751
does not work since the keys "_hidden_lstm_h_1" and "_hidden_lstm_c_1" do not exist in shared_tensordict_parent.

This passthrough section does not exist in SerialEnv. I guess that is why a model with LSTM does not create an issue with SerialEnv.

Thanks for the help :)

@matteobettini
Copy link
Collaborator

@gliese876b try now, should work

@gliese876b
Copy link
Contributor Author

@matteobettini, thanks!

I run a test with the same model configs on the branch after your commits.

It gets the error below;

Traceback (most recent call last):
  File "/home/singh/Desktop/tmp/BenchMARL/benchmarl/run.py", line 37, in hydra_experiment
    experiment = load_experiment_from_hydra(cfg, task_name=task_name)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/singh/Desktop/tmp/BenchMARL/benchmarl/hydra_config.py", line 43, in load_experiment_from_hydra
    return Experiment(
           ^^^^^^^^^^^
  File "/home/singh/Desktop/tmp/BenchMARL/benchmarl/experiment/experiment.py", line 350, in __init__
    self._setup()
  File "/home/singh/Desktop/tmp/BenchMARL/benchmarl/experiment/experiment.py", line 372, in _setup
    self._setup_collector()
  File "/home/singh/Desktop/tmp/BenchMARL/benchmarl/experiment/experiment.py", line 514, in _setup_collector
    self.collector = SyncDataCollector(
                     ^^^^^^^^^^^^^^^^^^
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/site-packages/torchrl/collectors/collectors.py", line 739, in __init__
    self._make_shuttle()
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/site-packages/torchrl/collectors/collectors.py", line 763, in _make_shuttle
    self._shuttle = self.env.reset()
                    ^^^^^^^^^^^^^^^^
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/site-packages/torchrl/envs/common.py", line 2165, in reset
    tensordict_reset = self._reset(tensordict, **kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/site-packages/torchrl/envs/transforms/transforms.py", line 814, in _reset
    tensordict_reset = self.base_env._reset(tensordict, **kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/site-packages/torchrl/envs/batched_envs.py", line 63, in decorated_fun
    self._start_workers()
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/site-packages/torchrl/envs/batched_envs.py", line 1436, in _start_workers
    process.start()
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/site-packages/torchrl/data/utils.py", line 244, in __getstate__
    return cloudpickle.dumps((self.fn, self.kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/site-packages/cloudpickle/cloudpickle.py", line 1529, in dumps
    cp.dump(obj)
  File "/home/singh/miniconda3/envs/tmp_env/lib/python3.11/site-packages/cloudpickle/cloudpickle.py", line 1295, in dump
    return super().dump(obj)
           ^^^^^^^^^^^^^^^^^
TypeError: cannot pickle 'dmlab2d.dmlab2d_pybind.Lab2d' object

This is related to meltingpot using lab2d as backend.
Following a suggestion online, I used

import multiprocessing
multiprocessing.set_start_method("fork", force=True)

and this solves the issue although it is restricted to Unix-like systems. I am running on Ubuntu at the moment.

@matteobettini
Copy link
Collaborator

Mmmh i don't get this error, can you confirm you are using just the code from my PR?

@gliese876b
Copy link
Contributor Author

gliese876b commented Dec 17, 2024

I created a fresh conda env and pulled branch named parallel.

I run the following command;

python benchmarl/run.py algorithm=iql task=meltingpot/coins model=sequence model@critic_model=sequence
where the content of sequence, mp_cnn and lstm files are

defaults:
  # Here is a list of layers for this model
  # You can use configs from the "layers" folder
  - [email protected]: mp_cnn
  - [email protected]: lstm 
  - _self_

# A list of ints for the intermediate sizes between layers
# Should be of length = num_layers - 1
intermediate_sizes: [64] # Example -> [256]

# You can override your layers for example like this
# layers:
#  l1:
#    num_cells: [4]
name: cnn

mlp_num_cells: [128]
mlp_layer_class: torch.nn.Linear
mlp_activation_class: torch.nn.Tanh
mlp_activation_kwargs: null
mlp_norm_class: null
mlp_norm_kwargs: null

cnn_num_cells: [32, 128]
cnn_kernel_sizes: [6, 11]
cnn_strides: [8, 1]
cnn_paddings: [1, 0]
cnn_activation_class: torch.nn.ReLU
cnn_activation_kwargs: null
cnn_norm_class: null
cnn_norm_kwargs: null
name: lstm

hidden_size: 128
n_layers: 1
bias: True
dropout: 0
compile: False

mlp_num_cells: []
mlp_layer_class: torch.nn.Linear
mlp_activation_class: torch.nn.Tanh
mlp_activation_kwargs: null
mlp_norm_class: null
mlp_norm_kwargs: null

and the base_experiment contains

defaults:
  - experiment_config
  - _self_

# The device for collection (e.g. cuda)
sampling_device: "cpu"
# The device for training (e.g. cuda)
train_device: "cpu"
# The device for the replay buffer of off-policy algorithms (e.g. cuda)
buffer_device: "cpu"

# Whether to share the parameters of the policy within agent groups
share_policy_params: True
# If an algorithm and an env support both continuous and discrete actions, what should be preferred
prefer_continuous_actions: True
# If False collection is done using a collector (under no grad). If True, collection is done with gradients.
collect_with_grad: False
# In case of non-vectorized environments, weather to run collection of multiple processes
# If this is used, there will be n_envs_per_worker processes, collecting frames_per_batch/n_envs_per_worker frames each
parallel_collection: True

# Discount factor
gamma: 0.9
# Learning rate
lr: 0.00005
# The epsilon parameter of the adam optimizer
adam_eps: 0.000001
# Clips grad norm if true and clips grad value if false
clip_grad_norm: True
# The value for the clipping, if null no clipping
clip_grad_val: 5

# Whether to use soft or hard target updates
soft_target_update: True
# If soft_target_update is True, this is its polyak_tau
polyak_tau: 0.005
# If soft_target_update is False, this is the frequency of the hard trarget updates in terms of n_optimizer_steps
hard_target_update_frequency: 5

# When an exploration wrapper is used. This is its initial epsilon for annealing
exploration_eps_init: 0.8
# When an exploration wrapper is used. This is its final epsilon after annealing
exploration_eps_end: 0.01
# Number of frames for annealing of exploration strategy in deterministic policy algorithms
# If null it will default to max_n_frames / 3
exploration_anneal_frames: null

# The maximum number of experiment iterations before the experiment terminates, exclusive with max_n_frames
max_n_iters: null
# Number of collected frames before ending, exclusive with max_n_iters
max_n_frames: 2_500_000

# Number of frames collected and each experiment iteration
on_policy_collected_frames_per_batch: 6000
# Number of environments used for collection
# If the environment is vectorized, this will be the number of batched environments.
# Otherwise batching will be simulated and each env will be run sequentially or parallely depending on parallel_collection.
on_policy_n_envs_per_worker: 10
# This is the number of times collected_frames_per_batch will be split into minibatches and trained
on_policy_n_minibatch_iters: 45
# In on-policy algorithms the train_batch_size will be equal to the on_policy_collected_frames_per_batch
# and it will be split into minibatches with this number of frames for training
on_policy_minibatch_size: 400

# Number of frames collected and each experiment iteration
off_policy_collected_frames_per_batch: 2000
# Number of environments used for collection
# If the environment is vectorized, this will be the number of batched environments.
# Otherwise batching will be simulated and each env will be run sequentially or parallely depending on parallel_collection.
off_policy_n_envs_per_worker: 8
# This is the number of times off_policy_train_batch_size will be sampled from the buffer and trained over.
off_policy_n_optimizer_steps: 20
# Number of frames used for each off_policy_n_optimizer_steps when training off-policy algorithms
off_policy_train_batch_size: 128
# Maximum number of frames to keep in replay buffer memory for off-policy algorithms
off_policy_memory_size: 20_000
# Number of random action frames to prefill the replay buffer with
off_policy_init_random_frames: 0


evaluation: True
# Whether to render the evaluation (if rendering is available)
render: False
# Frequency of evaluation in terms of collected frames (this should be a multiple of on/off_policy_collected_frames_per_batch)
evaluation_interval: 120_000
# Number of episodes that evaluation is run on
evaluation_episodes: 10
# If True, when stochastic policies are evaluated, their deterministic value is taken, otherwise, if False, they are sampled
evaluation_deterministic_actions: True

# List of loggers to use, options are: wandb, csv, tensorboard, mflow
loggers: [csv]
# Wandb project name
project_name: "benchmarl"
# Create a json folder as part of the output in the format of marl-eval
create_json: True

# Absolute path to the folder where the experiment will log.
# If null, this will default to the hydra output dir (if using hydra) or to the current folder when the script is run (if not).
# If you are reloading an experiment with "restore_file", this will default to the reloaded experiment folder.
save_folder: null
# Absolute path to a checkpoint file where the experiment was saved. If null the experiment is started fresh.
restore_file: null
# Map location given to `torch.load()` when reloading.
# If you are reloading in a cpu-only machine a gpu experiment, you can use `restore_map_location: {"cuda:0":"cpu"}`
# to map gpu tensors to the cpu
restore_map_location: null
# Interval for experiment saving in terms of collected frames (this should be a multiple of on/off_policy_collected_frames_per_batch).
# Set it to 0 to disable checkpointing
checkpoint_interval: 0
# Wether to checkpoint when the experiment is done
checkpoint_at_end: False
# How many checkpoints to keep. As new checkpoints are taken, temporally older checkpoints are deleted to keep this number of
# checkpoints. The checkpoint at the end is included in this number. Set to `null` to keep all checkpoints.
keep_checkpoints_num: 3

@matteobettini
Copy link
Collaborator

matteobettini commented Dec 17, 2024

This is strange... it only happens if there is an rnn in the sequence

@matteobettini
Copy link
Collaborator

matteobettini commented Dec 17, 2024

I think I got it. Some weird lambda thing. Try now, fingers crossed 🤞

@gliese876b
Copy link
Contributor Author

Yes, no errors, working fine with a model of LSTM on ParallelEnv.

Great work!

Thank you so much. Now that the collection is parallel, this will save a lot of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants