Exploding targets for A2C and task Tag when add_value_last_step: True #43

gsavarela · 2023-02-18T13:16:13Z

I would like to report an issue which has a defining impact on the critic based
algorithms for the MPE PredatorPrey Task. The target_value variable, built
for training the critic network, is about three orders of magnitude in relation
to return_mean, the long term sample return. This behavior is abnormal for
three reasons: (i) The return_mean is not discounted whereas the target_value
is discounted. The rewards used to estimate the value are (ii) unbiased,
where we subtract a non-negative value, (iii) and re-scaled where we divide by
a standard deviation larger than one. Figure 1 depicts the behavior:

Reproduce:

Commit reference: 3d1463d
Divide the rewards by a factor of 3.0 according to issue Cannot obtain the reported results on MPE:SimpleSpread task #29
Set configurations: (i) maa2c_ns.yamlaccording to Section C.1, subsection
MPE PredadorPrey and Table 23 from Supplemental. (ii) Set time_limit=25
in gymma.yaml.

Config:

{   
    "action_selector": "soft_policies",
    "add_value_last_step": true,
    "agent": "rnn_ns",
    "agent_output_type": "pi_logits",
    "batch_size": 10,
    "batch_size_run": 10,
    "buffer_cpu_only": true,
    "buffer_size": 10,
    "checkpoint_path": "",
    "critic_type": "cv_critic_ns",
    "entropy_coef": 0.01,
    "env": "gymma",
    "env_args": {   "key": "mpe:SimpleTag-v0",
                    "pretrained_wrapper": "PretrainedTag",
                    "seed": 853609918,
                    "state_last_action": false,
                    "time_limit": 25},
    "evaluate": false,
    "gamma": 0.99,
    "grad_norm_clip": 10,
    "hidden_dim": 128,
    "hypergroup": null,
    "label": "default_label",
    "learner": "actor_critic_learner",
    "learner_log_interval": 10000,
    "load_step": 0,
    "local_results_path": "results",
    "log_interval": 250000,
    "lr": 0.0003,
    "mac": "non_shared_mac",
    "mask_before_softmax": true,
    "name": "maa2c_ns",
    "obs_agent_id": false,
    "obs_individual_obs": false,
    "obs_last_action": false,
    "optim_alpha": 0.99,
    "optim_eps": 1e-05,
    "q_nstep": 5,
    "repeat_id": 1,
    "runner": "parallel",
    "runner_log_interval": 10000,
    "save_model": false,
    "save_model_interval": 500000,
    "save_replay": false,
    "seed": 853609918,
    "standardise_returns": false,
    "standardise_rewards": true,
    "t_max": 20050000,
    "target_update_interval_or_tau": 0.01,
    "test_greedy": true,
    "test_interval": 500000,
    "test_nepisode": 100,
    "use_cuda": false,
    "use_rnn": true,
    "use_tensorboard": true
}

Correction

The process is better controlled when the option add_value_last_step is set to false as showed in Figure 2.

Config

{    
    "action_selector":  "soft_policies",
    "add_value_last_step": false,
    "agent": "rnn_ns",
    "agent_output_type": "pi_logits",
    "batch_size": 10,
    "batch_size_run": 10,
    "buffer_cpu_only": true,
    "buffer_size": 10,
    "checkpoint_path": "",
    "critic_type": "cv_critic_ns",
    "entropy_coef": 0.01,
    "env": "gymma",
    "env_args": {   "key": "mpe:SimpleTag-v0",
                    "pretrained_wrapper": "PretrainedTag",
                    "seed": 932101488,
                    "state_last_action": false,
                    "time_limit": 25},
    "evaluate": false,
    "gamma": 0.99,
    "grad_norm_clip": 10,
    "hidden_dim": 128,
    "hypergroup": null,
    "label": "default_label",
    "learner": "actor_critic_learner",
    "learner_log_interval": 10000,
    "load_step": 0,
    "local_results_path": "results",
    "log_interval": 250000,
    "lr": 0.0003,
    "mac": "non_shared_mac",
    "mask_before_softmax": true,
    "name": "maa2c_ns",
    "obs_agent_id": false,
    "obs_individual_obs": false,
    "obs_last_action": false,
    "optim_alpha": 0.99,
    "optim_eps": 1e-05,
    "q_nstep": 5,
    "repeat_id": 1,
    "runner": "parallel",
    "runner_log_interval": 10000,
    "save_model": false,
    "save_model_interval": 500000,
    "save_replay": false,
    "seed": 932101488,
    "standardise_returns": false,
    "standardise_rewards": true,
    "t_max": 20050000,
    "target_update_interval_or_tau": 0.01,
    "test_greedy": true,
    "test_interval": 500000,
    "test_nepisode": 100,
    "use_cuda": false,
    "use_rnn": true,
    "use_tensorboard": true
}

This is the output from the diff command between the two configuration files.

1,3c1,3
< {   
<     "action_selector": "soft_policies",
<     "add_value_last_step": true,
---
> {    
>     "action_selector":  "soft_policies",
>     "add_value_last_step": false,
16c16
<                     "seed": 853609918,
---
>                     "seed": 932101488,
46c46
<     "seed": 853609918,
---
>     "seed": 932101488,

Unfortunately, I wasn't able to verify the published numbers even when correcting by this flag. Could you point me to the right direction?

Thanks in advance.

The text was updated successfully, but these errors were encountered:

gsavarela · 2023-03-02T22:36:12Z

Hi, I have noticed that this issue also has an effect on the MAA2C and IA2C algorithms for the Tag task. I was able to verify the maximum returns according to Table 3 by running the simulations as reported in the article (add_value_last_obs=True).

One of the simulations the metric return_mean presents a very strange pattern:

The target_mean is also way above expectation:

The config:

{   'action_selector': 'soft_policies',
    'add_value_last_step': True,
    'agent': 'rnn',
    'agent_output_type': 'pi_logits',
    'batch_size': 10,
    'batch_size_run': 10,
    'buffer_cpu_only': True,
    'buffer_size': 10,
    'checkpoint_path': '',
    'critic_type': 'cv_critic',
    'entropy_coef': 0.01,
    'env': 'gymma',
    'env_args': {   'key': 'mpe:SimpleTag-v0',
                    'pretrained_wrapper': 'PretrainedTag',
                    'seed': 94177189,
                    'state_last_action': False,
                    'time_limit': 25},
    'evaluate': False,
    'gamma': 0.99,
    'grad_norm_clip': 10,
    'hidden_dim': 128,
    'hypergroup': None,
    'label': 'default_label',
    'learner': 'actor_critic_learner',
    'learner_log_interval': 10000,
    'load_step': 0,
    'local_results_path': 'results',
    'log_interval': 250000,
    'lr': 0.0005,
    'mac': 'basic_mac',
    'mask_before_softmax': True,
    'name': 'maa2c',
    'obs_agent_id': True,
    'obs_individual_obs': False,
    'obs_last_action': False,
    'optim_alpha': 0.99,
    'optim_eps': 1e-05,
    'q_nstep': 5,
    'repeat_id': 1,
    'runner': 'parallel',
    'runner_log_interval': 10000,
    'save_model': False,
    'save_model_interval': 500000,
    'save_replay': False,
    'seed': 94177189,
    'standardise_returns': False,
    'standardise_rewards': True,
    't_max': 20050000,
    'target_update_interval_or_tau': 0.01,
    'test_greedy': True,
    'test_interval': 500000,
    'test_nepisode': 100,
    'use_cuda': False,
    'use_rnn': True,
    'use_tensorboard': True}

gsavarela mentioned this issue Feb 18, 2023

Are values from Tables 3-7 for task MPE Tag, algorithms MAA2C and MAA2C_NS swapped? #44

Open

gsavarela changed the title ~~Exploding targets for MAA2C_NS and task Tag when add_value_last_step: True~~ Exploding targets for A2C and task Tag when add_value_last_step: True Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploding targets for A2C and task Tag when add_value_last_step: True #43

Exploding targets for A2C and task Tag when add_value_last_step: True #43

gsavarela commented Feb 18, 2023

gsavarela commented Mar 2, 2023

Exploding targets for A2C and task Tag when add_value_last_step: True #43

Exploding targets for A2C and task Tag when add_value_last_step: True #43

Comments

gsavarela commented Feb 18, 2023

Reproduce:

Config:

Correction

Config

gsavarela commented Mar 2, 2023