Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploding targets for A2C and task Tag when add_value_last_step: True #43

Open
gsavarela opened this issue Feb 18, 2023 · 1 comment
Open

Comments

@gsavarela
Copy link

I would like to report an issue which has a defining impact on the critic based
algorithms for the MPE PredatorPrey Task. The target_value variable, built
for training the critic network, is about three orders of magnitude in relation
to return_mean, the long term sample return. This behavior is abnormal for
three reasons: (i) The return_mean is not discounted whereas the target_value
is discounted. The rewards used to estimate the value are (ii) unbiased,
where we subtract a non-negative value, (iii) and re-scaled where we divide by
a standard deviation larger than one. Figure 1 depicts the behavior:

Figure 1

Reproduce:

Config:

{   
    "action_selector": "soft_policies",
    "add_value_last_step": true,
    "agent": "rnn_ns",
    "agent_output_type": "pi_logits",
    "batch_size": 10,
    "batch_size_run": 10,
    "buffer_cpu_only": true,
    "buffer_size": 10,
    "checkpoint_path": "",
    "critic_type": "cv_critic_ns",
    "entropy_coef": 0.01,
    "env": "gymma",
    "env_args": {   "key": "mpe:SimpleTag-v0",
                    "pretrained_wrapper": "PretrainedTag",
                    "seed": 853609918,
                    "state_last_action": false,
                    "time_limit": 25},
    "evaluate": false,
    "gamma": 0.99,
    "grad_norm_clip": 10,
    "hidden_dim": 128,
    "hypergroup": null,
    "label": "default_label",
    "learner": "actor_critic_learner",
    "learner_log_interval": 10000,
    "load_step": 0,
    "local_results_path": "results",
    "log_interval": 250000,
    "lr": 0.0003,
    "mac": "non_shared_mac",
    "mask_before_softmax": true,
    "name": "maa2c_ns",
    "obs_agent_id": false,
    "obs_individual_obs": false,
    "obs_last_action": false,
    "optim_alpha": 0.99,
    "optim_eps": 1e-05,
    "q_nstep": 5,
    "repeat_id": 1,
    "runner": "parallel",
    "runner_log_interval": 10000,
    "save_model": false,
    "save_model_interval": 500000,
    "save_replay": false,
    "seed": 853609918,
    "standardise_returns": false,
    "standardise_rewards": true,
    "t_max": 20050000,
    "target_update_interval_or_tau": 0.01,
    "test_greedy": true,
    "test_interval": 500000,
    "test_nepisode": 100,
    "use_cuda": false,
    "use_rnn": true,
    "use_tensorboard": true
}

Correction

The process is better controlled when the option add_value_last_step is set to false as showed in Figure 2.

Figure 2

Config

{    
    "action_selector":  "soft_policies",
    "add_value_last_step": false,
    "agent": "rnn_ns",
    "agent_output_type": "pi_logits",
    "batch_size": 10,
    "batch_size_run": 10,
    "buffer_cpu_only": true,
    "buffer_size": 10,
    "checkpoint_path": "",
    "critic_type": "cv_critic_ns",
    "entropy_coef": 0.01,
    "env": "gymma",
    "env_args": {   "key": "mpe:SimpleTag-v0",
                    "pretrained_wrapper": "PretrainedTag",
                    "seed": 932101488,
                    "state_last_action": false,
                    "time_limit": 25},
    "evaluate": false,
    "gamma": 0.99,
    "grad_norm_clip": 10,
    "hidden_dim": 128,
    "hypergroup": null,
    "label": "default_label",
    "learner": "actor_critic_learner",
    "learner_log_interval": 10000,
    "load_step": 0,
    "local_results_path": "results",
    "log_interval": 250000,
    "lr": 0.0003,
    "mac": "non_shared_mac",
    "mask_before_softmax": true,
    "name": "maa2c_ns",
    "obs_agent_id": false,
    "obs_individual_obs": false,
    "obs_last_action": false,
    "optim_alpha": 0.99,
    "optim_eps": 1e-05,
    "q_nstep": 5,
    "repeat_id": 1,
    "runner": "parallel",
    "runner_log_interval": 10000,
    "save_model": false,
    "save_model_interval": 500000,
    "save_replay": false,
    "seed": 932101488,
    "standardise_returns": false,
    "standardise_rewards": true,
    "t_max": 20050000,
    "target_update_interval_or_tau": 0.01,
    "test_greedy": true,
    "test_interval": 500000,
    "test_nepisode": 100,
    "use_cuda": false,
    "use_rnn": true,
    "use_tensorboard": true
}

This is the output from the diff command between the two configuration files.

1,3c1,3
< {   
<     "action_selector": "soft_policies",
<     "add_value_last_step": true,
---
> {    
>     "action_selector":  "soft_policies",
>     "add_value_last_step": false,
16c16
<                     "seed": 853609918,
---
>                     "seed": 932101488,
46c46
<     "seed": 853609918,
---
>     "seed": 932101488,

Unfortunately, I wasn't able to verify the published numbers even when correcting by this flag. Could you point me to the right direction?

Thanks in advance.

@gsavarela gsavarela changed the title Exploding targets for MAA2C_NS and task Tag when add_value_last_step: True Exploding targets for A2C and task Tag when add_value_last_step: True Mar 2, 2023
@gsavarela
Copy link
Author

Hi, I have noticed that this issue also has an effect on the MAA2C and IA2C algorithms for the Tag task. I was able to verify the maximum returns according to Table 3 by running the simulations as reported in the article (add_value_last_obs=True).

One of the simulations the metric return_mean presents a very strange pattern:

image

The target_mean is also way above expectation:

image

The config:

{   'action_selector': 'soft_policies',
    'add_value_last_step': True,
    'agent': 'rnn',
    'agent_output_type': 'pi_logits',
    'batch_size': 10,
    'batch_size_run': 10,
    'buffer_cpu_only': True,
    'buffer_size': 10,
    'checkpoint_path': '',
    'critic_type': 'cv_critic',
    'entropy_coef': 0.01,
    'env': 'gymma',
    'env_args': {   'key': 'mpe:SimpleTag-v0',
                    'pretrained_wrapper': 'PretrainedTag',
                    'seed': 94177189,
                    'state_last_action': False,
                    'time_limit': 25},
    'evaluate': False,
    'gamma': 0.99,
    'grad_norm_clip': 10,
    'hidden_dim': 128,
    'hypergroup': None,
    'label': 'default_label',
    'learner': 'actor_critic_learner',
    'learner_log_interval': 10000,
    'load_step': 0,
    'local_results_path': 'results',
    'log_interval': 250000,
    'lr': 0.0005,
    'mac': 'basic_mac',
    'mask_before_softmax': True,
    'name': 'maa2c',
    'obs_agent_id': True,
    'obs_individual_obs': False,
    'obs_last_action': False,
    'optim_alpha': 0.99,
    'optim_eps': 1e-05,
    'q_nstep': 5,
    'repeat_id': 1,
    'runner': 'parallel',
    'runner_log_interval': 10000,
    'save_model': False,
    'save_model_interval': 500000,
    'save_replay': False,
    'seed': 94177189,
    'standardise_returns': False,
    'standardise_rewards': True,
    't_max': 20050000,
    'target_update_interval_or_tau': 0.01,
    'test_greedy': True,
    'test_interval': 500000,
    'test_nepisode': 100,
    'use_cuda': False,
    'use_rnn': True,
    'use_tensorboard': True}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant