You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to report an issue which has a defining impact on the critic based
algorithms for the MPE PredatorPrey Task. The target_value variable, built
for training the critic network, is about three orders of magnitude in relation
to return_mean, the long term sample return. This behavior is abnormal for
three reasons: (i) The return_mean is not discounted whereas the target_value
is discounted. The rewards used to estimate the value are (ii) unbiased,
where we subtract a non-negative value, (iii) and re-scaled where we divide by
a standard deviation larger than one. Figure 1 depicts the behavior:
Set configurations: (i) maa2c_ns.yamlaccording to Section C.1, subsection
MPE PredadorPrey and Table 23 from Supplemental. (ii) Set time_limit=25
in gymma.yaml.
gsavarela
changed the title
Exploding targets for MAA2C_NS and task Tag when add_value_last_step: True
Exploding targets for A2C and task Tag when add_value_last_step: True
Mar 2, 2023
Hi, I have noticed that this issue also has an effect on the MAA2C and IA2C algorithms for the Tag task. I was able to verify the maximum returns according to Table 3 by running the simulations as reported in the article (add_value_last_obs=True).
One of the simulations the metric return_mean presents a very strange pattern:
I would like to report an issue which has a defining impact on the critic based
algorithms for the MPE PredatorPrey Task. The
target_value
variable, builtfor training the critic network, is about three orders of magnitude in relation
to
return_mean
, the long term sample return. This behavior is abnormal forthree reasons: (i) The
return_mean
is not discounted whereas thetarget_value
is discounted. The rewards used to estimate the
value
are (ii) unbiased,where we subtract a non-negative value, (iii) and re-scaled where we divide by
a standard deviation larger than one. Figure 1 depicts the behavior:
Reproduce:
maa2c_ns.yaml
according to Section C.1, subsectionMPE PredadorPrey and Table 23 from Supplemental. (ii) Set
time_limit=25
in
gymma.yaml
.Config:
Correction
The process is better controlled when the option
add_value_last_step
is set tofalse
as showed in Figure 2.Config
This is the output from the
diff
command between the two configuration files.Unfortunately, I wasn't able to verify the published numbers even when correcting by this flag. Could you point me to the right direction?
Thanks in advance.
The text was updated successfully, but these errors were encountered: