You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unless I'm mistaken, there is something odd about the main training loop (Listing 8.13) for the Super Mario game in Chapter 8. The way that the current x-position is checked against the min_progress parameter makes no sense to me.
More precisely: in line 23 of the main training loop, the environment step is taken (6 times) and last_x_pos is set to the current x-position:
state2, e_reward_, done, info = env.step(action)
last_x_pos = info['x_pos']
In the following lines of code, neither last_x_pos nor info['x_pos'] are changed. Then in line 33 the two are compared to one another:
if episode_length > params['max_episode_len']:
if (info['x_pos'] - last_x_pos) < params['min_progress']:
done = True
else:
last_x_pos = info['x_pos']
Isn't info['x_pos'] - last_x_posalways going to be zero here? This would always reset the environment as soon as episode_length > params['max_episode_len'].
What is the min_progress parameter meant to be intuitively? The progress from beginning till the end of one episode? The progress from time 0 till max_episode_len? Or the progress against a certain checkpoint in a certain amount of time? If so, how are these checkpoints chosen?
This has not become clear to me yet, neither from the book nor from the code.
The text was updated successfully, but these errors were encountered:
Addon:
This also explains why in figure 8.19 in the book the training time for each episode is always exactly the same (i.e. the horizontal distance between consecutive peaks is always identical). The training loop always runs for params['max_episode_len'] and then resets.
Unless I'm mistaken, there is something odd about the main training loop (Listing 8.13) for the Super Mario game in Chapter 8. The way that the current x-position is checked against the
min_progress
parameter makes no sense to me.More precisely: in line 23 of the main training loop, the environment step is taken (6 times) and
last_x_pos
is set to the current x-position:In the following lines of code, neither
last_x_pos
norinfo['x_pos']
are changed. Then in line 33 the two are compared to one another:Isn't
info['x_pos'] - last_x_pos
always going to be zero here? This would always reset the environment as soon asepisode_length > params['max_episode_len']
.What is the
min_progress
parameter meant to be intuitively? The progress from beginning till the end of one episode? The progress from time 0 tillmax_episode_len
? Or the progress against a certain checkpoint in a certain amount of time? If so, how are these checkpoints chosen?This has not become clear to me yet, neither from the book nor from the code.
The text was updated successfully, but these errors were encountered: