Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Please, state clearly in the documentation and dataset definition if in a time step "r_0" is consequence of "a_0" #74

Open
jamartinh opened this issue May 22, 2023 · 3 comments

Comments

@jamartinh
Copy link
Contributor

Question

Hi, please state clearly in the documentation and dataset definition if in a time step "r_0" is consequence of "a_0"

With previous Offline RL libs, there has been some confusion with this respect.

With the standar in RL being (s,a,r,s') one assume that r is a consequence of applying action a in state s.

If r is not, please state it clearly, because then, the r(s,a) should be r_1 and not r_0

Thanks !

@balisujohn
Copy link
Collaborator

I agree that's good to mention.

It is implied in this code block:
image
https://minari.farama.org/main/content/dataset_standards/

But I think further clarification wouldn't hurt, so I'll make a PR.

@jamartinh
Copy link
Contributor Author

Thanks @balisujohn , now I am even more confused.

For me, this is a DejaVu from when working on the D3RLpy Offline RL library.
The code:

obs, rew, terminated, truncated, info = self.env.step(action)
# add/edit data from step and convert to dictionary step data
step_data = self._step_data_callback(
env=self,
obs=obs,
info=info,
action=action,
rew=rew,
terminated=terminated,
truncated=truncated,
)

With this data collector:
the action and its reward as a consequence are on the same "row", however the state in which the action was taken is not in the same "row".

Also, in previous discussions on these kids of datasets, we concluded that the "original" D4RL datasets where in the format that is actualy used in the replay buffers implemented in almost all RL libraries:

$$(s,a,r,s', terminated, truncated, info)$$

E.g., a "full iteration" not just an env.step

So in just one "row" we have the state, the action taken in that state, the corresponding reward for taking action in current state , the subsequent state (required by onpolicy learning), terminal flags info etc.

This is basically the format of a Replay Buffer, the format that one expects from a dataset that describes a control task for using it with RL.

In the documentation: is $t$ starting from 0? or from 1?

What is the value of $r_0$ ?

As said, thanks again, but please, take ALL THE CARE with this issue since it is a brainer for people and also a "hidden" source of bad training. People can commit severe mistakes in using the data for training assuming something that is not the thing.

@rodrigodelazcano
Copy link
Member

Hi @jamartinh . The datasets have the structure you are looking for:
The datacollector stores the action and its reward as a consequence on the same "row", and the state in which the action was taken is ALSO in the same "row". The first state that is recorded is the one returned by env.reset() thus the first timestep is t=0. You can also see that the last state of an episode is stored since each episode's observations array has an extra element compared to the rewards or actions arrays.

@balisujohn shared the code to convert an episodes data to the (s,a,r,s') format. However, I agree that we should update the documentation to make this clearer or add an extra array in each episode for s', sorry for the confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants