Removes h5py import in minari_dataset.py, adds tests for minari_dataset.py #101

balisujohn · 2023-06-28T08:48:33Z

Description

This PR removes the h5py import in minari_dataset.py and also adds some tests for minari_dataset.py.

This PR does make one breaking change to the API and a seperate slight relaxation of allowed types, both are mentioned in comments.

Checklist:

I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
I have run pytest -v and no errors are present.
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I solved any possible warnings that pytest -v has generated that are related to my code to the best of my knowledge.
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…me tests for minari_dataset, still intermittently failing a test

balisujohn · 2023-06-28T22:54:51Z

minari/dataset/minari_dataset.py

@@ -253,7 +201,7 @@ def filter_episodes(self, condition: Callable[[h5py.Group], bool]) -> MinariData
 ```

 Args:
- condition (Callable[[h5py.Group], bool]): callable that accepts an episode group and returns True if certain condition is met.


This relaxes the type expected as an input for the filter function to be applied to the underlying episode dataset to be any type, since MinariDatset is no longer supposed to assume the types used in MinariStorage

How about EpisodeData?

Right now MinariStorage.apply() calls the filter function with h5py.group representations of episodes. We'd need to make apply pass EpisodeData into the filter function instead of h5py.group Curious to hear your thoughts on this.

From the user perspective, it makes much more sense to have EpisodeData; however this means that we should convert the h5py.group into EpisodeData inside apply.
This means we add a dependency MinariStorage -> EpisodeData, and now it makes sense to also change the return type of get_episodes to be already EpisodeData.

The other alternative is to have dict; I think this is more weird for the user, because it is not immediately clear what keys you can use. On the other side it gives more flexibility for possible future MinariStorage.

A solution in the middle is having EpisodeData in MinariDataset and dict in MinariStorage; then you need to wrap the function in filter_episodes to convert the dict to EpsiodeData. It complicates a bit more the code, but it has the advantage of both.

I am towards the third option, but I don't have a strong opinion on this; feel free to choose one of the three if you see something else.

apply actually gives the condition function a h5py group rather than a dict. Do you think we should change apply to give dicts?

I think I understand, I'm tentatively changing apply to give the condition function dicts and appropriately wrapping the condition function in filter_episodes

I went with the solution in the middle; should be ready for review.

balisujohn · 2023-06-28T22:56:35Z

minari/dataset/minari_dataset.py

@@ -43,56 +41,6 @@ def parse_dataset_id(dataset_id: str) -> tuple[str | None, str, int | None]:
 return env_name, dataset_name, version


-def clear_episode_buffer(episode_buffer: Dict, episode_group: h5py.Group) -> h5py.Group:


Non private function moved to minari_storage.py so this is breaking.

younik

Looks good, just a couple of comments

younik · 2023-06-28T23:14:58Z

minari/dataset/minari_dataset.py

@@ -253,7 +201,7 @@ def filter_episodes(self, condition: Callable[[h5py.Group], bool]) -> MinariData
 ```

 Args:
- condition (Callable[[h5py.Group], bool]): callable that accepts an episode group and returns True if certain condition is met.


How about EpisodeData?

younik · 2023-06-28T23:18:08Z

minari/dataset/minari_dataset.py

+ self._episode_indices = np.arange(self._data.total_episodes)
+


It breaks filtered MinariDataset; you should append the new indices

younik · 2023-06-28T23:18:34Z

minari/dataset/minari_dataset.py

- )
+ self._data.update_from_buffer(buffer, self.spec.data_path)
+
+ self._episode_indices = np.arange(self._data.total_episodes)


same as above

tests/utils/test_dataset_combine.py

balisujohn · 2023-06-29T07:16:49Z

One comment is that if env.spec.total_episodes is supposed to reflect the total number of episodes in the underlying MinariStorage(without respect to the applied filter), we should either make it a computed property that loads the self._data.total_episodes or we should actually move spec to the self._data MinariStorage instance, because right now if you create a filtered dataset that points to the same MinariStorage as another one then add episodes to the filtered one, the spec doesn't get updated on the other one. Curious to hear your thoughts on this @younik

younik · 2023-06-29T09:20:56Z

One comment is that if env.spec.total_episodes is supposed to reflect the total number of episodes in the underlying MinariStorage(without respect to the applied filter), we should either make it a computed property that loads the self._data.total_episodes or we should actually move spec to the self._data MinariStorage instance, because right now if you create a filtered dataset that points to the same MinariStorage as another one then add episodes to the filtered one, the spec doesn't get updated on the other one. Curious to hear your thoughts on this @younik

It is hard for me to comment on this, because it is not clear to me the use-case of spec. For the little I understand about it, we want to keep in in MinariDataset so, if we want total_episodes to reflect the unfiltered total number, we should make it a computed property.

balisujohn · 2023-06-30T02:40:01Z

I talked to Rodrigo, and it seems like spec is supposed to reflect the MinariDataset and not the underlying MinariStorage, so it's ok that adding episodes to one MinariDataset that points to a MinariStorage doesn't update total_episodes in the the spec of the other.

I will add a commit which makes a change to spec behavior when adding episodes to MinariDataset instances with this in mind.

… activtive indices of a

…riately

younik

minor changes, then looks good

younik · 2023-07-04T14:40:55Z

minari/dataset/minari_dataset.py

 """Total episodes steps in the Minari dataset."""
 if self._total_steps is None:
 t_steps = self._data.apply(
- lambda episode: episode["total_steps"],
+ lambda episode: episode.total_steps,
 episode_indices=self._episode_indices,
 )
 self._total_steps = sum(t_steps)


Now that total_stepsis computed at init, you can delete this function.
This was intended for lazy initialization in case of large dataset

younik · 2023-07-04T14:42:19Z

minari/dataset/minari_dataset.py

+ assert self._episode_indices is not None
+
+ total_steps = sum(
+ [
+ episode["total_timesteps"]
+ for episode in self._data.get_episodes(self._episode_indices.tolist())
+ ]
+ )


I think it is better to use apply (as done in the property total_steps).
This because in future apply may exploit parallelism

(and assign the value to self.total_steps)

younik · 2023-07-04T14:47:52Z

minari/dataset/minari_dataset.py

@@ -255,9 +213,15 @@ def filter_episodes(self, condition: Callable[[h5py.Group], bool]) -> MinariData
 ```

 Args:
- condition (Callable[[h5py.Group], bool]): callable that accepts an episode group and returns True if certain condition is met.
+ condition (Callable[[Any], bool]): callable that accepts any type(For our current backend, an h5py episode group) and returns True if certain condition is met.


Why not Callable[EpisodeData]?

And then the comment updated accordingly (it should not mention h5py): callable that accepts EpisodeData and returns True if certain condition is met or boolean function on EpsiodeData

younik · 2023-07-04T14:48:20Z

minari/dataset/minari_dataset.py

- def filter_episodes(self, condition: Callable[[h5py.Group], bool]) -> MinariDataset:
+ def filter_episodes(
+ self, condition: Callable[[EpisodeData], bool]
+ ) -> MinariDataset:
 """Filter the dataset episodes with a condition.

 The condition must be a callable with a single argument, the episode HDF5 group.


remove mention to HDF5, now should be EpisodeData

younik · 2023-07-04T14:51:20Z

minari/dataset/minari_dataset.py

+ self.spec.total_steps = sum(
+ [
+ episode["total_timesteps"]
+ for episode in self._data.get_episodes(self._episode_indices)
+ ]
+ )


Better to use apply, as above

younik · 2023-07-04T14:54:13Z

minari/dataset/minari_dataset.py

+ self._episode_indices, np.arange(old_total_episodes, new_total_episodes)
+ ) # ~= np.append(self._episode_indices,np.arange(self._data.total_episodes))
+
+ self.spec.total_episodes = len(self._episode_indices)


(a whim): be consistent with init (there you used .size)

younik · 2023-07-04T14:56:52Z

minari/dataset/minari_storage.py

@@ -85,7 +87,21 @@ def apply(
 for ep_idx in episode_indices:
 ep_group = file[f"episode_{ep_idx}"]
 assert isinstance(ep_group, h5py.Group)
- out.append(function(ep_group))
+ ep_dict = {


on apply signature, change the signature of function to accept dictionary

younik

two small comments and ready to merge for me

younik · 2023-07-05T08:54:24Z

minari/dataset/minari_dataset.py

@@ -213,7 +207,7 @@ def filter_episodes(
 ```

 Args:
- condition (Callable[[Any], bool]): callable that accepts any type(For our current backend, an h5py episode group) and returns True if certain condition is met.
+ condition (Callable[[EpisodeData], bool]): callable that accepts any type(For our current backend, an h5py episode group) and returns True if certain condition is met.


Update any type with EpisodeData and remove parenthesis (For our current backend, an h5py episode group)

minari/dataset/minari_dataset.py

draft of refactor to remove h5py from minari_dataset.py, also adds so…

4333c0e

…me tests for minari_dataset, still intermittently failing a test

balisujohn marked this pull request as draft June 28, 2023 08:48

fixed pre-commit and intermittent test failures

99f6fbd

balisujohn commented Jun 28, 2023

View reviewed changes

balisujohn requested a review from younik June 28, 2023 22:57

balisujohn marked this pull request as ready for review June 28, 2023 23:00

younik requested changes Jun 28, 2023

View reviewed changes

balisujohn added 2 commits June 29, 2023 03:12

fixed behavior for add to dataset after filtering, and added tests

bb7af28

removed accidental newline

c18ea4c

balisujohn added 2 commits June 30, 2023 02:03

changes to reflect the fact that spec is meant to reflect the current…

202f54a

… activtive indices of a

Merge branch 'main' into dev-h5py-refactor

0cdcc5c

balisujohn requested a review from younik June 30, 2023 06:15

balisujohn added 5 commits June 30, 2023 02:21

hopefully fixes pre-commit

83f183f

hopefully fixes pre-commit

4e1accd

hopefully fixes pre-commit

eca9898

hopefully fixes pre-commit

bbd1b11

switched apply to return dicts, and updated filter_episodes in approp…

6e0bacb

…riately

younik requested changes Jul 4, 2023

View reviewed changes

younik mentioned this pull request Jul 4, 2023

ENH: improve MinariDataset #102

Merged

7 tasks

small changes to address review

1e930a9

younik approved these changes Jul 5, 2023

View reviewed changes

changes to address review

a759251

younik merged commit 59683a1 into Farama-Foundation:main Jul 5, 2023
12 checks passed

younik mentioned this pull request Jul 7, 2023

[Bug Report] total_steps property for pen-human dataset #106

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removes h5py import in minari_dataset.py, adds tests for minari_dataset.py #101

Removes h5py import in minari_dataset.py, adds tests for minari_dataset.py #101

balisujohn commented Jun 28, 2023 •

edited

Loading

balisujohn Jun 28, 2023

younik Jun 28, 2023

balisujohn Jun 29, 2023 •

edited

Loading

younik Jun 29, 2023 •

edited

Loading

balisujohn Jun 30, 2023

balisujohn Jun 30, 2023

balisujohn Jun 30, 2023

balisujohn Jun 28, 2023

younik left a comment

younik Jun 28, 2023

younik Jun 28, 2023

balisujohn Jun 29, 2023

younik Jun 28, 2023

balisujohn Jun 29, 2023

balisujohn commented Jun 29, 2023 •

edited

Loading

younik commented Jun 29, 2023

balisujohn commented Jun 30, 2023 •

edited

Loading

younik left a comment

younik Jul 4, 2023

younik Jul 4, 2023

younik Jul 4, 2023

balisujohn Jul 5, 2023

younik Jul 4, 2023

balisujohn Jul 5, 2023

younik Jul 4, 2023

balisujohn Jul 5, 2023

younik Jul 4, 2023

balisujohn Jul 5, 2023

younik Jul 4, 2023

balisujohn Jul 5, 2023

younik Jul 4, 2023

balisujohn Jul 5, 2023

younik left a comment

younik Jul 5, 2023

		@@ -43,56 +41,6 @@ def parse_dataset_id(dataset_id: str) -> tuple[str \| None, str, int \| None]:
		return env_name, dataset_name, version


		def clear_episode_buffer(episode_buffer: Dict, episode_group: h5py.Group) -> h5py.Group:

		self._episode_indices = np.arange(self._data.total_episodes)

Removes h5py import in minari_dataset.py, adds tests for minari_dataset.py #101

Removes h5py import in minari_dataset.py, adds tests for minari_dataset.py #101

Conversation

balisujohn commented Jun 28, 2023 • edited Loading

Description

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

balisujohn Jun 29, 2023 • edited Loading

Choose a reason for hiding this comment

younik Jun 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

balisujohn commented Jun 29, 2023 • edited Loading

younik commented Jun 29, 2023

balisujohn commented Jun 30, 2023 • edited Loading

younik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

balisujohn commented Jun 28, 2023 •

edited

Loading

balisujohn Jun 29, 2023 •

edited

Loading

younik Jun 29, 2023 •

edited

Loading

balisujohn commented Jun 29, 2023 •

edited

Loading

balisujohn commented Jun 30, 2023 •

edited

Loading