Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct corresponding helpers for RETURNN Datasets #231

Closed
JackTemaki opened this issue Oct 20, 2022 · 14 comments
Closed

Direct corresponding helpers for RETURNN Datasets #231

JackTemaki opened this issue Oct 20, 2022 · 14 comments
Assignees

Comments

@JackTemaki
Copy link
Contributor

Currently returnn_common is lacking any direct interface for defining RETURNN datasets, we just have some generic interface for a task or some very hard-coded settings for librispeech, which do not fit with the current from-scratch Sisyphus pipelines.

Originally it was my plan to move everything from https://github.com/rwth-i6/i6_experiments/tree/main/users/rossenbach/common_setups/returnn/datasets to i6_experiments/common, but I understand that this better belongs here.

Before I start pushing code (I also understand PR-driven development is too slow at this stage), I would like to discuss where to put the code, and use this thread as a discussion starting point.

There is currently the module datasets, which contains the DatasetConfig in interface. There is unfortunately the dual usage of the word "dataset", once as dataset in the sense of "corpus/task", and once as dataset in the sense of "returnn dataset/dataset type". Not sure how to handle this optimally.

So @albertz, where would you put this and how would you name it?

@JackTemaki JackTemaki changed the title Direct corresponding helpers for RETURNN Dataset Direct corresponding helpers for RETURNN Datasets Oct 20, 2022
@albertz
Copy link
Member

albertz commented Oct 20, 2022

Actually datasets currently is similar to what you have. I wonder if your code was even derived from that at some point?

But you are right, then it goes further and also mixes this up a bit with actual dataset instances (corpora such as Librispeech). It also combines train/dev/eval already right in the base class.

I think its use is somewhat rare. I have some demos which still use the RETURNN import_ mechanism which use this, but the import_ mechanism actually always has the Git commit hash, so we can safely change anything without breaking those demos.

Otherwise, I currently started to use the base class (DatasetConfig) for my current i6-experiments RETURNN-common pipeline, but all of this is very much just a proof-of-concept anyway, and I can change things there.

I'm not sure who else uses it actively. I assume no-one. If this is the case, we can safely move and rename this code to sth like datasets_deprecated_2022_10 or so, or completely remove it, or merge it somehow with your code.

In case someone uses the current code, and we don't want to break it, we can also put your code as datasets_v2 or so. Putting a version prefix is how I plan to handle future versions in general, to not break old setups. This becomes more important now when we don't use import_ anymore but instead somehow have returnn-common directly imported, either in Sisyphus recipes or inside RETURNN configs or both. But as said, if no-one uses it, we can safely change it in this case now.

@albertz
Copy link
Member

albertz commented Oct 20, 2022

For specific corpora, or instances of the datasets, or helper functions to get such instances, I would still put those under i6_experiments/common, as they might depend on some Sis pipeline. Here in returnn_common, I just would put those base classes.

@albertz
Copy link
Member

albertz commented Oct 20, 2022

For discussions on the code, such as review comments, we can keep using this issue here. I agree, I think we don't need PR-based development at the moment while this is new and no-one else has really started to use it. Only later when some people started to use it, then we can switch over to PR-based development.

@albertz
Copy link
Member

albertz commented Oct 20, 2022

If we go with your GenericDataset, I think we additionally need sth which combines those into train/dev/eval, and then further which defines sth like my Task base class in i6_experiments. But I think this can be based on this, so we can probably discuss these separately. I will maybe see later how I can change the Task class to use your interface, or if there are any problems.

@JackTemaki
Copy link
Contributor Author

For specific corpora, or instances of the datasets, or helper functions to get such instances, I would still put those under i6_experiments/common, as they might depend on some Sis pipeline. Here in returnn_common, I just would put those base classes.

Yes this is what I imagined.

I think no one is using the current dataset code except you. Just move/rename it how you want it and then I start moving my code to where you want to have it.

@JackTemaki
Copy link
Contributor Author

And then afterwards we can check how to adapt the rest of the code. I will probably not use DatasetConfig as it does not match my style of dataset definitions (and it is not really suited for what I do with TTS), but this is not a problem at all to keep/adapt it.

albertz added a commit that referenced this issue Oct 24, 2022
@albertz
Copy link
Member

albertz commented Oct 24, 2022

Ok, I moved it to datasets_old_2022_10.

albertz added a commit to rwth-i6/i6_experiments that referenced this issue Oct 24, 2022
@albertz
Copy link
Member

albertz commented Nov 3, 2022

@JackTemaki What's the state here?

@JackTemaki
Copy link
Contributor Author

There were other tasks with a higher priority, so I postponed this.

How should I deal with the Sisyphus dependency for paths. Like this?

try:
  from sisyphus import tk
  FilePathType = Union[tk.Path, str]
except:
  FilePathType = str

Then I still have the problem that I want to disallow `str' when working in a Sisyphus environment. An assert that figures out if the code is run within a Sisyphus manager would be nice, but I am not sure how to determine this.

@albertz
Copy link
Member

albertz commented Nov 10, 2022

Obviously never except: but except ImportError: here.

An assert that figures out if the code is run within a Sisyphus manager

We can easily introduce such thing on Sisyphus. E.g. in Sisyphus ConfigManager such that you can check whether the code is run from within a config.

@JackTemaki
Copy link
Contributor Author

in_sisyphus_config = False
try:
    from sisyphus import tk
    from sisyphus.loader import config_manager
    FilePathType = Union[tk.Path, str]
    if config_manager.current_config is not None:
        in_sisyphus_config = True
except ImportError:
    FilePathType = str

Then in_sisyphus_config (name is not good maybe) could be used for asserts.

@albertz
Copy link
Member

albertz commented Nov 10, 2022

This should not be in the module scope. You should put this into a helper function, and call that.

@albertz
Copy link
Member

albertz commented Dec 2, 2022

Can you describe the state here?
Related is also the issue #248.

@JackTemaki
Copy link
Contributor Author

The LmDataset is still missing, and we still have to deal with the Datastreams, but I guess this is #248 and not this issue here. If you want we can close this.

@albertz albertz closed this as completed Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants