Direct corresponding helpers for RETURNN Datasets #231

JackTemaki · 2022-10-20T17:06:12Z

Currently returnn_common is lacking any direct interface for defining RETURNN datasets, we just have some generic interface for a task or some very hard-coded settings for librispeech, which do not fit with the current from-scratch Sisyphus pipelines.

Originally it was my plan to move everything from https://github.com/rwth-i6/i6_experiments/tree/main/users/rossenbach/common_setups/returnn/datasets to i6_experiments/common, but I understand that this better belongs here.

Before I start pushing code (I also understand PR-driven development is too slow at this stage), I would like to discuss where to put the code, and use this thread as a discussion starting point.

There is currently the module datasets, which contains the DatasetConfig in interface. There is unfortunately the dual usage of the word "dataset", once as dataset in the sense of "corpus/task", and once as dataset in the sense of "returnn dataset/dataset type". Not sure how to handle this optimally.

So @albertz, where would you put this and how would you name it?

The text was updated successfully, but these errors were encountered:

albertz · 2022-10-20T18:26:20Z

Actually datasets currently is similar to what you have. I wonder if your code was even derived from that at some point?

But you are right, then it goes further and also mixes this up a bit with actual dataset instances (corpora such as Librispeech). It also combines train/dev/eval already right in the base class.

I think its use is somewhat rare. I have some demos which still use the RETURNN import_ mechanism which use this, but the import_ mechanism actually always has the Git commit hash, so we can safely change anything without breaking those demos.

Otherwise, I currently started to use the base class (DatasetConfig) for my current i6-experiments RETURNN-common pipeline, but all of this is very much just a proof-of-concept anyway, and I can change things there.

I'm not sure who else uses it actively. I assume no-one. If this is the case, we can safely move and rename this code to sth like datasets_deprecated_2022_10 or so, or completely remove it, or merge it somehow with your code.

In case someone uses the current code, and we don't want to break it, we can also put your code as datasets_v2 or so. Putting a version prefix is how I plan to handle future versions in general, to not break old setups. This becomes more important now when we don't use import_ anymore but instead somehow have returnn-common directly imported, either in Sisyphus recipes or inside RETURNN configs or both. But as said, if no-one uses it, we can safely change it in this case now.

albertz · 2022-10-20T18:27:58Z

For specific corpora, or instances of the datasets, or helper functions to get such instances, I would still put those under i6_experiments/common, as they might depend on some Sis pipeline. Here in returnn_common, I just would put those base classes.

albertz · 2022-10-20T18:28:10Z

For discussions on the code, such as review comments, we can keep using this issue here. I agree, I think we don't need PR-based development at the moment while this is new and no-one else has really started to use it. Only later when some people started to use it, then we can switch over to PR-based development.

albertz · 2022-10-20T18:32:22Z

If we go with your GenericDataset, I think we additionally need sth which combines those into train/dev/eval, and then further which defines sth like my Task base class in i6_experiments. But I think this can be based on this, so we can probably discuss these separately. I will maybe see later how I can change the Task class to use your interface, or if there are any problems.

JackTemaki · 2022-10-21T09:25:04Z

For specific corpora, or instances of the datasets, or helper functions to get such instances, I would still put those under i6_experiments/common, as they might depend on some Sis pipeline. Here in returnn_common, I just would put those base classes.

Yes this is what I imagined.

I think no one is using the current dataset code except you. Just move/rename it how you want it and then I start moving my code to where you want to have it.

JackTemaki · 2022-10-21T09:29:52Z

And then afterwards we can check how to adapt the rest of the code. I will probably not use DatasetConfig as it does not match my style of dataset definitions (and it is not really suited for what I do with TTS), but this is not a problem at all to keep/adapt it.

#231

albertz · 2022-10-24T08:38:20Z

Ok, I moved it to datasets_old_2022_10.

rwth-i6/returnn_common#231

albertz · 2022-11-03T13:15:54Z

@JackTemaki What's the state here?

JackTemaki · 2022-11-10T12:44:58Z

There were other tasks with a higher priority, so I postponed this.

How should I deal with the Sisyphus dependency for paths. Like this?

try:
  from sisyphus import tk
  FilePathType = Union[tk.Path, str]
except:
  FilePathType = str

Then I still have the problem that I want to disallow `str' when working in a Sisyphus environment. An assert that figures out if the code is run within a Sisyphus manager would be nice, but I am not sure how to determine this.

albertz · 2022-11-10T13:04:42Z

Obviously never except: but except ImportError: here.

An assert that figures out if the code is run within a Sisyphus manager

We can easily introduce such thing on Sisyphus. E.g. in Sisyphus ConfigManager such that you can check whether the code is run from within a config.

JackTemaki · 2022-11-10T13:23:41Z

in_sisyphus_config = False
try:
    from sisyphus import tk
    from sisyphus.loader import config_manager
    FilePathType = Union[tk.Path, str]
    if config_manager.current_config is not None:
        in_sisyphus_config = True
except ImportError:
    FilePathType = str

Then in_sisyphus_config (name is not good maybe) could be used for asserts.

albertz · 2022-11-10T13:26:35Z

This should not be in the module scope. You should put this into a helper function, and call that.

albertz · 2022-12-02T15:17:24Z

Can you describe the state here?
Related is also the issue #248.

JackTemaki · 2022-12-06T16:31:11Z

The LmDataset is still missing, and we still have to deal with the Datastreams, but I guess this is #248 and not this issue here. If you want we can close this.

JackTemaki changed the title ~~Direct corresponding helpers for RETURNN Dataset~~ Direct corresponding helpers for RETURNN Datasets Oct 20, 2022

albertz added a commit that referenced this issue Oct 24, 2022

Rename old datasets interface

000ee8d

#231

albertz added a commit to rwth-i6/i6_experiments that referenced this issue Oct 24, 2022

returnn_common.datasets was renamed

6b02d23

rwth-i6/returnn_common#231

This was referenced Dec 2, 2022

Datasets miss extern data handling and other things #248

Open

How to integrate returnn_common.nn into Sisyphus pipelines rwth-i6/i6_experiments#63

Closed

albertz assigned mmz33, JackTemaki and albertz Dec 2, 2022

albertz closed this as completed Dec 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct corresponding helpers for RETURNN Datasets #231

Direct corresponding helpers for RETURNN Datasets #231

JackTemaki commented Oct 20, 2022

albertz commented Oct 20, 2022

albertz commented Oct 20, 2022

albertz commented Oct 20, 2022 •

edited

Loading

albertz commented Oct 20, 2022

JackTemaki commented Oct 21, 2022

JackTemaki commented Oct 21, 2022

albertz commented Oct 24, 2022

albertz commented Nov 3, 2022

JackTemaki commented Nov 10, 2022

albertz commented Nov 10, 2022

JackTemaki commented Nov 10, 2022

albertz commented Nov 10, 2022

albertz commented Dec 2, 2022 •

edited

Loading

JackTemaki commented Dec 6, 2022

Direct corresponding helpers for RETURNN Datasets #231

Direct corresponding helpers for RETURNN Datasets #231

Comments

JackTemaki commented Oct 20, 2022

albertz commented Oct 20, 2022

albertz commented Oct 20, 2022

albertz commented Oct 20, 2022 • edited Loading

albertz commented Oct 20, 2022

JackTemaki commented Oct 21, 2022

JackTemaki commented Oct 21, 2022

albertz commented Oct 24, 2022

albertz commented Nov 3, 2022

JackTemaki commented Nov 10, 2022

albertz commented Nov 10, 2022

JackTemaki commented Nov 10, 2022

albertz commented Nov 10, 2022

albertz commented Dec 2, 2022 • edited Loading

JackTemaki commented Dec 6, 2022

albertz commented Oct 20, 2022 •

edited

Loading

albertz commented Dec 2, 2022 •

edited

Loading