Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading large HDFDatasets inside MetaDataset is slow #1669

Open
dorian-K opened this issue Dec 13, 2024 · 2 comments
Open

Loading large HDFDatasets inside MetaDataset is slow #1669

dorian-K opened this issue Dec 13, 2024 · 2 comments
Assignees

Comments

@dorian-K
Copy link
Contributor

dorian-K commented Dec 13, 2024

I am loading two HDFDatasets (each ~8Gb, 40310479 seqs) inside a MetaDataset. This process takes about 30 minutes and uses up ~36Gb of RAM, which I find excessive.
Profiling the program with py-spy suggests that _update_tag_idx in returnn/datasets/cached.py:148 is the bottleneck here.

(cc @patrick-wilken @NeoLegends @JackTemaki)

@patrick-wilken
Copy link
Contributor

For CombinedDataset there used to be a similar issue, that's why I changed its implementation so that the sequence order is passed to the sub-datasets via indices instead of sequence tags (already years ago).
MetaDataset does it via tags, and as you notice this will trigger the creation of the CachedDataset._tag_idx mapping which in your case contains 2 * 40M strings. Also, it accesses the HDF file 2 * 40M times to get the tags one by one which is slow. Not only because it's a file access, but also because the h5py library is not optimized for this and has some overhead in each call.
Also, the attributes MetaDataset.seq_list_original and MetaDataset.tag_idx probably just get large in your case.
I never had to use MetaDataset myself for so many sequences, so I don't have an existing solution, but for CombinedDataset the important points to get a setup that scales pretty much to arbitrary dataset sizes were:

  • don't use sequence tags at all for initialization / sequence order
  • (try to) remove all variables* of size O(total_num_sequences)
  • minimize number of HDF accesses
  • use sub-epochs and apply (non-default) sequence ordering only on sub-epoch level (especially length-based methods)

*We even use a custom variant of the HDFDataset that avoids file_seq_start by storing those indices in the HDF instead of the sequence lengths.

@albertz
Copy link
Member

albertz commented Dec 13, 2024

don't use sequence tags at all for initialization / sequence order

This doesn't really work for MetaDataset because we sometimes (even often?) have datasets where the underlying seq order (and thus the seq indices) is different.

(try to) remove all variables of size O(total_num_sequences)

I thought you still have the seq indices as order?

apply (non-default) sequence ordering only on sub-epoch level (especially length-based methods)

How do you do that?

We even use a custom variant of the HDFDataset that avoids file_seq_start by storing those indices in the HDF instead of the sequence lengths.

But this is not in master? Why did you not make a PR for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants