You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am loading two HDFDatasets (each ~8Gb, 40310479 seqs) inside a MetaDataset. This process takes about 30 minutes and uses up ~36Gb of RAM, which I find excessive.
Profiling the program with py-spy suggests that _update_tag_idx in returnn/datasets/cached.py:148 is the bottleneck here.
For CombinedDataset there used to be a similar issue, that's why I changed its implementation so that the sequence order is passed to the sub-datasets via indices instead of sequence tags (already years ago).
MetaDataset does it via tags, and as you notice this will trigger the creation of the CachedDataset._tag_idx mapping which in your case contains 2 * 40M strings. Also, it accesses the HDF file 2 * 40M times to get the tags one by one which is slow. Not only because it's a file access, but also because the h5py library is not optimized for this and has some overhead in each call.
Also, the attributes MetaDataset.seq_list_original and MetaDataset.tag_idx probably just get large in your case.
I never had to use MetaDataset myself for so many sequences, so I don't have an existing solution, but for CombinedDataset the important points to get a setup that scales pretty much to arbitrary dataset sizes were:
don't use sequence tags at all for initialization / sequence order
(try to) remove all variables* of size O(total_num_sequences)
minimize number of HDF accesses
use sub-epochs and apply (non-default) sequence ordering only on sub-epoch level (especially length-based methods)
*We even use a custom variant of the HDFDataset that avoids file_seq_start by storing those indices in the HDF instead of the sequence lengths.
don't use sequence tags at all for initialization / sequence order
This doesn't really work for MetaDataset because we sometimes (even often?) have datasets where the underlying seq order (and thus the seq indices) is different.
(try to) remove all variables of size O(total_num_sequences)
I thought you still have the seq indices as order?
apply (non-default) sequence ordering only on sub-epoch level (especially length-based methods)
How do you do that?
We even use a custom variant of the HDFDataset that avoids file_seq_start by storing those indices in the HDF instead of the sequence lengths.
But this is not in master? Why did you not make a PR for this?
I am loading two HDFDatasets (each ~8Gb, 40310479 seqs) inside a MetaDataset. This process takes about 30 minutes and uses up ~36Gb of RAM, which I find excessive.
Profiling the program with py-spy suggests that
_update_tag_idx
inreturnn/datasets/cached.py:148
is the bottleneck here.(cc @patrick-wilken @NeoLegends @JackTemaki)
The text was updated successfully, but these errors were encountered: