-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datasets miss extern data handling and other things #248
Comments
What @Atticus1806 and I are using is e.g. the following code: @dataclass(frozen=True)
class TrainingDatasets:
train: Dataset
cv: Dataset
devtrain: Dataset
extern_data: Dict[str, Dict[str, Any]] Or even directly with the @dataclass(frozen=True)
class TTSTrainingDatasets:
"""
Dataclass for TTS Datasets
"""
train: MetaDataset
cv: MetaDataset
datastreams: Dict[str, Datastream] So no dev-train because data-aug is currently not relevant for TTS, and the type is MetaDataset because we always have multiple inputs/outputs. For the dimension tags, maybe consider something like this (slightly adapted from my ASR setup): train_bpe_datastream = get_bpe_datastream(bpe_size=bpe_size, is_recog=False)
if use_raw_features:
audio_datastream = get_audio_raw_datastream()
else:
audio_datastream = get_audio_datastream([...])
datastreams = {
'audio_features': audio_datastream,
'bpe_labels': train_bpe_datastream
}
[.... do dataset stuff using the existing helpers...]
return TrainingDatasets(
train=train_dataset,
cv=cv_dataset,
devtrain=devtrain_dataset,
datastreams=datastreams,
) This pipeline is used in my case for both "traditional" and "returnn_common" setups. extern_data = {
key: datastream.as_returnn_extern_data_opts()
for key, datastream in training_datasets.datastreams.items()
} and for RC setups (with rc_extern_data = ExternData([
datastream.as_nnet_constructor_data(key)
for key, datastream in training_datasets.datastreams.items()
]) What is input and what is output is not relevant for me, because this is decided in the network construction. Especially for TTS this can switch often (e.g. duration labels as target in training but as input during speed controlled generation). Thus we also stopped using |
So the Regarding dim tags, your code seems wrong to me. It seems like you always can only get different dim tags but never share dim tags. Or at least I don't see how. E.g. in the case of framewise training, where the "classes" must have the same time-dim-tag as "data". But as I see from your code, it looks like I would get two separate (different) time dim tags, which is wrong. I did not really get your point on input/output. At some point, it is relevant what is the input/output, to define what to forward through the net, and what to use for the loss. Currently I define this also in my |
Yes and no, I am inferring the options for the dataset from the datastream:
This is correct, I did not add that possibility yet. So far this was also not necessary, but I understand this not optimal.
Wwhat is output and what is input is not always clear in my setups, so I do not make that distinction explicitly anywhere. And yes, I set specific key names that have to match. |
But this would not work automatically this way for all datasets. Actually for many datasets, this will not work. E.g. how do you handle And why don't you derive it automatically for
But I fear that this is not something which you can add easily to the way you designed the whole thing. This is a very fundamental property and I think it requires a different design. I think this requires that the dataset really specifies the datastreams and not the other way around. It's necessary for any framewise training (hybrid HMM, transducer), so this is quite an important aspect. In my old class SwitchboardExternSprint(DatasetConfig):
...
def get_extern_data(self) -> Dict[str, Dict[str, Any]]:
"""
Get extern data
"""
from returnn.tf.util.data import FeatureDim, SpatialDim, batch_dim
time_dim = SpatialDim("time")
feature_dim = FeatureDim("audio", 40) # Gammatone 40-dim
out_spatial_dim = SpatialDim("out-spatial")
classes_dim = FeatureDim("vocab", dimension=self.vocab.get_num_classes())
d = {
"data": {"dim_tags": [batch_dim, time_dim, feature_dim]},
}
if self.vocab:
target = "orth_classes"
d[target] = {
"dim_tags": [batch_dim, out_spatial_dim],
"sparse_dim": classes_dim,
"vocab": self.vocab.get_opts()
}
return d Via such construction, it is easy to share dim tags.
For the moment, and for my current applications, I'm specifically aiming to define a generic supervised training setting for the beginning, where you have exactly one input and one target. Other cases would be handled differently, could be extensions of that, or whatever. But such supervised training setting covers a lot of what we do. It covers all ASR (without speaker adaptation) and MT. I'm not exactly sure how to handle alignments actually. Should this replace the targets? But would this makes the scoring somehow complicated? Although my current setup is Switchboard where the scoring is anyway via the official scoring script and I don't use the targets from the datasets. Not sure about other cases. Alternatively, the dataset could maybe provide all three keys, inputs, alignment frames and normal targets, and then you could just ignore the normal targets for training with chunking. In any case, for a given kind of task, I want to define models, training, and recognition. E.g. think of an attention-based encoder-decoder model. I want to implement it in such a way that I can plugin easily some ASR or MT task, or any other supervised task where I have an input and a target. But it must be well defined what is the input and what are the targets. And I'm not sure if it is a good idea to have this just via implicit assumptions on specific key names. I remember that you always argued that having such implicit assumptions on key names is bad ("data" and "classes"). |
The HDF Dataset has no options related to the content, so there is no handling needed. It is actually the best example why a Datastream is somewhat independent of the Dataset, and should be created not as part of it.
It is not strictly necessary, I have a running Hybrid setup, and also our TTS model has 2 Datastreams which share a time axis. It would be better and more consistent though, I will think about it.
Correct, and this is why I set "explicit" keys, and have not automatism or defaults. I understand that you do not like that there is then some coupling needed between task and model (I do this in my |
I'm looking into how to convert my old
DatasetConfig
-based datasets to the newDataset
interface (#231).What I'm missing:
TrainingDatasets
?Dataset
interface? Otherwise you must do this manually, and somehow infer it from the dataset? Or I would need some other extended structureDatasetWithExternData
or so.TrainingDatasets
, or maybe there would be a more special variantSupervisedTrainingDatasets
?The text was updated successfully, but these errors were encountered: