-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TransducerFullSumAndFramewiseTrainingPipeline #64
base: master
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
It was not possible. I could only comment. Could't recover the branch of the PR, no matter what I tried. Right now I am looking into what has to be considered so that the alignments and the input features correspond to each other when loading. Is it clear what are options that influence this(ordering, sorting, batchsize..)? If yes, one could automatically read the options from the default dataset to create the What extra information has to be saved together with the alignments for each label topology and how.
|
It should always be possible by just force-pushing to your branch (which you used for the PR, that was Btw, I see that you also added the dataset there. Please separate this (always separate things when they are logically separate). And I'm anyway not sure about this. I don't like that we now need to recreate wrappers for all datasets. That's bad. That should be avoided (automated, or be part of RETURNN itself). But anyway, that's off-topic here.
I don't quite understand this comment. What do you mean by "correspond to each other"? Why do you think you need any extra logic there? Every sequence is already identified by the seq-tag.
Like what?
You mean more like some extra logic. Or what extra information? -> Logic
But Or you mean the extra chunking logic? We anyway need to think about how the chunking would be generalized. There is an initial implementation here but this needs changes. Anyway, this is all off-topic here, or not? |
Yes. It should have been put in the main config.
Seems like it is already taken care of on the side ob both
Logic
I am talking here about the stuff happening in Stage level. We either dump the alignments or load them and do CE training. For that,
Yes, that inclusive.
Not really, it is some work towards: The goal is to separate the logic of full sum, viterbi realignment and CE from the model itself. I think that multi stage training should be a plug in. Once you have a model one could easily choose the pipeline. |
So can you clean up this PR and separate this?
I'm not sure what you mean by "plug and play"? Obviously the normal chunking cannot work.
(I don't understand what's the different between making a path or making an alignment -> But making (and dumping) the alignment is independent from the label topology? I'm not really sure whether the multi stager should need to handle any of this? This looks very unclean to me. Like you mix up different things (multi staging + alignment creation + alignment dumping + alignment loading). Can't we separate all of this? Making things coupled together is always bad.
I thought the multi stager (this PR here) is about a multi stager, where you combine several different training steps (any, doesn't matter what they do).
But we already have that? We have some functions which build the model, and other (separate) functions which define the training loss, and yet separate functions which define pretraining and the training pipeline. Unless you never intended the multi-stager to be generic (then I misunderstood), but very specific for this transducer model, and transducer training pipeline. But then I would also call it more specific, like If it is supposed to be generic, I don't think it should have any extra logic for things like alignments etc. It might have very generic support for storing and loading (any!) auxiliary data (storing via HDFDumpLayer, and loading via MetaDataset/HDFDataset). |
Naming is bad.
Yes, if you try to change the chunking to make up for the topology. For RNNT one could dump index sequences of blank labels as an extra dataset and chunk along it, instead. I don't know if this is doable as it would require to return
They are separated, only not in different files.
Yes but the in-between steps of switching between FS and CE are missing. That is what I am intending to add. My plan was to add the logic in returnn-experiments only. As you say it is rather Transducer specific. I will rename it as you suggest to |
I don't really understand. The making/dumping is in any case independent. I think you refer to the framewise training and chunking. For chunking, yes it's specific for the topo. I don't understand what you describe. No matter how you dump it, the chunking needs custom logic. I also don't understand why the multi stager needs to handle any of this.
What exactly is this PR about? I thought it's about the multi stager (and only about that)? We should not mix up things. And I still don't see why alignment stuff (dumping, loading) and framewise training etc should matter for that. The logic of multi staging would be totally independent of any of that? |
This PR still has stuff about the dummy dataset. Can you remove this here? |
This PR still looks very much work-in-progress. Can you mark it as draft, until you consider it ready? |
"extend_existing_file": extend, # TODO: extend only the first time | ||
# dataset_name: comes from **opts of the lambda in filename | ||
"filename": | ||
(lambda **opts: "{align_dir}/align.{name}_{dataset_name}.hdf".format(align_dir=align_dir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use str.format
. Use f-strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is easy to do in this case. How would you do it? the dataset_name
comes from **opts
.
|
||
|
||
|
||
class Context: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you duplicate this here? We already have such a class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here the context is a little broader. Should I give the rest as separated params to the functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no excuse. Then take the base context as an argument here and extend it. But do not duplicate code & logic when not really needed.
But what is really the extension here? Just the alignment_topology
? For that, you don't need any new Context
type at all. Just pass it as an extra argument where-ever needed.
Or maybe extend the base Context
class. Probably we anyway need it also there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe extend the base Context class. Probably we anyway need it also there?
That is true. Should best add it there.
In transducer_fullsum.py
you only provide make_net()
. But that isn't very flexible especially when you are calling it through Pretrain
. Wouldn't it be more meaningfull to wrap it in a class that holds the parameters non related to the Pretrain
? We can then still define make_net
with the same default params so it doesn't break anything.
If so, let me know to open a pr.
Or how have you thought it?
task = get_global_config().value("task", "train") | ||
target = TargetConfig.global_from_config() | ||
model = get_global_config().value("model", "net-model/network") | ||
self.ctx = Context(task=task, target=target, model=model, name=self.stage.name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very bad. You should never introduce attribs in non-init functions. (Basic Python rules. I think PyCharm would also warn you about this, or not?)
Also, you should not access the global config in other functions. In the optimal case, it would never be accessed at all, and all relevant context information is passed in __init__
. If needed, it might be used for default arguments in __init__
. See also other code.
This reverts commit dcafc88.
For FixedPath training, different datasets, are to be handled differently. For example for Switchboard,
In
In |
This comment has been minimized.
This comment has been minimized.
No, those mean different things. In any case, the
Why do you mention this? This is totally independent from this PR here, or not?
If this really needs to be an own directory (not sure about this), then the last filename can be shorter, just like
No, there should be no specific code for specific dataset. It should be generic such that it works always. |
net["existing_alignment"] = {"class": "reinterpret_data", | ||
"from": "data:alignment", | ||
"set_sparse_dim": target.get_num_classes(), | ||
"size_base": "encoder", # TODO: for RNA only... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any idea how it can be set for RNNT?
I.e. how to give there the sum of the sizes of encoder and decoder
@albertz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we have an example for this somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the RNNT, Andre has it out-commented. He skips the reinterpret_data
layer and just continues with:
"1_targetb_base": {
"class": "copy",
# "from": "existing_alignment",
"from": "data:alignment",
"register_as_extern_data": "targetb" if task == "train" else None},
Otherwise I couldn't find an other example.
So the question is if we really need the reinterpret layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not what I meant. I meant for the size_base
. But maybe that's not needed (you need it when you want to tell RETURNN that it is the same to some other dim tag).
I also see that it sets the sparse dim. Although that looks a bit incorrect anyway. It would include Blank Labels at this point, and I think target
here is without Blank. Only once you remove the Blank Frames/Labels, this makes sense. But maybe this is also not relevant (depending on how it is used).
Note that RNNT Training with fixed alignment is anyway not fully supported yet, because chunking doesn't fully work. (See here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the question is, do we need to tell RETURNN about the shape of the targets if we are training with framewise CE? Or when should RETURNN know about the shape of the targets?
Although that looks a bit incorrect anyway. It would include Blank Labels at this point, and I think target here is without Blank.
Nice that you caught that one. It should have had the dim inclusive the blank.
I have looked into the issue with chunking. Will first commit it like this and integrate it later. The possibilities are, it either can be solved from #376 or through the workaround of Andre.
I see that he used it in some of his configs but am not sure how he worked the following problem out(or didn't at all):
But even that is hacky and ugly, and will break in some cases, e.g. when you define custom_iterate_seqs in some epochs, and later not anymore. Then it would not correctly reset this.
For more Information about the motivation and the idea read #60. This PR is the same as #60.
For testing the following config can be used.