Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TransducerFullSumAndFramewiseTrainingPipeline #64

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

jotix16
Copy link
Contributor

@jotix16 jotix16 commented Apr 16, 2021

For more Information about the motivation and the idea read #60. This PR is the same as #60.

For testing the following config can be used.

#!crnn/rnn.py
# kate: syntax python;
# vim: ft=python sw=2:
# based on Andre Merboldt rnnt-fs.bpe1k.readout.zoneout.lm-embed256.lr1e_3.no-curric.bs12k.mgpu.retrain1.config
from __future__ import annotations
import copy
from returnn.import_ import import_
import_("github.com/jotix16/returnn-experiments", "common", None)
from returnn_import.github_com.jotix16.returnn_experiments.dev.common.datasets.asr.librispeech import oggzip
from returnn_import.github_com.jotix16.returnn_experiments.dev.common.common_config import *

from returnn_import.github_com.jotix16.returnn_experiments.dev.common.models.transducer.transducer_fullsum import make_net
from returnn_import.github_com.jotix16.returnn_experiments.dev.common.training.pretrain import Pretrain
from returnn_import.github_com.jotix16.returnn_experiments.dev.common.models.transducer.transducer_training_pipeline.pipeline import TransducerFullSumAndFramewiseTrainingPipeline, Stage
from returnn_import.github_com.jotix16.returnn_experiments.dev.common.models.transducer.topology import rna_topology, rnnt_topology


from typing import Dict, Any
from returnn_import.github_com.jotix16.returnn_experiments.dev.common.datasets.asr.librispeech.vocabs import bpe1k, bpe10k
from returnn_import.github_com.jotix16.returnn_experiments.dev.common.datasets.interface import DatasetConfig, VocabConfig


class DummyDataset(DatasetConfig):
  def __init__(self, vocab: VocabConfig = bpe1k, audio_dim=50, seq_len=88, output_seq_len=8, num_seqs=32, debug_mode=None):
    """
    DummyDataset in RETURNN automatically downloads the data via `nltk`,
    so no preparation is needed.
    This is useful for demos/tests.
    """
    super(DummyDataset, self).__init__()
    self.audio_dim = audio_dim
    self.seq_len = seq_len
    self.output_seq_len = output_seq_len
    self.num_seqs = num_seqs
    self.vocab = vocab
    self.output_dim = vocab.get_num_classes()

  def get_extern_data(self) -> Dict[str, Dict[str, Any]]:
    return {
      "data": {"dim": self.audio_dim},
      "classes": {"sparse": True,
                  "dim": self.output_dim,
                  "vocab": self.vocab.get_opts()},
    }

  def get_train_dataset(self) -> Dict[str, Any]:
    return self.get_dataset("train")

  def get_eval_datasets(self) -> Dict[str, Dict[str, Any]]:
    return {
      "dev": self.get_dataset("dev"),
      "devtrain": self.get_dataset("devtrain")}

  def get_dataset(self, key, subset=None):
    assert key in {"train", "devtrain", "dev"}
    print(f"Using {key} dataset!")
    return {
      "class": "DummyDatasetMultipleSequenceLength",
      "input_dim": self.audio_dim,
      "output_dim": self.output_dim,
      "seq_len": {
        'data': self.seq_len,
        'classes': self.output_seq_len
      },
      "num_seqs": self.num_seqs,
    }


# DummyDataset
globals().update(DummyDataset().get_config_opts())

# LibriSpeech Dataset
# globals().update(
#   oggzip.Librispeech(train_random_permute={
#     "rnd_scale_lower": 1., "rnd_scale_upper": 1.,
#     "rnd_pitch_switch": 0.05,
#     "rnd_stretch_switch": 0.05,
#     "rnd_zoom_switch": 0.5,
#     "rnd_zoom_order": 0,
#   }).get_config_opts())


st1 = Stage(
  make_net=Pretrain(make_net, {"enc_lstm_dim": (512, 1024), "enc_num_layers": (3, 6)}, num_epochs=5).get_network,
  num_epochs=2,
  fixed_path=False,
  alignment_topology=rna_topology,
)

st2 = Stage(
  make_net=Pretrain(make_net, {"enc_lstm_dim": (512, 1024), "enc_num_layers": (3, 6)}, num_epochs=3).get_network,
  num_epochs=5,
  fixed_path=True,
  stage_num_align=0,
  alignment_topology=rna_topology,
)

# Multi stage training with pretraining
get_network = TransducerFullSumAndFramewiseTrainingPipeline([st1,
                                                             st2,
                                                             st1.st(fixed_path=True, stage_num_align=1),
                                                             st1.st(fixed_path=True, stage_num_align=2),
                                                             st2]).get_network

# trainer
debug_mode = False
batching = "random"
batch_size = 1000 if debug_mode else 12000
max_seqs = 10 if debug_mode else 200
max_seq_length = {"classes": 75}

device = "cpu"
num_epochs = 100
model = "net-model/network"
cleanup_old_models = True

adam = True
optimizer_epsilon = 1e-8
# debug_add_check_numerics_ops = True
# debug_add_check_numerics_on_output = True
stop_on_nonfinite_train_score = False
gradient_noise = 0.0
gradient_clip = 0
# gradient_clip_global_norm = 1.0

learning_rate = 0.001
learning_rate_control = "newbob_multi_epoch"
# learning_rate_control_error_measure = "dev_score_output"
learning_rate_control_relative_error_relative_lr = True
learning_rate_control_min_num_epochs_per_new_lr = 3
use_learning_rate_control_always = True
newbob_multi_num_epochs = globals().get("train", {}).get("partition_epoch", 1)
newbob_multi_update_interval = 1
newbob_learning_rate_decay = 0.9
learning_rate_file = "newbob.data"

# log
# log = "| /u/zeyer/dotfiles/system-tools/bin/mt-cat.py >> log/crnn.seq-train.%s.log" % task
# model_name = os.path.splitext(os.path.basename(__file__))[0]
# log = "/var/tmp/am540506/log/%s/crnn.%s.log" % (model_name, task)
# os.makedirs(os.path.dirname(log), exist_ok=True)
log = "log/crnn.%s.log" % task
log_verbosity = 2

@albertz

This comment has been minimized.

@jotix16
Copy link
Contributor Author

jotix16 commented Apr 16, 2021

Why open a separate new PR then? You could have just updated #60. Please do that in the future. But anyway, leave it like this now.

It was not possible. I could only comment. Could't recover the branch of the PR, no matter what I tried.

Right now I am looking into what has to be considered so that the alignments and the input features correspond to each other when loading. Is it clear what are options that influence this(ordering, sorting, batchsize..)? If yes, one could automatically read the options from the default dataset to create the HdfDataset for loading the alignments.

What extra information has to be saved together with the alignments for each label topology and how.
For example, chunking requires extra information if used with rnnt label topology.
As mentioned in Andre's thesis:

For the time-synchronous fixed-path transducer this is straight-forward, both the alignment and input has to be chunked accordingly.

However once we move to alignment-synchronous models with the “allow vertical” topology, this becomes more difficult due to the non-uniform input and output sizes. To implement this regardless, a similar technique can be used which still chunks the encoder frames as before, now the targets are collected dynamically to match the input frames. This procedure as follows: For each sequence in the batch, we split the encoder-level alignment into segments such that in each segment there are exactly C blanks, except for the last segment.

make_align() and make_fixed_path() callbacks could be saved in the Topology Class that we used for the loss and alignments if there are many differences that have to be considered separately.

@albertz
Copy link
Member

albertz commented Apr 16, 2021

Why open a separate new PR then? You could have just updated #60. Please do that in the future. But anyway, leave it like this now.

It was not possible. I could only comment. Could't recover the branch of the PR, no matter what I tried.

It should always be possible by just force-pushing to your branch (which you used for the PR, that was multi_stager).

Btw, I see that you also added the dataset there. Please separate this (always separate things when they are logically separate). And I'm anyway not sure about this. I don't like that we now need to recreate wrappers for all datasets. That's bad. That should be avoided (automated, or be part of RETURNN itself). But anyway, that's off-topic here.

Right now I am looking into what has to be considered so that the alignments and the input features correspond to each other when loading. Is it clear what are options that influence this(ordering, sorting, batchsize..)? If yes, one could automatically read the options from the default dataset to create the HdfDataset for loading the alignments.

I don't quite understand this comment. What do you mean by "correspond to each other"? Why do you think you need any extra logic there? Every sequence is already identified by the seq-tag.

What extra information has to be saved together with the alignments for each label topology and how.

Like what?

For example, chunking requires extra information if used with rnnt label topology.

You mean more like some extra logic. Or what extra information? -> Logic

make_align() and make_fixed_path() callbacks could be saved in the Topology Class that we used for the loss and alignments if there are many differences that have to be considered separately.

But alignment in that class is already exactly that?

Or you mean the extra chunking logic?

We anyway need to think about how the chunking would be generalized. There is an initial implementation here but this needs changes.

Anyway, this is all off-topic here, or not?

@jotix16
Copy link
Contributor Author

jotix16 commented Apr 19, 2021

Btw, I see that you also added the dataset there. Please separate this (always separate things when they are logically separate). And I'm anyway not sure about this. I don't like that we now need to recreate wrappers for all datasets. That's bad. That should be avoided (automated, or be part of RETURNN itself). But anyway, that's off-topic here.

Yes. It should have been put in the main config.

I don't quite understand this comment. What do you mean by "correspond to each other"? Why do you think you need any extra logic there? Every sequence is already identified by the seq-tag.

Seems like it is already taken care of on the side ob both HdfDump and MetaDataset Didn't know that. It is much or less plug and play. I am only unsure about non time synchron topologies as the alignments have different seq_lens compared to the features. Is it still plug and play for framewise CE training?

You mean more like some extra logic. Or what extra information?

Logic

make_align() and make_fixed_path() callbacks could be saved in the Topology Class that we used for the loss and alignments if there are many differences that have to be considered separately.

But alignment in that class is already exactly that?

I am talking here about the stuff happening in Stage level. We either dump the alignments or load them and do CE training. For that, make_align() and make_fixed_path() add the required logic, i.e. the HdfDump and MetaDataset respectively.
My point was that if make_align() or make_fixed_path() depend on the label topology we could maybe make them part of the Topology instead of MultiStager.

Or you mean the extra chunking logic?
We anyway need to think about how the chunking would be generalized. There is an initial implementation here but this needs changes.

Yes, that inclusive.
Ahh, I see, you mean the solution should be in chunk level. I will check that out and see if I come up with any generalization.

Anyway, this is all off-topic here, or not?

Not really, it is some work towards:
Find goog pipeline: How long full sum? How often viterbi realignment? Alternate between both?

The goal is to separate the logic of full sum, viterbi realignment and CE from the model itself. I think that multi stage training should be a plug in. Once you have a model one could easily choose the pipeline.

@albertz
Copy link
Member

albertz commented Apr 19, 2021

Btw, I see that you also added the dataset there. Please separate this ...

Yes. It should have been put in the main config.

So can you clean up this PR and separate this?

Every sequence is already identified by the seq-tag.

I am only unsure about non time synchron topologies as the alignments have different seq_lens compared to the features. Is it still plug and play for framewise CE training?

I'm not sure what you mean by "plug and play"?

Obviously the normal chunking cannot work.

make_align() and make_fixed_path() callbacks could be saved in the Topology Class that we used for the loss and alignments if there are many differences that have to be considered separately.

But alignment in that class is already exactly that?

I am talking here about the stuff happening in Stage level. We either dump the alignments or load them and do CE training. For that, make_align() and make_fixed_path() add the required logic, i.e. the HdfDump and MetaDataset respectively.
My point was that if make_align() or make_fixed_path() depend on the label topology we could maybe make them part of the Topology instead of MultiStager.

(I don't understand what's the different between making a path or making an alignment -> make_fixed_path is to create the config for framewise CE training)

But making (and dumping) the alignment is independent from the label topology?

I'm not really sure whether the multi stager should need to handle any of this? This looks very unclean to me. Like you mix up different things (multi staging + alignment creation + alignment dumping + alignment loading). Can't we separate all of this? Making things coupled together is always bad.

Anyway, this is all off-topic here, or not?

Not really, it is some work towards:
Find goog pipeline: How long full sum? How often viterbi realignment? Alternate between both?

I thought the multi stager (this PR here) is about a multi stager, where you combine several different training steps (any, doesn't matter what they do).

The goal is to separate the logic of full sum, viterbi realignment and CE from the model itself.

But we already have that? We have some functions which build the model, and other (separate) functions which define the training loss, and yet separate functions which define pretraining and the training pipeline.

Unless you never intended the multi-stager to be generic (then I misunderstood), but very specific for this transducer model, and transducer training pipeline. But then I would also call it more specific, like FullsumTransducerTrainingPipeline, and not just MultiStager.

If it is supposed to be generic, I don't think it should have any extra logic for things like alignments etc. It might have very generic support for storing and loading (any!) auxiliary data (storing via HDFDumpLayer, and loading via MetaDataset/HDFDataset).

@jotix16
Copy link
Contributor Author

jotix16 commented Apr 19, 2021

(I don't understand what's the different between making a path or making an alignment)

Naming is bad. update_for_alignment_dumping and update_for_fixed_path_training would be more exact.

But making (and dumping) the alignment is independent from the label topology?

Yes, if you try to change the chunking to make up for the topology. For RNNT one could dump index sequences of blank labels as an extra dataset and chunk along it, instead. I don't know if this is doable as it would require to return (ix, blank_idxs[start], blank_idxs[end]) instead of (ix, start, end). But then, you don't have to change chunking itself. You have the hdf dataset with the index-sequences that make up for the differences.

Like you mix up different things (multi staging + alignment creation + alignment dumping + alignment loading). Can't we separate all of this? Making things coupled together is always bad.

They are separated, only not in different files.

But we already have that? We have some functions which build the model, and other (separate) functions which define the training loss, and yet separate functions which define pretraining and the training pipeline.

Yes but the in-between steps of switching between FS and CE are missing. That is what I am intending to add. My plan was to add the logic in returnn-experiments only.

As you say it is rather Transducer specific. I will rename it as you suggest to TransducerTrainingPipeline

@albertz
Copy link
Member

albertz commented Apr 19, 2021

But making (and dumping) the alignment is independent from the label topology?

Yes, if you try to change the chunking to make up for the topology. For RNNT one could dump index sequences of blank labels as an extra dataset and chunk along it, instead. I don't know if this is doable as it would require to return (ix, blank_idxs[start], blank_idxs[end]) instead of (ix, start, end). But then, you don't have to change chunking itself. You have the hdf dataset with the index-sequences that make up for the differences.

I don't really understand. The making/dumping is in any case independent. I think you refer to the framewise training and chunking.

For chunking, yes it's specific for the topo. I don't understand what you describe. No matter how you dump it, the chunking needs custom logic.

I also don't understand why the multi stager needs to handle any of this.

Like you mix up different things (multi staging + alignment creation + alignment dumping + alignment loading). Can't we separate all of this? Making things coupled together is always bad.

They are separated, only not in different files.

What exactly is this PR about? I thought it's about the multi stager (and only about that)? We should not mix up things. And I still don't see why alignment stuff (dumping, loading) and framewise training etc should matter for that. The logic of multi staging would be totally independent of any of that?

@jotix16 jotix16 changed the title Add multi stager Add TransducerFullSumAndFramewiseTrainingPipeline Apr 19, 2021
@albertz
Copy link
Member

albertz commented Apr 21, 2021

This PR still has stuff about the dummy dataset. Can you remove this here?

@albertz
Copy link
Member

albertz commented Apr 21, 2021

This PR still looks very much work-in-progress. Can you mark it as draft, until you consider it ready?
Also, can you comment what the state is now?

"extend_existing_file": extend, # TODO: extend only the first time
# dataset_name: comes from **opts of the lambda in filename
"filename":
(lambda **opts: "{align_dir}/align.{name}_{dataset_name}.hdf".format(align_dir=align_dir,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use str.format. Use f-strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is easy to do in this case. How would you do it? the dataset_name comes from **opts.




class Context:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you duplicate this here? We already have such a class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here the context is a little broader. Should I give the rest as separated params to the functions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no excuse. Then take the base context as an argument here and extend it. But do not duplicate code & logic when not really needed.

But what is really the extension here? Just the alignment_topology? For that, you don't need any new Context type at all. Just pass it as an extra argument where-ever needed.

Or maybe extend the base Context class. Probably we anyway need it also there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe extend the base Context class. Probably we anyway need it also there?

That is true. Should best add it there.

In transducer_fullsum.py you only provide make_net(). But that isn't very flexible especially when you are calling it through Pretrain. Wouldn't it be more meaningfull to wrap it in a class that holds the parameters non related to the Pretrain? We can then still define make_net with the same default params so it doesn't break anything.
If so, let me know to open a pr.

Or how have you thought it?

task = get_global_config().value("task", "train")
target = TargetConfig.global_from_config()
model = get_global_config().value("model", "net-model/network")
self.ctx = Context(task=task, target=target, model=model, name=self.stage.name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very bad. You should never introduce attribs in non-init functions. (Basic Python rules. I think PyCharm would also warn you about this, or not?)

Also, you should not access the global config in other functions. In the optimal case, it would never be accessed at all, and all relevant context information is passed in __init__. If needed, it might be used for default arguments in __init__. See also other code.

@jotix16 jotix16 changed the title Add TransducerFullSumAndFramewiseTrainingPipeline WIP: Add TransducerFullSumAndFramewiseTrainingPipeline Apr 23, 2021
This reverts commit dcafc88.
@jotix16
Copy link
Contributor Author

jotix16 commented Apr 23, 2021

This PR still looks very much work-in-progress. Can you mark it as draft, until you consider it ready?

For FixedPath training, different datasets, are to be handled differently. For example for Switchboard, seq_order_seq_lens_file has to be provided whereas LibriSpeech has seq_tags.
Dumping seems to be independent from the dataset.

Also, can you comment what the state is now?
I have done some progress with the dummy dataset(similar to Librispeech).
Right now I don't know how to organize the files. You want them separated. Should I create a folder called transducer_training_pipeline and split the parts into files there?
Smth like

transducer_training_pipeline
├── alignment_dumping.py
├── fixed_path_training.py
└── transducer_fullsum_framewise_training_pipeline.py

In fixed_path_training.py we then put the functions

  • libri_update_net_for_fixed_path_training()
  • switchboard_update_net_for_fixed_path_training()

In alignment_dumping.py there is update_net_for_alignment_dumping()

@albertz

This comment has been minimized.

@jotix16 jotix16 marked this pull request as draft April 23, 2021 21:35
@albertz
Copy link
Member

albertz commented Apr 23, 2021

For FixedPath training, different datasets, are to be handled differently. For example for Switchboard, seq_order_seq_lens_file has to be provided whereas LibriSpeech has seq_tags.

No, those mean different things.

In any case, the MetaDataset would handle this, or not?

I have done some progress with the dummy dataset(similar to Librispeech).

Why do you mention this? This is totally independent from this PR here, or not?

Right now I don't know how to organize the files. You want them separated. Should I create a folder called transducer_training_pipeline and split the parts into files there?
Smth like

transducer_training_pipeline
├── alignment_dumping.py
├── fixed_path_training.py
└── transducer_fullsum_framewise_training_pipeline.py

If this really needs to be an own directory (not sure about this), then the last filename can be shorter, just like pipeline.py or so.

In fixed_path_training.py we then put the functions

  • libri_update_net_for_fixed_path_training()
  • switchboard_update_net_for_fixed_path_training()

No, there should be no specific code for specific dataset. It should be generic such that it works always.

net["existing_alignment"] = {"class": "reinterpret_data",
"from": "data:alignment",
"set_sparse_dim": target.get_num_classes(),
"size_base": "encoder", # TODO: for RNA only...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea how it can be set for RNNT?
I.e. how to give there the sum of the sizes of encoder and decoder
@albertz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we have an example for this somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the RNNT, Andre has it out-commented. He skips the reinterpret_data layer and just continues with:

    "1_targetb_base": {
        "class": "copy",
        # "from": "existing_alignment",
        "from": "data:alignment",
        "register_as_extern_data": "targetb" if task == "train" else None},

Otherwise I couldn't find an other example.

So the question is if we really need the reinterpret layer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not what I meant. I meant for the size_base. But maybe that's not needed (you need it when you want to tell RETURNN that it is the same to some other dim tag).

I also see that it sets the sparse dim. Although that looks a bit incorrect anyway. It would include Blank Labels at this point, and I think target here is without Blank. Only once you remove the Blank Frames/Labels, this makes sense. But maybe this is also not relevant (depending on how it is used).

Note that RNNT Training with fixed alignment is anyway not fully supported yet, because chunking doesn't fully work. (See here.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the question is, do we need to tell RETURNN about the shape of the targets if we are training with framewise CE? Or when should RETURNN know about the shape of the targets?

Although that looks a bit incorrect anyway. It would include Blank Labels at this point, and I think target here is without Blank.

Nice that you caught that one. It should have had the dim inclusive the blank.

I have looked into the issue with chunking. Will first commit it like this and integrate it later. The possibilities are, it either can be solved from #376 or through the workaround of Andre.
I see that he used it in some of his configs but am not sure how he worked the following problem out(or didn't at all):

But even that is hacky and ugly, and will break in some cases, e.g. when you define custom_iterate_seqs in some epochs, and later not anymore. Then it would not correctly reset this.

@albertz

This comment has been minimized.

@jotix16 jotix16 changed the title WIP: Add TransducerFullSumAndFramewiseTrainingPipeline Add TransducerFullSumAndFramewiseTrainingPipeline Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants