sync after 6915 (#14) (#15) · zhehuaichen/NeMo@206af78

Commit

sync after 6915 (#14) (#15)

* Fixed small bug with NoisePerturbationWithNormalization (NVIDIA#7118)



* Fix import guard checks (NVIDIA#7124)



* Revert "Fix import guard checks (NVIDIA#7124)" (NVIDIA#7125)

This reverts commit a46e325.

* Fix import guard checks (NVIDIA#7126)

* Fix import guard checks



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------




* Add updated fc ctc and rnnt xxl models (NVIDIA#7128) (NVIDIA#7130)

* [TTS] Create EnCodec training recipe (NVIDIA#6852)

* [TTS] Create EnCodec training recipe



* [TTS] Update encodec recipe



* [TTS] Rename EnCodec to AudioCodec



* [TTS] Add EnCodec unit tests



* [TTS] Add copyright header to distributed.py



---------



* Fix rank where torch.distributed may not be initialized yet and would not wait for tokenizer file caching (NVIDIA#7061)




* fix default attention size (NVIDIA#7141) (NVIDIA#7143)

* fix evaluator.py for various exceptions by ast (NVIDIA#7150)



* [TTS][ZH] add Chinese TTS recipes based on IPA symbol sets. (NVIDIA#6893)

* [TTS] add Chinese TTS recipe based on IPA.
* add new pinyin and ipa dictionaries with 36 finals.
* add yaml configs for 24-final pinyin and ipa.
* add copyright header
* add a directory level 24finals to discriminate from 36 finals.



* unify configs into a single one and add detailed comments providing supported candidates.



* choose 36-final IPA as default phoneme dict



---------



* [TTS] Add output audio format to preprocessing (NVIDIA#6889)

* [TTS] Add output audio format to preprocessing



* [TTS] Add format validation



* [TTS] Fix data tutorial



---------



* freeze (NVIDIA#7152)



* make sure any empty segments are removed (NVIDIA#7155)



* Update RIR generation scripts (NVIDIA#6547)

- fix: reduce room size if evaluation of params fails
- added randomized mic placement
- added diffuse noise generation
- added an option to specify the format and subtype for saved audio



* A quickstart speech enhancement tutorial (NVIDIA#6492)

A simple example of training a model for speech enhancement task



* NFA subtitle file config - specify colors and vertical alignment (NVIDIA#7160)

* allow specifying colors of text in ASS subtitle file



* specify vertical_alignment instead of marginv in ass_file_config



* add documentation of CTMFileConfig and ASSFileConfig to NFA README



---------



* Eagerly accumulate embedding grads into fp32 buffer (NVIDIA#6958) (NVIDIA#7153)




* TE bug fix (NVIDIA#7027) (NVIDIA#7036)




* [TTS] Remove nested TTS configs (NVIDIA#7154)

* [TTS] Remove nested TTS configs



* [TTS] Modify tutorial to support multiple sampling rates



* [TTS] Clarify min_duration unit



* [TTS] Default 22.05kHz highfreq to null



---------



* Merge release r1.20.0 to main (NVIDIA#7167)

* update package info



* Add ASR with TTS Tutorial. Fix enhancer usage. (NVIDIA#6955)

* Add ASR with TTS Tutorial
* Fix enhancer usage



* install_bs (NVIDIA#7019)



* Fix typo and branch in tutorial (NVIDIA#7048)



* fix syntax error introduced in PR-7079 (NVIDIA#7102)

* fix syntax error introduced in PR-7079



* fixes for pr review



---------



* fix links for TN (NVIDIA#7117)



* update branch (NVIDIA#7135)



* Fixed main and merging this to r1.20 (NVIDIA#7127)

* Fixed main and merging this to r1.20



* Update vad_utils.py



---------





* update branch



* fix version



* resolve conflict the other way



* keep both



* revert keep both



---------















* Upgrade to pytorch lightning 2.0 (NVIDIA#6433)

* Upgrade pytorch lightning version in requirements



* Initial fixes for PTL2.0



* Add further fixes to support lightning 2.0



* Add replacements for replace_sampler_ddp, resume_from_checkpoint_fit_path and few occurances of validation_epoch_end



* Replace all occurances of validation_epoch_end to on_validation_epoch_end



* Replace training_epoch_end, test_epoch_end with on_train_epoch_end and on_test_epoch_end respectively



* Change logger=None to logger=False in Trainer object



* Remove PTL2.0 deprecated Trainer args from TrainerConfig dataclass



* Modify trainer.precision check and other small edits



* Replace logger=None with logger=False in test_ptl_stateless_timer.py Trainer



* Add default values for args to fix Attribute Error



* Add the following modifications

1) Remove outputs arg from on_validation_epoch_end, on_test_epoch_end and make it an arg of the class
2) Replace resume_from_checkpoint with ckpt_path as needed
3) Explicitly add accelerator as 'CPU' in UTs being run on CPU



* Remove outputs arg from on_validation_epoch_end, on_test_epoch_end



* Remove outputs arg in on_validation_epoch_end in MultiBinaryAccuracy docstrings



* Add val, test outputs as instance vars in PunctuationCapitalizationModel and TokenClassificationModel



* Replace trainer.fit_loop.max_steps with trainer.fit_loop.epoch_loop.max_steps in test_optimizers_schedulers.py



* Revert an extra space that was mistakenly added



* Use self.validation_step_outputs and self.test_step_outputs in test_ema.py for uniformity



* Use self.validation_step_outputs and self.test_step_outputs in test_ptl_stateless_timer.py and check_for_ranks.py for uniformity



* Add self.validation_step_outputs.clear() and self.test_step_outputs.clear() wherever missing



* Remove outputs arg from on_train_epoch_end



* Remove outputs from on_validation_epoch_end in multi_binary_acc.py



* Remove output args from on_validation_epoch_end in the docstrings of some ASR files



* Remove output args from on_validation_epoch_end and clear memory from validation_step_outputs



* Add on_validation_epoch_end and remove outputs args for nlp models



* Append output of validation_step to validation_step_outputs in EncDecClassificationModel



* Add the following changes

1) Index self.validation_step_outputs and self.test_step.outputs with dataloader_idx wherever needed
2) Initialize self.validation_step_outputs and self.test_step.outputs as empty lists and add support for multi dataloaders if they exist
3) Remove self.pre_configure_ddp from NLPDDPStrategy class as its removed in PTL 2.0



* Add default value dataloader_idx=0 for on_validation_batch_end() in megatron_base_model.py



* TypeCast precision to str in attention.py and utils_funcs.py to avoid TypeError



* Add if condition check for multiple dataloaders when appending to validation outputs



* Separate validation pass to be used with both validation_step and test_step



* Add if condition check for multiple dataloader while appending to test_step_outputs in punctuation_capitalization_model.py



* Add condition check for multiple dataloaders based on type of trainer.val/test_dataloaders or self._validation/test_dl instead of len



* Comment Megatron T5 IA3 PP=2 in CI pipeline due to dataloader_iter issue with PTL 2.0



* Modify precision checks to account for 16-mixed and bf16-mixed



* Append output of validation/test_step to self.validation/test_step_outputs in CTCG2PModel



* Modify find_unused_parameters=True in g2p_heteronym model

1) Add find_unused_parameters=True for DDP strategy in g2p_heteronym_classification_train_and_evaluate.py
2) Remove args output in validation/test_step and add instance variables instead for heteronym_classification.py



* Remove outputs from on_test_epoch_end in DialogueGPTClassificationModel



* Add validation/test outputs in sgdqa_model and modify dialogue_config.yaml



* Add split arg self.test_step_outputs to TextClassificationModel



* Add test_step_outputs to dialogue and text classification models



* Change condition check for multiple dataloaders:

1) Replace ds_item as list in dialogue_config.yaml
2) Check for len of val/test_dataloaders or validation/test_dl along with type check of list in sgdqa_model.py while appending outputs of validation/test_step
3) Check for len of _validation/test_dl for creating self.validation/test_step_outputs in ModelPT and punctuation_cpitalization_model.py



* Add additional condition for multi dataloaders

Check len(self.trainer.val/test_dataloaders) > 1 along with type(self.trainer.val/test_dataloaders) == list for multi dataloaders in validation/test_step



* Add val step outputs and default val for dataloader_idx

1) Append validation_step outout to self.validation_step_outputs in MultiLabelIntentSlotClassificationMode
2) Add default val for dataloader_idx for on_test_batch_start/end in TimingCallback
3) Add self.validation/test_step_outputs in BERTQAModel and remove outputs arg



* Add val/test_step_outputs to S2SQAModel and GPTQAModel



* Edit JenkinsFile for bert_pretrainig.py

Edit Jenkinsfile for this test to disable validation as a workaround for trainer.val_dataloader None error



* Modify precision to support 16-mixed, bf16-mixed in megatron_gpt_pretraining.py



* Add ddp_find_unused_parameters_true and remove output args

1) Add ddp_find_unused_parameters_true fro trainer.strategy in self_alignment_pretraining.py as it has unused parameters
2) Remove output args and add self.validation/test_step_outputs to validation/test_step in mt_enc_dec_model.py
3) Comment tests in JenkinsFile that need to be fixed



* Precision fix in megatron_nmt_training.py for 16-mixed, bf16-mixed



* Precision fix for megatron_bert_pretraining.py and megatron_bert_model.py



* Precision fix and validation/test_step_outputs

1) Add fix to account for 16-mixed and bf16-mixed in megatron_retro_mutransfer_pretrain.py, megatron_retro_pretraining.py
2) Reset ckpt_path for test in enc_dec_nmt.py
3) Remove outputs args and add validation/test_step_outputs in megatron_retrieval_model.py
4) Comment Megatron Bert Pretraining and Resume Training with Pipeline Paralleism and add back NMT Training Post-LN



* Precision fix and skip few failing tests



* Add missing comment lines in JenkinsFile



* Comment jenkin tests and super().on_validation_epoch_end() in megatron_gpt_sft_model.py



* Minor edit JenkinsFile



* Minor edit in jenkins file



* Edit in Jenkins file



* Comment missed lines in Jenkins file



* Fix precision and validation/test outputs

1) Add precision fix to account for 16-mixed and bf16-mixed in megatron_t5_pretraining.py
2) Remove outputs args and add append loss to self.validation/test_step_outputs in megatron_lm_encoder_decoder_model.py
3) Add back resume_from_checkpoint in the megatron_t5_config.yaml
4) Comment out certain tests in Jenkins file



* Fix precision and validation/test/predict errors in megatron_t5_prompt_learning.py



* Precision fix and edit precision typo in all files

1) Account for 16-mixed and bf16-mixed in megatron_bart_pretraining.py and megatron_t5_seq2seq_finetune.py
2) Fix precision typo in all files



* Fix all CI TTS tests and comment few Jenkins tests



* Combine xx_epoch_end and on_xx_epoch_end

Add on_inference_epoch_end to inference_epoch_end function and have a single on_validation/test_epoch_end in megatron_finetune_model.py and megatron_gpt_sft_model.py



* Add a missing comment in JenkinsFile



* Add try except StopIteration in validation_step for models with dataloader_iter



* Remove pyyaml from requirements



* Add try except for inference_step in megatron_finetune_model.py



* Remove limit_val_batches for mockGPTDataset test



* Add new self.validation_step_outputs for MegatronGPTSFTModel



* Minor edit Jenkinsfile



* Initialize self.validation/test_step_outputs in megatron_gpt_sft_model.py

Initialize self.validation/test_step_outputs in setup of MegatronGPTSFTModel to take care of cases when datalaoders are not setup in ModelPT for example while restoring the model.



* Remove resume_from_checkpoint if trainer arg in conf yaml files



* Remove resume_from_checkpoint as trainer arg in GPT, T5 configs



* Remove resume_from_checkpoint in duplex_tn_config.yaml



* Fix typos, unused imports and refactor code to remove redundant funcs



* Remove commented code in megatron_nmt_model.py



* Fix overriden functions to match parent class functions



* Prefetch dataloader_iter to prevent hang for PP>1



* Override setup() in NLPDDPStrategy to avoid hang during predict with PP>1



* Uncomment tests in JenkinsFile



* Add '16' to precision checks and other minor fixes



* Clear validation/test_step_outputs with dataloader_idx for multi dataloaders



* Minor edits



* Modify precision checks to avoid indexing



* Remove self.validation_step_outputs_sft and add dataloader_idx to clear outputs



* Reference checkpoint with trainer.ckpt_path



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add _prefetch to NLPModel and minor fixes



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add limit_val_batches in JenkinsFile for NMT

1) Add trainer.limit_val_batches in Megatron NMT Training TP=2
2) Remove unused import in ModelPT



---------




* Include the scripts for preprocessing OAST and unit tests for chat sft datasets (NVIDIA#7112)

* scripts for sft



* fix style



* adde special token only for huggingface model



* change default name



* print out error datapoint content



* show error id



* annotation script working



* try to be compatible with huggingface tokenizer



* added examples



* added lang



* added lang



* text to value special case



* configure the slider



* annoatation handles lang



* added the unit test for chat sft dataset



* used the file in the test dir



* fix json error



* load local tokenizer



* remove mask count check



* added HF dataset backend



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------




* add paths to labeler. (NVIDIA#7087)



* T5 metrics fix (NVIDIA#7037)

* Fix race condition when executing with multi-node where some ranks does not wait for setup (NVIDIA#7016)




* Added bool types to neural_types export (NVIDIA#7032)




* rnnt and char utils (NVIDIA#6971)

* rnnt_ngram_merge



* char level bug



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------






* fix tab text gen (NVIDIA#7022) (NVIDIA#7031)





* Fixed kwargs for metric instance init



* Fixed kwargs for metric instance init



* removed kwagrs



* Updated config desc



* ASR Confidence update and tutorial (NVIDIA#6810)

* small fixes and tests



* various fixes for the tutorial



* tutorial added



* for for a little oops after rebasement



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix tests



* unused import removed



* fix review comments



* deprecated parameters for greedy configs



* move re-assigning to configs



* fix comments 2



* fix config tests



* fix ece test (my env was bugged apparently)



* renamings for confidence ensemble



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fox comments 3



* return dropped tutorial



* CI flips back and forth, increasing tolerance



---------





* install_bs (NVIDIA#7019) (NVIDIA#7028)





* fixes for spellmapper (NVIDIA#6994) (NVIDIA#7000)






* added back the retro documents (NVIDIA#7033)




* Remove pyyaml (NVIDIA#7052) (NVIDIA#7054)





* st standalone model (NVIDIA#6969)

* st standalone model



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* style fix



* sacrebleu import fix, unused imports removed



* import guard for nlp inside asr transformer bpe model



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* codeql fixes



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comments answered



* import ordering fix



* yttm for asr removed



* logging added



* added inference and translate method



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------





* remove pos emb from state dict for old models (NVIDIA#7068)

* remove pos emb from state dict



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* move to nlp_model



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update comment



* fix nmt test



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix nmt test



---------





* Fix typo in ASR-TTS tutorial (NVIDIA#7049)




* Fixed tutorial's name (NVIDIA#7047)





* Fix documentation for Numba (NVIDIA#7065) (NVIDIA#7077)

* Fix documentation for Numba



* Update force float32 flag dynamically



* Update force float32 flag dynamically



* Fix nemo version



---------






* Update Frame-VAD doc and fix onnx export (NVIDIA#7076)

* update fvad doc



* fix typo



* update fvad example



* update



* fix onnx export



* update test



* refactor



* update doc



* update



---------





* memmap worker arg (NVIDIA#7062)

* memmap worker arg



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update



* update



---------





* Fix caching bug in causal convolutions for cache-aware ASR models (NVIDIA#7034) (NVIDIA#7082)




* Fast Conformer global token fix (NVIDIA#7085)

* old way



* fix



* fix



* fix



* remove extra



* clean



* clean



* clean



* fix



* fix



* fix



* fix



* fix



* fix



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------





* Refined export_config (NVIDIA#7053) (NVIDIA#7066)

* Refined export_config
* Rolling back hierarchy change
---------





* small Bugfix (NVIDIA#7081)

* small Bugfix (NVIDIA#7079)

* fix branch



* fix typo



* fix link



---------



* Update tutorials/nlp/SpellMapper_English_ASR_Customization.ipynb



* Update tutorials/nlp/SpellMapper_English_ASR_Customization.ipynb



---------







* Added script to extract ASR CTC and RNNT models from ASR hybrid models (NVIDIA#7092)

* Added script to extract ctc and rnnt models from hybrid models



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updated hybrid extraction script for review request 1



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updated hybrid convert script to remove --cuda flag



---------






* Adding docs and models for multiple lookahead cache-aware ASR (NVIDIA#7067) (NVIDIA#7094)



* update TTS readme (NVIDIA#7088)

* update TTS readme



---------




* Fix absolute path in path join call (NVIDIA#7099)




* Disable distopt contiguous param buffer by default (NVIDIA#7095)




* microphone demo (NVIDIA#7110)





* [Fix] load_state_dict in nlp_model.py (NVIDIA#7086)

* Fix load_state_dict in nlp_model.py



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------





* Fix plot function in vad_utils.py (NVIDIA#7113)

Fix plot function in vad_utils.py




* Fixed small bug with NoisePerturbationWithNormalization (NVIDIA#7118)




* Fix import guard checks (NVIDIA#7124)




* Revert "Fix import guard checks (NVIDIA#7124)" (NVIDIA#7125)

This reverts commit a46e325.



* Fix import guard checks (NVIDIA#7126)

* Fix import guard checks



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------





* Add updated fc ctc and rnnt xxl models (NVIDIA#7128) (NVIDIA#7130)



* [TTS] Create EnCodec training recipe (NVIDIA#6852)

* [TTS] Create EnCodec training recipe



* [TTS] Update encodec recipe



* [TTS] Rename EnCodec to AudioCodec



* [TTS] Add EnCodec unit tests



* [TTS] Add copyright header to distributed.py



---------




* Fix rank where torch.distributed may not be initialized yet and would not wait for tokenizer file caching (NVIDIA#7061)





* fix default attention size (NVIDIA#7141) (NVIDIA#7143)



* fix evaluator.py for various exceptions by ast (NVIDIA#7150)




* [TTS][ZH] add Chinese TTS recipes based on IPA symbol sets. (NVIDIA#6893)

* [TTS] add Chinese TTS recipe based on IPA.
* add new pinyin and ipa dictionaries with 36 finals.
* add yaml configs for 24-final pinyin and ipa.
* add copyright header
* add a directory level 24finals to discriminate from 36 finals.



* unify configs into a single one and add detailed comments providing supported candidates.



* choose 36-final IPA as default phoneme dict



---------




* [TTS] Add output audio format to preprocessing (NVIDIA#6889)

* [TTS] Add output audio format to preprocessing



* [TTS] Add format validation



* [TTS] Fix data tutorial



---------




* freeze (NVIDIA#7152)




* make sure any empty segments are removed (NVIDIA#7155)




* Update RIR generation scripts (NVIDIA#6547)

- fix: reduce room size if evaluation of params fails
- added randomized mic placement
- added diffuse noise generation
- added an option to specify the format and subtype for saved audio




* A quickstart speech enhancement tutorial (NVIDIA#6492)

A simple example of training a model for speech enhancement task




* NFA subtitle file config - specify colors and vertical alignment (NVIDIA#7160)

* allow specifying colors of text in ASS subtitle file



* specify vertical_alignment instead of marginv in ass_file_config



* add documentation of CTMFileConfig and ASSFileConfig to NFA README



---------




* Eagerly accumulate embedding grads into fp32 buffer (NVIDIA#6958) (NVIDIA#7153)





* TE bug fix (NVIDIA#7027) (NVIDIA#7036)





* [TTS] Remove nested TTS configs (NVIDIA#7154)

* [TTS] Remove nested TTS configs



* [TTS] Modify tutorial to support multiple sampling rates



* [TTS] Clarify min_duration unit



* [TTS] Default 22.05kHz highfreq to null



---------




* Merge release r1.20.0 to main (NVIDIA#7167)

* update package info



* Add ASR with TTS Tutorial. Fix enhancer usage. (NVIDIA#6955)

* Add ASR with TTS Tutorial
* Fix enhancer usage



* install_bs (NVIDIA#7019)



* Fix typo and branch in tutorial (NVIDIA#7048)



* fix syntax error introduced in PR-7079 (NVIDIA#7102)

* fix syntax error introduced in PR-7079



* fixes for pr review



---------



* fix links for TN (NVIDIA#7117)



* update branch (NVIDIA#7135)



* Fixed main and merging this to r1.20 (NVIDIA#7127)

* Fixed main and merging this to r1.20



* Update vad_utils.py



---------





* update branch



* fix version



* resolve conflict the other way



* keep both



* revert keep both



---------
















* Upgrade to pytorch lightning 2.0 (NVIDIA#6433)

* Upgrade pytorch lightning version in requirements



* Initial fixes for PTL2.0



* Add further fixes to support lightning 2.0



* Add replacements for replace_sampler_ddp, resume_from_checkpoint_fit_path and few occurances of validation_epoch_end



* Replace all occurances of validation_epoch_end to on_validation_epoch_end



* Replace training_epoch_end, test_epoch_end with on_train_epoch_end and on_test_epoch_end respectively



* Change logger=None to logger=False in Trainer object



* Remove PTL2.0 deprecated Trainer args from TrainerConfig dataclass



* Modify trainer.precision check and other small edits



* Replace logger=None with logger=False in test_ptl_stateless_timer.py Trainer



* Add default values for args to fix Attribute Error



* Add the following modifications

1) Remove outputs arg from on_validation_epoch_end, on_test_epoch_end and make it an arg of the class
2) Replace resume_from_checkpoint with ckpt_path as needed
3) Explicitly add accelerator as 'CPU' in UTs being run on CPU



* Remove outputs arg from on_validation_epoch_end, on_test_epoch_end



* Remove outputs arg in on_validation_epoch_end in MultiBinaryAccuracy docstrings



* Add val, test outputs as instance vars in PunctuationCapitalizationModel and TokenClassificationModel



* Replace trainer.fit_loop.max_steps with trainer.fit_loop.epoch_loop.max_steps in test_optimizers_schedulers.py



* Revert an extra space that was mistakenly added



* Use self.validation_step_outputs and self.test_step_outputs in test_ema.py for uniformity



* Use self.validation_step_outputs and self.test_step_outputs in test_ptl_stateless_timer.py and check_for_ranks.py for uniformity



* Add self.validation_step_outputs.clear() and self.test_step_outputs.clear() wherever missing



* Remove outputs arg from on_train_epoch_end



* Remove outputs from on_validation_epoch_end in multi_binary_acc.py



* Remove output args from on_validation_epoch_end in the docstrings of some ASR files



* Remove output args from on_validation_epoch_end and clear memory from validation_step_outputs



* Add on_validation_epoch_end and remove outputs args for nlp models



* Append output of validation_step to validation_step_outputs in EncDecClassificationModel



* Add the following changes

1) Index self.validation_step_outputs and self.test_step.outputs with dataloader_idx wherever needed
2) Initialize self.validation_step_outputs and self.test_step.outputs as empty lists and add support for multi dataloaders if they exist
3) Remove self.pre_configure_ddp from NLPDDPStrategy class as its removed in PTL 2.0



* Add default value dataloader_idx=0 for on_validation_batch_end() in megatron_base_model.py



* TypeCast precision to str in attention.py and utils_funcs.py to avoid TypeError



* Add if condition check for multiple dataloaders when appending to validation outputs



* Separate validation pass to be used with both validation_step and test_step



* Add if condition check for multiple dataloader while appending to test_step_outputs in punctuation_capitalization_model.py



* Add condition check for multiple dataloaders based on type of trainer.val/test_dataloaders or self._validation/test_dl instead of len



* Comment Megatron T5 IA3 PP=2 in CI pipeline due to dataloader_iter issue with PTL 2.0



* Modify precision checks to account for 16-mixed and bf16-mixed



* Append output of validation/test_step to self.validation/test_step_outputs in CTCG2PModel



* Modify find_unused_parameters=True in g2p_heteronym model

1) Add find_unused_parameters=True for DDP strategy in g2p_heteronym_classification_train_and_evaluate.py
2) Remove args output in validation/test_step and add instance variables instead for heteronym_classification.py



* Remove outputs from on_test_epoch_end in DialogueGPTClassificationModel



* Add validation/test outputs in sgdqa_model and modify dialogue_config.yaml



* Add split arg self.test_step_outputs to TextClassificationModel



* Add test_step_outputs to dialogue and text classification models



* Change condition check for multiple dataloaders:

1) Replace ds_item as list in dialogue_config.yaml
2) Check for len of val/test_dataloaders or validation/test_dl along with type check of list in sgdqa_model.py while appending outputs of validation/test_step
3) Check for len of _validation/test_dl for creating self.validation/test_step_outputs in ModelPT and punctuation_cpitalization_model.py



* Add additional condition for multi dataloaders

Check len(self.trainer.val/test_dataloaders) > 1 along with type(self.trainer.val/test_dataloaders) == list for multi dataloaders in validation/test_step



* Add val step outputs and default val for dataloader_idx

1) Append validation_step outout to self.validation_step_outputs in MultiLabelIntentSlotClassificationMode
2) Add default val for dataloader_idx for on_test_batch_start/end in TimingCallback
3) Add self.validation/test_step_outputs in BERTQAModel and remove outputs arg



* Add val/test_step_outputs to S2SQAModel and GPTQAModel



* Edit JenkinsFile for bert_pretrainig.py

Edit Jenkinsfile for this test to disable validation as a workaround for trainer.val_dataloader None error



* Modify precision to support 16-mixed, bf16-mixed in megatron_gpt_pretraining.py



* Add ddp_find_unused_parameters_true and remove output args

1) Add ddp_find_unused_parameters_true fro trainer.strategy in self_alignment_pretraining.py as it has unused parameters
2) Remove output args and add self.validation/test_step_outputs to validation/test_step in mt_enc_dec_model.py
3) Comment tests in JenkinsFile that need to be fixed



* Precision fix in megatron_nmt_training.py for 16-mixed, bf16-mixed



* Precision fix for megatron_bert_pretraining.py and megatron_bert_model.py



* Precision fix and validation/test_step_outputs

1) Add fix to account for 16-mixed and bf16-mixed in megatron_retro_mutransfer_pretrain.py, megatron_retro_pretraining.py
2) Reset ckpt_path for test in enc_dec_nmt.py
3) Remove outputs args and add validation/test_step_outputs in megatron_retrieval_model.py
4) Comment Megatron Bert Pretraining and Resume Training with Pipeline Paralleism and add back NMT Training Post-LN



* Precision fix and skip few failing tests



* Add missing comment lines in JenkinsFile



* Comment jenkin tests and super().on_validation_epoch_end() in megatron_gpt_sft_model.py



* Minor edit JenkinsFile



* Minor edit in jenkins file



* Edit in Jenkins file



* Comment missed lines in Jenkins file



* Fix precision and validation/test outputs

1) Add precision fix to account for 16-mixed and bf16-mixed in megatron_t5_pretraining.py
2) Remove outputs args and add append loss to self.validation/test_step_outputs in megatron_lm_encoder_decoder_model.py
3) Add back resume_from_checkpoint in the megatron_t5_config.yaml
4) Comment out certain tests in Jenkins file



* Fix precision and validation/test/predict errors in megatron_t5_prompt_learning.py



* Precision fix and edit precision typo in all files

1) Account for 16-mixed and bf16-mixed in megatron_bart_pretraining.py and megatron_t5_seq2seq_finetune.py
2) Fix precision typo in all files



* Fix all CI TTS tests and comment few Jenkins tests



* Combine xx_epoch_end and on_xx_epoch_end

Add on_inference_epoch_end to inference_epoch_end function and have a single on_validation/test_epoch_end in megatron_finetune_model.py and megatron_gpt_sft_model.py



* Add a missing comment in JenkinsFile



* Add try except StopIteration in validation_step for models with dataloader_iter



* Remove pyyaml from requirements



* Add try except for inference_step in megatron_finetune_model.py



* Remove limit_val_batches for mockGPTDataset test



* Add new self.validation_step_outputs for MegatronGPTSFTModel



* Minor edit Jenkinsfile



* Initialize self.validation/test_step_outputs in megatron_gpt_sft_model.py

Initialize self.validation/test_step_outputs in setup of MegatronGPTSFTModel to take care of cases when datalaoders are not setup in ModelPT for example while restoring the model.



* Remove resume_from_checkpoint if trainer arg in conf yaml files



* Remove resume_from_checkpoint as trainer arg in GPT, T5 configs



* Remove resume_from_checkpoint in duplex_tn_config.yaml



* Fix typos, unused imports and refactor code to remove redundant funcs



* Remove commented code in megatron_nmt_model.py



* Fix overriden functions to match parent class functions



* Prefetch dataloader_iter to prevent hang for PP>1



* Override setup() in NLPDDPStrategy to avoid hang during predict with PP>1



* Uncomment tests in JenkinsFile



* Add '16' to precision checks and other minor fixes



* Clear validation/test_step_outputs with dataloader_idx for multi dataloaders



* Minor edits



* Modify precision checks to avoid indexing



* Remove self.validation_step_outputs_sft and add dataloader_idx to clear outputs



* Reference checkpoint with trainer.ckpt_path



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add _prefetch to NLPModel and minor fixes



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add limit_val_batches in JenkinsFile for NMT

1) Add trainer.limit_val_batches in Megatron NMT Training TP=2
2) Remove unused import in ModelPT



---------





* Include the scripts for preprocessing OAST and unit tests for chat sft datasets (NVIDIA#7112)

* scripts for sft



* fix style



* adde special token only for huggingface model



* change default name



* print out error datapoint content



* show error id



* annotation script working



* try to be compatible with huggingface tokenizer



* added examples



* added lang



* added lang



* text to value special case



* configure the slider



* annoatation handles lang



* added the unit test for chat sft dataset



* used the file in the test dir



* fix json error



* load local tokenizer



* remove mask count check



* added HF dataset backend



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------





* add paths to labeler. (NVIDIA#7087)




* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------
















































Co-authored-by: Adi Renduchintala <adithyar…

Signed-off-by: Daniel Egert <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Ryan <[email protected]>
Signed-off-by: Kim Ngo <[email protected]>
Signed-off-by: He Huang (Steve) <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: arendu <[email protected]>
Signed-off-by: Elena Rastorgueva <[email protected]>
Signed-off-by: Ante Jukić <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: ericharper <[email protected]>
Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: Nikolay Karpov <[email protected]>
Signed-off-by: Alexandra Antonova <[email protected]>
Signed-off-by: Evelina <[email protected]>
Signed-off-by: Taejin Park <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Yi Dong <[email protected]>
Signed-off-by: jubick1337 <[email protected]>
Signed-off-by: tbartley94 <[email protected]>
Signed-off-by: Aleksandr Laptev <[email protected]>
Signed-off-by: AlexGrinch <[email protected]>
Signed-off-by: Vitaly Lavrukhin <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: sam1373 <[email protected]>
Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: fayejf <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Signed-off-by: Jan Beckmann <[email protected]>
Signed-off-by: Linnea Pari Leaver <[email protected]>
Signed-off-by: Xin Yao <[email protected]>
Signed-off-by: fayejf <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Oleksii Kuchaiev <[email protected]>
Signed-off-by: Jocelyn Huang <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Alexandra Antonova <[email protected]>
Signed-off-by: Virginia Adams <[email protected]>
Signed-off-by: Vahid <[email protected]>
Signed-off-by: David Mosallanezhad <[email protected]>
Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: ekmb <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Micha Livne <[email protected]>
Signed-off-by: Abhinav Khattar <[email protected]>
Signed-off-by: Micha Livne <[email protected]>
Signed-off-by: Dima Rekesh <[email protected]>
Signed-off-by: Jim O’Regan <[email protected]>
Signed-off-by: Mostafa Ghorbandoost <[email protected]>
Signed-off-by: Dmytro Pykhtar <[email protected]>
Signed-off-by: Kunal Dhawan <[email protected]>
Signed-off-by: andrusenkoau <[email protected]>
Signed-off-by: Andrei Andrusenko <[email protected]>
Signed-off-by: KunalDhawan <[email protected]>
Signed-off-by: Greg Clark <[email protected]>
Signed-off-by: Eric Harper <[email protected]>
Signed-off-by: Jan Baczek <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Olivier Delalleau <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Mariana Graterol Fuenmayor <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Siddharth Tyagi <[email protected]>
Signed-off-by: Abhishree Thittenamane <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Signed-off-by: arendu <[email protected]>
Signed-off-by: Alireza Morsali <[email protected]>
Signed-off-by: Siddharth Tyagi <[email protected]>
Signed-off-by: dorotat <[email protected]>
Signed-off-by: mburchi <[email protected]>
Signed-off-by: Maxime Burchi <[email protected]>
Signed-off-by: Adi Renduchintala <[email protected]>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Xin Yao <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Alexander Jipa <[email protected]>
Signed-off-by: omahs <[email protected]>
Signed-off-by: lhb8125 <[email protected]>
Signed-off-by: Robin Dong <[email protected]>
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: Sangkug Lym <[email protected]>
Signed-off-by: George Zelenfroynd <[email protected]>
Signed-off-by: Anton Peganov <[email protected]>
Signed-off-by: Samuele Cornell <[email protected]>
Signed-off-by: Jason <[email protected]>
Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: Tamerlan Tabolov <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>
Co-authored-by: trias702 <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ryan Langman <[email protected]>
Co-authored-by: Kim Ngo <[email protected]>
Co-authored-by: David <[email protected]>
Co-authored-by: He Huang (Steve) <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Adi Renduchintala <[email protected]>
Co-authored-by: Elena Rastorgueva <[email protected]>
Co-authored-by: anteju <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Vladimir Bataev <[email protected]>
Co-authored-by: Nikolay Karpov <[email protected]>
Co-authored-by: bene-ges <[email protected]>
Co-authored-by: Evelina <[email protected]>
Co-authored-by: Taejin Park <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Yi Dong <[email protected]>
Co-authored-by: Matvei Novikov <[email protected]>
Co-authored-by: tbartley94 <[email protected]>
Co-authored-by: Aleksandr Laptev <[email protected]>
Co-authored-by: Aleksey Grinchuk (Oleksii Hrinchuk) <[email protected]>
Co-authored-by: Vitaly Lavrukhin <[email protected]>
Co-authored-by: fayejf <[email protected]>
Co-authored-by: Vahid Noroozi <[email protected]>
Co-authored-by: Samuel Kriman <[email protected]>
Co-authored-by: Boris Fomitchev <[email protected]>
Co-authored-by: Jan Beckmann <[email protected]>
Co-authored-by: lleaver <[email protected]>
Co-authored-by: Linnea Pari Leaver <[email protected]>
Co-authored-by: Xin Yao <[email protected]>
Co-authored-by: anmolgupt <[email protected]>
Co-authored-by: ANMOL GUPTA <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Jocelyn <[email protected]>
Co-authored-by: bene-ges <[email protected]>
Co-authored-by: Alexandra Antonova <[email protected]>
Co-authored-by: Virginia Adams <[email protected]>
Co-authored-by: Zhilin Wang <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Ante Jukić <[email protected]>
Co-authored-by: David Mosallanezhad <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>
Co-authored-by: Sean Naren <[email protected]>
Co-authored-by: Yang Zhang <[email protected]>
Co-authored-by: Sean Naren <[email protected]>
Co-authored-by: Neha Tadimeti <[email protected]>
Co-authored-by: Abhinav Khattar <[email protected]>
Co-authored-by: Dima Rekesh <[email protected]>
Co-authored-by: Jim O’Regan <[email protected]>
Co-authored-by: Mostafa Ghorbandoost <[email protected]>
Co-authored-by: Dmytro Pykhtar <[email protected]>
Co-authored-by: Kunal Dhawan <[email protected]>
Co-authored-by: Andrei Andrusenko <[email protected]>
Co-authored-by: Greg Clark <[email protected]>
Co-authored-by: jbaczek <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Olivier Delalleau <[email protected]>
Co-authored-by: Jason Wang <[email protected]>
Co-authored-by: Maanu Grover <[email protected]>
Co-authored-by: guyueh1 <[email protected]>
Co-authored-by: Mariana <[email protected]>
Co-authored-by: Igor Gitman <[email protected]>
Co-authored-by: styagi130 <[email protected]>
Co-authored-by: Siddharth Tyagi <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Alireza Morsali <[email protected]>
Co-authored-by: styagi130 <[email protected]>
Co-authored-by: dorotat-nv <[email protected]>
Co-authored-by: Maxime Burchi <[email protected]>
Co-authored-by: mikolajblaz <[email protected]>
Co-authored-by: eharper <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: Kelvin Liu <[email protected]>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>
Co-authored-by: omahs <[email protected]>
Co-authored-by: Robin Dong <[email protected]>
Co-authored-by: JimmyZhang12 <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>
Co-authored-by: Sangkug Lym <[email protected]>
Co-authored-by: George <[email protected]>
Co-authored-by: PeganovAnton <[email protected]>
Co-authored-by: Samuele Cornell <[email protected]>
Co-authored-by: Jason <[email protected]>
Co-authored-by: Igor Gitman <[email protected]>
Co-authored-by: Jan Lasek <[email protected]>
Co-authored-by: Tamerlan Tabolov <[email protected]>

Loading branch information

96 people authored Sep 22, 2023

1 parent 0f44a33 commit 206af78

.github/labeler.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -3,25 +3,33 @@ ASR: @@
     - examples/asr/**/*
     - tutorials/asr/**/*
     - docs/source/asr/**/*
+    - tests/collections/asr/**
     NLP:
     - nemo/collections/nlp/**/*
     - examples/nlp/**/*
     - tutorials/nlp/**/*
     - docs/source/nlp/**/*
+    - tests/collections/nlp/**
     Speaker Tasks:
     - examples/speaker_tasks/**/*
     - tutorials/speaker_tasks/**/*
     TTS:
     - nemo/collections/tts/**/*
+    - nemo/collections/common/tokenizers/text_to_speech/**
     - examples/tts/**/*
     - tutorials/tts/**/*
     - docs/source/tts/**/*
+    - scripts/dataset_processing/tts/**
+    - scripts/tts_dataset_files/**
+    - tests/collections/tts/**
+    - tests/collections/common/tokenizers/text_to_speech/**
     core:
     - nemo/core/**/*
+    - tests/core/**
     common:
     - nemo/collections/common/**/*
@@ Expand Down @@

Dockerfile

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -14,7 +14,7 @@
  
    # See the License for the specific language governing permissions and

    # limitations under the License.

    ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:23.06-py3

    ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:23.08-py3

    # build an image that includes only the nemo dependencies, ensures that dependencies

    # are included first for optimal caching, and useful for building a development

    @@ -45,12 +45,18 @@ RUN apt-get update && \
  
    WORKDIR /workspace/

    WORKDIR /tmp/

    # TODO: Remove once this Apex commit (5/12/23) is included in PyTorch

    # container

    # Distributed Adam support for multiple dtypes

    RUN git clone https://github.com/NVIDIA/apex.git && \

      cd apex && \

      git checkout 8b7a1ff183741dd8f9b87e7bafd04cfde99cea28 && \

      pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./

      git checkout 52e18c894223800cb611682dce27d88050edf1de && \

      pip3 install -v --no-build-isolation --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./

    # install megatron core, this can be removed once 0.3 pip package is released

    RUN git clone https://github.com/NVIDIA/Megatron-LM.git && \

      cd Megatron-LM && \

      git checkout ab0336a5c8eab77aa74ae604ba1e73decbf6d560 && \

      pip install -e .

    # uninstall stuff from base container

    RUN pip3 uninstall -y sacrebleu torchtext

    @@ -76,6 +82,8 @@ RUN for f in $(ls requirements*.txt); do pip3 install --disable-pip-version-chec
  
    RUN pip install flash-attn

    # pinned triton version for flash-attention https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py#L3

    RUN pip install triton==2.0.0.dev20221202

    # install numba for latest containers

    RUN pip install numba>=0.57.1

    # install k2, skip if installation fails

    COPY scripts /tmp/nemo/scripts/

    @@ -94,7 +102,7 @@ COPY . .
  
    # start building the final container

    FROM nemo-deps as nemo

    ARG NEMO_VERSION=1.20.0

    ARG NEMO_VERSION=1.21.0

    # Check that NEMO_VERSION is set. Build will fail without this. Expose NEMO and base container

    # version information as runtime environment variable for introspection purposes

0 comments on commit `206af78`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `206af78`

Commit

There are no files selected for viewing

0 comments on commit 206af78

0 comments on commit `206af78`