Librispeech Deepspeech pytorch training diverges #513

priyakasimbeg · 2023-09-18T19:47:30Z

Librispeech Conformer training loss and WER increase after ~6000 steps.

Description

Steps to Reproduce

Git commit: 4c38ffb

Ran the following command to produce the target setting configuration:

torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8 submission_runner.py --framework=pytorch --workload=librispeech_deepspeech --submission_path=reference_algorithms/target_setting_algorithms/pytorch_nadamw.py --tuning_search_space=reference_algorithms/target_setting_algorithms/librispeech_deepspeech/tuning_search_space.json --data_dir=/data/librispeech --num_tuning_trials=1 --experiment_dir=/experiment_runs --experiment_name=targets_check_pytorch/nadamw_run_0 --overwrite=true --save_checkpoints=false --max_global_steps=36000 --librispeech_tokenizer_vocab_path=/data/librispeech/spm_model.vocab --torch_compile=true 2>&1 | tee -a /logs/librispeech_deepspeech_pytorch_<date_time>.log

The text was updated successfully, but these errors were encountered:

priyakasimbeg · 2023-09-18T20:48:56Z

From offline conversation w @sourabh2k15: both conformer and deepspeech can do that [... ] even internal 20 run experiments have trials that diverge.

I'll rerun with different seed.

priyakasimbeg · 2023-09-20T22:27:15Z

Update: I reran this again and it did not diverge. However, it does not hit the new targets for the reduced budget.
One possible cause is that we did not update the warmup_steps in the target_setting algorithm hparams. I'll make sure we have the right hparams checked in and run this again.

priyakasimbeg · 2023-09-29T18:24:44Z

The root cause of Deepspeech not hitting targets is that the DeepspeechJax and DeepspeechPytorch workload classes are inheriting from the Conformer workload class but the step_hint and target properties are overridden in a generic Deepspeech class. Will send out PR shortly.

priyakasimbeg added 🔥 PyTorch Issue that mainly deals with the PyTorch version of the code 🚀 Launch Blocker Issues that are blocking launch of benchmark labels Sep 18, 2023

priyakasimbeg changed the title ~~Librispeech Conformer training diverges~~ Librispeech Deepspeech pytorch training diverges Sep 18, 2023

This was referenced Sep 29, 2023

Fix Speech target issues #521

Closed

Speech targets fix #526

Merged

priyakasimbeg added the P1 Launch 2023 High priority issues for October 2023 AlgoPerf Launch label Oct 3, 2023

priyakasimbeg closed this as completed Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Librispeech Deepspeech pytorch training diverges #513

Librispeech Deepspeech pytorch training diverges #513

priyakasimbeg commented Sep 18, 2023

priyakasimbeg commented Sep 18, 2023

priyakasimbeg commented Sep 20, 2023 •

edited

Loading

priyakasimbeg commented Sep 29, 2023 •

edited

Loading

Librispeech Deepspeech pytorch training diverges #513

Librispeech Deepspeech pytorch training diverges #513

Comments

priyakasimbeg commented Sep 18, 2023

Description

Steps to Reproduce

priyakasimbeg commented Sep 18, 2023

priyakasimbeg commented Sep 20, 2023 • edited Loading

priyakasimbeg commented Sep 29, 2023 • edited Loading

priyakasimbeg commented Sep 20, 2023 •

edited

Loading

priyakasimbeg commented Sep 29, 2023 •

edited

Loading