Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Librispeech Deepspeech pytorch training diverges #513

Closed
priyakasimbeg opened this issue Sep 18, 2023 · 3 comments
Closed

Librispeech Deepspeech pytorch training diverges #513

priyakasimbeg opened this issue Sep 18, 2023 · 3 comments
Labels
🚀 Launch Blocker Issues that are blocking launch of benchmark P1 Launch 2023 High priority issues for October 2023 AlgoPerf Launch 🔥 PyTorch Issue that mainly deals with the PyTorch version of the code

Comments

@priyakasimbeg
Copy link
Contributor

Librispeech Conformer training loss and WER increase after ~6000 steps.

Description

image

Steps to Reproduce

Git commit: 4c38ffb

Ran the following command to produce the target setting configuration:

torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8 submission_runner.py --framework=pytorch --workload=librispeech_deepspeech --submission_path=reference_algorithms/target_setting_algorithms/pytorch_nadamw.py --tuning_search_space=reference_algorithms/target_setting_algorithms/librispeech_deepspeech/tuning_search_space.json --data_dir=/data/librispeech --num_tuning_trials=1 --experiment_dir=/experiment_runs --experiment_name=targets_check_pytorch/nadamw_run_0 --overwrite=true --save_checkpoints=false --max_global_steps=36000 --librispeech_tokenizer_vocab_path=/data/librispeech/spm_model.vocab --torch_compile=true 2>&1 | tee -a /logs/librispeech_deepspeech_pytorch_<date_time>.log
@priyakasimbeg priyakasimbeg added 🔥 PyTorch Issue that mainly deals with the PyTorch version of the code 🚀 Launch Blocker Issues that are blocking launch of benchmark labels Sep 18, 2023
@priyakasimbeg priyakasimbeg changed the title Librispeech Conformer training diverges Librispeech Deepspeech pytorch training diverges Sep 18, 2023
@priyakasimbeg
Copy link
Contributor Author

From offline conversation w @sourabh2k15: both conformer and deepspeech can do that [... ] even internal 20 run experiments have trials that diverge.

I'll rerun with different seed.

@priyakasimbeg
Copy link
Contributor Author

priyakasimbeg commented Sep 20, 2023

Update: I reran this again and it did not diverge. However, it does not hit the new targets for the reduced budget.
One possible cause is that we did not update the warmup_steps in the target_setting algorithm hparams. I'll make sure we have the right hparams checked in and run this again.

@priyakasimbeg
Copy link
Contributor Author

priyakasimbeg commented Sep 29, 2023

The root cause of Deepspeech not hitting targets is that the DeepspeechJax and DeepspeechPytorch workload classes are inheriting from the Conformer workload class but the step_hint and target properties are overridden in a generic Deepspeech class. Will send out PR shortly.

This was referenced Sep 29, 2023
@priyakasimbeg priyakasimbeg added the P1 Launch 2023 High priority issues for October 2023 AlgoPerf Launch label Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🚀 Launch Blocker Issues that are blocking launch of benchmark P1 Launch 2023 High priority issues for October 2023 AlgoPerf Launch 🔥 PyTorch Issue that mainly deals with the PyTorch version of the code
Projects
None yet
Development

No branches or pull requests

1 participant