-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trained model can generate correct text but incorrect speech #13
Comments
I wonder if you have tried to test it directly using the model we provide, and whether this happens? If not, I think it may be a problem with the training scripts? Perhaps you can provide the training scripts? |
Thank you for your kind reply!
config_gcmvn.yaml
config_mtl_asr_st_ctcst.yaml
I also attach the model we trained here (https://drive.google.com/file/d/1rdOEt1NSt8oxUBHL0WfM_CCtKczt6TzO/view?usp=share_link) |
There seems to be no problem with training scripts. Problems with generating short speech are often caused by the non-autoregressive text-to-unit generation module. I wonder if you have modified this part of the code? |
Yeah, I think it should be the problem at autoregressive text-to-unit generation module. I did not change any part of training code and model. Do you have any idea what happens? |
Sorry, I haven't encountered this problem before, and I don't have any experience to solve this issue yet. Maybe you can retrain with the latest code and record the final loss. We can see whether the loss after convergence is within the normal range. |
Dear authors,
Do you know what the potential problem is? |
I have the same issue. Is there any update? |
Hi Emre, |
I am training on my own data. I applied loss bug fix. ASR and translation seem okey (wer decreases to ~30%). However, I cannot still get meaningful audio outputs after loss bug fix. They are very short and sound like a noise. |
I use another Hubert model to extract source units. Do it affect this situation? |
It was the problem :). I misunderstood some part of the model. When I changed back to the original hubert, the problem is solved. |
Can you please, provide the link to the data. |
I tried to reproduce the training of the fr-en simultaneous model. I follows the instruction to prepare the dataset and run the script train.simul-s2st.sh
The model training seems to go fine but the during evaluation of our trained model (using ./simuleval.simul-s2st.sh), weird behaviors happen.
Here is the training logging:
During the inference, when I tried to run the eval scripts on the example you provided, the weird thing happens, it can output correct text translation but the output speech is incorrect (output speech is almost silent). I print the text output and speech units output as follow:
Do you know what problem may be?
Thank you
The text was updated successfully, but these errors were encountered: