Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency Optimization for Speech-to-Speech Pipeline #107

Open
yatharthk2 opened this issue Sep 18, 2024 · 3 comments
Open

Latency Optimization for Speech-to-Speech Pipeline #107

yatharthk2 opened this issue Sep 18, 2024 · 3 comments

Comments

@yatharthk2
Copy link

Hi,

I am currently running the speech-to-speech pipeline on an AWS EC2 instance (Ubuntu 20.04) with an Nvidia A10g GPU. The pipeline works well, but I am experiencing around 1 second of latency, and I am particularly interested in improving the latency of the entire speech-to-speech pipeline, especially the Text to Speech (TTS) part.

Current Setup:
EC2 Instance: Nvidia A10g GPU, 24GB GPU RAM
OS: Ubuntu 20.04
GPU Driver: NVIDIA-SMI 470.141.03
CUDA Version: 12.2
Pipeline: Using the standard setup from your repo
STT Model: Whisper large-v2
TTS Model: Parler-TTS (default)

Problem:
I’m currently facing around 1 second of latency for the entire pipeline from speech input to speech output. While the STT part works fairly well, the TTS step seems to contribute most to the latency. I would greatly appreciate any suggestions or guidance on reducing the overall latency, particularly for TTS.

Thanks!

@sandorkonya
Copy link

Take a look at this .
The proposed method increases the TTS part.
Also it is mentioned there, that 500 ms is appended after the last chunk, which means, that 500 ms is the delay until the beginning of the LLM --> TTS steps.

@yatharthk2
Copy link
Author

Thank you for sharing the link and your discussions with the author. I understand the role of Whisper Streamer in accelerating the text-to-speech process. I also recognize the 500ms latency in ParlerTTS, but I believe I am not achieving this latency. Is there any way I can optimize Parler TTS setup to reach the 500ms target?

@yatharthk2
Copy link
Author

I am trying to make this pipeline really fast. I tried integrating styleTTS ; seems like streaming is not compatible with StyleTTS as of Now, how would you approach the latency optimization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants