Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting up a voice conversion pipeline #117

Open
holmbuar opened this issue Sep 27, 2024 · 2 comments
Open

Setting up a voice conversion pipeline #117

holmbuar opened this issue Sep 27, 2024 · 2 comments

Comments

@holmbuar
Copy link

holmbuar commented Sep 27, 2024

I successfully made your pipeline example run on my Mac. I did not expect to meet an assistant, but understand a bit more now about the intention of this project.

I would like to build a pipeline for voice conversion, similar to the product that ElevenLabs are offering. In their app you can upload a sound file up to 50 MB, and get a configurable voice conversion of the original speech sample. Microsoft SpeechT5 also offers voice conversion, but one would have to build a custom framework around that model.

Is speech-to-speech a relevant tool for such a task, or should I look at other s2s models or frameworks?

EDIT: After writing this, I realized that GPT4-o is a AI voice controlled assistant. My bad. It would still be nice to know if this pipeline can easily be modified to accept sound files, and convert voices.

EDIT2: I found this HuggingFace audio course, which I guess pretty much covers the basics. However: the ElevenLabs voice conversion outputs an audio file where the converted words is synced to the spoken words on a timeline, in practical terms mimicking the pace and style of the speaker. Unless I am missing something obvious, it seems my best option is to build a custom framework around the SpeechT5 vc model

EDIT3: I think this problem is solved, for example by WhisperX. If one wishes to build a framework from scratch, it would involve

  1. A STT model like whisper-distil-large for speech transcription
  2. An aligner like Pytorch audio forced align
  3. A TTS model like parler-tts
  4. Finally a custom framework for syncing the converted speech chunks to the original voice recording
@andimarafioti
Copy link
Member

Hi! I think this could be relevant for this project. Right now, we focused mostly on chatting to LLMs, but doing voice conversion is around the corner for it, I don't see a reason why we wouldn't support it here.

@PaParaZz1
Copy link

I successfully made your pipeline example run on my Mac. I did not expect to meet an assistant, but understand a bit more now about the intention of this project.

I would like to build a pipeline for voice conversion, similar to the product that ElevenLabs are offering. In their app you can upload a sound file up to 50 MB, and get a configurable voice conversion of the original speech sample. Microsoft SpeechT5 also offers voice conversion, but one would have to build a custom framework around that model.

Is speech-to-speech a relevant tool for such a task, or should I look at other s2s models or frameworks?

EDIT: After writing this, I realized that GPT4-o is a AI voice controlled assistant. My bad. It would still be nice to know if this pipeline can easily be modified to accept sound files, and convert voices.

EDIT2: I found this HuggingFace audio course, which I guess pretty much covers the basics. However: the ElevenLabs voice conversion outputs an audio file where the converted words is synced to the spoken words on a timeline, in practical terms mimicking the pace and style of the speaker. Unless I am missing something obvious, it seems my best option is to build a custom framework around the SpeechT5 vc model

EDIT3: I think this problem is solved, for example by WhisperX. If one wishes to build a framework from scratch, it would involve

  1. A STT model like whisper-distil-large for speech transcription
  2. An aligner like Pytorch audio forced align
  3. A TTS model like parler-tts
  4. Finally a custom framework for syncing the converted speech chunks to the original voice recording

For the voice conversion (VC) example, maybe you can refer to our released project CleanS2S.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants