Setting up a voice conversion pipeline #117

holmbuar · 2024-09-27T09:12:27Z

I successfully made your pipeline example run on my Mac. I did not expect to meet an assistant, but understand a bit more now about the intention of this project.

I would like to build a pipeline for voice conversion, similar to the product that ElevenLabs are offering. In their app you can upload a sound file up to 50 MB, and get a configurable voice conversion of the original speech sample. Microsoft SpeechT5 also offers voice conversion, but one would have to build a custom framework around that model.

Is speech-to-speech a relevant tool for such a task, or should I look at other s2s models or frameworks?

EDIT: After writing this, I realized that GPT4-o is a AI voice controlled assistant. My bad. It would still be nice to know if this pipeline can easily be modified to accept sound files, and convert voices.

EDIT2: I found this HuggingFace audio course, which I guess pretty much covers the basics. However: the ElevenLabs voice conversion outputs an audio file where the converted words is synced to the spoken words on a timeline, in practical terms mimicking the pace and style of the speaker. Unless I am missing something obvious, it seems my best option is to build a custom framework around the SpeechT5 vc model

EDIT3: I think this problem is solved, for example by WhisperX. If one wishes to build a framework from scratch, it would involve

A STT model like whisper-distil-large for speech transcription
An aligner like Pytorch audio forced align
A TTS model like parler-tts
Finally a custom framework for syncing the converted speech chunks to the original voice recording

The text was updated successfully, but these errors were encountered:

andimarafioti · 2024-10-14T16:16:12Z

Hi! I think this could be relevant for this project. Right now, we focused mostly on chatting to LLMs, but doing voice conversion is around the corner for it, I don't see a reason why we wouldn't support it here.

PaParaZz1 · 2024-11-02T03:51:50Z

I successfully made your pipeline example run on my Mac. I did not expect to meet an assistant, but understand a bit more now about the intention of this project.

I would like to build a pipeline for voice conversion, similar to the product that ElevenLabs are offering. In their app you can upload a sound file up to 50 MB, and get a configurable voice conversion of the original speech sample. Microsoft SpeechT5 also offers voice conversion, but one would have to build a custom framework around that model.

Is speech-to-speech a relevant tool for such a task, or should I look at other s2s models or frameworks?

EDIT: After writing this, I realized that GPT4-o is a AI voice controlled assistant. My bad. It would still be nice to know if this pipeline can easily be modified to accept sound files, and convert voices.

EDIT2: I found this HuggingFace audio course, which I guess pretty much covers the basics. However: the ElevenLabs voice conversion outputs an audio file where the converted words is synced to the spoken words on a timeline, in practical terms mimicking the pace and style of the speaker. Unless I am missing something obvious, it seems my best option is to build a custom framework around the SpeechT5 vc model

EDIT3: I think this problem is solved, for example by WhisperX. If one wishes to build a framework from scratch, it would involve

A STT model like whisper-distil-large for speech transcription

An aligner like Pytorch audio forced align

A TTS model like parler-tts

Finally a custom framework for syncing the converted speech chunks to the original voice recording

For the voice conversion (VC) example, maybe you can refer to our released project CleanS2S.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting up a voice conversion pipeline #117

Setting up a voice conversion pipeline #117

holmbuar commented Sep 27, 2024 •

edited

Loading

andimarafioti commented Oct 14, 2024

PaParaZz1 commented Nov 2, 2024

Setting up a voice conversion pipeline #117

Setting up a voice conversion pipeline #117

Comments

holmbuar commented Sep 27, 2024 • edited Loading

andimarafioti commented Oct 14, 2024

PaParaZz1 commented Nov 2, 2024

holmbuar commented Sep 27, 2024 •

edited

Loading