vc-lm is a project that can transform anyone's voice into thousands of different voices in audio.
- [2023/06/09] Support Any-to-One voice conversion model.
This project references the paper Vall-E
It uses encodec to discretize audio into tokens and build a transformer language model on tokens. The project consists of two-stage models: AR model and NAR model.
Input: 3-second voice prompt audio + voice to be transformed
Output: Transformed audio
During the training phase, a self-supervised approach is used where the source audio and target audio are the same.
Input: Prompt audio + source audio
Output: Target audio with 0-level tokens
Input: Target audio with 0 to k-level tokens
Output: Target audio with k+1 level tokens
# All WAV files are first processed into files with a length of 10 to 24 seconds. Reference to[[tools/construct_wavs_file.py]
python tools/construct_dataset.py
python tools/extract_whisper_encoder_model.py --input_model=../whisper/medium.pt --output_model=../whisper-encoder/medium-encoder.pt
bash ./sh/train_ar_model.sh
bash ./sh/train_nar_model.sh
from vc_lm.vc_engine import VCEngine
engine = VCEngine('/root/autodl-tmp/vc-models/ar.ckpt',
'/root/autodl-tmp/vc-models/nar.ckpt',
'/root/project/vc-lm/configs/ar_model.json',
'/root/project/vc-lm/configs/nar_model.json')
output_wav = engine.process_audio(content_wav,
style_wav, max_style_len=3, use_ar=True)
The models were trained on the Wenetspeech dataset, which consists of thousands of hours of audio data, including the AR model and NAR model.
Model download link:
Link: https://pan.baidu.com/s/1bJUXrSH7tJ1QLPTv3tZzRQ Extract code: 4kao
This project's models can generate a large number of one-to-any parallel data (i.e., any-to-one). These parallel data can be used to train any-to-one voice conversion models.
The target speaker's data achieves excellent results in just 10 minutes.
# All WAV files are first processed into files with a length of 10 to 24 seconds. Reference to[[tools/construct_wavs_file.py]
python tools/construct_dataset.py
# Construct train, val, test data
python tools.construct_parallel_dataset.py
Load the pre-trained model mentioned above and train it on the specified speaker's data.
bash ./sh/train_finetune_ar_model.sh
bash ./sh/train_finetune_nar_model.sh
from vc_lm.vc_engine import VCEngine
engine = VCEngine('/root/autodl-tmp/vc-models/jr-ar.ckpt',
'/root/autodl-tmp/vc-models/jr-nar.ckpt',
'/root/project/vc-lm/configs/ar_model.json',
'/root/project/vc-lm/configs/nar_model.json')
output_wav = engine.process_audio(content_wav,
style_wav, max_style_len=3, use_ar=True)