This code is for our paper titled: Visual Speech Enhancement Without A Real Visual Stream published at WACV 2021.
Authors: Sindhu Hegde*, K R Prajwal*, Rudrabha Mukhopadhyay*, Vinay Namboodiri, C.V. Jawahar
[Paper] | [Project Page] | [Demo Video] (coming soon) | [Real-World Test Set] (coming soon)
- Denoise any real-world audio/video and obtain the clean speech.
- Works in unconstrained settings for any speaker in any language
- Inputs only audio but uses the benefits of lip movements by generating a synthetic visual stream.
- Complete training code and inference codes available.
Python 3.7+
- ffmpeg:
sudo apt-get install ffmpeg
- Install necessary packages using
pip install -r requirements.txt
- Face detection pre-trained model should be downloaded to
face_detection/detection/sfd/s3fd.pth
.
Model | Description | Link to the model |
---|---|---|
Denoising model | Weights of the denoising model (needed for inference) | Link |
Lipsync student | Weights of the student lipsync model to generate the visual stream for noisy audio inputs (needed for inference) | Link |
Wav2Lip teacher | Weights of the teacher lipsync model (only needed if you want to train the network from scratch) | Link |
You can denoise any noisy audio/video and obtain the clean speech of the target speaker using:
python inference.py --lipsync_student_model_path=<trained-student-model-ckpt-path> --checkpoint_path=<trained-denoising-model-ckpt-path> --input=<noisy-audio/video-file>
The result is saved (by default) in results/result.mp4
. The result directory can be specified in arguments, similar to several other available options. The input file can be any audio file: *.wav
, *.mp3
or even a video file, from which the code will automatically extract the audio and generate the clean speech. Note that the noise should not be human speech, as this work only tackles the denoising task, not speaker separation.
The synthetic visual stream (lip-movements) can be generated for any noisy audio/video using:
cd lipsync
python inference.py --checkpoint_path=<trained-lipsync-student-model-ckpt-path> --audio=<path-of-noisy-audio/video>
The result is saved (by default) in results/result_voice.mp4
. The result directory can be specified in arguments, similar to several other available options. The input file can be any audio file: *.wav
, *.mp3
or even a video file, from which the code will automatically extract the audio and generate the visual stream.
We illustrate the training process using the LRS3 and VGGSound dataset. Adapting for other datasets would involve small modifications to the code.
data_root (we use both train-val and pre-train sets of LSR3 dataset in this work)
├── list of folders
│ ├── five-digit numbered video IDs ending with (.mp4)
python preprocess.py --data_root=<dataset-path> --preprocessed_root=<path-to-save-the-preprocessed-data>
Additional options like batch_size
and number of GPUs to use in parallel to use can also be set.
preprocessed_root (lrs3_preprocessed)
├── list of folders
| ├── Folders with five-digit numbered video IDs
| │ ├── *.jpg (extracted face crops from each frame)
We use VGGSound dataset as noisy data which is mixed with the clean speech from LRS3 dataset. We download the audio files (*.wav files
) from here.
data_root (vgg_sound)
├── *.wav (audio files)
There are two major steps: (i) Train the student-lipsync model, (ii) Train the Denoising model.
Navigate to the lipsync folder: cd lipsync
The lipsync model can be trained using:
python train_student.py --data_root_lrs3_pretrain=<path-of-preprocessed-LRS3-pretrain-set> --data_root_lrs3_train=<path-of-preprocessed-LRS3-train-set> --noise_data_root=<path-of-VGGSound-dataset-to-mix-with-clean-speech> --wav2lip_checkpoint_path=<pretrained-wav2lip-teacher-model-ckpt-path> --checkpoint_dir=<path-to-save-the-trained-student-lipsync-model>
Note: The pre-trained Wav2Lip teacher model must be downloaded (wav2lip weights) before training the student model.
Navigate to the main directory: cd ..
The denoising model can be trained using:
python train.py --data_root_lrs3_pretrain=<path-of-preprocessed-LRS3-pretrain-set> --data_root_lrs3_train=<path-of-preprocessed-LRS3-train-set> --noise_data_root=<path-of-VGGSound-dataset-to-mix-with-clean-speech> --lipsync_student_model_path=<trained-student-lipsync-model-ckpt-path> --checkpoint_dir=<path-to-save-the-trained-student-lipsync-model>
The model can be resumed for training as well. Look at python train.py --help
for more details. Also, additional less commonly-used hyper-parameters can be set at the bottom of the audio/hparams.py
file.
To be updated soon!
The software is licensed under the MIT License. Please cite the following paper if you have used this code:
@misc{hegde2020visual,
title={Visual Speech Enhancement Without A Real Visual Stream},
author={Sindhu B Hegde and K R Prajwal and Rudrabha Mukhopadhyay and Vinay Namboodiri and C. V. Jawahar},
year={2020},
eprint={2012.10852},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Parts of the lipsync code has been modified using our Wav2Lip repository. The audio functions and parameters are taken from this TTS repository. We thank the author for this wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models.