Data Strcuture

"This repository is directly related to the AlignVSR paper. We will continue to maintain and improve the code in the future."

Data Strcuture

LRS2-BBC
├── lrs2
│   ├── lrs2_video_seg24s
│   │   ├── main
|   │   │   ├── 5535415699068794046
|   |   │   │   ├── 0001.mp4
|   |   │   │   ├── 0001.wav
|   |   │   │   ├── ...
|   │   │   ├── ...
│   │   ├── pretrain
|   │   │   ├── 5535415699068794046
|   |   │   │   ├── 00001.mp4
|   |   │   │   ├── 00001.wav
|   |   │   │   ├── ...
|   │   │   ├── ...

1. Environment Setup and Preprocess

We have adopted a consistent approach with the AUTO-AVSR repository for preprocessing the LRS2 and Single datasets. Then, following the steps from AUTO-AVSR (preparation), we process the LRS2 and CNVSRC.Single datasets to generate the corresponding train.csv and test.csv.

1.1 AlignVSR Environment Setup

This guide will walk you through the process of setting up the AlignVSR environment, installing necessary dependencies, and preparing for preprocessing.

git clone [email protected]:liu12366262626/AlignVSR.git
conda env create -f alignvsr_env.yaml 
conda activate alignvsr
cd tools/face_alignment 
pip install --editable .
cd tools/face_detection
pip install --editable .

1.2 Preprocess

Preprocess Dataset

cd preprocess_data
python preprocess.py --root_dir /[path-to-origin_LRS2_data] --dst_path /[path-to-save-preprocess_data]

--root_dir /path/to/LRS2: Specifies the path to the input dataset (LRS2).
--dst_path /path/to/preprocess2: Specifies the path where the preprocessed data should be stored.

After preprocessing all the video files, you need to generate corresponding audio files,the code is:

cd preprocess_data
python generate_audio.py --root_dir /[path-to-origin_LRS2_data]  --dst_path /[path-to-save-preprocess_data]/data

--root_dir /path/to/LRS2: Specifies the path to the input dataset (LRS2).
--dst_path /path/to/preprocess2: Specifies the path where the audio data should be stored.

Finally , the videos will be processed into a size of 96x96, at 25fps, and the audio will be processed into mono with a 16k sample rate.

2. Phase1-K-means

For the LRS2 and CNVSRC.Single datasets, we randomly sample a portion of the audio data from the training set to train a k-means model with a total of 200 clusters. For specific steps, please refer to this link. After completing this step, we will obtain the k-means model for the next phase of training.

3. Phase2-ASR-Training

We use the pre-trained Hubert model and the trained k-means model to quantize the audio data. For the quantized audio, we use Conformer as the Encoder and train it in an ASR paradigm with the hybrid CTC/Attention Loss. For detailed steps, please refer to this link.

4. Phase3-AlignVSR

After completing Phase2, we use the obtained quantized audio as the K (Key) and V (Value) in the Cross-Attention mechanism. The video features are used as Q (Query) and are input into the Cross-Attention. Additionally, we introduce the Local Align Loss to align the audio and video features at the frame level. For detailed steps, please refer to this link.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
align_vsr		align_vsr
checkpoints		checkpoints
datamodule		datamodule
dataset		dataset
espnet		espnet
preprocess_data		preprocess_data
tools		tools
alignvsr_env.yaml		alignvsr_env.yaml
example.gif		example.gif
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Strcuture

1. Environment Setup and Preprocess

1.1 AlignVSR Environment Setup

1.2 Preprocess

2. Phase1-K-means

3. Phase2-ASR-Training

4. Phase3-AlignVSR

About

Releases

Packages

Languages

liu12366262626/AlignVSR

Folders and files

Latest commit

History

Repository files navigation

Data Strcuture

1. Environment Setup and Preprocess

1.1 AlignVSR Environment Setup

1.2 Preprocess

2. Phase1-K-means

3. Phase2-ASR-Training

4. Phase3-AlignVSR

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages