VC-S2E(Our)
: We propose a new model that leverages audio-visual modalities to improve speech quality and intelligibility.
-
VC-S2E (w/oEa, Ev): A version of the model with only
- audio input.
- Here, Ea and Ev represent Scenario-Aware Audio
- Embedding and Visual Embedding, respectively. w/o means without.
-
+
Conformer: (TO ADD)
Noisy video: The video source of noise, used as our second modality input source.
Gradcam: Displays the middle frame of the video with the corresponding Grad-CAM heatmap, highlighting key
noise areas identified by the video encoder.