Skip to content

Commit

Permalink
debug math formulate
Browse files Browse the repository at this point in the history
  • Loading branch information
devilran6 committed Aug 25, 2024
1 parent 593bb91 commit f51d85a
Showing 1 changed file with 18 additions and 17 deletions.
35 changes: 18 additions & 17 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -55,23 +55,24 @@ <h2 class="title is-2 has-text-centered">Abstract</h2>
</div>
<div class="markdown has-text-centered">
<div align="left" style="font-size:20px; text-align:justify">
Speech enhancement plays an essential role in various applications, and the integration of visual information
has been demonstrated to bring substantial advantages.
However, existing works mainly focus on the analysis of facial and lip movements, whereas contextual visual
cues from the surrounding environment have been overlooked:
for example, when we see a dog bark, our brain has the innate ability to discern and filter out the barking
noise.
To this end, in this paper, we introduce a novel task, i.e. Scene-aware Audio-Visual Speech Enhancement.
To our best knowledge, this is the first proposal to use rich contextual information from synchronized video
as auxiliary cues to indicate the type of noise,
which eventually improves the speech enhancement performance.
Specifically, we propose the VC-S$^2$E method, which incorporates the Conformer and Mamba modules for their
complementary strengths.
Extensive experiments are conducted on public MUSIC, AVSpeech and AudioSet datasets, where the results
demonstrate the superiority of VC-S$^2$E over other competitive methods.
We will make the source code publicly available.
Project demo page: https://AVSEPage.github.io/
<br>
<p>Speech enhancement plays an essential role in various applications, and the integration of visual
information
has been demonstrated to bring substantial advantages.
However, existing works mainly focus on the analysis of facial and lip movements, whereas contextual visual
cues from the surrounding environment have been overlooked:
for example, when we see a dog bark, our brain has the innate ability to discern and filter out the barking
noise.
To this end, in this paper, we introduce a novel task, i.e. Scene-aware Audio-Visual Speech Enhancement.
To our best knowledge, this is the first proposal to use rich contextual information from synchronized video
as auxiliary cues to indicate the type of noise,
which eventually improves the speech enhancement performance.
Specifically, we propose the \[ VC-S^{2}E \] method, which incorporates the Conformer and Mamba modules for
their
complementary strengths.
Extensive experiments are conducted on public MUSIC, AVSpeech and AudioSet datasets, where the results
demonstrate the superiority of \[ VC-S^{2}E \] over other competitive methods.
We will make the source code publicly available.
Project demo page: https://AVSEPage.github.io/</p>
</div>
</div>
</div>
Expand Down

0 comments on commit f51d85a

Please sign in to comment.