GaussianSpeech: Audio-Driven Gaussian Avatars

We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio. To capture the expressive, detailed nature of human heads, including skin furrowing and finer-scale facial movements, we propose to couple speech signal with 3D Gaussian splatting to create realistic, temporally coherent motion sequences. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details, including wrinkles that occur with different expressions. To enable sequence modeling of 3D Gaussian splats with audio, we devise an audio-conditioned transformer model capable of extracting lip and expression features directly from audio input. Due to the absence of high-quality datasets of talking humans in correspondence with audio, we captured a new large-scale multi-view dataset of audio-visual sequences of talking humans with native English accents and diverse facial geometry. GaussianSpeech consistently achieves state-of-the-art performance with visually natural motion at real time rendering rates, while encompassing diverse facial expressions and styles.

我们提出了一种名为 GaussianSpeech 的新方法，可根据语音音频生成高保真动画序列，生成的 3D 逼真个性化人头头像具备高度写实效果。为了捕捉人头的丰富细节和表现力，包括皮肤皱褶和细微的面部动作，我们将语音信号与 3D 高斯点绘制（Gaussian Splatting）相结合，以生成逼真且时间上连贯的动态序列。我们设计了一种紧凑且高效的基于 3DGS（3D Gaussian Splatting）的头像表示方法，该方法生成与表情相关的颜色，并通过基于皱纹和感知的损失函数来合成面部细节，包括随不同表情变化的皱纹。为实现音频驱动的 3D 高斯点序列建模，我们开发了一种音频条件变换器模型（audio-conditioned transformer），能够直接从音频输入中提取唇部和表情特征。由于缺乏与语音对应的高质量人脸说话数据集，我们采集了一个全新的大规模多视角音视频序列数据集，该数据集包含以英语为母语的人物说话场景，具有多样的面部几何特征。GaussianSpeech 在真实时间渲染速率下，一贯实现了视觉自然的运动效果，支持多样的面部表情和风格，并在性能上达到了当前最先进水平。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2411.18675.md

2411.18675.md

GaussianSpeech: Audio-Driven Gaussian Avatars

Files

2411.18675.md

Latest commit

History

2411.18675.md

File metadata and controls

GaussianSpeech: Audio-Driven Gaussian Avatars