We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio. To capture the expressive, detailed nature of human heads, including skin furrowing and finer-scale facial movements, we propose to couple speech signal with 3D Gaussian splatting to create realistic, temporally coherent motion sequences. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details, including wrinkles that occur with different expressions. To enable sequence modeling of 3D Gaussian splats with audio, we devise an audio-conditioned transformer model capable of extracting lip and expression features directly from audio input. Due to the absence of high-quality datasets of talking humans in correspondence with audio, we captured a new large-scale multi-view dataset of audio-visual sequences of talking humans with native English accents and diverse facial geometry. GaussianSpeech consistently achieves state-of-the-art performance with visually natural motion at real time rendering rates, while encompassing diverse facial expressions and styles.
我们提出了一种名为 GaussianSpeech 的新方法,可根据语音音频生成高保真动画序列,生成的 3D 逼真个性化人头头像具备高度写实效果。为了捕捉人头的丰富细节和表现力,包括皮肤皱褶和细微的面部动作,我们将语音信号与 3D 高斯点绘制(Gaussian Splatting)相结合,以生成逼真且时间上连贯的动态序列。 我们设计了一种紧凑且高效的基于 3DGS(3D Gaussian Splatting)的头像表示方法,该方法生成与表情相关的颜色,并通过基于皱纹和感知的损失函数来合成面部细节,包括随不同表情变化的皱纹。为实现音频驱动的 3D 高斯点序列建模,我们开发了一种音频条件变换器模型(audio-conditioned transformer),能够直接从音频输入中提取唇部和表情特征。 由于缺乏与语音对应的高质量人脸说话数据集,我们采集了一个全新的大规模多视角音视频序列数据集,该数据集包含以英语为母语的人物说话场景,具有多样的面部几何特征。GaussianSpeech 在真实时间渲染速率下,一贯实现了视觉自然的运动效果,支持多样的面部表情和风格,并在性能上达到了当前最先进水平。