We propose a novel approach for reconstructing animatable 3D Gaussian avatars from monocular videos captured by commodity devices like smartphones. Photorealistic 3D head avatar reconstruction from such recordings is challenging due to limited observations, which leaves unobserved regions under-constrained and can lead to artifacts in novel views. To address this problem, we introduce a multi-view head diffusion model, leveraging its priors to fill in missing regions and ensure view consistency in Gaussian splatting renderings. To enable precise viewpoint control, we use normal maps rendered from FLAME-based head reconstruction, which provides pixel-aligned inductive biases. We also condition the diffusion model on VAE features extracted from the input image to preserve details of facial identity and appearance. For Gaussian avatar reconstruction, we distill multi-view diffusion priors by using iteratively denoised images as pseudo-ground truths, effectively mitigating over-saturation issues. To further improve photorealism, we apply latent upsampling to refine the denoised latent before decoding it into an image. We evaluate our method on the NeRSemble dataset, showing that GAF outperforms the previous state-of-the-art methods in novel view synthesis by a 5.34% higher SSIM score. Furthermore, we demonstrate higher-fidelity avatar reconstructions from monocular videos captured on commodity devices.
我们提出了一种从由智能手机等常见设备拍摄的单目视频中重建可动画化三维高斯头像的新方法。从此类视频中进行光真实感三维头像重建具有挑战性,因受限的观察视角会使未观察区域欠约束,从而在新视图中引发伪影问题。为解决这一问题,我们引入了一种多视角头部扩散模型,利用其先验知识填补缺失区域,并在高斯点云渲染中确保视图一致性。 为了实现精确的视角控制,我们使用基于 FLAME 的头部重建生成的法线图,提供像素对齐的归纳偏置。同时,我们通过对扩散模型输入条件化的方式,将从输入图像提取的 VAE 特征作为条件,保留面部身份和外观的细节。在高斯头像重建中,我们通过使用迭代去噪图像作为伪真值,蒸馏多视角扩散先验,有效缓解了过度饱和的问题。为了进一步提高光真实感,我们采用潜变量上采样技术,在解码图像之前对去噪潜变量进行精细化处理。 在 NeRSemble 数据集上的评估结果表明,GAF 在新视图合成中比现有最先进方法提高了 5.34% 的 SSIM 得分。此外,我们证明了从由常见设备拍摄的单目视频中实现了更高保真度的头像重建。