Creating realistic avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot provide multi-view shape priors with guaranteed 3D consistency. We propose Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion. Our key insight is that 2D multi-view diffusion and 3D reconstruction models provide complementary information for each other, and by coupling them in a tight manner, we can fully leverage the potential of both models. We introduce a novel image-conditioned generative 3D Gaussian Splats reconstruction model that leverages the priors from 2D multi-view diffusion models, and provides an explicit 3D representation, which further guides the 2D reverse sampling process to have better 3D consistency. Experiments show that our proposed framework outperforms state-of-the-art methods and enables the creation of realistic avatars from a single RGB image, achieving high-fidelity in both geometry and appearance. Extensive ablations also validate the efficacy of our design, (1) multi-view 2D priors conditioning in generative 3D reconstruction and (2) consistency refinement of sampling trajectory via the explicit 3D representation.
从单张RGB图像创建逼真的虚拟形象是一个具有吸引力但又极具挑战性的问题。由于这一问题的不适定性,近期的研究利用基于大数据集预训练的2D扩散模型的强大先验。尽管2D扩散模型展示了强大的泛化能力,但它们不能提供具有保证的3D一致性的多视图形状先验。我们提出了Human 3Diffusion:通过显式3D一致扩散创建逼真的虚拟形象。我们的关键见解是,2D多视图扩散和3D重建模型为彼此提供了互补的信息,并且通过紧密地结合它们,我们可以充分利用这两个模型的潜力。我们引入了一种新颖的图像条件生成的3D高斯喷溅重建模型,该模型利用来自2D多视图扩散模型的先验,并提供了一个显式的3D表示,进一步指导2D反向采样过程以提高3D一致性。实验表明,我们提出的框架超越了现有最先进的方法,并使从单张RGB图像创建逼真的虚拟形象成为可能,同时在几何和外观上都达到了高保真度。广泛的消融实验也验证了我们设计的有效性:(1)在生成性3D重建中多视图2D先验的调节;(2)通过显式3D表示对采样轨迹的一致性细化。