We introduce NovelGS, a diffusion model for Gaussian Splatting (GS) given sparse-view images. Recent works leverage feed-forward networks to generate pixel-aligned Gaussians, which could be fast rendered. Unfortunately, the method was unable to produce satisfactory results for areas not covered by the input images due to the formulation of these methods. In contrast, we leverage the novel view denoising through a transformer-based network to generate 3D Gaussians. Specifically, by incorporating both conditional views and noisy target views, the network predicts pixel-aligned Gaussians for each view. During training, the rendered target and some additional views of the Gaussians are supervised. During inference, the target views are iteratively rendered and denoised from pure noise. Our approach demonstrates state-of-the-art performance in addressing the multi-view image reconstruction challenge. Due to generative modeling of unseen regions, NovelGS effectively reconstructs 3D objects with consistent and sharp textures. Experimental results on publicly available datasets indicate that NovelGS substantially surpasses existing image-to-3D frameworks, both qualitatively and quantitatively. We also demonstrate the potential of NovelGS in generative tasks, such as text-to-3D and image-to-3D, by integrating it with existing multiview diffusion models. We will make the code publicly accessible.
我们提出了 NovelGS,一种基于稀疏视角图像进行高斯投影(Gaussian Splatting, GS)的扩散模型。近期研究利用前馈网络生成像素对齐的高斯基元,能够实现快速渲染。然而,这些方法由于其公式化限制,对于输入图像未覆盖的区域往往无法生成令人满意的结果。 与之相比,我们通过基于 Transformer 的新视角去噪网络生成 3D 高斯分布。具体而言,网络结合条件视角和噪声目标视角,为每个视角预测像素对齐的高斯。在训练过程中,对目标视角的渲染结果以及一些额外视角的高斯进行监督。在推理过程中,目标视角从纯噪声中迭代渲染并去噪。 我们的方法在多视角图像重建挑战中表现出最先进的性能。由于对未见区域的生成建模,NovelGS 能够有效重建具有一致且清晰纹理的 3D 对象。在公开数据集上的实验结果表明,无论从定性还是定量角度,NovelGS 均显著超越现有的图像到 3D 框架。此外,我们通过将 NovelGS 与现有多视角扩散模型集成,展示了其在生成任务(如文本到 3D 和图像到 3D)中的潜力。