This paper addresses a challenging question: How can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image? Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene optimization, low visual quality in backgrounds, and distorted reconstructions in unseen areas. We propose a novel pipeline to overcome these limitations. Specifically, we introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splattings for the scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that contain multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets demonstrate that our model significantly outperforms existing methods for single-view 3D scene generation, particularly with out-of-domain images. For the first time, we demonstrate that a 3D reconstruction model can be effectively built upon the latent space of a diffusion model to realize efficient 3D scene generation.
本文探讨了一个具有挑战性的问题:如何从单张任意图像高效生成高质量、广范围的三维场景。现有方法存在多种限制,例如需要多视角数据、场景优化耗时、背景视觉质量较低,以及未观察区域的重建失真等。为克服这些问题,我们提出了一种新型流程,设计了一个基于视频扩散模型的三维高斯点云(3D Gaussian Splattings)预测模型,以前馈方式生成场景。 该视频扩散模型专为精确生成遵循指定相机轨迹的视频而设计,能够生成包含多视角信息的压缩视频潜变量,同时保持三维一致性。我们通过渐进式训练策略训练三维重建模型,使其能够在视频潜变量空间中操作,从而高效生成高质量、广范围的通用三维场景。 在多个数据集上的广泛评估表明,我们的模型在单视角三维场景生成方面显著优于现有方法,尤其在处理域外图像时表现尤为突出。我们首次展示了一种基于扩散模型潜变量空间的三维重建模型能够有效实现高效的三维场景生成,开创了新的研究方向。