3D Gaussian Splatting has recently gained traction for its efficient training and real-time rendering. While the vanilla Gaussian Splatting representation is mainly designed for view synthesis, more recent works investigated how to extend it with scene understanding and language features. However, existing methods lack a detailed comprehension of scenes, limiting their ability to segment and interpret complex structures. To this end, We introduce SuperGSeg, a novel approach that fosters cohesive, context-aware scene representation by disentangling segmentation and language field distillation. SuperGSeg first employs neural Gaussians to learn instance and hierarchical segmentation features from multi-view images with the aid of off-the-shelf 2D masks. These features are then leveraged to create a sparse set of what we call Super-Gaussians. Super-Gaussians facilitate the distillation of 2D language features into 3D space. Through Super-Gaussians, our method enables high-dimensional language feature rendering without extreme increases in GPU memory. Extensive experiments demonstrate that SuperGSeg outperforms prior works on both open-vocabulary object localization and semantic segmentation tasks.
三维高斯点云技术(3D Gaussian Splatting)因其高效的训练和实时渲染能力,近期受到了广泛关注。尽管基础的高斯点云表示主要用于视图合成,但近年来的研究尝试将其扩展到场景理解和语言特征融合。然而,现有方法在场景的细粒度理解方面存在不足,限制了其对复杂结构进行分割和解释的能力。 为此,我们提出了 SuperGSeg,一种通过解耦分割和语言场蒸馏来促进连贯、上下文感知场景表示的新方法。SuperGSeg 首先利用神经高斯(Neural Gaussians)结合现成的二维掩码,从多视角图像中学习实例和层次分割特征。这些特征随后被用于创建一个稀疏集合,我们称之为 Super-Gaussians。Super-Gaussians 用于将二维语言特征蒸馏到三维空间,从而支持高维语言特征渲染,而无需极大地增加 GPU 内存需求。 广泛的实验结果表明,SuperGSeg 在开放词汇对象定位和语义分割任务上均优于现有方法,显著提升了性能和场景理解的能力。