Human lives in a 3D world and commonly uses natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat, which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model, LangSplat advances the field by utilizing a collection of 3D Gaussians, each encoding language features distilled from CLIP, to represent the language field. By employing a tile-based splatting technique for rendering language features, we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space, thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields, which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM, thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experiments on open-vocabulary 3D object localization and semantic segmentation demonstrate that LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin. Notably, LangSplat is extremely efficient, achieving a {\speed} × speedup compared to LERF at the resolution of 1440 × 1080.
人类生活在一个3D世界中,并通常使用自然语言与3D场景进行交互。近来,建模一个支持在3D中进行开放式语言查询的3D语言场受到越来越多的关注。本文介绍了LangSplat,它构建了一个3D语言场,使得在3D空间中进行精确且高效的开放词汇查询成为可能。与现有方法不同,后者在NeRF模型中嵌入CLIP语言特征,LangSplat通过使用一系列3D高斯模型推进了这一领域,每个高斯模型都编码了从CLIP中提炼出的语言特征来代表语言场。我们采用基于瓦片的涂抹技术来渲染语言特征,从而避开了NeRF中固有的成本高昂的渲染过程。LangSplat首先训练一个场景级别的语言自编码器,然后在场景特定的潜在空间上学习语言特征,而不是直接学习CLIP嵌入,从而减轻了显式建模所带来的大量内存需求。现有方法在3D语言场中往往存在不精确和模糊的问题,无法清晰区分对象之间的边界。我们深入研究了这个问题,并提出使用SAM来学习层次化语义,从而消除了在不同尺度上广泛查询语言场和DINO特征的正则化的需要。广泛的实验表明,在开放词汇的3D对象定位和语义分割方面,LangSplat显著超越了之前的最先进方法LERF。值得注意的是,LangSplat非常高效,在1440 × 1080的分辨率下比LERF快{\speed}倍。