Tracking the 6DoF pose of unknown objects in monocular RGB video sequences is crucial for robotic manipulation. However, existing approaches typically rely on accurate depth information, which is non-trivial to obtain in real-world scenarios. Although depth estimation algorithms can be employed, geometric inaccuracy can lead to failures in RGBD-based pose tracking methods. To address this challenge, we introduce GSGTrack, a novel RGB-based pose tracking framework that jointly optimizes geometry and pose. Specifically, we adopt 3D Gaussian Splatting to create an optimizable 3D representation, which is learned simultaneously with a graph-based geometry optimization to capture the object's appearance features and refine its geometry. However, the joint optimization process is susceptible to perturbations from noisy pose and geometry data. Thus, we propose an object silhouette loss to address the issue of pixel-wise loss being overly sensitive to pose noise during tracking. To mitigate the geometric ambiguities caused by inaccurate depth information, we propose a geometry-consistent image pair selection strategy, which filters out low-confidence pairs and ensures robust geometric optimization. Extensive experiments on the OnePose and HO3D datasets demonstrate the effectiveness of GSGTrack in both 6DoF pose tracking and object reconstruction.
在单目RGB视频序列中跟踪未知物体的6自由度(6DoF)位姿对于机器人操作至关重要。然而,现有方法通常依赖于准确的深度信息,而在真实世界场景中获取深度数据并非易事。虽然可以使用深度估计算法,但几何不准确性可能导致基于RGBD的位姿跟踪方法失败。 为了解决这一问题,我们提出了GSGTrack,一种新颖的基于RGB的位姿跟踪框架,可联合优化几何和位姿。具体来说,我们采用三维高斯散点(3D Gaussian Splatting)来创建可优化的三维表示,该表示在捕获物体外观特征和优化几何的同时,通过基于图的几何优化进行学习。然而,联合优化过程容易受到噪声位姿和几何数据的干扰。为此,我们提出了一种物体轮廓损失(silhouette loss),以解决像素级损失对位姿噪声过于敏感的问题。 此外,为减轻由深度信息不准确引起的几何模糊性,我们设计了一种几何一致的图像对选择策略(geometry-consistent image pair selection strategy),过滤掉低置信度的图像对,以确保稳健的几何优化。 在OnePose和HO3D数据集上的大量实验表明,GSGTrack在6DoF位姿跟踪和物体重建任务中均表现出色,显著提升了鲁棒性和精度,为基于RGB的物体位姿跟踪提供了一个有效的解决方案。