This paper introduces VisionPAD, a novel self-supervised pre-training paradigm designed for vision-centric algorithms in autonomous driving. In contrast to previous approaches that employ neural rendering with explicit depth supervision, VisionPAD utilizes more efficient 3D Gaussian Splatting to reconstruct multi-view representations using only images as supervision. Specifically, we introduce a self-supervised method for voxel velocity estimation. By warping voxels to adjacent frames and supervising the rendered outputs, the model effectively learns motion cues in the sequential data. Furthermore, we adopt a multi-frame photometric consistency approach to enhance geometric perception. It projects adjacent frames to the current frame based on rendered depths and relative poses, boosting the 3D geometric representation through pure image supervision. Extensive experiments on autonomous driving datasets demonstrate that VisionPAD significantly improves performance in 3D object detection, occupancy prediction and map segmentation, surpassing state-of-the-art pre-training strategies by a considerable margin.
本文提出了 VisionPAD,一种专为自动驾驶视觉算法设计的新型自监督预训练范式。与以往依赖显式深度监督的神经渲染方法不同,VisionPAD 通过更高效的 3D Gaussian Splatting (3DGS),仅使用图像作为监督信号即可重建多视图表示。 具体而言,我们提出了一种自监督的体素速度估计方法。通过将体素变换到相邻帧并监督其渲染输出,模型能够有效地从序列数据中学习运动线索。此外,我们采用了 多帧光度一致性 方法来增强几何感知能力。该方法基于渲染深度和相对位姿将相邻帧投影到当前帧,从纯图像监督中提升 3D 几何表示。 在自动驾驶数据集上的广泛实验表明,VisionPAD 在 3D目标检测、占用预测 和 地图分割 等任务中显著提升了性能,并在多个基准上超越了现有最先进的预训练策略。