3D Gaussian Splatting (3DGS) is increasingly attracting attention in both academia and industry owing to its superior visual quality and rendering speed. However, training a 3DGS model remains a time-intensive task, especially in load imbalance scenarios where workload diversity among pixels and Gaussian spheres causes poor renderCUDA kernel performance. We introduce Balanced 3DGS, a Gaussian-wise parallelism rendering with fine-grained tiling approach in 3DGS training process, perfectly solving load-imbalance issues. First, we innovatively introduce the inter-block dynamic workload distribution technique to map workloads to Streaming Multiprocessor(SM) resources within a single GPU dynamically, which constitutes the foundation of load balancing. Second, we are the first to propose the Gaussian-wise parallel rendering technique to significantly reduce workload divergence inside a warp, which serves as a critical component in addressing load imbalance. Based on the above two methods, we further creatively put forward the fine-grained combined load balancing technique to uniformly distribute workload across all SMs, which boosts the forward renderCUDA kernel performance by up to 7.52x. Besides, we present a self-adaptive render kernel selection strategy during the 3DGS training process based on different load-balance situations, which effectively improves training efficiency.
三维高斯点云(3D Gaussian Splatting,3DGS)因其卓越的视觉质量和渲染速度,在学术界和工业界日益受到关注。然而,训练3DGS模型仍然是一项耗时的任务,尤其是在负载不平衡的场景中,像素和高斯球体之间的工作负载多样性导致renderCUDA内核性能不佳。我们提出了Balanced 3DGS,这是一种在3DGS训练过程中采用细粒度切片的高斯级并行渲染方法,完美解决了负载不平衡的问题。首先,我们创新性地引入了块间动态工作负载分配技术,动态地将工作负载映射到单个GPU内的流多处理器(Streaming Multiprocessor,SM)资源上,这构成了负载平衡的基础。其次,我们首次提出了高斯级并行渲染技术,显著减少了一个warp内的工作负载分歧,这是解决负载不平衡的关键组件。在上述两种方法的基础上,我们进一步创造性地提出了细粒度组合负载平衡技术,以均匀分配所有SM的工作负载,从而将前向renderCUDA内核性能提升高达7.52倍。此外,我们在3DGS训练过程中基于不同的负载平衡情况提出了一种自适应渲染内核选择策略,有效提高了训练效率。