We introduce an approach that creates animatable human avatars from monocular videos using 3D Gaussian Splatting (3DGS). Existing methods based on neural radiance fields (NeRFs) achieve high-quality novel-view/novel-pose image synthesis but often require days of training, and are extremely slow at inference time. Recently, the community has explored fast grid structures for efficient training of clothed avatars. Albeit being extremely fast at training, these methods can barely achieve an interactive rendering frame rate with around 15 FPS. In this paper, we use 3D Gaussian Splatting and learn a non-rigid deformation network to reconstruct animatable clothed human avatars that can be trained within 30 minutes and rendered at real-time frame rates (50+ FPS). Given the explicit nature of our representation, we further introduce as-isometric-as-possible regularizations on both the Gaussian mean vectors and the covariance matrices, enhancing the generalization of our model on highly articulated unseen poses. Experimental results show that our method achieves comparable and even better performance compared to state-of-the-art approaches on animatable avatar creation from a monocular input, while being 400x and 250x faster in training and inference, respectively.
我们介绍了一种方法,使用3D高斯喷溅(3DGS)从单目视频中创建可动画的人类虚拟形象。现有基于神经辐射场(NeRFs)的方法能够实现高质量的新视角/新姿态图像合成,但通常需要数天的训练时间,并且在推理时极其缓慢。最近,研究社区探索了快速网格结构,用于高效训练穿着服装的虚拟形象。尽管这些方法在训练时极为快速,但它们几乎只能达到大约15 FPS的交互式渲染帧率。在这篇论文中,我们使用3D高斯喷溅,并学习一个非刚性变形网络,以重建可动画的穿着服装的人类虚拟形象,这些虚拟形象可以在30分钟内训练完成,并以实时帧率(50+ FPS)渲染。鉴于我们表示的明确性,我们进一步引入尽可能等距的规范化,作用于高斯均值向量和协方差矩阵上,增强了我们模型在高度关节化未见姿态上的泛化能力。实验结果显示,与最先进的方法相比,我们的方法在从单目输入创建可动画虚拟形象方面实现了可比甚至更好的性能,同时在训练和推理方面分别快了400倍和250倍。