- Authors: Jun Cen, Yizheng Wu, Kewei Wang, Xingyi Li, Jingkang Yang, Yixuan Pei, Lingdong Kong
- Institutes: Visual Intelligence Lab@HKUST, HUST, MMLab@NTU, Smiles Lab@XJTU, NUS
πππ Welcome to the Segment Any RGBD GitHub repository! πππ
πππ New! We release technical report! πππ [arxiv]
π€π€π€ Segment AnyRGBD is a toolbox to segment rendered depth images based on SAM! Don't forget to star this repo if you find it interesting!
Input to SAM (RGB or Rendered Depth Image) | SAM Masks with Class and Semantic Masks | 3D Visualization for SAM Masks with Class and Semantic Masks |
---|---|---|
We find that humans can naturally identify objects from the visulization of the depth map, so we first map the depth map ([H, W]) to the RGB space ([H, W, 3]) by a colormap function, and then feed the rendered depth image into SAM. Compared to the RGB image, the rendered depth image ignores the texture information and focuses on the geometry information. The input images to SAM are all RGB images in SAM-based projects like SSA, Anything-3D, and SAM 3D. We are the first to use SAM to extract the geometry information directly. The following figures show that depth maps with different colormap functions has different SAM results.
In this repo, we provide two alternatives for the users, including feeding the RGB images or rendered depth images to the SAM. In each mode, the user could obtain the semantic masks (one color refers to one class) and the SAM masks with the class. The overall structure is shown in the following figure. We use OVSeg for zero-shot semantic segmentation.
- RGB images mainly represents the texture information and depth images contains the geometry information, so the RGB images are more colorful than the rendered depth image. In this case, SAM provides much more masks for RGB inputs than depth inputs, as shown in the following figure.
- The rendered depth image alleviates the over-segment results of SAM. For example, the table is segmented as four parts on the RGB images, and one of them is classified as the chair in the semantic results (yellow circles in the following figure). In contrast, the table is regarded as a whole object on the depth image and correctly-classified. A part of the head of a human is classified as the wall on the RGB image (blue circles in the following figure), but it is well classified on the depth image.
- Two objects which are very close may be segmented as one object on the depth image, such as the chair in the red circle. In this case, texture information in the RGB images are essential to find out the object.
Input to SAM (RGB or Rendered Depth Image) | SAM Masks with Class and Semantic Masks | 3D Visualization for SAM Masks with Class and Semantic Masks |
---|---|---|
Input to SAM (RGB or Rendered Depth Image) | SAM Masks with Class and Semantic Masks | 3D Visualization for SAM Masks with Class and Semantic Masks |
---|---|---|
Please see installation guide.
We provide the UI (ui.py
) and example inputs (/UI/
) to reproduce the above demos. We use the OVSeg checkpoints ovseg_swinbase_vitL14_ft_mpt.pth for zero-shot semantic segmentation, and SAM checkpoints sam_vit_h_4b8939.pth. Put them under this repo. Simply try our UI on your own computer:
python ui.py
Simply click one of the Examples at the bottom and the input examples will be automatically fill in. Then simply click 'Send' to generate and visualize the results. The inference takes around 2 and 3 minutes for ScanNet and SAIL-VOS 3D respectively.
Please download SAIL-VOS 3D and ScanNet to try more demos.
This repo is developed based on OVSeg which is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
However portions of the project are under separate license terms: CLIP and ZSSEG are licensed under the MIT license; MaskFormer is licensed under the CC-BY-NC; openclip is licensed under the license at its repo; SAM is licensed under the Apache License.