Skip to content

Latest commit

 

History

History
269 lines (217 loc) · 14.3 KB

README.md

File metadata and controls

269 lines (217 loc) · 14.3 KB

An Embodied Generalist Agent in 3D World

ICML 2024

   
LEO Teaser

We introduce LEO, an embodied multi-modal generalist agent capable of grounding, reasoning, chatting, planning, and acting in the 3D world. LEO is trained in a two-stage scheme: (i) 3D vision-language (VL) alignment and (ii) 3D vision-language-action (VLA) instruction tuning.

We meticulously collect extensive diverse data for training LEO. indicates the task contains our generated data. See Task and Data for details. We show the data statistics as below:

Dataset Task 2D required? 3D assets #data
LEO-align object captioning Objaverse 660k
object referring ScanNet + 3RScan 354k
scene captioning 3RScan 20k
LEO-instruct 3D captioning ScanNet 37k
3D QA ScanNet + 3RScan 83k
3D dialogue 3RScan 11k
task planning 3RScan 14k
navigation MP3D 60k
manipulation CLIPort 300k

News

[2024.07] We release a few EAI data examples for demonstration purpose.

[2024.05] LEO is accepted by ICML 2024.

[2024.04] We release the scripts for inference and scaling law analysis, model weights, and training code of EAI tasks.

[2024.03] We release the code and data. The embodied AI (EAI) tasks (navigation and manipulation) need further organization and will be released soon.

[2024.01] We release a Huggingface interactive demo. Chat with LEO and enjoy yourself.

Get Started

  1. Clone Github repo.
git clone [email protected]:embodied-generalist/embodied-generalist.git
cd embodied-generalist
  1. Create conda environment and install dependencies.
conda create -n leo python=3.9
conda activate leo

# install PyTorch, take our version for example
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

# install other dependencies with pip
pip install -r requirements.txt

# install peft separately to escape its install_requires
pip install peft==0.5.0 --no-deps
  1. Install third party libraries (for point cloud backbones). Note that if the installation of PointNext fails, you can either 1) comment the line of importing PointNext in model/pcd_backbone.py or 2) download the compiled file and place it at model/pointnext/cpp/pointnet2_batch/, which may possibly help.
cd model

# default PointNet++
cd pointnetpp
python setup.py install
cd ..

# optional: PointNext (if you want to substitute the default PointNet++)
cd pointnext/cpp/pointnet2_batch
python setup.py build_ext --inplace
cd ../../../

cd ..
# sanity check
python -c 'from model.pointnetpp.pointnetpp import PointNetPP'
# for PointNext, run 'from model.pointnext.pointnext import PointNext'
  1. Go through task and data, model weights, and you are ready to run.

Task and Data

Data preparation. The data includes two components: scan data and language annotations.

  • Scan data. To simplify the preparation and save storage, we streamline the scan data (point clouds and instance segments), which is less than 10G yet already sufficient for experiments on LEO. You can download the compressed files from the links below and arrange the data according to the illustration of scan data structure.
# scan data structure

├── ${scannet_base}
    ├── scan_data
    │   └── pcd_with_global_alignment
    │       ├── ${scan_id}.pth
    └── mask
        ├── ${scan_id}.mask.npz

├── ${rscan_base}
    └── 3RScan-ours-align
        ├── ${scan_id}
            ├── pcds.pth
            ├── pcd-align.pth
            └── inst_to_label.pth

├── ${cap3d_root}
    ├── Cap3D_pcs_pt
    │   ├── ${obj_id}.pt
    └── Cap3D_automated_Objaverse_no3Dword.csv   # included in annotations
  • Language annotations. The annotations are categorized into two parts according to the training stage. We provide a compressed file that wraps up all the annotations, which should be organized in the following structure:
# annotations structure

├── ${alignment_base}
    ├── obj_caption -> ${cap3d_root}
    │   ├── Cap3D_pcs_pt
    │   │   ├── ${obj_id}.pt
    │   └── Cap3D_automated_Objaverse_no3Dword.csv
    ├── obj_scene_caption
    │   ├── 3rscan_prompted.json
    │   ├── 3rscan_scanscribe.json
    │   ├── scannet_referit3d_nr3d_train.json
    │   └── scannet_referit3d_sr3d+_train.json
    └── scene_caption
        ├── 3rscan_scenecap_train.json
        └── 3rscan_scenecap_val.json

├── ${instruction_base}
    ├── scan2cap
    │   ├── scanrefer_train.json
    │   ├── scanrefer_val.json
    │   └── scanrefer_corpus.json
    ├── scanqa
    │   ├── ScanQA_v1.0_train.json
    │   └── ScanQA_v1.0_val.json
    ├── sqa3d
    │   ├── v1_balanced_questions_train_scannetv2.json
    │   ├── v1_balanced_questions_val_scannetv2.json
    │   ├── v1_balanced_questions_test_scannetv2.json
    │   ├── v1_balanced_sqa_annotations_train_scannetv2.json
    │   ├── v1_balanced_sqa_annotations_val_scannetv2.json
    │   ├── v1_balanced_sqa_annotations_test_scannetv2.json
    │   └── axisAlignment.pth
    ├── 3rscanqa
    │   ├── 3rscan_qa_train.json
    │   └── 3rscan_qa_val.json
    ├── dialogue
    │   ├── 3rscan_dialog_train.json
    │   └── 3rscan_dialog_val.json
    └── planning
        ├── 3rscan_plan_train.json
        └── 3rscan_plan_val.json

Data configurations. After data preparation, check configs/data/default.yaml to update the paths, including scan_family_base, rscan_base, alignment_base and instruction_base.

Dataloaders. The implementation of dataset per task lies in data/datasets.py, where LeoMix aggregates various datasets as the training dataset.

EAI. We release a small subset of EAI tasks with a few data examples for demonstration purpose. You can download here. It is recommended to put the extracted folders (mp3d_objnav and cliport) right inside the instruction_base path. Though the test in simulator is not incorporated yet, it is ready for the training and validation of EAI tasks.

Model Weights

Pretrained weights to load.

  • LLM: Vicuna-7B. We use Vicuna v1.1 from FastChat, which you can refer to for the access of Vicuna-13B or more advanced versions. Remember to update cfg_path in configs/llm/*.yaml.
  • Point cloud backbone: PointNet++, PointBERT. We have not tried PointNext, but everything is ready except the pretrained weights. Remember to update path in configs/vision3d/backbone/*.yaml.

Trained LEO weights. We release two checkpoints here:

  • align.pth: the checkpoint after the alignment stage, trained with LoRA.
  • sft_noact.pth: the checkpoint after the instruction tuning stage, based on align.pth and tuned without embodied acting tasks.

Running

Training. The training pipeline is elaborated in trainer/leo_trainer.py. Make sure the config file configs/default.yaml is properly set up before running.

  • General setup. We use wandb as the default experiment logger. Remember to modify logger.entity to your account and init the wandb. Modify name, note, and base_dir for proper experiment output.
  • Model. The components of LeoAgent can be configured in configs/llm, configs/vision2d and configs/vision3d.
  • Task. You can configure the tasks by specifying a yaml in configs/task. You can also run new tasks by creating similar configs.
  • GPU usage. We run the experiments on NVIDIA A100-80GB and A800-80GB. Modify dataloader arguments for your GPU if necessary.

We prepare some running scripts in scripts/, covering two-stage training and evaluation. The core is to run launch.py with proper arguments. There are three launch modes:

# python launch
python launch.py --mode python --config configs/default.yaml <HYDRA_CONFIG>

# accelerate launch
python launch.py --mode accelerate --config configs/default.yaml <HYDRA_CONFIG>

# SLURM submitit launch, default
python launch.py --mode submitit --config configs/default.yaml <HYDRA_CONFIG>

# for example, run alignment with submitit
python launch.py --mode submitit \
                 --config configs/default.yaml \
                 --name leo_tuning \ # job name
                 --qos lv0b \   # QoS
                 --time 48 \   # job execution duration (hour)
                 --num_nodes 1 \
                 --partition HGX \   # node type
                 --gpu_per_node 4 \
                 --mem_per_gpu 100 \   # memory per GPU
                 --port 2050 \
                 task=align \   # hydra: cfg.task, select task
                 note=align_lora \   # hydra: cfg.note, for exp_dir

Inference. We prepare an inference script scripts/inference.sh, where we run a different python script inference.py in python mode by default:

# single-GPU python-mode launch
python launch.py --mode python \
                 --run_file inference.py \
                 --config configs/default.yaml \
                 note=tuning_noact \
                 pretrained_ckpt_path=null \

Modify probe arguments in configs/default.yaml to customize the inputs for inference. You can select a checkpoint by specifying either note or pretrained_ckpt_path. For the former, note should align with the corresponding note for the training exp_dir. For the latter, you shoud assign with a checkpoint folder wherein pytorch_model.bin exists.

Launch mode. For explanation of the launch arguments, use python launch.py --help. Refer to SLURM submitit or Accelerate for more information.

Notes

We manually modify some methods of accelerate.Accelerator in common/misc.py, including gather_for_metrics (fix gathering non-tensor objects), get_state_dict (for saving only learnable parameters when calling save_state), and prepare_scheduler (fix behavior with gradient accumulation).

BibTex

@inproceedings{huang2023embodied,
  title={An Embodied Generalist Agent in 3D World},
  author={Huang, Jiangyong and Yong, Silong and Ma, Xiaojian and Linghu, Xiongkun and Li, Puhao and Wang, Yan and Li, Qing and Zhu, Song-Chun and Jia, Baoxiong and Huang, Siyuan},
  booktitle={Proceedings of the International Conference on Machine Learning (ICML)},
  year={2024}
}