Allen Z. Ren, Jaden Clark, Anushri Dixit, Masha Itkina, Anirudha Majumdar, Dorsa Sadigh
Princeton University, Stanford University, Toyota Research Institute
Set up the conda environment (Linux, Python 3.9):
conda env create -f environment.yml
conda activate explore-eqa
pip install -e .
Install the latest version of Habitat-Sim (headless with no Bullet physics) with:
conda install habitat-sim headless -c conda-forge -c aihabitat
Set up Prismatic VLM with the submodule:
cd prismatic-vlms && pip install -e .
Download the train split (hm3d-train-habitat-v0.2.tar
) of the HM3D dataset here. You will be asked to request for access first.
(Optional) For running CLIP-based exploration:
cd CLIP && pip install -e .
We release the HM-EQA dataset, which includes 500 questions about 267 scenes from the HM-3D dataset. They are available in data/
.
First specify scene_data_path
in the config files with the path to the downloaded HM3D train split, and specify hf_token
to be your Hugging Face user access token.Running the script below for the first time will download the VLM model, which assumes access to a GPU with sufficient VRAM for the chosen VLM.
Run our method (VLM-semantic exploration) in Habitat-Sim:
python run_vlm_exp.py -cf cfg/vlm_exp.yaml
Run CLIP-based exploration in Habitat-Sim:
python run_clip_exp.py -cf cfg/clip_exp.yaml
Load a scene (with the question from our dataset) in Habitat-Sim:
python test_scene.py -cf cfg/test_scene.yaml
We also share a few scripts that might be helpful:
- script/sample_views_from_scene.py: for sampling random views in a scene in Habitat-Sim. We used such images for generating EQA questions with GPT4-V.
- script/get_floor_height.py: for getting the height of the floors in each scene of the HM-3D dataset, which is not available from the original dataset.
- script/get_questions_gpt4v.py: for generating EQA questions with GPT4-V with random views of the scene and fewshot examples.
- script/sample_init_pose.py: for sampling valid initial poses of the robot in each scene.
The CLIP-based exploration uses the CLIP multi-scale relevancy extractor from Semantic Abstraction.