This repository contains the PyTorch implementation of our paper and the associated datasets:
Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations
Sagnik Majumder, Hao Jiang, Pierre Moulon, Ethan Henderson, Paul Calamia, Kristen Grauman*, Vamsi Krishna Ithapu*
UT Austin, Reality Labs Research at Meta, FAIR at Meta
CVPR 2023
* denotes equal contribution
Project website: https://vision.cs.utexas.edu/projects/chat2map/
Can conversational videos captured from multiple egocentric viewpoints reveal the map of a scene in a cost-efficient way? We seek to answer this question by proposing a new problem: efficiently building the map of a previously unseen 3D environment by exploiting shared information in the egocentric audio-visual observations of participants in a natural conversation. Our hypothesis is that as multiple people (“egos") move in a scene and talk among themselves, they receive rich audio-visual cues that can help uncover the unseen areas of the scene. Given the high cost of continuously processing egocentric visual streams, we further explore how to actively coordinate the sampling of visual information, so as to minimize redundancy and reduce power use. To that end, we present an audio-visual deep reinforcement learning approach that works with our shared scene mapper to selectively turn on the camera to efficiently chart out the space. We evaluate the approach using a state-of-the-art audio-visual simulator for 3D scenes as well as real-world video. Our model outperforms previous state-of-the-art mapping methods, and achieves an excellent cost-accuracy tradeoff.
This code has been tested with python 3.6.13
, habitat-api 0.1.4
, habitat-sim 0.1.4
and torch 1.4.0
. Additional python package requirements are available in requirements.txt
.
First, install the required versions of habitat-api, habitat-sim and torch inside a conda environment.
Next, install the remaining dependencies either by
pip3 install -r requirements.txt
or by parsing requirements.txt
to get the names and versions of individual dependencies and install them individually.
Download the additional data needed for running the code from this link under the project root, extract its contents into a directory named data
. The data
directory should have 7 subdirectories
- datasets: conversation episodes for training and evaluating the model for MP3D scenes
- gt_topdown_maps: pre-processed topdown occupancy maps for using as inputs and targets for the model for MP3D scenes
- metadata: empty directory for storing metadata describing the discrete grid of MP3D scenes
- noise: sensor noise data for evaluating with sensor noise for MP3D scenes
- scene_observations: empty directory for storing cached RGB and depth images from MP3D scenes
- sounds: anechoic speech and distractor (ambient environment) sounds for training and evaluation for MP3D scenes
- ambisonic_rirs: empty dir for storing ambisonic IRs for MP3D scenes
Download the SoundSpaces Matterport3D ambisonic RIRs and metadata, and extract them into directories named data/ambisonic_rirs/mp3d
and data/metadata/mp3d
, respectively.
Download the Matterport3D dataset, and cache the observations relevant for the SoundSpaces simulator using this script from the SoundSpaces repository. Use resolutions of 128 x 128
for both RGB and depth sensors. Place the cached observations for all scenes (.pkl files) in data/scene_observations/mp3d
. Also, copy or symlink the scene directories under habitat_data/v1/tasks/mp3d_with_semantics
of the downloaded MP3D dataset folder, which contain .glb
, .house
, .navmesh
and .ply
files, in data/scene_datasets/mp3d
.
Next, cd
into data/noise/gaussian_rgb_sensor_noise
and run ln -s ../redwood_depth_sensor_noise/seed0_h128w128.pkl ./seed0_h128w128_mu0.0sigma1.0.pkl
and ln -s ../redwood_depth_sensor_noise/seed0_h128w128_1000eps.pkl ./seed0_h128w128_mu0.0sigma1.0_1000eps.pkl
.
For further info about the structuring of the associated datasets, refer to chat2map/config/default.py
or the task configs.
4 GPU DataParallel training:
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 main.py --exp-config chat2map/config/passive_mapping/train.yaml --model-dir runs/passive_mapping --run-type train NUM_PROCESSES 1
4 GPU DataParallel testing on simulated data :
CUDA_VISIBLE_DEVICES=0 python3 main.py --exp-config chat2map/config/passive_mapping/test.yaml --model-dir runs_eval/passive_mapping/ --run-type eval EVAL_CKPT_PATH runs_eval/passive_mapping/data/best_ckpt_val.1.pth NUM_PROCESSES 1
Compute average map quality metrics, like F1 score and IoU, using scripts/map_quality/compute_avgMetricVals_passiveMapping.ipynb
.
Note: active mapping training with 8 GPU DDPPO with 8 processes per GPU requires ~490 GBs of CPU memory.
8 GPU + 8 process DDPPO training of random agent:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -u -m torch.distributed.launch --use_env --nproc_per_node 8 main.py --exp-config chat2map/config/active_mapping/train_random.yaml --model-dir runs/random --run-type train NUM_PROCESSES 8 RL.PPO.use_ddppo True
8 GPU + 8 process DDPPO training of Chat2Map active mapping agent:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -u -m torch.distributed.launch --use_env --nproc_per_node 8 main.py --exp-config chat2map/config/active_mapping/train_chat2map_activeMapper.yaml --model-dir runs/chat2map_activeMapper --run-type train NUM_PROCESSES 8 RL.PPO.use_ddppo True
Note: before testing, symlink ckpt.best_reward_lastStep.pth
from runs/chat2map_activeMapper
in runs_eval/chat2map_activeMapper
1 GPU + 1 process testing of ChatMap active mapping on simulated data:
CUDA_VISIBLE_DEVICES=0 python3 main.py --exp-config chat2map/config/active_mapping/test_chat2map_activeMapper.yaml --model-dir runs_eval/chat2map_activeMapper/ --run-type eval NUM_PROCESSES 1
Compute average map quality metrics, like F1 score and IoU, over all episode steps or for last step using scripts/map_quality/compute_avg_allSteps_n_lastStep_metricVals_activeMapping.ipynb
.
Coming soon!!
Please open an issue in this repository (preferred for better visibility), or reach out to [email protected].
See the CONTRIBUTING file for how to help out.
If you use the code or the paper, please cite the following paper:
@inproceedings{majumder2023chat2map,
title={Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations},
author={Majumder, Sagnik and Jiang, Hao and Moulon, Pierre and Henderson, Ethan and Calamia, Paul and Grauman, Kristen and Ithapu, Vamsi Krishna},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={10554--10564},
year={2023}
}
This codebase is licensed under the CC-By-NC license, and the checkpoints are licensed under the same license in accordance with the Matterport3D data terms. Matterport3D data are distributed with its own terms using this license.