VQASynth 🎹

Spatial Reasoning is fundamental to interacting within and navigating physical environments for embodied AI applications like robotics. However, data samples suitable for learning these capabilities are rare in AI pretraining datasets. Don't be limited by what your model can do out-of-the-box, curate any image dataset from the Huggingface Hub for Spatial VQA with tools for scene understanding.

VLMs trained using VQASynth 🎹

estimate 3D distances between objects in an image
describe distances colloquially, convert between common units of measurement
answer queries about the orientation and spatial relationships between objects
base responses on consistent references like floors and surfaces

Description

Fusing semantic and metric data into templated VQA chat, Vision Language Models can be instruction-tuned with low-rank adapters to enhance their baseline spatial reasoning capabilities. VQASynth 🎹 provides an open-source reproduction of SpatialVLM, which describes a 3D scene reconstruction pipeline and prompt templates for enhancing the spatial reasoning abilities of VLMs including:

Semantic filtering with CLIP to normalize the image distribution and attributes
Metric Depth Estimation with ZoeDepth to lift the 2D image to 3D point cloud
Object-level captioning with FlexCap for precise 2D region proposal
Plane-fitting with RANSAC for consistent 3D reference coordinates

Initial VQASynth 🎹 pipelines prompted LLaVA for JSON-formatted object-level detailed captions or tags using RAM. Accordingly, we evaluated caption/tag based region proposal with publicly available models like CLIPSeg and groundingDINO.

What's New 👀 in VQASynth 🎹

🪶 Faster & lighter using Florence-2 for detailed image captions and region proposal grounded on text captions.

📐 Improves metric depth estimation speed & accuracy by replacing ZoeDepth with DepthPro.

🎓 SAM2 replaces SAM in the localization refinement stage.

Environment

Before running the demo scripts, ensure you have the following installed:

CLIPSeg-based SpatialVLM data processing (recommended):

cd tests/data_processing/
docker build -f clipseg_data_processing.dockerfile -t vqasynth:clipseg-dataproc-test .
docker run --gpus all -v /path/to/output/:/path/to/output vqasynth:clipseg-dataproc-test --input_image="warehouse_rgb.jpg" --output_dir "/path/to/output"

GroundingDINO-based SpatialVLM data processing:

cd tests/data_processing/
docker build -f groundingDino_data_processing.dockerfile -t vqasynth:dino-dataproc-test .
docker run --gpus all -v /path/to/output/:/path/to/output vqasynth:dino-dataproc-test --input_image="warehouse_rgb.jpg" --output_dir "/path/to/output"

The scripts will produce 3D point clouds, segmented images, labels, and prompt examples for a test image.

Run a Pipeline on Your Images

The main pipeline uses Docker Compose to process a Hugging Face dataset into a VQA dataset including spatial relations between objects. The dataset follows conventions for training models like LLaVA. We recommend using an A10 GPU or larger for processing.

Make sure to update the config.yaml file by adding the following details: an output directory path, the repository ID for the dataset to be processed, and a dataset name to store the results to the hub. You can also optionally add include_tags and/or exclude_tags as comma-separated lists in the config file for filtering the dataset based on tags. If no tags are provided, the filtering will not be applied.

Then launch the pipeline with:

# Authenticate to push to hub
huggingface-cli login

# Run the pipeline
cd /path/to/VQASynth
bash run.sh

In your designated output directory, you'll find a json file processed_dataset.json containing the formatted dataset.

Here are some examples:


Does the red forklift in warehouse appear on the left side of the brown cardboard boxes stacked?	How close is the man in red hat walking from the wooden pallet with boxes?	Does the man in blue shirt working have a greater height compared to the wooden pallet with boxes on floor?
Incorrect, the red forklift in warehouse is not on the left side of the brown cardboard boxes stacked.	The man in red hat walking is 60.13 centimeters from the wooden pallet with boxes.	Indeed, the man in blue shirt working is taller compared to the wooden pallet with boxes on floor.

Here's a sample of warehouse images captioned with spatial relationships similar to the table above.

wget https://remyx.ai/assets/vqasynth/vqasynth_warehouse_spaces.zip

# Data is formatted for LLaVA fine-tuning
unzip vqasynth_warehouse_spaces.zip

Once completed, you can follow this resource on fine-tuning LLaVa.

Datasets from VQASynth 🎹

Models tuned on VQASynth 🎹

Try SpaceMantis in the HF Space or SpaceLLaVA in Discord

Notebooks

We've hosted some notebooks visualizing and experimenting with the techniques included in this repo.

Notebook	Description	Launch
Spatial Reasoning with Point Clouds	Visualize point clouds and evaluate spatial relationships

References

This project was inspired by or utilizes concepts discussed in the following research paper(s):

@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
assets		assets
config		config
docker		docker
pipelines		pipelines
tests/data_processing		tests/data_processing
vqasynth		vqasynth
.gitmodules		.gitmodules
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
run.sh		run.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VQASynth 🎹

Description

What's New 👀 in VQASynth 🎹

Environment

Run a Pipeline on Your Images

Datasets from VQASynth 🎹

Models tuned on VQASynth 🎹

Notebooks

References

About

Releases

Packages

Contributors 3

Languages

remyxai/VQASynth

Folders and files

Latest commit

History

Repository files navigation

VQASynth 🎹

Description

What's New 👀 in VQASynth 🎹

Environment

Run a Pipeline on Your Images

Datasets from VQASynth 🎹

Models tuned on VQASynth 🎹

Notebooks

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages