This is the official implementation of Unifying (Machine) Vision via Counterfactual World Modeling.
See Setup below to install. Please reference our work as Bear, D.M. et al. (2023).
Counterfactual World Models (CWMs) can be prompted with "counterfactual" visual inputs: "What if?" questions about slightly perturbed versions of real scenes.
Beyond generating new, simulated scenes, properly prompting CWMs can reveal the underlying physical structure of a scene. For instance, asking which points would also move along with a selected point is a way of segmenting a scene into independently movable "Spelke" objects.
The provided notebook demos are a subset of the use cases described in our paper.
Run the jupyter notebook CounterfactualWorldModels/demo/FactualAndCounterfactual.ipynb
Given all of one frame and a few patches of a subsequent frame from a real video, a CWM makes predictions about the rest of the second frame. The ability to prompt the CWM with a small number of tokens relies on training with a very small number of patches revealed in the second frame.
A small number of patches (colored) in a single image can be selected to counterfactually move in a chosen direction, while other patches (black) are static. This produces object movement in the intended directions.
Run the jupyter notebook CounterfactualWorldModels/demo/SpelkeObjectSegmentation.ipynb
Users can upload their own images on which to run counterfactuals.
In each row, one patch is selected to move "upward" (green square) and in the last two rows, one patch is selected to remain static (red square). The optical flow resulting from the simulation represents the CWM's implicit segmentation of the moved object. In the last row, the implied segment includes both the robot arm and the object it is grasping, as the CWM predicts they will move as a unit.
Run the jupyter notebook CounterfactualWorldModels/demo/MovabilityAndMotionCovariance.ipynb
A number of motion counterfactuals were randomly sampled (i.e. patches placed throughout the input image and moved.) This produces a "movability" heatmap of which parts of a scene tend to move and which tend to remain static. Spelke objects are inferred to be most movable, while the background rarely moves.
By inspecting the pixel-pixel covariance across many motion counterfactuals, we can estimate which parts of a scene move together on average. Shown are maps of what tends to move along with a selected point (cyan.) Objects adjacent to one another tend to move together, as some motion counterfactuals include collisions between them; however, motion counterfactuals in the appropriate direction can isolate single Spelke objects (see above.)
We recommend installing required packages in a virtual environment, e.g. with venv or conda.
- clone the repo:
git clone https://github.com/neuroailab/CounterfactualWorldModels.git
- install requirements and
cwm
package:cd CounterfactualWorldModels && pip install -e .
Note: If you want to run models on a CUDA backend with Flash Attention (recommended), it needs to be installed separately via these instructions.
Weights are currently available for three VMAEs trained with the temporally-factored masking policy:
- A ViT-base VMAE with 8x8 patches, trained 3200 epochs on Kinetics400
- A ViT-large VMAE with 4x4 patches, trained 100 epochs on Kinetics700 + Moments + (20% of Ego4D)
- A ViT-base VMAE with 4x4 patches, conditioned on both IMU and RGB video data (otherwise same as above)
See demo jupyter notebooks for urls to download these weights and load them into VMAEs.
These notebooks also download weights for other models required for some computations:
- A ViT that predicts IMU from a 2-frame RGB movie (required for running the IMU-conditioned VMAE)
- A pretrained RAFT optical flow model
- A pretrained RAFT architecture optimized to predict keypoints in a single image. (See paper for definition.)
- Fine control over counterfactuals (multiple patches moving in different directions)
- Iterative algorithms for segmenting Spelke objects
- Using counterfactuals to estimate other scene properties
- Model training code
If you found this work interesting or useful in your own research, please cite the following:
@misc{bear2023unifying,
title={Unifying (Machine) Vision via Counterfactual World Modeling},
author={Daniel M. Bear and Kevin Feigelis and Honglin Chen and Wanhee Lee and Rahul Venkatesh and Klemen Kotar and Alex Durango and Daniel L. K. Yamins},
year={2023},
eprint={2306.01828},
archivePrefix={arXiv},
primaryClass={cs.CV}
}