Perception Test: A Diagnostic Benchmark for Multimodal Video Models

News

The Second Perception Test Challenge is being organised as an ECCV2024 workshop! Please see the website here for more details and links to eval.ai challenge pages: ptchallenge-workshop.github.io.

Overview


Quickstart visualisation notebook
Dataset Explorer	Dataset Explorer
Download data	Download section here
Evaluation scripts (including data loader, dummy baseline, evaluation metrics)	multiple-choice video QA, object tracking, action localisation, point tracking, sound localisation, grounded video QA
Challenges and evaluation servers	multiple-choice video QA, object tracking, action localisation, point tracking, sound localisation, grounded video QA

Perception Test: A Diagnostic Benchmark for Multimodal Video Models is a multimodal benchmark designed to comprehensively evaluate the perception and reasoning skills of multimodal video models. The Perception Test dataset introduces real-world videos designed to show perceptually interesting situations and defines multiple tasks (object and point tracking, action and sound localisation, multiple-choice and grounded video question-answering) that require understanding of memory, abstract patterns, physics, and semantics, across visual, audio, and text modalities.

In this repository, you will find:

A summary of the Perception Test and the associated challenge
A detailed description of the data and annotations in the Perception Test (interactive demo notebook here)
Details about how to download the data and annotations in the Perception Test (download section here)
Metrics for evaluating the performance on the different tasks (metrics section here)
Dummy baselines showcasing how to evaluate models on each of the tasks (baselines section here)

5-minutes summary of the Perception Test

Try the Perception Test for yourself by accessing this quiz.

For more example videos in the Perception Test, check out this playlist.

Download the data and annotations

The Perception Test dataset can be downloaded as zip files containing:

annotations in JSON format
videos (including audio) as MP4 files
audio-only files in WAV format
pre-computed features for the action localisation and sound localisation tasks.

Full Dataset Splits

Task	Split	Videos	Audio	Labels
Sample	All	sample_videos.zip (214.9MB)	sample_audios.zip (83.9MB)	sample_annotations.zip (3MB)
All Tasks	Train	train_videos.zip (26.5GB)	train_audios.zip (12.3GB)	train_annotations.zip (30.6MB)
All Tasks	Valid	valid_videos.zip (70.2GB)	valid_audios.zip (33.1GB)	valid_annotations.zip (81.5MB)
All Tasks	Test	test_videos.zip (41.8GB)	test_audios.zip (19.3GB)	test_annotations.zip (633.9kB)

*In test videos, where the end of the video gives away the answer to some questions (e.g. in cup-games, where is the hidden object at the end), we cut the end of the video. For the validation split, we provide the frame id where the cut should be made: cut_frame_mapping_valid.json.

Challenge Downloads

Video IDs
Since some of the challenges use subsets of the benchmark, we provide here the lists of video IDs for each challenge. These should be used to filter the videos/audios/annotations from the full splits above. For single object tracking, single point tracking, and grounded video QA we provide separate zip files since the subsets are much smaller than the full dataset.

Computational Task	Challenge Train Video IDs	Challenge Valid Video IDs	Challenge Test Video IDs
Single Object Tracking	object_tracking_train_id_list.csv	object_tracking_valid_subset_id_list.csv	object_tracking_test_subset_id_list.csv
Single Point Tracking	point_tracking_train_id_list.csv	point_tracking_valid_id_list.csv	point_tracking_test_id_list.csv
Temporal Action Localisation	action_localisation_train_id_list.csv	localisation_challenge_valid_id_list.csv	localisation_challenge_test_id_list.csv
Temporal Sound Localisation	sound_localisation_train_id_list.csv	localisation_challenge_valid_id_list.csv	localisation_challenge_test_id_list.csv
Multiple-Choice Video QA	mc_question_train_id_list.csv	mc_question_valid_id_list.csv	mc_question_test_id_list.csv
Grounded Video QA	grounded_question_train_id_list.csv	grounded_question_valid_id_list.csv	grounded_question_test_id_list.csv

Single Object Tracking
Challenge link: https://eval.ai/web/challenges/challenge-page/2094/overview

Task	Split	Videos	Audio	Labels
Single Object Tracking	Train	Use full split download above	N/A	Use full split download above
Single Object Tracking	Valid	sot_valid_videos_challenge2023.zip (11.6GB)	N/A	sot_valid_annotations_challenge2023.zip (9MB)
Single Object Tracking	Test	sot_test_videos_challenge2023.zip (12.1GB)	N/A	sot_test_annotations_challenge2023.zip (613kB)

Single Point Tracking
Challenge link: https://eval.ai/web/challenges/challenge-page/2108/overview

Task	Split	Videos	Audio	Labels
Single Point Tracking	Train	point_tracking_train_videos.zip (398.4MB)	N/A	point_tracking_train_annotations.zip (4.7MB)
Single Point Tracking	Valid	point_tracking_valid_videos.zip (1.1GB)	N/A	point_tracking_valid_annotations.zip (11.1MB)
Single Point Tracking	Test	point_tracking_test_videos.zip (691MB)	N/A	point_tracking_test_annotations.zip (42.2kB)

Temporal Action Localisation
Challenge link: https://eval.ai/web/challenges/challenge-page/2101/overview

Task	Split	Videos	Audio	Labels	Video Features (TSP)
Temporal Action Localisation	Train	Use full split download above	Use full split download above	action_localisation_train_annotations.zip (217kB)	action_localisation_train_video_features.zip (81.7MB)
Temporal Action Localisation	Valid	Use full split download above	Use full split download above	challenge_action_localisation_valid_annotations.zip (558kB)	action_localisation_valid_video_features.zip (219.2MB)
Temporal Action Localisation	Test	Use full split download above	Use full split download above	N/A	action_localisation_test_video_features.zip (131.7MB)

Temporal Sound Localisation
Challenge link: https://eval.ai/web/challenges/challenge-page/2109/overview

Task	Split	Videos	Audio	Labels	Audio Features (MMV)
Temporal Sound Localisation	Train	Use full split download above	Use full split download above	sound_localisation_train_annotations.zip (363kB)	sound_localisation_train_audio_features.zip (109.1MB)
Temporal Sound Localisation	Valid	Use full split download above	Use full split download above	challenge_sound_localisation_valid_annotations.zip (552kB)	sound_localisation_valid_audio_features.zip (291.4MB)
Temporal Sound Localisation	Test	Use full split download above	Use full split download above	N/A	sound_localisation_test_video_features.zip (177.2MB)

Multiple-Choice Video QA
Challenge link: https://eval.ai/web/challenges/challenge-page/2091/overview

Task	Split	Videos	Audio	Labels
Multiple-Choice Video QA	Train	Use full split download above	Use full split download above	mc_question_train_annotations.zip (85kB)
Multiple-Choice Video QA	Valid	Use full split download above	Use full split download above	mc_question_valid_annotations.zip (200kB)
Multiple-Choice Video QA	Test	Use full split download above	Use full split download above	mc_question_test_annotations.zip (200kB)

Grounded Video QA
Challenge link: https://eval.ai/web/challenges/challenge-page/2110/overview

Task	Split	Videos	Audio	Labels
Grounded Video QA	Train	grounded_question_train_videos.zip (7.3GB)	grounded_question_train_audios.zip (3.4GB)	grounded_question_train_annotations.zip (6.1MB)
Grounded Video QA	Valid	grounded_question_valid_videos.zip (19.3GB)	grounded_question_valid_audios.zip (9.1GB)	grounded_question_valid_annotations.zip (16.8MB)
Grounded Video QA	Test	grounded_question_test_videos.zip (11.3GB)		grounded_question_test_annotations.zip (17.5kB)

Baselines

In this repo we provide dummy baselines to demonstrate how to load the data, evaluate and recreate some baseline results from the paper. For the other results in the baselines section in the paper, we will be adding another external repo.

Computational task	Baseline	Description
Single Object Tracking	Static	Static object baseline.
Single Point Tracking	Static	Static point baseline.
Temporal Action Localisation	ActionFormer	ActionFormer model fine-tuned on Perception Test data.
Temporal Sound Localisation	ActionFormer	ActionFormer model fine-tuned on Perception Test data.
Multiple-Choice Video QA	Frequency	Frequency baseline using training question/answer pairs. More details are provided in the paper.
Grounded Video QA	MDETR + static	MDETR open-vocabulary object detections kept static throughout the video.

Metrics

Computational task	Metric
Single Object Tracking	Average IoU
Single Point Tracking	Average Jaccard
Temporal Action Localisation	Mean Average Precision
Temporal Sound Localisation	Mean Average Precision
Multiple-Choice Video QA	Top-1 Accuracy
Grounded Video QA	HOTA

Metrics code to evaluate performance for the different tasks coming soon.

Perception Test annotations

Explore the annotations: data_visualisation.ipynb

Summary

Annotation type	Number of videos	Number of annotations
Object tracks	11,609	189,940
Point tracks	145	8,647
Action segments	11,353	73,503
Sound segments	11,433	137,128
Multiple-choice Questions	10,361	38,060
Grounded video Questions	3,063	6,086

Video metadata

Field Name	Description
split	The data split the video belongs to, one of ['train','valid','test'].
video_id	The ID of the video ['video_xxxx'].
frame_rate	The frame rate of the video in frames per second.
num_frames	The total number of frames in the video.
resolution	The height and width of the video in pixels.
audio_samples	The total number of audio samples in the video.
audio_sample_rate	The sample rate of the audio in the video in Hz.
is_cup_game	Whether the video shows a cups-game or not, see paper for details.
is_camera_moving	Whether the camera used to film the video is moving or not.

Object tracks

Field Name	Description
id	A unique annotation ID for each object track
label	The name of the object, can also contain object attributes, e.g. red box.
is_occluder	Whether the object occludes other objects in the video (This is valid only for the cups-games videos).
bounding_boxes	The coordinates of the object's bounding box in the format [x1,y1,x2,y2] shape [n,4] where n is the number of annotated frames.
initial_tracking_box	one-hot vector indicating which box annotation should be used to start the tracking for this object [n].
frame_ids	The IDs of the frames that are annotated, normally 1 per second, e.g. 0, 30, 60, etc. shape [n].
timestamps	The timestamps of the annotated frames in μs. Shape [n].
is_masked	Whether the object is masked in the annotated frame, corresponds to the bounding boxes [n] ( This is valid only for the cups-games videos).

Point tracks

Field Name	Description
id	A unique annotation ID for each point track.
label	The label of the point track.
parent_objects	The id of the object that the point belongs to.
frame_ids	The IDs of the frames that are annotated, normally 0, 1, 2 etc. shape [N], where N is the total number of points in the track.
points	The coordinates of the points in [y,x], shape [N, 2].

Action segments

Field Name	Description
id	A unique annotation ID for each action segment.
label	The templated class of the action segment, e.g. Putting something into something.
parent_objects	The ids of the objects involved in the action, can be empty, single, multiple or -1 for an object not annotated.
timestamps	The start and end timestamps of the action segment in μs [start time, end time].
frame_ids	The start and end frame IDs of the action segment [start frame, end frame].
label_id	A unique class ID for each label in the dataset.

Sound segments

Field Name	Description
id	A unique annotation ID for each sound segment.
label	The name or class of the sound segment.
parent_objects	The object ids related to this sound segment, can be empty, single, multiple or -1 for an object not annotated.
timestamps	The start and end timestamps of the sound segment in μs [start time, end time].
frame_ids	The start and end frame IDs of the sound segment [start frame, end frame].
is_visible	Whether the objects causing the sound in this segment are visible or not, e.g. if an object falls off the table and the impact point with the floor is occluded, then is_visible=False.
label_id	A unique class ID for each label in the dataset.

Multiple-choice video question-answers

Field Name	Description
id	A unique annotation ID for each question.
question	The text of the question.
options	The possible options for the question. There are 3 possible options, and only one is correct.
answer_id	The ID of the correct option for the question.
area	The skill area the question pertains to. Can be Memory, Abstraction, Physics, Semantics.
reasoning	The type of reasoning required to answer the question. Can be Descriptive, Explanatory, Predictive, or Counterfactual.
tag	Different skills involved in answering the given question. A question can have multiple skill tags.

Grounded video question-answers

Field Name	Description
id	A unique annotation ID for each question.
question	The text of the question.
answers	The answer for the question given as a list of IDs, these relate to single object tracking annotation IDs, specifically the 'id' field for a given object in the same video.
area	The skill area the question pertains to. Can be Memory, Abstraction, Physics, Semantics.
reasoning	The type of reasoning required to answer the question. Can be Descriptive, Explanatory, Predictive, or Counterfactual.

Feedback and support

If you have any questions, feedback, or require support regarding the Perception Test dataset or challenge, please contact us at [email protected].

Citing this work

@inproceedings{patraucean2023perception,
      title={Perception Test: A Diagnostic Benchmark for Multimodal Video Models}, 
      author={Viorica Pătrăucean and Lucas Smaira and Ankush Gupta and Adrià Recasens Continente and Larisa Markeeva and Dylan Banarse and Skanda Koppula and Joseph Heyward and Mateusz Malinowski and Yi Yang and Carl Doersch and Tatiana Matejovicova and Yury Sulsky and Antoine Miech and Alex Frechette and Hanna Klimczak and Raphael Koster and Junlin Zhang and Stephanie Winkler and Yusuf Aytar and Simon Osindero and Dima Damen and Andrew Zisserman and João Carreira},
      booktitle={Advances in Neural Information Processing Systems},
      year={2023},
      url={https://openreview.net/forum?id=HYEGXFnPoq}
}

License and disclaimer

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an official Google product.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
baselines		baselines
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
data_visualisation.ipynb		data_visualisation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

News

Overview

5-minutes summary of the Perception Test

Download the data and annotations

Full Dataset Splits

Challenge Downloads

Baselines

Metrics

Perception Test annotations

Feedback and support

Citing this work

License and disclaimer

About

Releases

Packages

Contributors 3

Languages

License

google-deepmind/perception_test

Folders and files

Latest commit

History

Repository files navigation

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

News

Overview

5-minutes summary of the Perception Test

Download the data and annotations

Full Dataset Splits

Challenge Downloads

Baselines

Metrics

Perception Test annotations

Feedback and support

Citing this work

License and disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages