Skip to content
View EgocentricVision's full-sized avatar

Block or report EgocentricVision

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
EgocentricVision/README.md

better view here -> https://egocentricvision.github.io/EgocentricVision/

Surveys

Papers

Action / Activity Recognition

Action Recognition

Hand-Object Interactions

Usupervised Domain Adaptation

Domain Generalization

Source Free Domain Adaptation

Test Time Training (Adaptation)

Zero-Shot Learning

Action Anticipation

Short-Term Action Anticipation

Long-Term Action Anticipation

Future Gaze Prediction

Trajectory prediction

Region prediction

Multi-Modalities

Audio-Visual

Depth

Thermal

Event

IMU

Temporal Segmentation (Action Detection)

Retrieval

Segmentation

Video-Language

Few-Shot Action Recognition

Gaze

From Third-Person to First-Person

NeRF

User Data from an Egocentric Point of View

Localization

Privacy protection

Tracking

Social Interactions

Multiple Egocentric Tasks

  • A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives - A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives, CVPR 2024. [project page]

  • EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models - Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, Yang Liu, CVPR 2024. [code] [project page]

  • Multi-Task Learning of Object States and State-Modifying Actions from Web Videos - Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic, TPAMI 2023. [code]

  • Ego4D: Around the World in 3,000 Hours of Egocentric Video - Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Christian Fuegen, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik, CVPR 2023. [video]

  • Egocentric Video Task Translation - Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani, CVPR 2023. [project page]

Activity-context

Diffusion models

Video summarization

Applications

Human to Robot

Asssitive Egocentric Vision

Popular Architectures

2D

  • GSM - Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz, CVPR 2020. [code]

  • TSM - Ji Lin, Chuang Gan, Song Han, ICCV 2019.

  • TBN - Kazakos, Evangelos and Nagrani, Arsha and Zisserman, Andrew and Damen, Dima, ICCV 2019. [code]

  • TRN - Bolei Zhou, Alex Andonian, Aude Oliva, Antonio Torralba, ECCV 2018.

  • R(2+1) - Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri, CVPR 2018.

  • TSN - Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc Van Gool, ECCV 2016.

3D

  • SlowFast - Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He, ICCV 2019.

  • I3D - Joao Carreira, Andrew Zisserman, CVPR 2017.

RNN

  • RULSTM - Antonino Furnari, Giovanni Maria Farinella, ICCV 2019. [code]

  • LSTA - Sudhakaran, Swathikiran and Escalera, Sergio and Lanz, Oswald, CVPR 2019. [code]

Transformer

  • Ego-STAN - Jinman Park, Kimathi Kaai, Saad Hossain, Norikatsu Sumi, Sirisha Rambhatla, Paul Fieguth, WCVPR 2022.

  • XViT - Adrian Bulat, Juan-Manuel Perez-Rua, Swathikiran Sudhakaran, Brais Martinez, Georgios Tzimiropoulos, NIPS 2021.

  • TimeSformer - Gedas Bertasius, Heng Wang, Lorenzo Torresani, ICML 2021.

  • ViViT - Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid, ICCV 2021.

Other EGO-Context

Datasets

  • [IndustReal] - The IndustReal dataset contains 84 videos, demonstrating how 27 participants perform maintenance and assembly procedures on a construction-toy assembly set. WACV 2024. [paper] [code]

  • IKEA Ego 3D Dataset - A novel dataset for ego-view 3D point cloud action recognition. The dataset consists of approximately 493k frames and 56 classes of intricate furniture assembly actions of four different furniture types. WACV 2024. [paper]

  • [EvIs-Kitchen] - The EvIs-Kitchen dataset is the first VIdeo-Sensor-Sensor (V-S-S) interaction-focused dataset for ego-HAR tasks, capturing sequences of everyday kitchen activities. This dataset uses two inertial sensors on both wrists to better capture subject-object interactions. IEEE Sensors Journal 2024. [paper]

  • Ego-Exo4D - Ego-Exo4D, a vast multimodal multiview video dataset capturing skilled human activities in both egocentric and exocentric perspectives (e.g., sports, music, dance). With 800+ participants in 13 cities, it offers 1,422 hours of combined footage, featuring diverse activities in 131 natural scene contexts, ranging from 1 to 42 minutes per video. CVPR 2024. [paper]

  • [EgoExoLearn] - EgoExoLearn, a large-scale dataset that emulates the human demonstration following process, in which individuals record egocentric videos as they execute tasks guided by demonstration videos. EgoExoLearn contains egocentric and demonstration video data spanning 120 hours captured in daily life scenarios and specialized laboratories. CVPR 2024. [paper] [code]

  • OAKINK2 - A dataset of bimanual object manipulation tasks for complex daily activities. OAKINK2 introduces three level of abstraction to organize the manipulation tasks: Affordance, Primitive Task, and Complex Task. OAKINK2 dataset provides multi-view image streams and precise pose annotations for the human body, hands and various interacting objects. This extensive collection supports applications such as interaction reconstruction and motion synthesis. CVPR 2024. [paper]

  • UnrealEgo2-UnrealEgo-RW - UnrealEgo2 Dataset: An expanded dataset capturing over 15,200 motions of realistic 3D human models with a glasses-based device, offering 1.25 million stereo views and comprehensive joint annotations. UnrealEgo-RW Dataset: A real-world dataset utilizing a compact mobile device with fisheye cameras, designed for versatile egocentric image capture in various environments. CVPR 2024. [paper] [code]

  • [TF2023] - A novel dataset featuring synchronized first-person and third-person views, including masks of camera wearers linked to their respective views. It consists of 208,794 training and 87,449 testing image pairs, with no actor overlap between sets. Each scene averages 4.29 actors, focusing on complex interactions like puzzle games, enhancing its value for cross-view matching in egocentric vision. CVPR 2024. [paper] [code]

  • TACO - A large-scale dataset of real-world bimanual tool-object interactions, featuring 131 tool-action-object triplets across 2.5K motion sequences and 5.2M frames with egocentric and 3rd-person views. TACO enables benchmarks in action recognition, hand-object motion forecasting, and grasp synthesis, advancing generalization research in human-object interactions. CVPR 2024. [paper]

  • [BioVL-QR] - A biochemical vision-and-language dataset, which consists of 24 egocentric experiment videos, corresponding protocols, and video-and-language alignments. This study focuses on Micro QR Codes to detect objects automatically. From our preliminary study, we found that detecting objects only using Micro QR Codes is still difficult because the researchers manipulate objects, causing blur and occlusion frequently. 2024. [paper]

  • HOI-Ref - It consists of 3.9M question-answer pairs for training and evaluating VLMs. HOI-QA includes questions relating to locating hands, objects, and critically their interactions (e.g. referring to the object being manipulated by the hand). 2024. [paper]

  • HOT3D - HOT3D is benchmark dataset for egocentric vision-based understanding of 3D hand-object interactions. The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects, multi-modal signals such as eye gaze or scene point clouds, as well as comprehensive ground truth annotations including 3D poses of objects, hands, and cameras, and 3D models of hands and objects. 2024. [paper] [code]

  • [ADL4D] - ADL4D dataset offers a novel perspective on human-object interactions, providing video sequences of everyday activities involving multiple people and objects interacting simultaneously. 2024. [paper]

  • ENIGMA-51 - ENIGMA-51 is a new egocentric dataset acquired in an industrial scenario by 19 subjects who followed instructions to complete the repair of electrical boards using industrial tools (e.g., electric screwdriver) and equipments (e.g., oscilloscope). The 51 egocentric video sequences are densely annotated with a rich set of labels that enable the systematic study of human behavior in the industrial domain. WACV 2023. [paper]

  • VidChapters-7M - VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters and hence without any additional manual annotation. NeurIPS 2023. [paper]

  • POV-Surgery - POV-Surgery, a large-scale, synthetic, egocentric dataset focusing on pose estimation for hands with different surgical gloves and three orthopedic surgical instruments, namely scalpel, friem, and diskplacer. Our dataset consists of 53 sequences and 88,329 frames, featuring high-resolution RGB-D video streams with activity annotations, accurate 3D and 2D annotations for hand-object pose, and 2D hand-object segmentation masks. MICCAI 2023. [paper]

  • CaptainCook4D - CaptainCook4D, comprising 384 recordings (94.5 hours) of people performing recipes in real kitchen environments. This dataset consists of two distinct types of activity: one in which participants adhere to the provided recipe instructions and another in which they deviate and induce errors. We provide 5.3K step annotations and 10K fine-grained action annotations and benchmark the dataset for the following tasks: supervised error recognition, multistep localization, and procedure learning. ICMLW 2023. [paper]

  • ARGO1M - Action Recognition Generalisation dataset (ARGO1M) from videos and narrations from Ego4D. ARGO1M is the first to test action generalisation across both scenario and location shifts, and is the largest domain generalisation dataset across images and video. ICCV 2023. [paper]

  • [EgoObjects] - EgoObjects, a large-scale egocentric dataset for fine-grained object understanding. contains over 9K videos collected by 250 participants from 50+ countries using 4 wearable devices, and over 650K object annotations from 368 object categories. ICCV 2023. [paper] [code]

  • HoloAssist - HoloAssist: a large-scale egocentric human interaction dataset that spans 166 hours of data captured by 350 unique instructor-performer pairs, wearing mixed-reality headsets during collaborative tasks. ICCV 2023. [paper]

  • AssemblyHands - AssemblyHands, a large-scale benchmark dataset with accurate 3D hand pose annotations, to facilitate the study of egocentric activities with challenging handobject interactions. CVPR 2023. [paper]

  • EpicSoundingObject - Epic Sounding Object dataset with sounding object annotations to benchmark the localization performance in egocentric videos. CVPR 2023. [paper] [code]

  • VOST - Video Object Segmentation under Transformations (VOST). It consists of more than 700 high-resolution videos, captured in diverse environments, which are 20 seconds long on average and densely labeled with instance masks. CVPR 2023. [paper]

  • ARCTIC - A dataset with 2.1 million video frames shows two hands skillfully manipulating objects. It includes precise 3D models of the hands and objects, as well as detailed, dynamic contact information. The dataset features two-handed actions with objects like scissors and laptops, capturing the changing hand positions and object states over time. CVPR 2023. [paper]

  • Aria Digital Twin - Aria Digital Twin (ADT) - an egocentric dataset captured using Aria glasses with extensive object, environment, and human level ground truth. This ADT release contains 200 sequences of real-world activities conducted by Aria wearers. Very challenging research problems such as 3D object detection and tracking, scene reconstruction and understanding, sim-to-real learning, human pose prediction - while also inspiring new machine perception tasks for augmented reality (AR) applications. 2023. [paper] [code]

  • WEAR - The dataset comprises data from 18 participants performing a total of 18 different workout activities with untrimmed inertial (acceleration) and camera (egocentric video) data recorded at 10 different outside locations. 2023. [paper]

  • EPIC Fields - EPIC Fields, an augmentation of EPIC-KITCHENS with 3D camera information. Like other datasets for neural rendering, EPIC Fields removes the complex and expensive step of reconstructing cameras using photogrammetry, and allows researchers to focus on modelling problems. 2023. [paper]

  • [EGOFALLS] - The dataset comprises 10,948 video samples from 14 subjects, focusing on falls among the elderly. Extracting multimodal descriptors from egocentric camera videos. 2023. [paper]

  • [Exo2EgoDVC] - EgoYC2, a novel egocentric dataset, adapts procedural captions from YouCook2 to cooking videos re-recorded with head-mounted cameras. Unique in its weakly-paired approach, it aligns caption content with exocentric videos, distinguishing itself from other datasets focused on action labels. 2023. [paper]

  • EgoWholeBody - EgoWholeBody, a large synthetic dataset, comprising 840,000 high-quality egocentric images captured across a diverse range of whole-body motion sequences. Quantitative and qualitative evaluations demonstrate the effectiveness of our method in producing high-quality whole-body motion estimates from a single egocentric camera. 2023. [paper]

  • [IT3DEgo] - IT3DEgo dataset: Addresses 3D instance tracking using egocentric sensors (AR/VR). Recorded in diverse indoor scenes with HoloLens2, it comprises 50 recordings (5+ minutes each). Evaluates tracking performance in 3D coordinates, leveraging camera pose and allocentric representation. 2023. [paper] [code]

  • Touch and Go - we present a dataset, called Touch and Go, in which human data collectors walk through a variety of environments, probing objects with tactile sensors and simultaneously recording their actions on video. NeurIPS 2022. [paper] [code]

  • EPIC-Visor - VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. NeurIPS 2022. [paper]

  • AssistQ - A new dataset comprising 529 question-answer samples derived from 100 newly filmed first-person videos. Each question should be completed with multi-step guidances by inferring from visual details (e.g., buttons' position) and textural details (e.g., actions like press/turn). ECCV 2022. [paper]

  • EgoProceL - EgoProceL dataset focuses on the key-steps required to perform a task instead of every action in the video. EgoProceL consistis of 62 hours of videos captured by 130 subjects performing 16 tasks. ECCV 2022. [paper]

  • EgoHOS - EgoHOS, a labeled dataset consisting of 11,243 egocentric images with per-pixel segmentation labels of hands and objects being interacted with during a diverse array of daily activities. Our dataset is the first to label detailed hand-object contact boundaries. ECCV 2022. [paper] [code]

  • UnrealEgo - UnrealEgo, i.e., a new large-scale naturalistic dataset for egocentric 3D human pose estimation. It is the first dataset to provide in-the-wild stereo images with the largest variety of motions among existing egocentric datasets. ECCV 2022. [paper]

  • Assembly101 - Procedural activity dataset featuring 4321 videos of people assembling and disassembling 101 “take-apart” toy vehicles. CVPR 2022. [paper]

  • EgoPAT3D - A large multimodality dataset of more than 1 million frames of RGB-D and IMU streams, with evaluation metrics based on high-quality 2D and 3D labels from semi-automatic annotation. CVPR 2022. [paper]

  • AGD20K - Affordance dataset constructed by collecting and labeling over 20K images from 36 affordance categories. CVPR 2022. [paper]

  • HOI4D - A large-scale 4D egocentric dataset with rich annotations, to catalyze the research of category-level human-object interaction. HOI4D consists of 2.4M RGB-D egocentric video frames over 4000 sequences collected by 4 participants interacting with 800 different object instances from 16 categories over 610 different indoor rooms. Frame-wise annotations for panoptic segmentation, motion segmentation, 3D hand pose, category-level object pose and hand action have also been provided, together with reconstructed object meshes and scene point clouds. CVPR 2022. [paper]

  • EgoPW - A dataset captured by a head-mounted fisheye camera and an auxiliary external camera, which provides an additional observation of the human body from a third-person perspective. CVPR 2022. [paper]

  • Ego4D - 3,025 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries. CVPR 2022. [paper]

  • N-EPIC-Kitchens - N-EPIC-Kitchens, the first event-based camera extension of the large-scale EPIC-Kitchens dataset. CVPR 2022. [paper]

  • EasyCom-Clustering - The first large-scale egocentric video face clustering dataset. 2022. [paper]

  • First2Third-Pose - A new paired synchronized dataset of nearly 2,000 videos depicting human activities captured from both first- and third-view perspectives. 2022. [paper]

  • TREK-100 - Object tracking in first person vision. WICCV 2021. [paper]

  • [BioVL] - A novel biochemical video-andlanguage (BioVL) dataset, which consists of experimental videos, corresponding protocols, and annotations of alignment between events in the video and instructions in the protocol. 16 videos from four protocols with a total length of 1.6 hours. WICCV 2021. [paper]

  • MECCANO - 20 subject assembling a toy motorbike. WACV 2021. [paper]

  • EPIC-Kitchens 2020 - Subjects performing unscripted actions in their native environments. IJCV 2021. [paper]

  • H2O - H2O (2 Hands and Objects), provides synchronized multi-view RGB-D images, interaction labels, object classes, ground-truth 3D poses for left & right hands, 6D object poses, ground-truth camera poses, object meshes and scene point clouds. ICCV 2021. [paper]

  • HOMAGE - Home Action Genome (HOMAGE): a multi-view action dataset with multiple modalities and view-points supplemented with hierarchical activity and atomic action labels together with dense scene composition labels. CVPR 2021. [paper]

  • EgoCom - A natural conversations dataset containing multi-modal human communication data captured simultaneously from the participants' egocentric perspectives. TPAMI 2020. [paper]

  • EGO-CH - 70 subjects visiting two cultural sites in Sicily, Italy. Pattern Recognition Letters 2020. [paper]

  • EPIC-Tent - 29 participants assembling a tent while wearing two head-mounted cameras. ICCV 2019. [paper]

  • EPIC-Kitchens 2018 - 32 subjects performing unscripted actions in their native environments. ECCV 2018. [paper]

  • Charade-Ego - Paired first-third person videos.

  • EGTEA Gaze+ - 32 subjects, 86 cooking sessions, 28 hours.

  • ADL - 20 subjects performing daily activities in their native environments.

  • CMU kitchen - Multimodal, 18 subjects cooking 5 different recipes: brownies, eggs, pizza, salad, sandwich.

  • EgoSeg - Long term actions (walking, running, driving, etc.).

  • First-Person Social Interactions - 8 subjects at disneyworld.

  • UEC Dataset - Two choreographed datasets with different egoactions (walk, jump, climb, etc.) + 6 YouTube sports videos.

  • JPL - Interaction with a robot.

  • FPPA - Five subjects performing 5 daily actions.

  • UT Egocentric - 3-5 hours long videos capturing a person's day.

  • VINST/ Visual Diaries - 31 videos capturing the visual experience of a subject walking from metro station to work.

  • Bristol Egocentric Object Interaction (BEOID) - 8 subjects, six locations. Interaction with objects and environment.

  • Object Search Dataset - 57 sequences of 55 subjects on search and retrieval tasks.

  • UNICT-VEDI - Different subjects visiting a museum.

  • UNICT-VEDI-POI - Different subjects visiting a museum.

  • Simulated Egocentric Navigations - Simulated navigations of a virtual agent within a large building.

  • EgoCart - Egocentric images collected by a shopping cart in a retail store.

  • Unsupervised Segmentation of Daily Living Activities - Egocentric videos of daily activities.

  • Visual Market Basket Analysis - Egocentric images collected by a shopping cart in a retail store.

  • Location Based Segmentation of Egocentric Videos - Egocentric videos of daily activities.

  • Recognition of Personal Locations from Egocentric Videos - Egocentric videos clips of daily.

  • EgoGesture - 2k videos from 50 subjects performing 83 gestures.

  • EgoHands - 48 videos of interactions between two people.

  • DoMSEV - 80 hours/different activities.

  • DR(eye)VE - 74 videos of people driving.

  • THU-READ - 8 subjects performing 40 actions with a head-mounted RGBD camera.

  • EgoDexter - 4 sequences with 4 actors (2 female), and varying interactions with various objects and and cluttered background. [paper]

  • First-Person Hand Action (FPHA) - 3D hand-object interaction. Includes 1175 videos belonging to 45 different activity categories performed by 6 actors. [paper]

  • UTokyo Paired Ego-Video (PEV) - 1,226 pairs of first-person clips extracted from the ones recorded synchronously during dyadic conversations.

  • UTokyo Ego-Surf - Contains 8 diverse groups of first-person videos recorded synchronously during face-to-face conversations.

  • TEgO: Teachable Egocentric Objects Dataset - Contains egocentric images of 19 distinct objects taken by two people for training a teachable object recognizer.

  • Multimodal Focused Interaction Dataset - Contains 377 minutes of continuous multimodal recording captured during 19 sessions, with 17 conversational partners in 18 different indoor/outdoor locations.

Not Yet Explored Task

Challenges

  • Ego4D- Episodic Memory, Hand-Object Interactions, AV Diarization, Social, Forecasting.

  • Epic Kitchen Challenge- Action Recognition, Action Detection, Action Anticipation, Unsupervised Domain Adaptation for Action Recognition, Multi-Instance Retrieval.

  • MECCANO- Multimodal Action Recognition (RGB-Depth-Gaze).

Devices

This is a work in progress...

Popular repositories Loading

  1. EgocentricVision EgocentricVision Public

    🔍 Explore Egocentric Vision: research, data, challenges, real-world apps. Stay updated & contribute to our dynamic repository! Work-in-progress; join us!

    79 7

  2. N-EPIC-Kitchens N-EPIC-Kitchens Public

    N-EPIC-Kitchens: The event-based camera extension of the large-scale EPIC-Kitchens dataset.

    Shell 20 1

  3. RNA-TTA RNA-TTA Public

    A new multi-modal TTA approach, called RNA++, combined with a new set of losses (CR losses) aiming at reducing classifier’s uncertainty

    Python 6

  4. EgoWild EgoWild Public

    Code accompanying Bringing Online Egocentric Action Recognition into the wild (RA-L)

    Python 5