-
Notifications
You must be signed in to change notification settings - Fork 0
Visual Understanding
Farley Lai edited this page Apr 30, 2021
·
1 revision
Datasets:
- ImageNet DET: object detection from images with classes including VID
- ImageNet VID: video object detection with instance IDs
- MOT16: 14 video sequences filmed in unconstrained environments for tracking
Code:
- Detect to Track and Track to Detect
- Object detection from video tubelets with convolutional neural networks
Datasets:
- AVA: fine-grained audiovisual annotations of long human action videos
- ActivityNet: large-scale video benchmark for human activity understanding
- Kinetics: large-scale high-quality action videos from YouTube up to 700 classes
- THUMOS: action recognition in temporally untrimmed video
- Jester: humans performing hand gestures in front of a webcam
- Something-Something: humans performing basic actions with everyday objects
- Charade: people recording videos at home, acting out casual everyday activities
- Charade-Ego: indoors activities through AMT recorded from both third and first person
Combo:
-
mmdetection
- SOTA model zoo including Faster/Mask R-CNN, RetinaNet, DCN-v2, etc. but YOLO
- up-to-date optimizations including Group Normalization (GN) and Weight Standardization (WS)
One Stage:
- YOLOv3: fastest
- RetinaMask: one stage Mask R-CNN
Two Stage: