Visual Understanding

Jump to bottom

Farley Lai edited this page Apr 30, 2021 · 1 revision

Image Understanding

Optimizing Object Detection

Video Understanding

Video Object Detection and Tracking

Datasets:

ImageNet DET: object detection from images with classes including VID
ImageNet VID: video object detection with instance IDs
MOT16: 14 video sequences filmed in unconstrained environments for tracking

Code:

Action Recognition

Datasets:

AVA: fine-grained audiovisual annotations of long human action videos
ActivityNet: large-scale video benchmark for human activity understanding
Kinetics: large-scale high-quality action videos from YouTube up to 700 classes
THUMOS: action recognition in temporally untrimmed video
Jester: humans performing hand gestures in front of a webcam
Something-Something: humans performing basic actions with everyday objects
Charade: people recording videos at home, acting out casual everyday activities
Charade-Ego: indoors activities through AMT recorded from both third and first person

Object Detectors

Combo:

mmdetection
- SOTA model zoo including Faster/Mask R-CNN, RetinaNet, DCN-v2, etc. but YOLO
- up-to-date optimizations including Group Normalization (GN) and Weight Standardization (WS)

One Stage:

YOLOv3: fastest
RetinaMask: one stage Mask R-CNN

Two Stage:

Compilations