Skip to content

Visual Understanding

Farley Lai edited this page Apr 30, 2021 · 1 revision

Image Understanding

Video Understanding

Video Object Detection and Tracking

Datasets:

  • ImageNet DET: object detection from images with classes including VID
  • ImageNet VID: video object detection with instance IDs
  • MOT16: 14 video sequences filmed in unconstrained environments for tracking

Code:

Action Recognition

Datasets:

  • AVA: fine-grained audiovisual annotations of long human action videos
  • ActivityNet: large-scale video benchmark for human activity understanding
  • Kinetics: large-scale high-quality action videos from YouTube up to 700 classes
  • THUMOS: action recognition in temporally untrimmed video
  • Jester: humans performing hand gestures in front of a webcam
  • Something-Something: humans performing basic actions with everyday objects
  • Charade: people recording videos at home, acting out casual everyday activities
  • Charade-Ego: indoors activities through AMT recorded from both third and first person

Object Detectors

Combo:

  • mmdetection
    • SOTA model zoo including Faster/Mask R-CNN, RetinaNet, DCN-v2, etc. but YOLO
    • up-to-date optimizations including Group Normalization (GN) and Weight Standardization (WS)

One Stage:

Two Stage:

Compilations

Clone this wiki locally