1603.07763.md

Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric Video, CVPR'17 {paper} {project page} {code.gz} {dataset.zip}

Hao Jiang, Kristen Grauman

Objective

Go beyond previous work that reconstruct only visible first person poses (visible arms)r

Learn a prior of full body motion given environment visible cues

Datasets

Collected with both 3rd (Kinect) and 1st person (chest-mounted GoPro) views, both provide RGB streams
- Using Kinect V2 sensor, capture ground truth human poses.
- 3D positions of 25 body joints defined in the MS Kinect SDK.
- chest-mounted camera
- 18 ground truth videos, 3 videos for training rest for testing
- 10 subjects, normal daily activities

Method

Handle pose estimation as per-frame classification task
- k-means with L2 norm to obtain K=300 pose clusters (where poses are all ground-truth poses in training set).
- dynamic features
  - use optical flow to compute point correspondences, which is used to compute the homographye (underlying assumption that scene is planar? Only one homography for the full scene is computed afaiu. I am missing smthg here)
  - use homographies between consecutive frames to estimate camera rotation, assuming rotation dominates over translation, and camera intrinsics are known
- static features
  - collect a dataset of standing vs sitting, train classifier on standing vs sitting
Additional temporal model on 1-3minute temporal sequences to produce
- constraints on transitioning from pose clusters given existing transitions in training set
- encourage consistency between predictions from static and motion features

runtime: 0.5s per frame

Experiments

Report mean cm errors per for different joints

Compare to several 3rd person baselines on their dataset For upper-body joints results slightly better then always-standing baseline (that predicts fixed standing pose), clearer improvement on lower joints