Review: DeepCut & DeeperCut — Multi Person Pose Estimation (Human Pose Estimation)

Deep Learning Based CNN for Part Labeling and Part Clustering

Sik-Ho Tsang
5 min readMar 14, 2020
(a) initial detections (= part candidates) and pairwise terms (graph) between all detections, (b) detections that jointly clustered belonging to one person, (c) the predicted pose sticks

In this story, DeepCut & DeeperCut, are briefly reviewed. First, human part labeling and part clustering are obtained through Convolutional Neural Network (CNN). Then, the Integer Linear Program (ILP) is set up, and the pose of multiple persons can be estimated. DeepCut & DeeperCut are the 2016 CVPR and 2016 ECCV papers respectively, with both more than 400 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. DeepCut Architecture
  2. DeeperCut Architecture
  3. Integer Linear Program (ILP)
  4. DeepCut & DeeperCut Results

1. DeepCut Architecture

DeepCut Using VGGNet as Backbone

1.1. Adapted Fast R-CNN (AFR-CNN)

  • Modified Fast R-CNN is used called Adapted Fast R-CNN (AFR-CNN). The modified parts are proposal generation and detection region size. (Please refer to Fast R-CNN for more details.)
  • For the proposal generation, DPM-based part detectors are used instead of using selective search (SS) since it is human pose estimation task. (DPM: Deformable Part Model)
  • K top-scoring detections by each part detector in a common pool of N part-independent proposals and use these proposals as input to AFR-CNN. (N=2,000 for single person and N=20,000 for multiple persons.)
  • Detection region size is increased to capture more context around each part.

1.2. Dense-CNN

  • Then, fully convolutional VGGNet is developed for computing part probability scoremaps.
  • While the stride is 32 which is too coarse for precis part localization, hole algorithm or dilated convolution is used to reduce the stride to 8.
  • (Hole algorithm or dilated convolution is commonly used in segmentation task such as DeepLab and DilatedNet series, i.e. DeepLabv1 & DeepLabv2, DeepLabv3, DeepLabv3+, DilatedNet and DRN.)
  • Multi-label classification task: Sigmoid activation function on the output neurons and cross entropy loss are used.
  • Location Refinement: A location refinement FC layer after the FC7 and use the relative offsets (Δx,Δy) from a scoremap location to the ground truth as targets.
  • Regression to other parts: Similar to location refinement, an extra term is added to the objective function where for regressing each part onto all other part locations.

2. DeeperCut Architecture

DeeperCut Using ResNet as Backbone
  • Deeper Model: Similar to DeepCut, but the backbone is ResNet, which is better than VGGNet.
  • Speed-up inference: 1. solve for head and shoulder locations, 2. then, add elbows/wrists to stage 1 solution, re-optimize, 3. and finally add rest of body parts to stage 2 solution, re-optimize.
  • Image conditioned pairwise using CNN regression: CNN is trained to regress body part locations, and the regressed offsets and angles as features to train logistic regression to output pairwise probability

3. Integer Linear Program (ILP)

  • Consider two body part candidates d and d’ from the set of body part candidates D and classes c and c’ from the set of classes C. The body part candidates were obtained through the CNN. Now, the following set of statements is developed.
  • If x(d,c)=1, then it means that body part candidate d belongs to class c.
  • Also, y(d,d’)=1 indicates that body part candidates d and d belong to the same person.
  • By substituting z(d,d',c,c’)=x(d,c)x(d’,c’)y(d,d’), the objective is converted to Integer Linear Program (ILP), and solved by branch-and-cut.
  • If the value of z(d,d’,c,c’) is 1, then it means that body part candidate d belongs to class c, body part candidate d’ belongs to class c, and finally body part candidates d,d’ belong to the same person.

4. DeepCut & DeeperCut Results

4.1. Single Person Pose Estimation

MPII Single Person dataset Using Percentage of Correct Keypoints (PCK) Metric
Leeds Sports Poses (LSP) Using PCK Metric
  • Newell ECCV’16 has higher PCK than DeepCut and DeeperCut on MPII dataset.
  • DeeperCut has higher PCK than Bulat&Tzimir ECCV’16 on LSP dataset.

4.2. Multi-Person Pose Estimation

MPII Multi-Person dataset Using mean Average Precision (mAP) Metric
  • We can see that, DeeperCut improves a lot compared with DeepCut.
Family (WAF) dataset Using Percentage of Correct Parts (PCP) metric
  • DeeperCut has the highest PCP results.

4.3. Qualitative Results for DeepCut

Successful Cases
Failure Cases

4.4. Qualitative Results for DeeperCut

Successful Cases
Failure Cases

References

My Previous Reviews

Image Classification [LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [ResNet-38] [Shake-Shake] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2]

Object Detection [OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]

Semantic Segmentation [FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3] [ResNet-38] [ResNet-DUC-HDC] [LC] [FC-DenseNet] [IDW-CNN] [DIS] [SDN] [DeepLabv3+]

Biomedical Image Segmentation [CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [QSA+QNT] [3D U-Net+ResNet] [Cascaded 3D U-Net] [Attention U-Net] [RU-Net & R2U-Net] [VoxResNet] [DenseVoxNet][UNet++] [H-DenseUNet] [DUNet]

Instance Segmentation [SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

Super Resolution [SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet] [SR+STN]

Human Pose Estimation [DeepPose] [Tompson NIPS’14] [Tompson CVPR’15] [CPM] [FCGN] [IEF] [DeepCut & DeeperCut] [Newell ECCV’16 & Newell POCV’16]

Codec Post-Processing [ARCNN] [Lin DCC’16] [IFCNN] [Li ICME’17] [VRCNN] [DCAD] [DS-CNN] [Lu CVPRW’19] [Wang APSIPA ASC’19]

Generative Adversarial Network [GAN]

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.