Review: DeepCut & DeeperCut — Multi Person Pose Estimation (Human Pose Estimation)

Deep Learning Based CNN for Part Labeling and Part Clustering

5 min readMar 14, 2020

**(a)** **initial detections** (= part candidates) and pairwise terms (graph) between all detections, **(b) detections that jointly clustered** belonging to one person, **(c)** **the predicted pose sticks**

In this story, DeepCut & DeeperCut, are briefly reviewed. First, human part labeling and part clustering are obtained through Convolutional Neural Network (CNN). Then, the Integer Linear Program (ILP) is set up, and the pose of multiple persons can be estimated. DeepCut & DeeperCut are the 2016 CVPR and 2016 ECCV papers respectively, with both more than 400 citations. (Sik-Ho Tsang @ Medium)

Outline

DeepCut Architecture
DeeperCut Architecture
Integer Linear Program (ILP)
DeepCut & DeeperCut Results

1. DeepCut Architecture

**DeepCut Using** **VGGNet** **as Backbone**

1.1. Adapted Fast R-CNN (AFR-CNN)

Modified Fast R-CNN is used called Adapted Fast R-CNN (AFR-CNN). The modified parts are proposal generation and detection region size. (Please refer to Fast R-CNN for more details.)
For the proposal generation, DPM-based part detectors are used instead of using selective search (SS) since it is human pose estimation task. (DPM: Deformable Part Model)
K top-scoring detections by each part detector in a common pool of N part-independent proposals and use these proposals as input to AFR-CNN. (N=2,000 for single person and N=20,000 for multiple persons.)
Detection region size is increased to capture more context around each part.

1.2. Dense-CNN

Then, fully convolutional VGGNet is developed for computing part probability scoremaps.
While the stride is 32 which is too coarse for precis part localization, hole algorithm or dilated convolution is used to reduce the stride to 8.
(Hole algorithm or dilated convolution is commonly used in segmentation task such as DeepLab and DilatedNet series, i.e. DeepLabv1 & DeepLabv2, DeepLabv3, DeepLabv3+, DilatedNet and DRN.)
Multi-label classification task: Sigmoid activation function on the output neurons and cross entropy loss are used.
Location Refinement: A location refinement FC layer after the FC7 and use the relative offsets (Δx,Δy) from a scoremap location to the ground truth as targets.
Regression to other parts: Similar to location refinement, an extra term is added to the objective function where for regressing each part onto all other part locations.

2. DeeperCut Architecture

**DeeperCut Using** **ResNet** **as Backbone**

Deeper Model: Similar to DeepCut, but the backbone is ResNet, which is better than VGGNet.

Speed-up inference: 1. solve for head and shoulder locations, 2. then, add elbows/wrists to stage 1 solution, re-optimize, 3. and finally add rest of body parts to stage 2 solution, re-optimize.
Image conditioned pairwise using CNN regression: CNN is trained to regress body part locations, and the regressed offsets and angles as features to train logistic regression to output pairwise probability

3. Integer Linear Program (ILP)

Consider two body part candidates d and d’ from the set of body part candidates D and classes c and c’ from the set of classes C. The body part candidates were obtained through the CNN. Now, the following set of statements is developed.

If x(d,c)=1, then it means that body part candidate d belongs to class c.
Also, y(d,d’)=1 indicates that body part candidates d and d’ belong to the same person.
By substituting z(d,d',c,c’)=x(d,c)x(d’,c’)y(d,d’), the objective is converted to Integer Linear Program (ILP), and solved by branch-and-cut.
If the value of z(d,d’,c,c’) is 1, then it means that body part candidate d belongs to class c, body part candidate d’ belongs to class c’, and finally body part candidates d,d’ belong to the same person.