Review: CPM — Convolutional Pose Machines (Human Pose Estimation)

Outperforms Tompson NIPS’14, and Tompson CVPR’15

5 min readMay 13, 2019

--

A sequential architecture that composed of convolutional networks for the prediction of right elbow

In this story, Convolutional Pose Machines (CPMs), by Carnegie Mellon University, is briefly reviewed.

CPM proposed a sequential architecture that composed of convolutional networks which directly operate on belief maps from previous stages, producing increasingly refined estimates for part locations.
CPM addresses the vanishing gradients problem during training by providing a natural learning objective function that enforces intermediate supervision.

This is a paper in 2016 CVPR with over 700 citations. (Sik-Ho Tsang @ Medium)

Outline

Convolutional Pose Machines (CPM)
Loss Function
Evaluation

1. Convolutional Pose Machines (CPM)

(a): In the first stage, t = 1, with the input data xz as input:

where Yp is the pixel location of the p-th anatomical landmark (part), Z are the set of all locations in the image.
bp_1 is the the score predicted by the classifier g1 for assigning the p-th part at the first image.

At the output, there are P+1 belief maps, where P is the total number of parts, and +1 is the background.
(b): In the subsequent stages, t > 1, with the input data and the contextual information from the preceding classifier:

Therefore the part location can be refined by the later stages as below:

Part location can be refined by the later stages

(c): As within the first stage, there are five convolutional layers followed by two 1×1 convolutional layers which results in a fully convolutional manner.
(d): Input data goes through four convolutional layers before adding with the belief maps obtained in the first stage. And then three more convolutional layers followed by two 1×1 convolutional layers are processed.
(e): As for later stage, the effective receptive field is increased which help to improve the accuracy as shown below.

The effective receptive field is increased which help to improve the accuracy

The accuracy improves as the effective receptive field increases, and starts to saturate around 250 pixels.
In practice, to achieve certain precision, the input cropped images is normalized to size 368×368.
The receptive field of the first stage shown above is 160×160 pixels.
The second stage output on the belief maps of the first stage is set to 31×31, which is equivalently 400×400 pixels on the original image.

2. Loss Function

The cost function aimed to minimize at the output of each stage at each level is:

The summation of the costs of all stages is considered as the the costs of all stages:

By enforcing supervision in intermediate stages through the network, it can somehow address the vanishing gradient problem as the intermediate loss functions replenish the gradients at each stage.

HIstogram of gradient magnitudes: Intermediate supervision addresses vanishing gradients

In early epochs, as we move from the output layer to the input layer, we observe on the model without intermediate supervision, the gradient distribution is tightly peaked around zero because of vanishing gradients.
The model with intermediate supervision has a much larger variance across all layers, suggesting that learning is indeed occurring in all the layers thanks to intermediate supervision.

3. Evaluation

3.1. Ablation Study

Different Training Methods (Left), Number of Stages (Right)

PCK is measured which is the error tolerance is normalized with respect to the part target.
Left: With end-to-end training, it is better than stagewise training, and stages without intermediate supervision.
Right: The performance increases monotonically until 5 stages, diminishing returns at the 6th stage.

3.2. MPII Human Pose Dataset

MPII Human Pose Dataset

28000 training samples.
As shown above, CPM outperforms Tompson NIPS’14 and Tompson CVPR’15.

3.3. Leeds Sports Pose (LSP) Dataset

Leeds Sports Pose (LSP) Dataset

11000 training images and 1000 testing images.
As shown above, CPM outperforms Tompson NIPS’14.

3.4. FLIC Dataset

FLIC Dataset

3987 training images and 1016 testing images.
Again, CPM outperforms Tompson NIPS’14 and Tompson CVPR’15.

Qualitative Results

Reference

[2016 CVPR] [CPM]
Convolutional Pose Machines

My Previous Reviews

Image Classification
[LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [MSDNet] [ShuffleNet V1] [SENet]

Object Detection
[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]

Semantic Segmentation
[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3]

Biomedical Image Segmentation
[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [QSA+QNT] [3D U-Net+ResNet]

Instance Segmentation
[SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

Super Resolution
[SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet]

Human Pose Estimation
[DeepPose] [Tompson NIPS’14] [Tompson CVPR’15]

Machine Learning

Artificial Intelligence

Human Pose Estimation

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams