Brief Review — CPN: Cascaded Pyramid Network for Multi-Person Pose Estimation

CPN, 2-Stage Network Instead of Stacking Multiple Hourglass, Outperforms Mask R-CNN, CMUPose, and G-RMI

4 min readAug 11, 2023

Cascaded Pyramid Network for Multi-Person Pose Estimation,
CPN, by Tsinghua University, HuaZhong University of Science and Technology, and Megvii Inc. (Face++)
2018 CVPR, Over 1300 Citations (Sik-Ho Tsang @ Medium)
Human Pose Estimation
2014 … 2018 [PersonLab] 2019 [OpenPose] [HRNet / HRNetV1] 2020 [A-HRNet] 2021 [HRNetV2, HRNetV2p] [Lite-HRNet]
==== My Other Paper Readings Are Also Over Here ====

Cascaded Pyramid Network (CPN) is proposed which targets to relieve the problem from these “hard” keypoints. More specifically, CPN includes two stages: GlobalNet and RefineNet.
GlobalNet is a feature pyramid network (FPN) which can successfully localize the “simple” keypoints like eyes and hands but may fail to precisely recognize the occluded or invisible keypoints.
RefineNet tries explicitly handling the “hard” keypoints by integrating all levels of feature representations from the GlobalNet together with an online hard keypoint mining loss.
In general, to address the multi-person pose estimation problem, a top-down pipeline is adopted to first generate a set of human bounding boxes based on a detector, followed by CPN for keypoint localization.

Outline

Cascaded Pyramid Network (CPN)
Results

1. Cascaded Pyramid Network (CPN)

1.1. Human Detector

A modified FPN is used where ROIAlign from Mask R-CNN is adopted to replace the ROIPooling in FPN.
SoftNMS is used.
To train the object detector, all 80 categories from the COCO dataset are utilized during the training process but only the boxes of human category is used for the multi-person skeleton task.

1.2. GlobalNet

ResNet is use as backbone, where the last residual blocks of different conv features conv2~5 as C2, C3, …, C5 respectively.
3 × 3 convolution filters are applied on C2, …, C5 to generate the heatmaps for keypoints.
Slightly different from FPN, 1 × 1 convolutional kernel is applied before each element-wise sum procedure in the upsampling process.

1.3. RefineNet

RefineNet transmits the information across different levels and finally integrates the informations of different levels via upsampling and concatenating as HyperNet [21].
Besides concatenating all the pyramid features, more bottleneck blocks are stacked into deeper layers, whose smaller spatial size achieves a good trade-off between effectiveness and efficiency.
Hard keypoints are selected online explicitly based on the training loss and backpropagate the losses from the selected keypoints only.
Only the top-M (M < N) keypoint losses out of N are used. M=8.

1.4. Some Details

This ResNet-50-based model takes about 1.5 day on eight NVIDIA Titan X Pascal GPUs.
In order to minimize the variance of prediction, a 2D Gaussian filter is applied on the predicted heatmaps.
Following the same techniques used in Stack-Hourglass, the pose of the corresponding flipped image is also predicted and the heatmaps are averaged to get the final prediction.
A quarter offset in the direction from the highest response to the second highest response is used to obtain the final location of the keypoints.
Rescoring strategy is also used. Different from the rescoring strategy used in G-RMI, the product of boxes’ score and the average score of all keypoints is considered as the final pose score of a person instance.

**Illustration of Easy and Hard Keypoints**

The above figure shows the examples.
Left eye can be localized easily using GlobalNet at the left.
Left hip needs RefineNet to refine the location.

2. Results

2.1. MS COCO

**MS COCO Test-Challenge2017** (“+” indicates ensembled model.)

CPN+ obtains 72.1 AP achieving state-of-art performance on COCO test-challenge2017 dataset, outperforms such as Mask R-CNN, and G-RMI.

On Test-Dev, without extra data involved in training, 72.1 AP is achieved using a single model of CPN and 73.0 AP is achieved using ensembled models of CPN with different ground truth heatmaps, outperforms such as Mask R-CNN, CMUPose, and G-RMI.

2.2. Visualizations

Some illustrative examples are shown above.

(There are numerous ablation experiments. Please feel free to read the paper directly.)

Brief Review — CPN: Cascaded Pyramid Network for Multi-Person Pose Estimation

CPN, 2-Stage Network Instead of Stacking Multiple Hourglass, Outperforms Mask R-CNN, CMUPose, and G-RMI

Outline

1. Cascaded Pyramid Network (CPN)

1.1. Human Detector

1.2. GlobalNet

1.3. RefineNet

1.4. Some Details

2. Results

2.1. MS COCO

2.2. Visualizations

Written by Sik-Ho Tsang

No responses yet