Brief Review — CPN: Cascaded Pyramid Network for Multi-Person Pose Estimation

CPN, 2-Stage Network Instead of Stacking Multiple Hourglass, Outperforms Mask R-CNN, CMUPose, and G-RMI

Sik-Ho Tsang
4 min readAug 11


Multi-Person Pose Estimation

Cascaded Pyramid Network for Multi-Person Pose Estimation,
CPN, by Tsinghua University, HuaZhong University of Science and Technology, and Megvii Inc. (Face++)
2018 CVPR, Over 1300 Citations (Sik-Ho Tsang @ Medium)

Human Pose Estimation
2014 … 2018 [PersonLab] 2019 [OpenPose] [HRNet / HRNetV1] 2020 [A-HRNet] 2021 [HRNetV2, HRNetV2p] [Lite-HRNet]
==== My Other Paper Readings Are Also Over Here ====

  • Cascaded Pyramid Network (CPN) is proposed which targets to relieve the problem from these “hard” keypoints. More specifically, CPN includes two stages: GlobalNet and RefineNet.
  • GlobalNet is a feature pyramid network (FPN) which can successfully localize the “simple” keypoints like eyes and hands but may fail to precisely recognize the occluded or invisible keypoints.
  • RefineNet tries explicitly handling the “hard” keypoints by integrating all levels of feature representations from the GlobalNet together with an online hard keypoint mining loss.
  • In general, to address the multi-person pose estimation problem, a top-down pipeline is adopted to first generate a set of human bounding boxes based on a detector, followed by CPN for keypoint localization.


  1. Cascaded Pyramid Network (CPN)
  2. Results

1. Cascaded Pyramid Network (CPN)

Cascaded Pyramid Network (CPN)

1.1. Human Detector

  • A modified FPN is used where ROIAlign from Mask R-CNN is adopted to replace the ROIPooling in FPN.
  • SoftNMS is used.
  • To train the object detector, all 80 categories from the COCO dataset are utilized during the training process but only the boxes of human category is used for the multi-person skeleton task.

1.2. GlobalNet

  • ResNet is use as backbone, where the last residual blocks of different conv features conv2~5 as C2, C3, …, C5 respectively.
  • 3 × 3 convolution filters are applied on C2, …, C5 to generate the heatmaps for keypoints.
  • Slightly different from FPN, 1 × 1 convolutional kernel is applied before each element-wise sum procedure in the upsampling process.

1.3. RefineNet

  • RefineNet transmits the information across different levels and finally integrates the informations of different levels via upsampling and concatenating as HyperNet [21].
  • Besides concatenating all the pyramid features, more bottleneck blocks are stacked into deeper layers, whose smaller spatial size achieves a good trade-off between effectiveness and efficiency.
  • Hard keypoints are selected online explicitly based on the training loss and backpropagate the losses from the selected keypoints only.
  • Only the top-M (M < N) keypoint losses out of N are used. M=8.

1.4. Some Details

  • This ResNet-50-based model takes about 1.5 day on eight NVIDIA Titan X Pascal GPUs.
  • In order to minimize the variance of prediction, a 2D Gaussian filter is applied on the predicted heatmaps.
  • Following the same techniques used in Stack-Hourglass, the pose of the corresponding flipped image is also predicted and the heatmaps are averaged to get the final prediction.
  • A quarter offset in the direction from the highest response to the second highest response is used to obtain the final location of the keypoints.
  • Rescoring strategy is also used. Different from the rescoring strategy used in G-RMI, the product of boxes’ score and the average score of all keypoints is considered as the final pose score of a person instance.
Illustration of Easy and Hard Keypoints
  • The above figure shows the examples.
  • Left eye can be localized easily using GlobalNet at the left.
  • Left hip needs RefineNet to refine the location.

2. Results

2.1. MS COCO

MS COCO Test-Challenge2017 (“+” indicates ensembled model.)

CPN+ obtains 72.1 AP achieving state-of-art performance on COCO test-challenge2017 dataset, outperforms such as Mask R-CNN, and G-RMI.

MS COCO Test-Dev

On Test-Dev, without extra data involved in training, 72.1 AP is achieved using a single model of CPN and 73.0 AP is achieved using ensembled models of CPN with different ground truth heatmaps, outperforms such as Mask R-CNN, CMUPose, and G-RMI.

2.2. Visualizations


Some illustrative examples are shown above.

  • (There are numerous ablation experiments. Please feel free to read the paper directly.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.