Brief Review — CPN: Cascaded Pyramid Network for Multi-Person Pose Estimation
CPN, 2-Stage Network Instead of Stacking Multiple Hourglass, Outperforms Mask R-CNN, CMUPose, and G-RMI
Cascaded Pyramid Network for Multi-Person Pose Estimation,
CPN, by Tsinghua University, HuaZhong University of Science and Technology, and Megvii Inc. (Face++)
2018 CVPR, Over 1300 Citations (Sik-Ho Tsang @ Medium)
- Cascaded Pyramid Network (CPN) is proposed which targets to relieve the problem from these “hard” keypoints. More specifically, CPN includes two stages: GlobalNet and RefineNet.
- GlobalNet is a feature pyramid network (FPN) which can successfully localize the “simple” keypoints like eyes and hands but may fail to precisely recognize the occluded or invisible keypoints.
- RefineNet tries explicitly handling the “hard” keypoints by integrating all levels of feature representations from the GlobalNet together with an online hard keypoint mining loss.
- In general, to address the multi-person pose estimation problem, a top-down pipeline is adopted to first generate a set of human bounding boxes based on a detector, followed by CPN for keypoint localization.
- Cascaded Pyramid Network (CPN)
1. Cascaded Pyramid Network (CPN)
1.1. Human Detector
- A modified FPN is used where ROIAlign from Mask R-CNN is adopted to replace the ROIPooling in FPN.
- SoftNMS is used.
- To train the object detector, all 80 categories from the COCO dataset are utilized during the training process but only the boxes of human category is used for the multi-person skeleton task.
- ResNet is use as backbone, where the last residual blocks of different conv features conv2~5 as C2, C3, …, C5 respectively.
- 3 × 3 convolution filters are applied on C2, …, C5 to generate the heatmaps for keypoints.
- Slightly different from FPN, 1 × 1 convolutional kernel is applied before each element-wise sum procedure in the upsampling process.
- RefineNet transmits the information across different levels and finally integrates the informations of different levels via upsampling and concatenating as HyperNet .
- Besides concatenating all the pyramid features, more bottleneck blocks are stacked into deeper layers, whose smaller spatial size achieves a good trade-off between effectiveness and efficiency.
- Hard keypoints are selected online explicitly based on the training loss and backpropagate the losses from the selected keypoints only.
- Only the top-M (M < N) keypoint losses out of N are used. M=8.
1.4. Some Details
- This ResNet-50-based model takes about 1.5 day on eight NVIDIA Titan X Pascal GPUs.
- In order to minimize the variance of prediction, a 2D Gaussian filter is applied on the predicted heatmaps.
- Following the same techniques used in Stack-Hourglass, the pose of the corresponding flipped image is also predicted and the heatmaps are averaged to get the final prediction.
- A quarter offset in the direction from the highest response to the second highest response is used to obtain the final location of the keypoints.
- Rescoring strategy is also used. Different from the rescoring strategy used in G-RMI, the product of boxes’ score and the average score of all keypoints is considered as the final pose score of a person instance.
- The above figure shows the examples.
- Left eye can be localized easily using GlobalNet at the left.
- Left hip needs RefineNet to refine the location.
2.1. MS COCO
On Test-Dev, without extra data involved in training, 72.1 AP is achieved using a single model of CPN and 73.0 AP is achieved using ensembled models of CPN with different ground truth heatmaps, outperforms such as Mask R-CNN, CMUPose, and G-RMI.
Some illustrative examples are shown above.
- (There are numerous ablation experiments. Please feel free to read the paper directly.)