Brief Review — RMPE: Regional Multi-Person Pose Estimation

RMPE for Multi-Person Pose Estimation

Sik-Ho Tsang
4 min readJul 30, 2023

RMPE: Regional Multi-Person Pose Estimation
RMPE, by Shanghai Jiao Tong University, and Tencent YouTu,
2017 ICCV, Over 1600 Citations (

@ Medium)

Human Pose Estimation
2014 … 2018 [PersonLab] 2019 [OpenPose] [HRNet / HRNetV1] 2020 [A-HRNet] 2021 [HRNetV2, HRNetV2p] [Lite-HRNet]
==== My Other Paper Readings Are Also Over Here ====

  • At that moment, researchers focused on single-person pose estimation. Multi-person pose estimation in the wild was a diffficult problem.
  • A novel regional multi-person pose estimation (RMPE) framework is proposed to facilitate pose estimation in the presence of inaccurate human bounding boxes. It consists of three components: Symmetric Spatial Transformer Network (SSTN), Parametric Pose Non-Maximum-Suppression (NMS), and Pose-Guided Proposals Generator (PGPG).


  1. Regional Multi-person Pose Estimation (RMPE)
  2. Results

1. Regional Multi-person Pose Estimation (RMPE)

Regional Multi-person Pose Estimation (RMPE)
  • The human bounding boxes obtained by the human detector are fed into the “Symmetric STN + SPPE” module, and the pose proposals are generated automatically. The generated pose proposals are refined by parametric Pose NMS to obtain the estimated human poses.
  • During the training, “Parallel SPPE” is introduced in order to avoid local minimums and further leverage the power of SSTN. To augment the existing training samples, a pose-guided proposals generator (PGPG) is designed

1.1. Symmetric STN and Parallel SPPE

  • STN is used to extract high quality dominant human proposals. Mathematically, the STN performs a 2D affine transformation:
  • After SPPE, the resulting pose is mapped into the original human proposal image.
  • A spatial detransformer network (SDTN) is required to remap the estimated human pose back to the original image coordinate.
  • After extracting high quality dominant human proposal regions, off-the-shelf SPPE using stacked hourglass network is used for accurate pose estimation.
  • Parallel SPPE is additionally used. The output of this SPPE branch is directly compared to labels of centerlocated ground truth poses.
  • All the layers of this parallel SPPE are frozen during the training phase. The weights of this branch are fixed and its purpose is to back-propagate center-located pose errors to the STN module.

This Parallel SPPE can help the STN focus on the correct area and extract high quality human-dominant regions.

1.2. Parametric Pose NMS

  • A pose distance metric d(Pi, Pj | Λ) to measure the pose similarity, and a threshold η as elimination criterion, where Λ is a parameter set of function d(·):
  • The sum of pose distance, K, and the spatial distance between parts, H, is used as distance function:
  • (Please feel free to read the paper directly for more details.)

1.3. Pose-Guided Proposals Generator (PGPG)

  • For each annotated pose in the training sample, the corresponding atomic pose a is first looked up.
  • Then additional offsets are generated by dense sampling according to P(δB|a) to produce augmented training proposals.
  • (Please feel free to read the paper directly for more details.)
  • (I tend to keep 1.2 and 1.3 abstract since the paper is in 2017 which is a bit legacy paper, and 1.2 and 1.3 seem to be not so common later on.)

1.4. Some Details

  • VGG-based SSD-512 is used as our human detector, as it performs object detection effectively and efficiently.
  • In order to guarantee that the entire person region will be extracted, detected human proposals are extended by 30% along both the height and width directions.
  • The stacked hourglass model is used as the single person pose estimator because of its superior performance. Considering the memory efficiency, we use a smaller 4-stack hourglass network as the parallel SPPE.
  • For the STN network, the ResNet-18 is adopted as the localization network.
  • For “++” version, human detector is replaced with ResNet-152 based Faster R-CNN and the pose estimator is replaced with PyraNet [45].

2. Results

2.1. MPII


RMPE achieves an average accuracy of 72 mAP on identifying difficult joints such as wrists, elbows, ankles, and knees, which is 3.3 mAP higher than the previous state-of-the-art result.

2.2. MS COCO Keypoint Challenge

MS COCO Keypoint Challenge

Again, RMPE achieves the state-of-the-art performance.

2.3. Visualizations

  • (Please feel free to read the paper directly for ablation experiments.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.