Review: CPM — Convolutional Pose Machines (Human Pose Estimation)
Outperforms Tompson NIPS’14, and Tompson CVPR’15
In this story, Convolutional Pose Machines (CPMs), by Carnegie Mellon University, is briefly reviewed.
- CPM proposed a sequential architecture that composed of convolutional networks which directly operate on belief maps from previous stages, producing increasingly refined estimates for part locations.
- CPM addresses the vanishing gradients problem during training by providing a natural learning objective function that enforces intermediate supervision.
This is a paper in 2016 CVPR with over 700 citations. (Sik-Ho Tsang @ Medium)
Outline
- Convolutional Pose Machines (CPM)
- Loss Function
- Evaluation
1. Convolutional Pose Machines (CPM)
- (a): In the first stage, t = 1, with the input data xz as input:
- where Yp is the pixel location of the p-th anatomical landmark (part), Z are the set of all locations in the image.
- bp_1 is the the score predicted by the classifier g1 for assigning the p-th part at the first image.
- At the output, there are P+1 belief maps, where P is the total number of parts, and +1 is the background.
- (b): In the subsequent stages, t > 1, with the input data and the contextual information from the preceding classifier:
- Therefore the part location can be refined by the later stages as below:
- (c): As within the first stage, there are five convolutional layers followed by two 1×1 convolutional layers which results in a fully convolutional manner.
- (d): Input data goes through four convolutional layers before adding with the belief maps obtained in the first stage. And then three more convolutional layers followed by two 1×1 convolutional layers are processed.
- (e): As for later stage, the effective receptive field is increased which help to improve the accuracy as shown below.
- The accuracy improves as the effective receptive field increases, and starts to saturate around 250 pixels.
- In practice, to achieve certain precision, the input cropped images is normalized to size 368×368.
- The receptive field of the first stage shown above is 160×160 pixels.
- The second stage output on the belief maps of the first stage is set to 31×31, which is equivalently 400×400 pixels on the original image.
2. Loss Function
- The cost function aimed to minimize at the output of each stage at each level is:
- The summation of the costs of all stages is considered as the the costs of all stages:
- By enforcing supervision in intermediate stages through the network, it can somehow address the vanishing gradient problem as the intermediate loss functions replenish the gradients at each stage.
- In early epochs, as we move from the output layer to the input layer, we observe on the model without intermediate supervision, the gradient distribution is tightly peaked around zero because of vanishing gradients.
- The model with intermediate supervision has a much larger variance across all layers, suggesting that learning is indeed occurring in all the layers thanks to intermediate supervision.
3. Evaluation
3.1. Ablation Study
- PCK is measured which is the error tolerance is normalized with respect to the part target.
- Left: With end-to-end training, it is better than stagewise training, and stages without intermediate supervision.
- Right: The performance increases monotonically until 5 stages, diminishing returns at the 6th stage.
3.2. MPII Human Pose Dataset
- 28000 training samples.
- As shown above, CPM outperforms Tompson NIPS’14 and Tompson CVPR’15.
3.3. Leeds Sports Pose (LSP) Dataset
- 11000 training images and 1000 testing images.
- As shown above, CPM outperforms Tompson NIPS’14.
3.4. FLIC Dataset
- 3987 training images and 1016 testing images.
- Again, CPM outperforms Tompson NIPS’14 and Tompson CVPR’15.
Reference
[2016 CVPR] [CPM]
Convolutional Pose Machines
My Previous Reviews
Image Classification
[LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [MSDNet] [ShuffleNet V1] [SENet]
Object Detection
[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]
Semantic Segmentation
[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3]
Biomedical Image Segmentation
[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [QSA+QNT] [3D U-Net+ResNet]
Instance Segmentation
[SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]
Super Resolution
[SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet]
Human Pose Estimation
[DeepPose] [Tompson NIPS’14] [Tompson CVPR’15]