Review — HRNetV2, HRNetV2p: Deep High-Resolution Representation Learning for Visual Recognition

HRNetV2, HRNetV2p, Improves

Sik-Ho Tsang
4 min readMay 8, 2023

--

,
HRNetV2, HRNetV2p, by Microsoft Research, University of Science and Technology of China, Huazhong University of Science and Technology, Peking University, South China University of Technology, Griffith University, and Microsoft, Redmond
2021 TPAMI, Over 1900 Citations (Sik-Ho Tsang @ Medium)
,
2019 arXiv, Over 600 Citations

Human Pose Estimation
2014 … 2017 [] [] [] 2018 []
Image Classification: 1989 … 2023 []
Object Detection: 2014 … 2023 []
Semantic Segmentation: 2014 … 2022 [] []

  • High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel and (ii) repeatedly exchange the information across resolutions.
  • HRNet supports wide range of applications, including human pose estimation, semantic segmentation, and object detection.
  • Authors also have a 2019 arXiv paper: , with over 600 Citations. In this paper, they also cover ImageNet classification results by modifying HRNet

Outline

  1. HRNet
  2. Human Pose Estimation Results
  3. Semantic Segmentation Results
  4. Object Detection Results
  5. Image Classification Results

1. HRNet

  • (Please feel free to read first before reading this article.)

1.1. Backbone

HRNet Backbone

As mentioned, the HRNet backbone (i) connects the high-to-low resolution convolution streams in parallel and (ii) repeatedly exchanges the information across resolutions.

Fusion Module

To exchange the information across resolution, fusion modules are designed as above.

1.2. Representation Head

HRNet Variants

(a) : The output is the representation only from the high-resolution stream. Other three representations are ignored.

(b) HRNetV2: Low-resolution representations are rescaled through bilinear upsampling without changing the number of channels to the high resolution, and the four representations are concatenated, followed by a 1×1 convolution to mix the four representations.

(c) HRNetV2p: Multi-level representations are constructed by downsampling the high-resolution representation output from HRNetV2 to multiple levels.

  • In this paper, is used for human pose estimation, HRNetV2 is used for semantic segmentation, and HRNetV2p is used for object detection.

2. Human Pose Estimation Results

Visual Results
COCO Val
COCO Test-Dev
  • MSE is used as loss function.
  • is used rather than HRNetV2 as its computation complexity is a little lower while the performance is similar.

obtains the best results.

3. Semantic Segmentation Results

Visual Results
Semantic Segmentation
  • HRNetV2 is used and the resulting 15C-dimensional representation at each position is passed to a linear classifier with the softmax loss to predict the segmentation maps.

In Table 3, HRNetV2-W40 (40 indicates the width of the high-resolution convolution), with similar model size to and much lower computation complexity, gets better performance: 4.7 points gain over , 1.7 points gain over and about 0.5 points gain over , .

HRNetV2-W48, with similar model size to and much lower computation complexity, achieves much significant improvement: 5.6 points gain over , 2.6 points gain over and about 1.4 points gain over and .

4. Object Detection Results

In the framework, the proposed networks perform better than with similar parameter and computation complexity: HRNetV2p-W32 vs. -101-, HRNetV2p-W40 vs. -152-, HRNetV2p-W48 vs. X-101-64×4d-.

In the and frameworks, HRNet also performs better.

In the Cascade and frameworks, HRNet gets the overall better performance.

5. Image Classification Results (2019 arXiv)

  • After obtaining multiple resolution using HRNet backbone, the above head is used for ImageNet classification.
ImageNet

In Table 14, using HRNet as backbone is better than one. In Table 15, the proposed manner is superior to the two alternatives.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

No responses yet

Write a response