Review — Lite-HRNet: A Lightweight High-Resolution Network

Proposes Light Weight HRNet Using Conditional Channel Weighting (CCW)

Sik-Ho Tsang
4 min readMay 9, 2023

Lite-HRNet: A Lightweight High-Resolution Network,
Lite-HRNet, by Huazhong University of Science and Technology, and Microsoft,
2021 CVPR, Over 130 Citations (Sik-Ho Tsang @ Medium)

Human Pose Estimation
2014 … 2018
[PersonLab] 2019 [HRNet / HRNetV1] 2021 [HRNetV2, HRNetV2p]
Semantic Segmentation
2014 … 2022
[PVTv2] [YOLACT++]
==== My Other Paper Readings Are Also Over Here ====

  • The heavily-used pointwise (1×1) convolutions in shuffle blocks become the computational bottleneck.
  • A lightweight unit, conditional channel weighting (CCW), is introduced to replace costly pointwise (1×1) convolutions in shuffle blocks.
  • The proposed solution learns the weights from all the channels and over multiple resolutions that are readily available in the parallel branches in HRNet. It uses the weights as the bridge to exchange information across channels and resolutions, compensating the role played by the pointwise (1×1) convolution.

Outline

  1. Lite-HRNet
  2. Results

1. Lite-HRNet

1.1. Backbone

Illustration of the Small HRNet architecture
  • Small HRNet design is used, which uses fewer layers and smaller width to form the proposed network.
Shuffle Block vs Proposed Conditional Channel Weighting Block

1.2. (a) Shuffle Block in ShuffleNet V2

  • Shuffle block first splits the channels into two partitions. One partition passes through a sequence of 1×1 convolution, 3×3 depthwise convolution, and 1 × 1 convolution, and the output is concatenated with the other partition. Finally, the concatenated channels are shuffled.

However, 1×1 convolution is costly.

1.3. (b) Conditional Channel Weighting (CCW) in Lite-HRNet

CCW is used to replace the 1×1 convolution in naive Lite-HRNet. There are H function and F Function in the module as shown above.

1.3.1. H Function

  • The element-wise weighting operation for the sth resolution branch is written as:
  • where:
  • Adaptive average pooling (AAP) is applied on input X.
  • Then, {X′1, X′2, …, Xs−1} and Xs are concatenated together, followed by a sequence of 1×1 convolution, ReLU, 1×1 convolution, and sigmoid:

Here, the weights at each position for each resolution depend on the channel feature at the same position from the average-pooled multi-resolution channel maps. This is why it is called as cross-resolution weight computation.

The weight maps serves as a bridge for information exchange across channels and resolutions.

1.3.2. F Function

  • The weights depend on all the pixels of the input channels in a single resolution:
  • The function Fs(·) is implemented as: Xs → GAP → FC → ReLU → FC → sigmoid → ws, where GAP is global average pooling.
  • Indeed, we can see that it is SE module in SENet.

1.4. Model Architecture & Complexity

Summary of Model Architecture & Complexity
  • Two variants are constructed: Lite-HRNet-18 and Lite-HRNet-30.
  • CCW is much light-weight with 0.51M FLOPs only.

2. Results

2.1. Human Pose Estimation

COCO Val

e.g.: Compared to ShuffleNet V2, Lite-HRNet-18 and Lite-HRNet-30 achieve 4.9 and 7.3 points gain, respectively.

COCO Test-Dev, COCO Val, and MPII Val
  • On COCO Test-Dev, Lite-HRNet-30 outperforms Mask R-CNN, G-RMI, and Integral Pose Regression [38]. Although there is a performance gap with some large networks, the proposed networks have far lower GFLOPs and parameters.

As in the figure, Lite-HRNet achieves a better balance between accuracy and computational complexity.

2.2. Semantic Segmentation

Cityscapes

Lite-HRNet-18 achieves 72.8% mIoU with only 1.95 GFLOPs and Lite-HRNet-30 achieves 75.3% mIoU with 3.02 GFLOPs, outperforming the hand-crafted methods and NAS-based methods, and comparable with SwiftNetRN-18 [32] that is far computationally intensive (104 GFLOPs).

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.