Brief Review — Dynamic ReLU
DY-ReLU, ReLU But Dynamically Determined Based on Input
Dynamic ReLU
DY-ReLU, by Microsoft Corporation,
2020 ECCV, Over 110 Citations (Sik-Ho Tsang @ Medium)Image Classification
1989 … 2023 [Vision Permutator (ViP)] [ConvMixer] [CrossFormer++]
Human Pose Estimation
2014 … 2018 [PersonLab] 2019 [OpenPose] [HRNet / HRNetV1] 2020 [A-HRNet] 2021 [HRNetV2, HRNetV2p] [Lite-HRNet]
==== My Other Paper Readings Are Also Over Here ====
- Dynamic ReLU (DY-ReLU) is proposed which encodes the global context into the hyper function, and adapts the piecewise linear activation function accordingly.
- Compared to its static counterpart, DY-ReLU has negligible extra computational cost, but significantly more representation capability.
Outline
- Dynamic ReLU (DY-ReLU)
- Results
1. Dynamic ReLU (DY-ReLU)
- For a given input vector (or tensor) x, the dynamic activation is defined as a function fθ(x)(x) with learnable parameters θ(x), which adapt to the input x.
- As shown above, it includes two functions:
- Hyper function θ(x): that computes parameters for the activation function.
- Activation function fθ(x)(x): that uses the parameters θ(x) to generate activation for all channels.
1.1. Definitions
- Let the traditional or static ReLU as y = max(x, 0).
- ReLU can be generalized to a parametric piecewise linear function for each channel c.
- where the coeffcients (akc, bkc) are the output of a hyper function (x) as:
- where K is the number of functions, and C is the number of channels.
- K=2 in this paper.
1.2. Implementation of hyper function θ(x)
A light-weight network is used to model the hyper function that is similar to SE module in SENet (which is shown below later).
- The output has 2KC elements, corresponding to the residual of a and b.
- 2σ(x)-1 is used to normalize the residual between -1 to 1, where σ(x) denotes sigmoid function. The final output is computed as the sum of initialization and residual as follows:
- where λ are scalars.
1.3. Relation to Prior Works
- Indeed, the three special cases of DY-ReLU are equivalent to ReLU, Leaky ReLU and PReLU.
1.4. DY-ReLU Variants
- DY-ReLU-A: the activation function is spatial and channel-shared.
- DY-ReLU-B: the activation function is spatial-shared and channel-wise.
- DY-ReLU-C: the activation function is spatial and channel-wise.
3 models uses the concept of SE module in SENet.
2. Results
2.1. Ablations
- Although all three variations achieve improvement from the baseline, channel-wise DY-ReLUs (variation B and C) are clearly better than the channel-shared DY-ReLU (variation A).
Base upon these ablations, DY-ReLU-B is used for ImageNet classification and DY-ReLU-C is used for COCO keypoint detection.
2.2. ImageNet Classification
- MobileNetV2 (×0.35 and ×1.0) is used, and ReLU is replaced with different activation functions in prior work.
The proposed method outperforms all prior work with a clear margin, including Maxout that has significantly more computational cost. This demonstrates that DY-ReLU not only has more representation capability, but also is computationally efficient.
Top: plots the input and output values of DY-ReLU at different blocks (from low level to high level) for 50,000 validation images. Clearly, the learnt DY-ReLU is dynamic over features as activation values (y) vary in a range (that blue dots cover) for a given input x.
Bottom: analyzes the angle between two segments in DY-ReLU (i.e. slope difference |a1c-a2c|).The activation functions tend to have lower bending in higher levels.
2.3. COCO Keypoint Estimation
- When using MobileNetV3 as backbone, Squeeze-and-Excitation (SENet Module) is removed and either ReLU or h-Swish is replaced by DY-ReLU.
DY-ReLU outperforms baselines by a clear margin.