Brief Review — Dynamic ReLU

DY-ReLU, ReLU But Dynamically Determined Based on Input

4 min readAug 9, 2023

Dynamic ReLU
DY-ReLU, by Microsoft Corporation,
2020 ECCV, Over 110 Citations (Sik-Ho Tsang @ Medium)
Image Classification
1989 … 2023 [Vision Permutator (ViP)] [ConvMixer] [CrossFormer++]
Human Pose Estimation
2014 … 2018 [PersonLab] 2019 [OpenPose] [HRNet / HRNetV1] 2020 [A-HRNet] 2021 [HRNetV2, HRNetV2p] [Lite-HRNet]
==== My Other Paper Readings Are Also Over Here ====

Dynamic ReLU (DY-ReLU) is proposed which encodes the global context into the hyper function, and adapts the piecewise linear activation function accordingly.
Compared to its static counterpart, DY-ReLU has negligible extra computational cost, but significantly more representation capability.

Outline

Dynamic ReLU (DY-ReLU)
Results

1. Dynamic ReLU (DY-ReLU)

**Dynamic ReLU,** **ReLU** **but the piecewise linear function is determined by the input x.**

For a given input vector (or tensor) x, the dynamic activation is defined as a function fθ(x)(x) with learnable parameters θ(x), which adapt to the input x.
As shown above, it includes two functions:

Hyper function θ(x): that computes parameters for the activation function.
Activation function fθ(x)(x): that uses the parameters θ(x) to generate activation for all channels.

1.1. Definitions

Let the traditional or static ReLU as y = max(x, 0).
ReLU can be generalized to a parametric piecewise linear function for each channel c.

where the coeffcients (akc, bkc) are the output of a hyper function (x) as:

where K is the number of functions, and C is the number of channels.
K=2 in this paper.

1.2. Implementation of hyper function θ(x)

A light-weight network is used to model the hyper function that is similar to SE module in SENet (which is shown below later).

The output has 2KC elements, corresponding to the residual of a and b.
2σ(x)-1 is used to normalize the residual between -1 to 1, where σ(x) denotes sigmoid function. The final output is computed as the sum of initialization and residual as follows:

where λ are scalars.

1.3. Relation to Prior Works

Indeed, the three special cases of DY-ReLU are equivalent to ReLU, Leaky ReLU and PReLU.

1.4. DY-ReLU Variants

DY-ReLU-A: the activation function is spatial and channel-shared.
DY-ReLU-B: the activation function is spatial-shared and channel-wise.
DY-ReLU-C: the activation function is spatial and channel-wise.

3 models uses the concept of SE module in SENet.

2. Results

2.1. Ablations

Although all three variations achieve improvement from the baseline, channel-wise DY-ReLUs (variation B and C) are clearly better than the channel-shared DY-ReLU (variation A).

Base upon these ablations, DY-ReLU-B is used for ImageNet classification and DY-ReLU-C is used for COCO keypoint detection.

2.2. ImageNet Classification

MobileNetV2 (×0.35 and ×1.0) is used, and ReLU is replaced with different activation functions in prior work.

The proposed method outperforms all prior work with a clear margin, including Maxout that has significantly more computational cost. This demonstrates that DY-ReLU not only has more representation capability, but also is computationally efficient.

Top: plots the input and output values of DY-ReLU at different blocks (from low level to high level) for 50,000 validation images. Clearly, the learnt DY-ReLU is dynamic over features as activation values (y) vary in a range (that blue dots cover) for a given input x.
Bottom: analyzes the angle between two segments in DY-ReLU (i.e. slope difference |a1c-a2c|).The activation functions tend to have lower bending in higher levels.

2.3. COCO Keypoint Estimation

When using MobileNetV3 as backbone, Squeeze-and-Excitation (SENet Module) is removed and either ReLU or h-Swish is replaced by DY-ReLU.

DY-ReLU outperforms baselines by a clear margin.

Brief Review — Dynamic ReLU

DY-ReLU, ReLU But Dynamically Determined Based on Input

Outline

1. Dynamic ReLU (DY-ReLU)

1.1. Definitions

1.2. Implementation of hyper function θ(x)

1.3. Relation to Prior Works

1.4. DY-ReLU Variants

2. Results

2.1. Ablations

2.2. ImageNet Classification

2.3. COCO Keypoint Estimation

Written by Sik-Ho Tsang

No responses yet