Brief Review — Dynamic ReLU

DY-ReLU, ReLU But Dynamically Determined Based on Input

Sik-Ho Tsang
4 min readAug 9, 2023

Dynamic ReLU
, by Microsoft Corporation,
2020 ECCV, Over 110 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2023
[Vision Permutator (ViP)] [ConvMixer] [CrossFormer++]
Human Pose Estimation
2014 … 2018 [PersonLab] 2019 [OpenPose] [HRNet / HRNetV1] 2020 [A-HRNet] 2021 [HRNetV2, HRNetV2p] [Lite-HRNet]
==== My Other Paper Readings Are Also Over Here ====

  • Dynamic ReLU (DY-ReLU) is proposed which encodes the global context into the hyper function, and adapts the piecewise linear activation function accordingly.
  • Compared to its static counterpart, DY-ReLU has negligible extra computational cost, but significantly more representation capability.


  1. Dynamic ReLU (DY-ReLU)
  2. Results

1. Dynamic ReLU (DY-ReLU)

Dynamic ReLU, ReLU but the piecewise linear function is determined by the input x.
  • For a given input vector (or tensor) x, the dynamic activation is defined as a function (x)(x) with learnable parameters θ(x), which adapt to the input x.
  • As shown above, it includes two functions:
  1. Hyper function θ(x): that computes parameters for the activation function.
  2. Activation function (x)(x): that uses the parameters θ(x) to generate activation for all channels.

1.1. Definitions

  • Let the traditional or static ReLU as y = max(x, 0).
  • ReLU can be generalized to a parametric piecewise linear function for each channel c.
  • where the coeffcients (akc, bkc) are the output of a hyper function (x) as:
  • where K is the number of functions, and C is the number of channels.
  • K=2 in this paper.

1.2. Implementation of hyper function θ(x)

A light-weight network is used to model the hyper function that is similar to SE module in SENet (which is shown below later).

  • The output has 2KC elements, corresponding to the residual of a and b.
  • 2σ(x)-1 is used to normalize the residual between -1 to 1, where σ(x) denotes sigmoid function. The final output is computed as the sum of initialization and residual as follows:
  • where λ are scalars.

1.3. Relation to Prior Works

Relation to Prior Works

1.4. DY-ReLU Variants

DY-ReLU Variants
  • DY-ReLU-A: the activation function is spatial and channel-shared.
  • DY-ReLU-B: the activation function is spatial-shared and channel-wise.
  • DY-ReLU-C: the activation function is spatial and channel-wise.

3 models uses the concept of SE module in SENet.

2. Results

2.1. Ablations

  • Although all three variations achieve improvement from the baseline, channel-wise DY-ReLUs (variation B and C) are clearly better than the channel-shared DY-ReLU (variation A).

Base upon these ablations, DY-ReLU-B is used for ImageNet classification and DY-ReLU-C is used for COCO keypoint detection.

2.2. ImageNet Classification

ImageNet Classification
  • MobileNetV2 (×0.35 and ×1.0) is used, and ReLU is replaced with different activation functions in prior work.

The proposed method outperforms all prior work with a clear margin, including Maxout that has significantly more computational cost. This demonstrates that DY-ReLU not only has more representation capability, but also is computationally efficient.

Inspecting DY-ReLU: Is It Dynamic?

Top: plots the input and output values of DY-ReLU at different blocks (from low level to high level) for 50,000 validation images. Clearly, the learnt DY-ReLU is dynamic over features as activation values (y) vary in a range (that blue dots cover) for a given input x.

Bottom: analyzes the angle between two segments in DY-ReLU (i.e. slope difference |a1c-a2c|).The activation functions tend to have lower bending in higher levels.

2.3. COCO Keypoint Estimation

COCO Keypoint Estimation
  • When using MobileNetV3 as backbone, Squeeze-and-Excitation (SENet Module) is removed and either ReLU or h-Swish is replaced by DY-ReLU.

DY-ReLU outperforms baselines by a clear margin.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.