# Brief Review — Dynamic ReLU

## DY-ReLU, ReLU But Dynamically Determined Based on Input

Dynamic ReLU
DY-ReLU
, by Microsoft Corporation,
2020 ECCV, Over 110 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2023
[Vision Permutator (ViP)] [ConvMixer] [CrossFormer++]
Human Pose Estimation
2014 … 2018 [PersonLab] 2019 [OpenPose] [HRNet / HRNetV1] 2020 [A-HRNet] 2021 [HRNetV2, HRNetV2p] [Lite-HRNet]
==== My Other Paper Readings Are Also Over Here ====

• Dynamic ReLU (DY-ReLU) is proposed which encodes the global context into the hyper function, and adapts the piecewise linear activation function accordingly.
• Compared to its static counterpart, DY-ReLU has negligible extra computational cost, but significantly more representation capability.

# Outline

1. Dynamic ReLU (DY-ReLU)
2. Results

# 1. Dynamic ReLU (DY-ReLU)

• For a given input vector (or tensor) x, the dynamic activation is defined as a function (x)(x) with learnable parameters θ(x), which adapt to the input x.
• As shown above, it includes two functions:
1. Hyper function θ(x): that computes parameters for the activation function.
2. Activation function (x)(x): that uses the parameters θ(x) to generate activation for all channels.

## 1.1. Definitions

• Let the traditional or static ReLU as y = max(x, 0).
• ReLU can be generalized to a parametric piecewise linear function for each channel c.
• where the coeffcients (akc, bkc) are the output of a hyper function (x) as:
• where K is the number of functions, and C is the number of channels.
• K=2 in this paper.

## 1.2. Implementation of hyper function θ(x)

A light-weight network is used to model the hyper function that is similar to SE module in SENet (which is shown below later).

• The output has 2KC elements, corresponding to the residual of a and b.
• 2σ(x)-1 is used to normalize the residual between -1 to 1, where σ(x) denotes sigmoid function. The final output is computed as the sum of initialization and residual as follows:
• where λ are scalars.

## 1.4. DY-ReLU Variants

• DY-ReLU-A: the activation function is spatial and channel-shared.
• DY-ReLU-B: the activation function is spatial-shared and channel-wise.
• DY-ReLU-C: the activation function is spatial and channel-wise.

3 models uses the concept of SE module in SENet.

# 2. Results

## 2.1. Ablations

• Although all three variations achieve improvement from the baseline, channel-wise DY-ReLUs (variation B and C) are clearly better than the channel-shared DY-ReLU (variation A).

Base upon these ablations, DY-ReLU-B is used for ImageNet classification and DY-ReLU-C is used for COCO keypoint detection.

## 2.2. ImageNet Classification

• MobileNetV2 (×0.35 and ×1.0) is used, and ReLU is replaced with different activation functions in prior work.

The proposed method outperforms all prior work with a clear margin, including Maxout that has significantly more computational cost. This demonstrates that DY-ReLU not only has more representation capability, but also is computationally efficient.

Top: plots the input and output values of DY-ReLU at different blocks (from low level to high level) for 50,000 validation images. Clearly, the learnt DY-ReLU is dynamic over features as activation values (y) vary in a range (that blue dots cover) for a given input x.

Bottom: analyzes the angle between two segments in DY-ReLU (i.e. slope difference |a1c-a2c|).The activation functions tend to have lower bending in higher levels.

## 2.3. COCO Keypoint Estimation

• When using MobileNetV3 as backbone, Squeeze-and-Excitation (SENet Module) is removed and either ReLU or h-Swish is replaced by DY-ReLU.

DY-ReLU outperforms baselines by a clear margin.

--

--