# Brief Review — Dynamic ReLU

## DY-ReLU, ReLU But Dynamically Determined Based on Input

Dynamic ReLU, by Microsoft Corporation,

DY-ReLU2020 ECCV, Over 110 Citations(Sik-Ho Tsang @ Medium)

Image Classification[Vision Permutator (ViP)] [ConvMixer] [CrossFormer++]

1989 … 2023Human Pose Estimation2014 … 2018[PersonLab]2019[OpenPose][HRNet / HRNetV1]2020[A-HRNet]2021[HRNetV2, HRNetV2p] [Lite-HRNet]

==== My Other Paper Readings Are Also Over Here ====

**Dynamic ReLU (DY-ReLU)**is proposed which**encodes the global context into the hyper function**, and**adapts the piecewise linear activation function accordingly.**- Compared to its static counterpart, DY-ReLU has
**negligible extra computational cost**, but**significantly more representation capability.**

# Outline

**Dynamic ReLU (DY-ReLU)****Results**

**1. Dynamic ReLU (DY-ReLU)**

- For a given input vector (or tensor)
*x*, the dynamic activation is defined as a function*fθ*(*x*)(*x*) with learnable parameters*θ*(*x*), which adapt to the input*x*. - As shown above, it includes two functions:

**Hyper function**: that computes parameters for the activation function.*θ*(*x*)**Activation function**: that uses the parameters*fθ*(*x*)(*x*)*θ*(*x*) to generate activation for all channels.

## 1.1. Definitions

- Let the traditional or static
**ReLU***y*= max(*x*, 0). - ReLU can be
**generalized to a parametric piecewise linear function for each channel**.*c*

- where the
**coeffcients (**are*akc*,*bkc*)**the output of a hyper function (**as:*x*)

- where
is the*K***number of functions**, andis the*C***number of channels**. in this paper.*K*=2

## 1.2. Implementation of hyper function *θ*(*x*)

*θ*(

*x*)

A light-weight network is used to

model the hyper functionthat issimilar to SE module inSENet(which is shown below later).

- The output has
**2**, corresponding to the residual of*KC*elements*a*and*b*. **2**is used to*σ*(*x*)-1**normalize**the residual between -1 to 1, where*σ*(*x*) denotes sigmoid function. The**final output**is computed as the sum of initialization and residual as follows:

- where
*λ*are scalars.

## 1.3. Relation to Prior Works

- Indeed, the three special cases of DY-ReLU are equivalent to ReLU, Leaky ReLU and PReLU.

## 1.4. DY-ReLU Variants

**DY-ReLU-A**: the activation function is spatial and channel-shared.**DY-ReLU-B**: the activation function is spatial-shared and channel-wise.**DY-ReLU-C**: the activation function is spatial and channel-wise.

3 models uses the concept of SE module inSENet.

# 2. Results

## 2.1. Ablations

- Although all three variations achieve improvement from the baseline,
**channel-wise DY-ReLUs (variation B and C) are clearly better than the channel-shared DY-ReLU (variation A).**

Base upon these ablations,

DY-ReLU-Bis used forImageNet classificationandDY-ReLU-Cis used forCOCO keypoint detection.

## 2.2. ImageNet Classification

**MobileNetV2****(×0.35 and ×1.0) is used**, and ReLU is replaced with different activation functions in prior work.

The proposed method

outperforms all prior work with a clear margin,includingMaxoutthat has significantly more computational cost. This demonstrates thatDY-ReLU not only has more representation capability, but also is computationally efficient.

Top: plots the input and output values of DY-ReLU at different blocks (from low level to high level) for 50,000 validation images.Clearly, the learnt DY-ReLU is dynamic over features as activation values (y) vary in a range (that blue dots cover) for a given inputx.

Bottom: analyzes theangle between two segments in DY-ReLU (i.e. slope difference |.The activation functions tend to havea1c-a2c|)lower bending in higher levels.

## 2.3. COCO Keypoint Estimation

- When using MobileNetV3 as backbone, Squeeze-and-Excitation (SENet Module) is removed and either ReLU or h-Swish is replaced by DY-ReLU.

DY-ReLU

outperforms baselines by a clear margin.