Review — Mish: A Self Regularized Non-Monotonic Activation Function

Mish, Outperformed Leaky ReLU, on With a -DarkNet-53 Backbone

Sik-Ho Tsang
5 min readApr 26, 2022

(a) Graph of Mish, ReLU, SoftPlus, and activation functions; (b) The 1st and 2nd derivatives of Mish and activation functions

Mish: A Self Regularized Non-Monotonic Activation Function
Mish, by Landskape, KIIT, Bhubaneswar
2020 BMVC, Over 500 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Object Detection, Activation Function

  • Mish, a novel self-regularized non-monotonic activation function, is proposed, which can be mathematically defined as:
  • f(x) = x tanh(softplus(x))

Outline

  1. Conventional Activation Functions
  2. Mish
  3. Experimental Results

1. Conventional Activation Functions

1.1. Sigmoid and Tanh

  • Sigmoid and Tanh activation functions were extensively used, which subsequently became ineffective in deep neural networks, due to zero gradients at both tails.

1.2. ReLU

  • A less probability inspired, unsaturated piece-wise linear activation known as Rectified Linear Unit (ReLU) became more relevant and showed better generalization and improved speed of convergence compared to Sigmoid and Tanh.
  • But ReLU has a popularly known problem as Dying ReLU, which is experienced through a gradient information loss caused by collapsing the negative inputs to zero.

1.3. Leaky ReLU, ELU, SELU,

  • Over the years, many activation functions, such as Leaky ReLU, ELU, SELU, , have been proposed which improve performance and address the shortcomings of ReLU.
  • The smooth, continuous profile of proved essential in better information propagation as compared to ReLU.

In this paper, inspired by the self gating property of , Mish is proposed.

2. Mish

(a) Graph of Mish, , and similar validated experimental functions (b) Training curve of a six-layered CNN on CIFAR-10
  • While is found by Neural Architecture Search (NAS), the design of Mish, while influenced by the work performed by , was found by systematic analysis and experimentation over the characteristics that made so effective.
  • (a): arctan(x)softplus(x), tanh(x)softplus(x), x log(1+arctan(e^x)) and x log(1+tanh(e^x)), where softplus(x) = ln(1+e^x), are studied.
  • (b): Mish performed better than the other activation functions.
  • It is found that x log(1+tanh(e^x)) performed on par with Mish,its training is often unstable.

Mish is a smooth, continuous, self regularized, non-monotonic activation function:

  • Mish is bounded below at around 0.31, and unbound above.

Due to the preservation of a small amount of negative information, Mish eliminated by design the preconditions necessary for the Dying ReLU phenomenon.

  • Being unbounded above, Mish avoids saturation, which generally causes training to slow down.
  • Also, unlike ReLU, Mish is continuously differentiable, a property that is preferable because it avoids singularities.

3. Experimental Results

3.1. Ablation Study on CIFAR-10 and MNIST

(a) Increasing depth of the neural network on the MNIST dataset (b) increasing input gaussian noise (c) Different weight initializations
  • (a): Post fifteen layers, there was a sharp decrease in accuracy for both and ReLU, while Mish maintained a significantly higher accuracy in large models where optimization becomes difficult.
  • (b): Consistently better loss is observed with varying intensity of Input Gaussian Noise with Mish as compared to ReLU and .
  • (c): Consistent positive difference is observed in the performance of Mish compared to while using different initializers.

3.2. CIFAR-10

Comparison between Mish, , and ReLU activation functions based on test accuracy on image classification of CIFAR-10 across various network architectures

Mish activation function consistently outperforms ReLU and activation functions across all the standard architectures used in the experiment, with often providing 1% to 3% performance improvement over the baseline ReLU enabled network architectures.

3.3. ImageNet

Comparison between Mish, , ReLU and Leaky ReLU activation functions on image classification of ImageNet-1k dataset

Mish consistently outperforms the default Leaky ReLU/ ReLU on all the four network architectures with a 1% increase in Top-1 Accuracy over Leaky ReLU in --50 architecture although provides marginally stronger result in as compared to Mish.

3.4. MS-COCO Object Detection

Comparison between ReLU and Mish activation functions on object detection on MS-COCO dataset

Simply replacing ReLU with Mish in the backbone improved the mAP@0.5 for -DarkNet-53 and -DarkNet-53++ by 0.4%.

Comparison between Leaky ReLU and Mish activation functions on object detection on MS-COCO 2017 dataset with a test image size of 736 × 736

Using Mish, a consistent 0.9% to 2.1% improvement is observed in the AP50 val on test size of 736.

3.5. Time Complexity

Comparison between the runtime for the forward and backward passes
  • In practical implementation, a threshold of 20 is enforced on Softplus, which makes the training more stable and prevents gradient overflow.
  • All runs were performed on an NVIDIA GeForce RTX-2070 GPU using standard benchmarking practices over 100 runs.

The significant reduction in computational overhead of Mish by using the optimized version Mish-CUDA.

It is surprising that Mish outperforms while they seems to be similar. And Mish is used in and .

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

No responses yet

Write a response