Reading: ERFNet — Efficient Residual Factorized ConvNet for Real-time (Semantic Segmentation)

Outperforms DilatedNet, DPN, FCN, DeepLabv1, ENet & SegNet, Similar accuracy to SOTA, RefineNet & DeepLabv2, While Taking Only 24ms Per Image on a Single GPU

4 min readOct 10, 2020

From Authors: https://www.youtube.com/watch?v=AbXzU9ZzqF4

In this story, “ERFNet: Efficient Residual Factorized ConvNet for Real-time Semantic Segmentation” (ERFNet), by University of Alcal´a (UAH), and CSIRO-Data61, is shortly presented. In this paper:

A novel layer that uses residual connections and factorized convolutions, is proposed in order to remain efficient while retaining remarkable accuracy.
ERFNet is able to run at over 83 FPS in a single Titan X, and 7 FPS in a Jetson TX1 (embedded GPU).

This is a paper in 2017 TITS (IEEE Transactions on Intelligent Transportation Systems) with over 300 citations and with high impact factor of 6.319. (Sik-Ho Tsang @ Medium)

Outline

Non-Bottleneck-1D Block
ERFNet: Network Architecture
Experimental Results

1. Non-Bottleneck-1D Block

(a) & (b): They are the non-bottleneck and bottleneck residual blocks originally used in ResNet. (If interested, please feel free to read ResNet.)
(c): The factorization, originated in Inception-v3 is applied to reduce the 3×3 convolutions on the original residual modules. While larger filters would be more benefited by this decomposition, applying it on 3×3 convolutions already yields a 33% reduction in parameters and further increases its computational efficiency.

**#ext-fm: external feature map, #int-fm: internal feature map, #weight: number of weights**

The above table summarizes the total dimensions of the weights on the convolutions of every residual block.

2. ERFNet: Network Architecture

**Details of ERFNet: Network Architecture**

ERFNet is designed by stacking sequentially the proposed non-bt-1D layers in a way that best leverages their learning performance and efficiency.

2.1. Encoder

The layers from 1 to 16 in our architecture form the encoder, composed of residual blocks and downsampling blocks.
Three downsamplings are performed at layers 1, 2 and 8.

**Downsampler block Inspired by** **ENet** **initial block**

The downsampler block, inspired by the initial block of ENet, performs downsampling by concatenating the parallel outputs of a single 3×3 convolution with stride 2 and a Max-Pooling module.
Dilated convolutions, originated in DeepLab and DilatedNet, are also inserted at certain non-bt-1D layers to gather more context, which led to an improvement in accuracy. Thus, the second pair of 3×1 and 1×3 convolutions are a pair of dilated 1D convolutions.

2.2. Decoder

The decoder segment is composed of the layers from 17 to 23.
ERFNet follows a similar strategy to ENet in having a small decoder.
Deconvolution layers with stride 2 are used.

3. Experimental Results

3.1. Ablation Study for non-bt-1D Module

**IoU on Cityscapes Validation (VI-IoU) & Training Sets (Tr-IoU)**

Cityscapes Dataset, which contains a train set of 2975 images, a validation set of 500 images and a test set of 1525 images, is used for evaluation.
For a fair comparison, the non-bottleneck designs should receive 4 times less feature maps at the layer’s input.
The higher accuracy of the architectures with wider layers (those presented in the bottom side of the table) demonstrates that computing more feature maps per layer allows networks to better approximate the loss functions that they are trying to learn.
The proposed non-bottleneck-1D design is the most effective choice to increase capacity of a network with the lowest impact on efficiency.

3.2. SOTA Comparison

ERFNet, achieves a 69.7% Class IoU and a 87.3% Category IoU on the Cityscapes Test set, which supposes a similar accuracy to the state of the art (such as RefineNet, DeepLabv2), while taking only 24 ms per image on a single GPU, which makes it one of the fastest networks available.
And ERFNet outperforms DilatedNet, DPN, FCN, DeepLabv1, ENet and SegNet.

3.3. Inference Time

At 640×360, a resolution that is enough to recognize any urban scene accurately, our network achieves over 83 FPS on a single Titan X and over 7 FPS on a Tegra TX1, an embedded GPU that uses less than 10 Watts at full load.
At 1024×512 (the ratio used in the Cityscapes tests), ERFNet achieves 24ms (41 FPS) on a Titan X.
In summary, our network achieves a speed that is as competitively fast as the fastest ones (ENet and SQ), while having a significantly better accuracy.