Reading: C3 — Concentrated-Comprehensive Convolution (Semantic Segmentation)

Compared to ESPNet, ERFNet, DRN & ENet, Similar or Improved mIOU Achieved While Obtaining Smaller model sizes and fewer number of FLOPs

Sik-Ho Tsang
3 min readOct 11, 2020

In this story, Concentrated-Comprehensive Convolution (C3), by Seoul National University, and CLOVA AI Research, Naver Corp., is shortly presented. In this paper:

  • A new block called Concentrated-Comprehensive Convolution (C3) which applies the asymmetric convolutions before the depth-wise separable dilated convolution to compensate for the information loss due to dilated convolution.
  • C3 is applied to ESPNet and achieve about 2% better performance while reducing the number of parameters by half and the number of FLOPs by 35% compared with the original ESPNet.

This is a paper in 2019 arXiv. (Sik-Ho Tsang @ Medium)

Outline

  1. Concentrated-Comprehensive Convolution (C3)
  2. C3 Module
  3. Experimental Results

1. Concentrated-Comprehensive Convolution (C3)

Upper: Conventional, Bottom: Concentrated-Comprehensive Convolution (C3)
  • The complexity is further reduced by using two depth-wise asymmetric convolutions instead of a regular depth-wise convolution.
  • Also, non-linearity (PReLU and Batch normalization) is inserted between the asymmetric filters.
  • After that, the cross-channel operation is executed with a 1×1 point-wise convolution.

In summary, the C3 block combines both advantages of the depth-wise separable convolution and the dilated convolution.

2. C3 Module

Network structure of C3 and ESP module
  • In ESPNet module, the feature maps are added one by one in a hierarchical way, i.e. Hierarchical feature fusion (HFF), before concatenation.
  • In C3 module, the feature maps are just concatenated directly.
  • Also, dilated rate=1 is excluded in C3 module.

3. Experimental Results

3.1. Ablation Study

Ablation Study on Cityscape Test Set
  • (2)-(5): A naive usage of the depthwise separable architecture brought significant degradation of the performance (about 3 to 5%), and even HFF module could not fully resolve the performance degradation in (2).
  • (3)-(5): It can be concluded that the concentration stage is critical for resolving the accuracy drop from depthwise separable dilated conv.
  • (4): With number of layers increased, mIOU is increased.
  • (5): With also wider, more channels, mIOU is further improved.
  • (6): Using C3 but with RC3, mIOU is improved much.
  • (7): Using C3, mIOU obtained is the highest.

3.2. SOTA Comparison

SOTA Comparison on Cityscape
  • C3 module is easily applied on DRN, ENet, ERFNet and ESPNet.
  • With C3 module, smaller model sizes and fewer number of FLOPs are obtained with similar or improved mIOU achieved.
  • Both of C3Net1 and C3Net2 use ESPNet as a baseline but with varying dilation rate d, which is d = {2, 4, 8, 16} and {2, 3, 7, 13}, respectively in C3 module.
  • C3Net2 outperforms C3Net1 about 1% with fewer parameters, shows that the dilation rates should be coprime.

3.3. Visualization

Visualization
  • DS-ESPNet has gridding effect while C3Net1 removes it.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.