Review — Toward Achieving Robust Low-Level and High-Level Scene Parsing

EFCN, Aggregate Contexts Using Convolutional Context Network (CCN)

Sik-Ho Tsang
4 min readJan 26, 2023
Semantic segmentation demands robust high-level as well as low-level parsing. EFCN outperforms FCN for both the high-level smoothing/recognition.

Toward Achieving Robust Low-Level and High-Level Scene Parsing,
EFCN, by Nanyang Technological University, and Alibaba AI Labs,
2019 TIP, Over 25 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Fully Convolutional Network, FCN

  • It is found that the parsing performance of “skip” network can be noticeably improved by modifying the parameterization of skip layers.
  • Thus, “dense skip” architecture is introduced to retain a rich set of low-level information. A Convolutional Context Network (CCN) is proposed based on “dense skip” layers, to aggregate contexts for high-level feature maps.
  • Finally, Enhanced Fully Convolutional Network (EFCN) is formed.

Outline

  1. Parameterization of Skip Layers, Dense Skip Layers, & CCN
  2. Enhanced Fully Convolutional Network (EFCN)
  3. Results

1. Parameterization of Skip Layers, Dense Skip Layers, & CCN

1.1. Skip Layers

Left Figure: Skip layers of segmentation networks. Right Figure: Parameterizations of skip layers. Table: Comparisons of “Dilation” and “Skip” Using VGG-16 on ADE20K.
  • Both “Dilation” and “Skip” approaches can retain some useful information to improve parsing performance. However, it’s important to mention that “Dilation” network is significantly slower than “Skip” network.
  • “Our Skip”: is proposed where 2-layer convolutional layers with batch normalization (BN) are used to parameterize skip layers (i.e.: Conv+BN+ReLU+Conv).

Performance of the same “skip” networks are significantly boosted (> 1% IOU) after the new parameterized skip layers are used.

1.2. Dense Skip Layers

Comparison Between “Sparse Skip” and “Dense Skip” Using VGG-16 on ADE20K.
  • “dense skip” network adds skip layers for each intermediate feature map after POOL3, which can help to aggregate multi-scale contexts. (More details in 1.3.)

The “dense skip” architecture outperforms the conventional “sparse skip” counterpart by a noticeable margin, and its inference speed remains competitively fast.

1.3. Convolutional Context Network (CCN)

Convolutional context network (CCN) is a convolutional network with dense skip layers.
  • As shown above, several conv blocks are chained to progressively expand the contextual view of feature maps.

Dense skip layers are also used in CCN to aggregate multi-scale contexts.

Applying Different Context Aggregation (CA) Modules into “Our Skip” Architecture.
  • Segmentation networks with CA modules outperforms baseline FCN.

Meanwhile, CCN outperforms CRF, DAG-RNN and ASPP (DeepLabv2) by a significant margin.

2. Enhanced Fully Convolutional Network (EFCN)

Network architecture of EFCN-xs.
  • EFCN-xs is shown as above.
  • First, “dense skip” architecture is used in EFCN to retain and incorporate low-level information from pre-trained CNN, which enhances low-level visual understanding (e.g., boundary localization).
  • Moreover, CCN is introduced to aggregate context for high-level feature maps, which brings benefits to high-level visual parsing.
  • EFCN-4s and EFCN-2s can be trivially inferred from the above architecture demonstration.
Comparisons of Different Networks on ADE20K.

EFCN outperforms all other networks by a significant margin.

3. Results

3.1. Ablation Studies

Different EFCNs on ADE20K.

EFCN-8s performs the best among all “dense skip” architectures. Though EFCN-4s entails more skip layers, it achieves inferior segmentation results.

EFCN-8s Ablation

Each proposed contribution collectively improves the baseline segmentation network FCN.

3.2. ADE20K

SOTA Comparisons on ADE20K.

EFCN consistently and significantly outperforms its FCN counterpart.

  • Although EFCN slightly lags behind PSPNet, the proposed EFCN is faster than PSPNet and requires much less memory than PSPNet.
Qualitative ablation analysis of the proposed segmentation network — EFCN. Images are from ADE20K dataset.
  • The proposed “dense skip” architecture helps retain detailed spatial information.

In the meantime, CCN aggregates high-level contexts for feature maps, which is essential to achieve smooth and robust semantic interpretations for visually inconsistent image regions.

3.3. Other Segmentation Benchmarks

SOTA Comparisons on Pascal Context
SOTA Comparisons on SUN RGB-D
SOTA Comparisons on Pascal VOC 2012

EFCN demonstrates significantly better quantitative parsing performance than FCN on all segmentation benchmarks.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.