Review — Toward Achieving Robust Low-Level and High-Level Scene Parsing

EFCN, Aggregate Contexts Using Convolutional Context Network (CCN)

4 min readJan 26, 2023

--

**Semantic segmentation demands robust high-level as well as low-level parsing. EFCN outperforms FCN for both the high-level smoothing/recognition.**

Toward Achieving Robust Low-Level and High-Level Scene Parsing,
EFCN, by Nanyang Technological University, and Alibaba AI Labs,
2019 TIP, Over 25 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Fully Convolutional Network, FCN

It is found that the parsing performance of “skip” network can be noticeably improved by modifying the parameterization of skip layers.
Thus, “dense skip” architecture is introduced to retain a rich set of low-level information. A Convolutional Context Network (CCN) is proposed based on “dense skip” layers, to aggregate contexts for high-level feature maps.
Finally, Enhanced Fully Convolutional Network (EFCN) is formed.

Outline

Parameterization of Skip Layers, Dense Skip Layers, & CCN
Enhanced Fully Convolutional Network (EFCN)
Results

1. Parameterization of Skip Layers, Dense Skip Layers, & CCN

1.1. Skip Layers

**Left Figure: Skip layers of segmentation networks. Right Figure: Parameterizations of skip layers. Table: Comparisons of “Dilation” and “Skip” Using** **VGG-16 on** **ADE20K**.

Both “Dilation” and “Skip” approaches can retain some useful information to improve parsing performance. However, it’s important to mention that “Dilation” network is significantly slower than “Skip” network.
“Our Skip”: is proposed where 2-layer convolutional layers with batch normalization (BN) are used to parameterize skip layers (i.e.: Conv+BN+ReLU+Conv).

Performance of the same “skip” networks are significantly boosted (> 1% IOU) after the new parameterized skip layers are used.

1.2. Dense Skip Layers

**Comparison Between “Sparse Skip” and “Dense Skip” Using** **VGG-16 on** **ADE20K**.

“dense skip” network adds skip layers for each intermediate feature map after POOL3, which can help to aggregate multi-scale contexts. (More details in 1.3.)

The “dense skip” architecture outperforms the conventional “sparse skip” counterpart by a noticeable margin, and its inference speed remains competitively fast.

1.3. Convolutional Context Network (CCN)

**Convolutional context network (CCN) is a convolutional network with dense skip layers.**

As shown above, several conv blocks are chained to progressively expand the contextual view of feature maps.

Dense skip layers are also used in CCN to aggregate multi-scale contexts.

**Applying Different Context Aggregation (CA) Modules into “Our Skip” Architecture.**

Segmentation networks with CA modules outperforms baseline FCN.

Meanwhile, CCN outperforms CRF, DAG-RNN and ASPP (DeepLabv2) by a significant margin.

2. Enhanced Fully Convolutional Network (EFCN)

EFCN-xs is shown as above.
First, “dense skip” architecture is used in EFCN to retain and incorporate low-level information from pre-trained CNN, which enhances low-level visual understanding (e.g., boundary localization).
Moreover, CCN is introduced to aggregate context for high-level feature maps, which brings benefits to high-level visual parsing.
EFCN-4s and EFCN-2s can be trivially inferred from the above architecture demonstration.

**Comparisons of Different Networks on** **ADE20K**.

EFCN outperforms all other networks by a significant margin.

3. Results

3.1. Ablation Studies

EFCN-8s performs the best among all “dense skip” architectures. Though EFCN-4s entails more skip layers, it achieves inferior segmentation results.

Each proposed contribution collectively improves the baseline segmentation network FCN.

3.2. ADE20K

EFCN consistently and significantly outperforms its FCN counterpart.

Although EFCN slightly lags behind PSPNet, the proposed EFCN is faster than PSPNet and requires much less memory than PSPNet.

**Qualitative ablation analysis of the proposed segmentation network — EFCN. Images are from** **ADE20K** **dataset.**

The proposed “dense skip” architecture helps retain detailed spatial information.

In the meantime, CCN aggregates high-level contexts for feature maps, which is essential to achieve smooth and robust semantic interpretations for visually inconsistent image regions.