Review — Toward Achieving Robust Low-Level and High-Level Scene Parsing
EFCN, Aggregate Contexts Using Convolutional Context Network (CCN)
Toward Achieving Robust Low-Level and High-Level Scene Parsing,
EFCN, by Nanyang Technological University, and Alibaba AI Labs,
2019 TIP, Over 25 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Fully Convolutional Network, FCN
- It is found that the parsing performance of “skip” network can be noticeably improved by modifying the parameterization of skip layers.
- Thus, “dense skip” architecture is introduced to retain a rich set of low-level information. A Convolutional Context Network (CCN) is proposed based on “dense skip” layers, to aggregate contexts for high-level feature maps.
- Finally, Enhanced Fully Convolutional Network (EFCN) is formed.
Outline
- Parameterization of Skip Layers, Dense Skip Layers, & CCN
- Enhanced Fully Convolutional Network (EFCN)
- Results
1. Parameterization of Skip Layers, Dense Skip Layers, & CCN
1.1. Skip Layers
- Both “Dilation” and “Skip” approaches can retain some useful information to improve parsing performance. However, it’s important to mention that “Dilation” network is significantly slower than “Skip” network.
- “Our Skip”: is proposed where 2-layer convolutional layers with batch normalization (BN) are used to parameterize skip layers (i.e.: Conv+BN+ReLU+Conv).
Performance of the same “skip” networks are significantly boosted (> 1% IOU) after the new parameterized skip layers are used.
1.2. Dense Skip Layers
- “dense skip” network adds skip layers for each intermediate feature map after POOL3, which can help to aggregate multi-scale contexts. (More details in 1.3.)
The “dense skip” architecture outperforms the conventional “sparse skip” counterpart by a noticeable margin, and its inference speed remains competitively fast.
1.3. Convolutional Context Network (CCN)
- As shown above, several conv blocks are chained to progressively expand the contextual view of feature maps.
Dense skip layers are also used in CCN to aggregate multi-scale contexts.
- Segmentation networks with CA modules outperforms baseline FCN.
Meanwhile, CCN outperforms CRF, DAG-RNN and ASPP (DeepLabv2) by a significant margin.
2. Enhanced Fully Convolutional Network (EFCN)
- EFCN-xs is shown as above.
- First, “dense skip” architecture is used in EFCN to retain and incorporate low-level information from pre-trained CNN, which enhances low-level visual understanding (e.g., boundary localization).
- Moreover, CCN is introduced to aggregate context for high-level feature maps, which brings benefits to high-level visual parsing.
- EFCN-4s and EFCN-2s can be trivially inferred from the above architecture demonstration.
EFCN outperforms all other networks by a significant margin.
3. Results
3.1. Ablation Studies
EFCN-8s performs the best among all “dense skip” architectures. Though EFCN-4s entails more skip layers, it achieves inferior segmentation results.
Each proposed contribution collectively improves the baseline segmentation network FCN.
3.2. ADE20K
EFCN consistently and significantly outperforms its FCN counterpart.
- Although EFCN slightly lags behind PSPNet, the proposed EFCN is faster than PSPNet and requires much less memory than PSPNet.
- The proposed “dense skip” architecture helps retain detailed spatial information.
In the meantime, CCN aggregates high-level contexts for feature maps, which is essential to achieve smooth and robust semantic interpretations for visually inconsistent image regions.
3.3. Other Segmentation Benchmarks
EFCN demonstrates significantly better quantitative parsing performance than FCN on all segmentation benchmarks.
Reference
[2019 TIP] [EFCN]
Toward Achieving Robust Low-Level and High-Level Scene Parsing
1.5. Semantic Segmentation / Scene Parsing
2015-2019 … [EFCN] 2020 [DRRN Zhang JNCA’20] [Trans10K, TransLab] [CCNet] 2021 [PVT, PVTv1] [SETR] [Trans10K-v2, Trans2Seg] 2022 [PVTv2]