Review — Gated-SCNN: Gated Shape CNNs for Semantic Segmentation

Gated-SCNN, Improves Segmentation Using Boundary Information

Sik-Ho Tsang
6 min readJan 9, 2023
Gated-SCNN (GSCNN) with Proposed Shape Stream Using Gated Convolutional Layer (GCL)

Gated-SCNN: Gated Shape CNNs for Semantic Segmentation,
Gated-SCNN (GSCNN), by NVIDIA, University of Waterloo, University of Toronto, Vector Institute,
2019 ICCV, Over 480 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation

  • A new two-stream CNN architecture for semantic segmentation, i.e. shape stream, that processes shape information in parallel to the classical stream.
  • A new type of gates is introduced, which uses the higher-level activations in the classical stream to gate the lower-level activations in the shape stream, effectively removing noise and helping the shape stream to only focus on processing the relevant boundary-related information.

Outline

  1. Gated-SCNN (GSCNN) Model Architecture
  2. Gated Convolutional Layer (GCL)
  3. Dual Task Regularizer
  4. Experimental Results

1. Gated-SCNN (GSCNN)

Gated-SCNN (GSCNN)
  • Regular stream can be the architecture such as ResNet-101 and WideResNet, image as input, outputs dense feature representation r.
  • In shape stream, Canny edge detector is used to retrieve the image gradients and used as supervision signal, output boundary map s.
  • Then, fusion module takes as input the dense feature representation r coming from the regular branch and fuses it with the boundary map s output by the shape branch, outputs a refined semantic segmentation output.
  • Specifically, the boundary map s is merged with r using an Atrous Spatial Pyramid Pooling (ASPP) originated in DeepLab. This allows us to preserve the multi-scale contextual information.

2. Gated Convolutional Layer (GCL)

  • GCL layer that facilitates flow of information from the regular stream to the shape stream.
  • An attention map αt is first obtained by concatenating rt and st followed by a normalized 1×1 convolutional layer C1×1 which in turn is followed by a sigmoid function σ:
  • where || denotes concatenation.
  • GCL is applied on st as an element-wise product ⊙ with attention map α followed by a residual connection and channel-wise weighting with kernel wt. Thus, GCL ∗ is computed as:
  • where ˆst is then passed on to the next layer in the shape stream for further processing.
  • As shown above, three GCLs are used. They are connected to the third, fourth and last layer of the regular stream. Bilinear interpolation, if needed, is used to upsample the feature maps coming from the regular stream.

Standard binary cross-entropy (BCE) loss is used on predicted boundary maps s and standard cross-entropy (CE) loss is used on predicted semantic segmentation f:

  • where ˆs and f are the GT boundary map and GT segmentation map respectively. λ1 = 20, λ2 = 1.

3. Dual Task Regularizer

  • As mentioned above, p(y|r, s) denotes a categorical distribution output of the fusion module.
  • Let ζ be a potential that represents whether a particular pixel belongs to a semantic boundary in the input image I. It is computed by taking a spatial derivative on segmentation output as follows:
  • where G denotes Gaussian filter.
  • If we assume ˆζ is a GT binary mask computed in the same way from the GT semantic labels ˆf, the loss function is written as:
  • where p+ contains the set of all non-zero pixel coordinates in both ζ and ˆζ. Intuitively, it can be ensured that boundary pixels are penalized when there is a mismatch with GT boundaries, and to avoid non-boundary pixels to dominate the loss function.
  • Similarly, the boundary prediction from the shape stream s can be used to ensure consistency between the binary boundary prediction s and the predicted semantics p(y|r, s):
  • where 1sp corresponds to the indicator function and thrs=0.8 is a confidence threshold. λ3 = λ4 = 1.
  • Finally, the total dual task regularizer loss function is:
  • Since argmax is not a differentiable function:
  • Gumbel softmax trick [24] is used:
  • where gj∼Gumbel(0, I) and τ=1 is a hyper-parameter. The operator G can be computed by filtering with Sobel kernel.

4. Experimental Results

4.1. Model, Training, Dataset & Metrics

  • DeepLabv3+ is used as main baseline with ResNet-50, ResNet-101 and WideResNet as the backbone architecture.
  • 800×800 is used as the training resolution and synchronized batch norm is used. NVIDIA DGX Station using 8 GPUs with a total batch size of 16 is used.
  • Cityscapes dataset is used.
  • Standard IoU, and F-score metric along the boundary are used as metrics.
Distance-Based Evaluation
  • Distance-based evaluation is also used since high accuracy is also important for small (distant) objects, where however, the global IoU metric does not well reflect this. Multiple crops are used for measuring IoU.

4.2. SOTA Comparisons

Comparison in terms of IoU vs state-of-the-art baselines on the Cityscapes val set.

A 2% mean IoU improvement is obtained copmared. Significant improvements are obtained for small objects: motorcycles, traffic signs, traffic lights, and poles.

Comparison vs baselines at different thresholds in terms of boundary F-score on the Cityscapes val set.

In terms of boundary accuracy (measured by F score), GSCNN performs considerably better, outperforming the baseline by close to 4% in the strictest regime.

Comparison vs state-of-the-art methods (with/without coarse training) on the Cityscapes test set.
  • The proposed GSCNN model consistently outperforms very strong baselines, some of which also use extra coarse training data.

At the time of this writing, GSCNN approach is also ranked as first among the published methods that do not use coarse data.

4.3. Ablation Study

Comparison of the shape stream, GCL, and additional image gradient features (Canny) for different regular streams.

GSCNN achieves between 1 to 2% improvement in performance in terms of mIoU, and around 3% for boundary alignment.

Effect of the Dual Task Loss at difference thresholds in terms of boundary quality (F-score).

Dual loss significantly improves the performance of the model in terms of boundary accuracy, up to 3% improvement at the strictest regime.

Performance improvements and the percentage increase in the number of parameters due to the shape stream on different base networks.

Only small increase in model size when using GSCNN.

4.4. Qualitative Results

Qualitative results of our method on the Cityscapes test set.

Some typical cases are shown above.

Qualitative comparison in terms of errors in predictions.

The prediction errors are also shown for both methods. e.g.: Deeplab v3+ fails to capture the poles and naively classifies them as humans. Conversely, GSCNN can properly classifies the poles.

Visualization of the alpha channels from the GCLs.
  • Alpha channels are shown above.

e.g.: The first gate emphasized very low-level edges while the second and third focus on object-level boundaries.

Example output of shape stream fed into the fusion module.

This stream learns to produce high quality class-agnostic boundaries which are then fed to the fusion module. Qualitative results of the output of the shape stream.

Qualitative results on the Cityscapes test set showing the high-quality boundaries of our predicted segmentation masks.
  • The boundaries obtained from the final segmentation masks are shown above. Notice their accuracy on the thinner and smaller objects.

Reference

[2019 ICCV] [Gated-SCNN]
Gated-SCNN: Gated Shape CNNs for Semantic Segmentation

1.5. Semantic Segmentation / Scene Parsing

20152019 [Gated-SCNN] … 2020 [DRRN Zhang JNCA’20] [Trans10K, TransLab] [CCNet] 2021 [PVT, PVTv1] [SETR] [Trans10K-v2, Trans2Seg] 2022 [PVTv2]

==== My Other Previous Paper Readings ====

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.