Review — CCNet: Criss-Cross Attention for Semantic Segmentation

CCNet, Recurrent Cross-Shaped Self-Attention, More Efficient Than Non-Local Neural Network

Sik-Ho Tsang
6 min readDec 20, 2022
Left: Non-Local Neural Network, Right Proposed CCNet

CCNet: Criss-Cross Attention for Semantic Segmentation,
CCNet, by Huazhong University of Science and Technology, Horizon Robotics, ReLER, and University of Illinois at Urbana-Champaign
2019 ICCV, Over 1500 Citations, 2020 TPAMI, Over 1600 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Instance Segmentation, Transformer

  • Criss-Cross Network (CCNet) is proposed such that, for each pixel, a novel criss-cross attention module in CCNet harvests the contextual information of all the pixels on its criss-cross path.
  • By taking a further recurrent operation, each pixel can finally capture the full-image dependencies from all pixels.
  • Compared with Non-Local Neural Network, CCNet uses 11× less GPU memory usage, and with FLOP reduction of about 85%.
  • CCNet in TPAMI, is further enhanced with better loss function, and extended to 3D situation.

Outline

  1. CCNet (2019 ICCV)
  2. CCNet (2020 TPAMI)
  3. Experimental Results

1. CCNet (2019 ICCV)

CCNet Framework
  • This whole part is also kept in CCNet (2020 TPAMI).

1.1. CNN Backbone

  • A deep convolutional neural network (DCNN), which is designed in a fully convolutional fashion, as in DeepLabv2, is used to produce feature maps X with the spatial size of H×W.
  • The last two downsampling operations are removed and dilation convolutions are employed in the subsequent convolutional layers, leading to enlarge the width/height of the output feature maps X to 1/8 of the input image.

1.2. Criss-Cross Attention Module

Left: Criss-Cross Attention Module, and Right: its Information Propagation.
  • Given the X, a convolutional layer is applied to obtain the feature maps H of dimension reduction, then, the feature maps H are fed into the criss-cross attention module to generate new feature maps H′.

The feature maps Honly aggregate the contextual information in horizontal and vertical directions.

To obtain richer and denser context information, the feature maps H′ is fed into the criss-cross attention module again to obtain feature maps H’’. Thus, each position in feature maps H′′ actually gathers the information from all pixels.

  • Two criss-cross attention modules before and after share the same parameters to avoid adding too many extra parameters. It is named as recurrent criss-cross attention (RCCA) module.

1.3. Output

  • Then, the dense contextual feature H′′ is concatenated with the local representation feature X. It is followed by one or several convolutional layers with batch normalization and activation for feature fusion.
  • Finally, the fused features are fed into the segmentation layer to predict the final segmentation result.

2. CCNet (2020 TPAMI)

2.1. Learning Category Consistent Features

  • In TPAMI, besides cross-entropy loss lseg for segmentation loss, there is also the category consistent loss to drive RCCA module to learn category consistent features directly.
  • In particular, the three terms, denoted as lvar, ldis, lreg, are adopted to
  1. penalize large distances between features with the same label for each instance,
  2. penalize small distances between the mean features of different labels, and
  3. draw mean features of all categories towards the origin, respectively.
  • as below:
  • Let C be the set of classes, Nc is the number of valid elements belonging to category c. hi is the feature vector at spatial position i. μc is the mean feature of category c C (the cluster center). φ is a piece-wise distance function. δv and δd are respectively the margins.
  • To reduce the computation load, a convolutional layer with 1×1 filters is first applied on the output of RCCA module for dimension reduction and then these three losses are applied on the feature map with fewer channels. The final loss l is weighted sum of all losses:
  • where δv= 0.5, δd=1.5, α=β=1, γ=0.001, and 16 as the number of channels for dimension reduction.

2.2. 3D Criss-Cross Attention

3D Criss-Cross Attention
  • In general, the architecture of 3D Criss-Cross Attention is an extension the 2D version by additional collecting more contextual information from the temporal dimension.

3. Experimental Results

3.1. Cityscapes

Comparison with state-of-the-arts on Cityscapes (test).

The proposed CCNet with single-scale testing still achieve comparable performance without bells and whistles.

With both train and val set trained, CCNet substantially outperforms all the previous state-of-the-arts on test set.

Visualization results of RCCA with different loops on Cityscapes validation set.
Performance on Cityscapes (val) for different number of loops in RCCA.
  • Adding a criss-cross attention module into the baseline, donated as R=1, improves the performance by 2.9%.

Furthermore, increasing the number of loops from 1 to 2 can further improve the performance by 1.8%, demonstrating the effectiveness of dense contextual information.

  • Finally, increasing loops from 2 to 3 slightly improves the performance by 0.4%.
Performance on Cityscapes (val) for different kinds of category consistent loss.

Using the piece-wise function in loss function could achieve slightly better performance than a single quadratic function.

Comparison of context aggregation approaches on Cityscapes (val).

As seen, “+RCCA” takes two steps to form dense contextual information, leading to that the latter step can learn a better attention map benefiting from the feature map produced by the first step in which some long-range dependencies has already been embedded.

Comparison of Non-local module and RCCA. FLOPs and memory

Compared with “+NL” method as in Non-Local Neural Network, the proposed “+RCCA” requires 11× less GPU memory usage and significantly reduces FLOPs by about 85% of non-local block in computing full-image dependencies.

Visualization of attention module on Cityscapes validation set.

Long-range dependency can be learnt when R=2, but not for the case when R=1.

3.2. ADE20K

Comparison with state-of-the-arts on ADE20K (val).

CCNet with CCL achieves the state-of-the-art performance of 45.76%, outperforms the previous state-of-the-art methods by more than 1.1% and also outperforms the conference version CCNet by 0.5%.

Visualized examples on ADE20K val set with/without category consistent loss (CCL).

Adding CCL obtains much better results.

3.3. LIP

Comparison with state-of-the-arts on LIP (val).

CCNet achieves the state-of-the-art performance of 55.47%, outperforms the previous state-of-the-art methods by more than 2.3%.

Visualized examples for human parsing result on LIP val set.

The top two rows show some successful segmentation results. It shows that CCNet can produce accurate segmentation even for complicated poses. The third row shows a failure segmentation result where the “skirt” is misclassified as “pants”.

3.4. COCO

Comparison on COCO (val).

CCNet substantially outperforms the baseline in all metrics.

Visualized examples for instance segmentation result on COCO val set.

3.5. CamVid

  • 3D version of CCNet, CCNet3D, is used on CamVid.

CCNet3D achieves an mIoU of 79.1%, outperforming all other methods by a large margin.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.