Review — CCNet: Criss-Cross Attention for Semantic Segmentation
CCNet, Recurrent Cross-Shaped Self-Attention, More Efficient Than Non-Local Neural Network
CCNet: Criss-Cross Attention for Semantic Segmentation,
CCNet, by Huazhong University of Science and Technology, Horizon Robotics, ReLER, and University of Illinois at Urbana-Champaign
2019 ICCV, Over 1500 Citations, 2020 TPAMI, Over 1600 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Instance Segmentation, Transformer
- Criss-Cross Network (CCNet) is proposed such that, for each pixel, a novel criss-cross attention module in CCNet harvests the contextual information of all the pixels on its criss-cross path.
- By taking a further recurrent operation, each pixel can finally capture the full-image dependencies from all pixels.
- Compared with Non-Local Neural Network, CCNet uses 11× less GPU memory usage, and with FLOP reduction of about 85%.
- CCNet in TPAMI, is further enhanced with better loss function, and extended to 3D situation.
- CCNet (2019 ICCV)
- CCNet (2020 TPAMI)
- Experimental Results
1. CCNet (2019 ICCV)
- This whole part is also kept in CCNet (2020 TPAMI).
1.1. CNN Backbone
- A deep convolutional neural network (DCNN), which is designed in a fully convolutional fashion, as in DeepLabv2, is used to produce feature maps X with the spatial size of H×W.
- The last two downsampling operations are removed and dilation convolutions are employed in the subsequent convolutional layers, leading to enlarge the width/height of the output feature maps X to 1/8 of the input image.
1.2. Criss-Cross Attention Module
- Given the X, a convolutional layer is applied to obtain the feature maps H of dimension reduction, then, the feature maps H are fed into the criss-cross attention module to generate new feature maps H′.
The feature maps H′ only aggregate the contextual information in horizontal and vertical directions.
To obtain richer and denser context information, the feature maps H′ is fed into the criss-cross attention module again to obtain feature maps H’’. Thus, each position in feature maps H′′ actually gathers the information from all pixels.
- Two criss-cross attention modules before and after share the same parameters to avoid adding too many extra parameters. It is named as recurrent criss-cross attention (RCCA) module.
- Then, the dense contextual feature H′′ is concatenated with the local representation feature X. It is followed by one or several convolutional layers with batch normalization and activation for feature fusion.
- Finally, the fused features are fed into the segmentation layer to predict the final segmentation result.
2. CCNet (2020 TPAMI)
2.1. Learning Category Consistent Features
- In TPAMI, besides cross-entropy loss lseg for segmentation loss, there is also the category consistent loss to drive RCCA module to learn category consistent features directly.
- In particular, the three terms, denoted as lvar, ldis, lreg, are adopted to
- penalize large distances between features with the same label for each instance,
- penalize small distances between the mean features of different labels, and
- draw mean features of all categories towards the origin, respectively.
- as below:
- Let C be the set of classes, Nc is the number of valid elements belonging to category c. hi is the feature vector at spatial position i. μc is the mean feature of category c ∈ C (the cluster center). φ is a piece-wise distance function. δv and δd are respectively the margins.
- To reduce the computation load, a convolutional layer with 1×1 filters is first applied on the output of RCCA module for dimension reduction and then these three losses are applied on the feature map with fewer channels. The final loss l is weighted sum of all losses:
- where δv= 0.5, δd=1.5, α=β=1, γ=0.001, and 16 as the number of channels for dimension reduction.
2.2. 3D Criss-Cross Attention
- In general, the architecture of 3D Criss-Cross Attention is an extension the 2D version by additional collecting more contextual information from the temporal dimension.
3. Experimental Results
The proposed CCNet with single-scale testing still achieve comparable performance without bells and whistles.
With both train and val set trained, CCNet substantially outperforms all the previous state-of-the-arts on test set.
- Adding a criss-cross attention module into the baseline, donated as R=1, improves the performance by 2.9%.
Furthermore, increasing the number of loops from 1 to 2 can further improve the performance by 1.8%, demonstrating the effectiveness of dense contextual information.
- Finally, increasing loops from 2 to 3 slightly improves the performance by 0.4%.
Using the piece-wise function in loss function could achieve slightly better performance than a single quadratic function.
As seen, “+RCCA” takes two steps to form dense contextual information, leading to that the latter step can learn a better attention map benefiting from the feature map produced by the first step in which some long-range dependencies has already been embedded.
Compared with “+NL” method as in Non-Local Neural Network, the proposed “+RCCA” requires 11× less GPU memory usage and significantly reduces FLOPs by about 85% of non-local block in computing full-image dependencies.
Long-range dependency can be learnt when R=2, but not for the case when R=1.
CCNet with CCL achieves the state-of-the-art performance of 45.76%, outperforms the previous state-of-the-art methods by more than 1.1% and also outperforms the conference version CCNet by 0.5%.
Adding CCL obtains much better results.
CCNet achieves the state-of-the-art performance of 55.47%, outperforms the previous state-of-the-art methods by more than 2.3%.
The top two rows show some successful segmentation results. It shows that CCNet can produce accurate segmentation even for complicated poses. The third row shows a failure segmentation result where the “skirt” is misclassified as “pants”.
CCNet substantially outperforms the baseline in all metrics.
- 3D version of CCNet, CCNet3D, is used on CamVid.
CCNet3D achieves an mIoU of 79.1%, outperforming all other methods by a large margin.
[2019 ICCV] [CCNet]
CCNet: Criss-Cross Attention for Semantic Segmentation
[2020 TPAMI] [CCNet]
CCNet: Criss-Cross Attention for Semantic Segmentation
1.5. Semantic Segmentation / Scene Parsing
2015 … 2020 [DRRN Zhang JNCA’20] [Trans10K, TransLab] [CCNet] 2021 [PVT, PVTv1] [SETR] 2022 [PVTv2]