Brief Review — Learning Transferrable Knowledge for Semantic Segmentation with Deep Convolutional Neural Network
Learning Transferrable Knowledge for Semantic Segmentation with Deep Convolutional Neural Network
TransferNet, by POSTECH, and University of Michigan,
2016 CVPR, Over 190 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation / Scene Parsing / Instance Segmentation / Panoptic Segmentation
2021 [PVT, PVTv1] [SETR] [Trans10K-v2, Trans2Seg] [Copy-Paste] [HRNetV2, HRNetV2p] [Lite-HRNet] 2022 [PVTv2] [YOLACT++] 2023 [Segment Anything Model (SAM)]
==== My Other Paper Readings Are Also Over Here ====
- A decoupled encoder-decoder architecture with attention model is designed. In this architecture, the model generates spatial highlights of each category presented in images using an attention model, and subsequently performs binary segmentation for each highlighted region using decoder.
- Combining attention model, the decoder trained with segmentation annotations in different categories boosts accuracy of weakly-supervised semantic segmentation.
- The network is composed of four parts: encoder, attention model, classifier and decoder.
1.1. Encoder fenc
The encoder is frozen ImageNet-pretrained VGGNet, fenc.
- Given an input image x, the network first extracts a feature descriptor A as:
1.2. Attention Model
The objective of the attention model is to learn a set of positive weight vectors α over a 2D space.
- where yl is a one-hot label vector for the lth category, denotes parameters of the attention model, and vl represents unnormalized attention weights.
- Specifically, fatt is computed by 3 weight vectors:
- zl is the features based on the category-specific attention by aggregating features over the spatial region:
- zl is input to fcls, which is composed of two fully-connected layers for classification
Using the images with weak annotations in both target and source domain, we jointly train attention model and classifier to minimize the classification loss:
- Since α is obtained after Softmax which is sparse, there is information loss for semantic segmentation.
- The intermediate representation zl is exploited. Spatial activations in each channel of the feature using zl are aggregated as coefficients, which is given by:
- where sl represents densified attention in the same size with αl and serves as inputs to the decoder.
Given densified attention sl as input, the attention model and decoder are jointly trained to minimize the segmentation loss:
- where dli is the binary segmentation map.
- The overall objective function:
The proposed algorithm TransferNet outperforms all weakly-supervised semantic segmentation techniques with substantial margins.
- Subsets of training data are randomly constructed by varying their sizes in ratios (50%, 25%, 10%, 5% and 1%) and average the performance in each size with 3 subsets.
The performance of the proposed algorithm is still better than other weakly-supervised methods even with a very small fraction of annotations. It suggests that exploiting even small number of segmentations from other categories can effectively reduce the gap between the approaches based on strong and weak supervisions.
The proposed algorithm often produces accurate segmentations in the target domain by transferring the decoder trained with source domain examples.