Brief Review — Weakly- and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation
Weakly-Supervised Expectation-Maximization (EM) Methods
Weakly- and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation,
Weakly-Supervised EM, by Google, and UCLA
2015 ICCV, Over 1300 Citations (Sik-Ho Tsang @ Medium)Semantic Segmentation / Scene Parsing / Instance Segmentation / Panoptic Segmentation
2021 [PVT, PVTv1] [SETR] [Trans10K-v2, Trans2Seg] [Copy-Paste] [HRNetV2, HRNetV2p] [Lite-HRNet] 2022 [PVTv2] [YOLACT++] 2023 [Segment Anything Model (SAM)]
==== My Other Paper Readings Are Also Over Here ====
- (After reading Segment Anything Model (SAM), it reminds me a paper that I’ve read since long time ago, which is about semantic segmentation with only image-level label or bounding box label as input for weakly/semi-supervised learning.)
- In this paper, Expectation-Maximization (EM) methods are developed for semantic segmentation model training under these weakly/semi-supervised settings.
Outline
- Preliminaries
- Weakly-Supervised Methods (Image-Level Annotations)
- Weakly-Supervised Methods (Bounding Box Annotations)
- Semi-Supervised Method (Mixed Strong and Weak Annotations)
- Results
1. Preliminaries
1.1. Symbol Definition
- x is the image. y is the segmentation map.
- In particular, ym ∈ {0, … , L} is the pixel label at position m ∈ {1, … , M}, assuming that we have the background as well as L possible foreground labels and M is the number of pixels.
1.2. Fully-Supervised Setting
- In the fully supervised case above, the objective function is:
- where θ is the vector of model parameters. The per-pixel label distributions are computed by:
- where fm(ym|x,θ) is the model output at pixel m.
- J(θ) is optimized by mini-batch SGD.
2. Weakly-Supervised Methods (Image-Level Annotations)
- When only image-level annotation is available, we can observe the image values x and the image-level labels z, but the pixel-level segmentations y are latent variables. Thene we have the following probabilistic graphical model:
- The expected complete-data log-likelihood given the previous parameter estimate θ′ is:
- where a hard-EM approximation can be adopted, estimating in the E-step of the algorithm the latent segmentation by:
2.1. EM-Fixed
- In this variant, it is assumed that log P(z|y) factorizes over pixel positions as:
- allowing to estimate the E-step segmentation at each pixel separately:
- The parameters bl=bfg, if l > 0 and b0=bbg, with bfg > bbg > 0.
- Intuitively, this potential encourages a pixel to be assigned to one of the image-level labels z. bfg > bbg boosts present foreground classes more than the background, to encourage full object coverage and avoid a degenerate solution.
- bfg = 5 and bbg = 3.
2.2. EM-Adapt
- Instead of using const at EM-Fixed, EM-Adapt encourages at least a ρl portion of the image area to be assigned to class l, if zl = 1, and enforce that no pixel is assigned to class l, if zl = 0, so that EM-Adapt adaptively sets the image- and class-dependent biases bl.
- ρfg = 20% and ρbg = 40%.
3. Weakly-Supervised Methods (Bounding Box Annotations)
3.1. Bbox-Rect
- The Bbox-Rect method amounts to simply considering each pixel within the bounding box as positive example for the respective object class. Ambiguities are resolved by assigning pixels that belong to multiple bounding boxes to the one that has the smallest area.
- Yet, the bounding boxes fully surround objects but also contain background pixels that contaminate the training set with false positive examples.
3.2. Bbox-Seg
- To filter out these background, CRF used in DeepLab is also used here.
- More specifically, the center area of the bounding box (% of pixels within the box) is constrained to be foreground.
- CRF parameters are estimated by held-out set.
3.3. Bbox-EM-Fixed
- The method is a variant of the EM-Fixed algorithm mentioned previously, in which only the present foreground object scores within the bounding box area are boosted.
4. Semi-Supervised Method (Mixed Strong and Weak Annotations)
- With mixed strong and weak annotations, it is a semi-supervised situation.
- In SGD training of deep CNN models, each mini-batch has a fixed proportion of strongly/weakly annotated images, and the proposed EM algorithm is used for estimating at each iteration the latent semantic segmentations for the weakly annotated images.
5. Results
5.1. Weakly-Supervised (Image-Level Annotations)
Table 1: Using 1,464 pixel-level and 9,118 image-level annotations in the EM-Fixed semi-supervised setting significantly improves performance, yielding 64.6%, approaching to strong supervised setting of 67.6%.
Table 2: Using 2.9k pixel-level annotations along with 9k image-level annotations in the semi-supervised setting yields 68.5%, approaching to strong supervised setting of 70.3%.
5.2. Weakly-Supervised (Bounding Box Annotations)
Table 3: Bbox-Seg improves over Bbox-Rect by 8.1%, and gets within 7.0% of the strong pixel-level annotation result. It is observed that combining 1,464 strong pixel-level annotations with weak bounding box annotations yields 65.1%, only 2.5% worse than the strong pixel-level annotation result.
Table 4: Bbox-EM-Fixed improves over Bbox-Seg as adding more strong annotations, and it performs 1.0% better (69.0% vs. 68.0%) with 2.9k strong annotations.
- This shows that the E-step of our EM algorithm can estimate the object masks better than the foreground-background segmentation pre-processing step.
5.3. Cross-Dataset Pretraining/Training
- Cross-Pretrain/Joint is also tried to train the model.
- (Please feel free to read the paper directly for the explanation of this part.)