Review — Learning to Segment Every Thing
Mask X R-CNN, Weakly Supervised Segmentation for Unseen-Class Objects
Learning to Segment Every Thing,
Mask X R-CNN, by BAIR, UC Berkeley, and Facebook AI Research (FAIR)
2018 CVPR, Over 290 Citations (Sik-Ho Tsang @ Medium)
Instance Segmentation, Weakly Supervised, Faster R-CNN, Mask R-CNN
- A new partially supervised training paradigm is proposed with a novel weight transfer function, that enables training instance segmentation models on a large set of categories all of which have box annotations, but only a small fraction of which have mask annotations.
- Mask X R-CNN
1. Mask X R-CNN
- Let C be the set of object categories, where C=A∪B where examples from the categories in A have masks, while those in B have only bounding boxes.
- It is referred to training on the combination of strong and weak labels as a partially supervised learning problem.
1.1. Mask Prediction Using Weight Transfer
- The method is built on Mask R-CNN.
- During training, the mask branch is trained jointly and in parallel with the standard bounding box head found in Faster R-CNN.
Instead of learning the category-specific bounding box parameters and mask parameters independently, a category’s mask parameters are predicted from its bounding box parameters using a generic, category-agnostic weight transfer function.
- Specifically, let wcdet be the class-specific object detection weights in the last layer of the bounding box head.
- Instead of treating wcseg as model parameters, wcseg is parameterized using a generic weight prediction function T():
- where θ are class-agnostic, learned parameters.
- T() can be implemented as a small fully connected neural network.
- The bounding box head contains two types of detection weights: the RoI classification weights wccls and the bounding box regression weights wcbox.
- wcdet can be:
1.2. Training Procedures
1.2.1. Stage-Wise Training
- In the first stage, Faster R-CNN is trained using only the bounding box annotations of the classes in A∪B, and then in the second stage the additional mask head is trained while keeping the convolutional features and the bounding box head fixed.
- However, separate training may result in inferior performance.
1.2.2. End-to-End Training
- To train both heads jointly while solving the discrepancy due to different classes in A and B, stop_grad is used to stop the gradient flow:
1.3. Extension: Fused FCN+MLP Mask Heads
- Either ImageNet-pretrained ResNet-50-FPN or ResNet-101-FPN is used as the backbone architecture for Mask R-CNN.
2.1. Ablation Study
- (a) Input to T: Input “cls+box” giving the best results.
- (b) Structure of T: A 2-layer MLP with Leaky ReLU gives the best mask AP on set B. Given this, “cls+box, 2-layer, LeakyReLU” implementation of T are used for all subsequent experiments.
- (c) MLP Mask Branch: A class-agnostic MLP mask branch can be fused with either the baseline or the proposed transfer approach.
- (d) Training Strategy: End-to-end training can bring improved results, however only when backpropagation from T to wdet is disabled.
- There are 80 classes in COCO.
- 20, 30, 40, 50 or 60 classes are randomly included in set A (the complement forms set B), and 5 trials are performed for each split size.
The proposed method yields to up to over 40% relative increase in mask AP.
2.2. Full Method
Mask X R-CNN outperforms these approaches by a large margin (over 20% relative increase in mask AP).
The above figure shows example mask predictions from the class-agnostic baseline and the proposed approach.
2.3. Visual Genome
- Visual Genome dataset is used. 3000 classes are selected.
- The green boxes are the 80 classes that overlap with COCO (set A with mask training data) while the red boxes are the remaining 2920 classes not in COCO (set B without mask training data).
- The above results illustrate the exciting potential of systems that can recognize and segment thousands of concepts.