Brief Review — MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features
MaskLab, Instance Segmentation, With the Assistance of Semantic Segmentation & Direction Prediction
MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features,
MaskLab, by Google Inc., RWTH Aachen University, and UCLA
2018 CVPR, Over 300 Citations (Sik-Ho Tsang @ Medium)
Instance Segmentation, Semantic Segmentation
- MaskLab is proposed, which produces three outputs: box detection, semantic segmentation, and direction prediction.
- Semantic segmentation assists the model in distinguishing between objects of different semantic classes including background, while the direction prediction, estimating each pixel’s direction towards its corresponding center, allows separating instances of the same semantic class.
1.1. Overall Framework
- MaskLab, employs ResNet-101 as feature extractor, and is built on top of Faster R-CNN framework.
- It consists of three components with all features shared up to conv4 (or res4×) block and one extra duplicate conv5 (or res5×) block is used for the box classifier in Faster R-CNN, producing:
- box prediction (in particular, refined boxes after the box classifier),
- semantic segmentation logits (logits for pixel-wise classification), and
- direction prediction logits (logits for predicting each pixel’s direction towards its corresponding instance center.
- Semantic segmentation logits and direction prediction logits are computed by another 1×1 convolution added after the last feature map in the conv5 block of ResNet-101.
- Given each predicted box (or region of interest), foreground-background segmentation is performed by exploiting those two logits. Specifically, a class-agnostic (i.e., with weights shared across all classes) 1×1 convolution is applied on the concatenation of (1) cropped semantic logits from the semantic channel predicted by Faster-RCNN and (2) cropped direction logits after direction pooling.
As seen, the segmentation logits for ‘person’ clearly separate the person class from background and the tie class, and the direction logits are able to predict the pixel’s direction towards its instance center. After assembling the direction logits, the model is able to further separate the two persons within the specified box region.
1.2. Mask Refinement
- The generated coarse mask logits (by only exploiting semantic and direction features) are concatenated with features from lower layers of ResNet-101, which are then processed by three extra convolutional layers in order to predict the final mask.
1.3. Deformable Crop and Resize
- Inspired by DCN, “crop and resize” first crops a specified bounding box region from the feature maps and then bilinearly resizes them to a specified size (e.g., 4×4). The regions are then further divided into several sub-boxes (e.g., 4 sub-boxes and each has size 2×2) and another small network is employed to learn the offsets for each sub-box.
1.4. Atrous Convolution
- Atrous convolution, originated in DeepLab, is applied to extract features with output stride = 8.
1.5. Loss Function
- Only ground-truth boxes are used to train the branches that predict semantic segmentation logits and direction logits.
- Sigmoid function is applied to estimate both the coarse and refined mask results.
2.1. Ablation Studies
- Table 1: Using crop size more than 41 does not change the performance significantly.
- Table 2: When employing both semantic and direction features, the performance is improved. Using 4 bins (Figure 6) can further improves performance to 30.57%.
- Table 3: Using 8 directions is sufficient to deliver good performance, when adopting 4 bins for distance quantization. The proposed model thus uses 32 = 8×4 channels for direction pooling.
- Table 4: Using both conv1 and conv2 (i.e., the last feature map in res2× block) obtains the best performance of 33.89%. It is observed no further improvement when adding more lower-level features.
- Table 5: A hierarchy of different atrous rates (4, 8, 4) gives improved performance.
- Table 6: Pretrained on COCO, and further pretrained on JFT improve the mAP.
2.2. SOTA Comparisons
- Furthermore, pretraining MaskLab+ on the JFT dataset achieves performance of 38.1% mAP.
- In ‘Person’ channel, there can be some high activations in the non-person regions (e.g., regions that are near elephant’s legs and kite).
However, it is being handled by the box detection branch which filters out wrong box predictions.
Sub-boxes are deformed in a circle-shaped arrangement, attempting to capture longer context for box classification.
Failure mode is shown in the last row, mainly resulting from detection failure (e.g., missed-detection and wrong class prediction) and segmentation failure (e.g., coarse boundary result).