Brief Review — MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features

MaskLab, Instance Segmentation, With the Assistance of Semantic Segmentation & Direction Prediction

Sik-Ho Tsang
5 min readJan 13, 2023
Instance segmentation aims to solve detection and segmentation jointly.

MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features,
MaskLab, by Google Inc., RWTH Aachen University, and UCLA
2018 CVPR, Over 300 Citations (

@ Medium)
Instance Segmentation, Semantic Segmentation

  • MaskLab is proposed, which produces three outputs: box detection, semantic segmentation, and direction prediction.
  • Semantic segmentation assists the model in distinguishing between objects of different semantic classes including background, while the direction prediction, estimating each pixel’s direction towards its corresponding center, allows separating instances of the same semantic class.


  1. MaskLab
  2. Results

1. MaskLab

1.1. Overall Framework

MaskLab generates three outputs, including refined box predictions (from Faster R-CNN), semantic segmentation logits (logits for pixel-wise classification), and direction prediction logits (logits for predicting each pixel’s direction toward its corresponding instance center).
  • MaskLab, employs ResNet-101 as feature extractor, and is built on top of Faster R-CNN framework.
  • It consists of three components with all features shared up to conv4 (or res4×) block and one extra duplicate conv5 (or res5×) block is used for the box classifier in Faster R-CNN, producing:
  1. box prediction (in particular, refined boxes after the box classifier),
  2. semantic segmentation logits (logits for pixel-wise classification), and
  3. direction prediction logits (logits for predicting each pixel’s direction towards its corresponding instance center.
Semantic segmentation logits and direction prediction logits are used to perform foreground/background segmentation within each predicted box.
  • Semantic segmentation logits and direction prediction logits are computed by another 1×1 convolution added after the last feature map in the conv5 block of ResNet-101.
  • Given each predicted box (or region of interest), foreground-background segmentation is performed by exploiting those two logits. Specifically, a class-agnostic (i.e., with weights shared across all classes) 1×1 convolution is applied on the concatenation of (1) cropped semantic logits from the semantic channel predicted by Faster-RCNN and (2) cropped direction logits after direction pooling.

As seen, the segmentation logits for ‘person’ clearly separate the person class from background and the tie class, and the direction logits are able to predict the pixel’s direction towards its instance center. After assembling the direction logits, the model is able to further separate the two persons within the specified box region.

1.2. Mask Refinement

Mask Refinement
  • The generated coarse mask logits (by only exploiting semantic and direction features) are concatenated with features from lower layers of ResNet-101, which are then processed by three extra convolutional layers in order to predict the final mask.

1.3. Deformable Crop and Resize

Deformable crop and resize.
  • Inspired by DCN, “crop and resize” first crops a specified bounding box region from the feature maps and then bilinearly resizes them to a specified size (e.g., 4×4). The regions are then further divided into several sub-boxes (e.g., 4 sub-boxes and each has size 2×2) and another small network is employed to learn the offsets for each sub-box.

1.4. Atrous Convolution

  • Atrous convolution, originated in DeepLab, is applied to extract features with output stride = 8.

1.5. Loss Function

  • Only ground-truth boxes are used to train the branches that predict semantic segmentation logits and direction logits.
  • Sigmoid function is applied to estimate both the coarse and refined mask results.

2. Results

2.1. Ablation Studies

Ablation Studies
  • Table 1: Using crop size more than 41 does not change the performance significantly.
  • Table 2: When employing both semantic and direction features, the performance is improved. Using 4 bins (Figure 6) can further improves performance to 30.57%.
  • Table 3: Using 8 directions is sufficient to deliver good performance, when adopting 4 bins for distance quantization. The proposed model thus uses 32 = 8×4 channels for direction pooling.
  • Table 4: Using both conv1 and conv2 (i.e., the last feature map in res2× block) obtains the best performance of 33.89%. It is observed no further improvement when adding more lower-level features.
  • Table 5: A hierarchy of different atrous rates (4, 8, 4) gives improved performance.
  • Table 6: Pretrained on COCO, and further pretrained on JFT improve the mAP.

2.2. SOTA Comparisons

SOTA comparisons on COCO test-dev

MaskLab model outperforms FCIS+++, although FCIS+++ employs scale augmentation and on-line hard example mining.

  • Furthermore, pretraining MaskLab+ on the JFT dataset achieves performance of 38.1% mAP.

2.3. Visualizations

‘Person’ channel in the predicted semantic segmentation logits.
  • In ‘Person’ channel, there can be some high activations in the non-person regions (e.g., regions that are near elephant’s legs and kite).

However, it is being handled by the box detection branch which filters out wrong box predictions.

Visualization of learned deformed sub-boxes.

Sub-boxes are deformed in a circle-shaped arrangement, attempting to capture longer context for box classification.

Visualization results on the minival set.

Failure mode is shown in the last row, mainly resulting from detection failure (e.g., missed-detection and wrong class prediction) and segmentation failure (e.g., coarse boundary result).



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.