Brief Review — Mask Scoring R-CNN

Mask Scoring R-CNN (MS R-CNN), Rescore the Mask of Mask R-CNN

Sik-Ho Tsang
5 min readJan 15


The left four images show good detection results with high classification scores but low mask quality. Mask Scoring R-CNN is proposed by retrain the high score, the rightmost image shows the case of a good mask with a high classification score.

Mask Scoring R-CNN,
MS R-CNN, by Huazhong University of Science and Technology, and Horizon Robotics Inc.
2019 CVPR, Over 700 Citations (Sik-Ho Tsang @ Medium)
Instance Segmentation, Mask R-CNN

  • Mask Scoring R-CNN (MS R-CNN) is proposed, which contains a network block to learn the quality of the predicted instance masks.
  • The proposed network block takes the instance feature and the corresponding predicted mask together to regress the mask IoU, so as to calibrate the misalignment between mask quality and mask score, and improves instance segmentation performance.


  1. Brief Description of Mask R-CNN
  2. Mask Scoring R-CNN (MS R-CNN)
  3. Results

1. Brief Description of Mask R-CNN

1.1. Framework

  • The first stage is the Region Proposal Network (RPN). It proposes candidate object bounding boxes regardless of object categories.
  • The second stage is termed as the R-CNN stage, which extracts features using RoIAlign for each proposal and performs proposal classification, bounding box regression and mask prediction.

1.2. Mask Scoring

  • smask is defined as the score of the predicted mask.

The ideal smask is equal to the pixel-level IoU between the predicted mask and its matched ground truth mask, which is termed as MaskIoU.

  • The ideal smask also should only have positive value for ground truth category, and be zero for other classes, since a mask only belongs to one class.

This requires the mask score to works well on two tasks: classifying the mask to the right category and regressing the proposal’s MaskIoU for foreground object category.

  • It is hard to train the two tasks only using a single objective function.


2.1. Motivations

For simplify, the mask score learning task is decomposed into mask classification and IoU regression, denoted as smask = scls · siou for all object categories.

  • scls focuses on classifying the proposal belong to which class and;
  • siou focuses on regressing the MaskIoU.
  • As for scls, the goal of scls is to classify the proposal belonging to which class, which has been done in the classification task in the R-CNN stage. So we can directly take the corresponding classification score

Regressing siou is the target of this paper.

2.2. MaskIoU Head

Mask Scoring R-CNN (MS R-CNN)
  • The MaskIoU head aims to regress the IoU between the predicted mask and its ground truth mask.
  • The concatenation of feature from RoIAlign layer and the predicted mask as the input of MaskIoU head.
  • When concatenating, a max pooling layer with a kernel size of 2 and stride of 2 is used to make the predicted mask have the same spatial size with RoI feature.
  • It only regresses the MaskIoU for the ground truth class.
  • MaskIoU head consists of 4 convolution layers, with the kernel size and filter number to 3 and 256, and 3 fully connected (FC) layers, with first 2 FC of size 1024.

2.3. Training

  • The predicted mask of the target class is binarized using a threshold of 0.5.
  • Then, the MaskIoU between the binary mask and its matched ground truth is used as the MaskIoU target.
  • ℓ2 loss is used for regressing MaskIoU, and the loss weight is set to 1.

2.4. Inference

  • R-CNN stage of Mask R-CNN outputs N bounding boxes, and among them, top-k (i.e. k = 100) scoring boxes after SoftNMS.
  • These top-k boxes are fed into the Mask head to generate multi-class masks.
  • The top-k target masks are fed MaskIoU head to predict the MaskIoU. The predicted MaskIoU are multiplied with classification score, to get the new calibrated mask score as the final mask confidence.

3. Results

3.1. Effectiveness of MaskIoU Head

COCO 2017 validation results.

MS R-CNN is not sensitive to the backbone network and can achieve stable improvement on all backbone networks: MS R-CNN can get a remarkable improvement (about 1.5 AP). Especially for AP@0.75, MS R-CNN can improve baseline by about 2 points.

COCO 2017 validation results.

MS R-CNN is robust to different framework including Faster R-CNN / FPN / DCN+FPN.

  • Besides, MS R-CNN does not harm bounding box detection performance; in fact, it improves bounding box detection performance slightly.
COCO 2017 test-dev.

Similar trend on COCO 2017 test-dev.

3.2. Ablation Studies

Different design choices of the input of MaskIoU head.
Results of the different design choices for the input of MaskIoU head.
  • MaskIoU head is robust to different ways of fusing mask prediction and RoI feature.

Concatenating the target score map and RoI feature obtains the best results.

3.3. Correlation Between MaskIoU and Score

Visualizations of MaskIoU predictions and their ground truth. (a) Results with ResNet-18 FPN backbone and (b) results with ResNet-101 DCN+FPN backbone
Comparisons of Mask R-CNN and our proposed MS R-CNN
  • (a): shows the results of Mask R-CNN, and the mask score has less relationship with MaskIoU.
  • (b): shows the results of MS R-CNN; every detection with a high score and a low MaskIoU is penalized, and the mask score can correlate with MaskIoU better.
  • (c): shows the quantitative results, where the scores between each MaskIoU interval are averaged.

MS R-CNN can have a better correspondence between score and MaskIoU.


[2019 CVPR] [MS R-CNN]
Mask Scoring R-CNN

1.6. Instance Segmentation

2014 … 2019 [MS R-CNN] … 2021 [PVT, PVTv1] [Copy-Paste] 2022 [PVTv2]

==== My Other Previous Paper Readings ====



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.