Brief Review — Mask Scoring R-CNN
- Mask Scoring R-CNN (MS R-CNN) is proposed, which contains a network block to learn the quality of the predicted instance masks.
- The proposed network block takes the instance feature and the corresponding predicted mask together to regress the mask IoU, so as to calibrate the misalignment between mask quality and mask score, and improves instance segmentation performance.
1. Brief Description of Mask R-CNN
- The first stage is the Region Proposal Network (RPN). It proposes candidate object bounding boxes regardless of object categories.
- The second stage is termed as the R-CNN stage, which extracts features using RoIAlign for each proposal and performs proposal classification, bounding box regression and mask prediction.
1.2. Mask Scoring
- smask is defined as the score of the predicted mask.
The ideal smask is equal to the pixel-level IoU between the predicted mask and its matched ground truth mask, which is termed as MaskIoU.
- The ideal smask also should only have positive value for ground truth category, and be zero for other classes, since a mask only belongs to one class.
This requires the mask score to works well on two tasks: classifying the mask to the right category and regressing the proposal’s MaskIoU for foreground object category.
- It is hard to train the two tasks only using a single objective function.
2. MS R-CNN
For simplify, the mask score learning task is decomposed into mask classification and IoU regression, denoted as smask = scls · siou for all object categories.
- scls focuses on classifying the proposal belong to which class and;
- siou focuses on regressing the MaskIoU.
- As for scls, the goal of scls is to classify the proposal belonging to which class, which has been done in the classification task in the R-CNN stage. So we can directly take the corresponding classification score
Regressing siou is the target of this paper.
2.2. MaskIoU Head
- The MaskIoU head aims to regress the IoU between the predicted mask and its ground truth mask.
- The concatenation of feature from RoIAlign layer and the predicted mask as the input of MaskIoU head.
- When concatenating, a max pooling layer with a kernel size of 2 and stride of 2 is used to make the predicted mask have the same spatial size with RoI feature.
- It only regresses the MaskIoU for the ground truth class.
- MaskIoU head consists of 4 convolution layers, with the kernel size and filter number to 3 and 256, and 3 fully connected (FC) layers, with first 2 FC of size 1024.
- The predicted mask of the target class is binarized using a threshold of 0.5.
- Then, the MaskIoU between the binary mask and its matched ground truth is used as the MaskIoU target.
- ℓ2 loss is used for regressing MaskIoU, and the loss weight is set to 1.
- R-CNN stage of Mask R-CNN outputs N bounding boxes, and among them, top-k (i.e. k = 100) scoring boxes after SoftNMS.
- These top-k boxes are fed into the Mask head to generate multi-class masks.
- The top-k target masks are fed MaskIoU head to predict the MaskIoU. The predicted MaskIoU are multiplied with classification score, to get the new calibrated mask score as the final mask confidence.
3.1. Effectiveness of MaskIoU Head
MS R-CNN is not sensitive to the backbone network and can achieve stable improvement on all backbone networks: MS R-CNN can get a remarkable improvement (about 1.5 AP). Especially for AP@0.75, MS R-CNN can improve baseline by about 2 points.
- Besides, MS R-CNN does not harm bounding box detection performance; in fact, it improves bounding box detection performance slightly.
Similar trend on COCO 2017 test-dev.
3.2. Ablation Studies
- MaskIoU head is robust to different ways of fusing mask prediction and RoI feature.
Concatenating the target score map and RoI feature obtains the best results.
3.3. Correlation Between MaskIoU and Score
- (a): shows the results of Mask R-CNN, and the mask score has less relationship with MaskIoU.
- (b): shows the results of MS R-CNN; every detection with a high score and a low MaskIoU is penalized, and the mask score can correlate with MaskIoU better.
- (c): shows the quantitative results, where the scores between each MaskIoU interval are averaged.
MS R-CNN can have a better correspondence between score and MaskIoU.