Brief Review — YOLACT++ Better Real-Time Instance Segmentation
YOLACT++, Extends YOLACT (You Only Look At CoefficienTs)
YOLACT++ Better Real-Time Instance Segmentation,
YOLACT++, by Georgia Institute of Technology, and University of California, Davis,
2022 TPAMI, Over 270 Citations (Sik-Ho Tsang @ Medium)
Instance Segmentation, Semantic Segmentation, YOLACT
- By extending YOLACT, YOLACT++ is proposed, by incorporating deformable convolutions into the backbone network, optimizing the prediction head with better anchor scales and aspect ratios, and adding a novel fast mask re-scoring branch.
- (Here, only YOLACT++ is described. For YOLACT, please feel free to read YOLACT.)
Outline
- Brief Review of You Only Look At CoefficienTs (YOLACT)
- YOLACT++
- YOLACT++ Results
1. Brief Review of YOLACT
- You Only Look At CoefficienTs (YOLACT) is proposed, which is a simple, fully-convolutional model for real-time instance segmentation, which is trained using one GPU only.
- It has two parallel subtasks: (1) generating a set of prototype masks and (2) predicting per-instance mask coefficients.
- Then, instance masks are produced by linearly combining the prototypes with the mask coefficients.
- FastNMS is also proposed, which has a drop-in 12 ms faster replacement for standard NMS, with only a marginal performance penalty.
YOLACT only consumes 1500 MB of VRAM, which makes it an attractive model that could be deployed in low-capacity embedded systems.
2. YOLACT++
2.1. Fast Mask Re-Scoring Network
- Inspired by MS R-CNN, a fast mask re-scoring branch is introduced, which rescores the predicted masks based on their mask IoU with ground-truth.
- The Fast Mask Re-Scoring Network is a 6-layer FCN with ReLU non-linearity per conv layer and a final global pooling layer.
- It takes as input YOLACT’s cropped mask prediction (before thresholding) and outputs the mask IoU for each object category.
- Each mask is rescored by taking the product between the predicted mask IoU for the category predicted by the classification head and the corresponding classification confidence.
Since there is no feature concatenation nor any fc layers, the speed overhead of adding the Fast Mask Re-Scoring branch to YOLACT is 1.2 ms, which changes the fps from 34.4 to 33.
2.2. Deformable Convolution With Intervals
- Based on DCNv2, the 3×3 convolution layer in each ResNet block is replaced with a 3×3 deformable convolution layer for C3 to C5.
- This leads to a +1.8 mask mAP gain with a speed overhead of 8 ms.
- DCN can strengthen the network’s capability of handling instances with different scales, rotations, and aspect ratios by aligning to the target instances.
- Specifically, deformable convolution are tested in four different configurations: (1) in the last 10 ResNet blocks, (2) in the last 13 ResNet blocks, (3) in the last 3 ResNet stages with an interval of 3 (i.e., skipping two ResNet blocks in between; total 11 deformable layers), and (4) in the last 3 ResNet stages with an interval of 4.
The DCN (interval=3) setting is chosen as the final configuration in YOLACT++, which cuts down the speed overhead by 5.2 to 2.8 ms and only has a 0.2 mAP drop compared to not having an interval.
2.3. Optimized Prediction Head
- Anchor choice is revisited with 2 configs: (1) keeping the scales unchanged while increasing the anchor aspect ratios from [1, 1/2, 2] to [1, 1/2, 2, 1/3, 3], and (2) keeping the aspect ratios unchanged while increasing the scales per FPN level by threefold ([1×, 2^(1/3)×, 2^(2/3)×]).
- Using multi-scale anchors per FPN level (config 2) produces the best speed versus performance trade off.
3. YOLACT++ Results
3.1. Ablation Studies
- Left: The optimized anchor choice directly improves the recall of box prediction and boosts the backbone detector.
- The deformable convolutions help with better feature sampling by aligning the sampling positions with the instances of interest and better handles changes in scale, rotation, and aspect ratio.
- The mask re-scoring method is also fast. Compared to incorporating MS R-CNN into YOLACT, it is 26.8 ms faster yet can still improve YOLACT by 1 mAP.
- Right: With the exploration of using less deformable convolution layers, the speed overhead is significantly cut down (from 8 to 2.8 ms) while keeping the performance almost the same (only 0.2 mAP drop) as compared to the original configuration proposed in DCNv2.
3.2. SOTA Comparisons
- The bottom two rows in the above table show the results of YOLACT++ model with ResNet-50 and ResNet-101 backbones.
YOLACT++ obtains a huge performance boost over YOLACT (5.9 mAP for the ResNet-50 model and 4.8 mAP for the ResNet-101 model) while maintaining high speed.
3.3. Qualitative Results
- Both the box prediction and instance segmentation mask by YOLACT++ are more precise.
- YOLACT++ obtains increased detection recall and improves class confidence scores.
3.4. Timing Breakdown
- The time for each part of the method is taken with asynchronous GPU execution disabled. With parallelism turned off, the total time is much higher than the original model.
- The fact that the model is 3 times faster with parallelism turned on which demonstrates how effective the method is at exploiting parallel computation.
Reference
[2022 TPAMI] [YOLACT++]
YOLACT++ Better Real-Time Instance Segmentation
1.6. Instance Segmentation
2014–2020 … [Open Images] 2021 [PVT, PVTv1] [Copy-Paste] 2022 [PVTv2] [YOLACT++]