Brief Review — YOLACT++ Better Real-Time Instance Segmentation

YOLACT++, Extends YOLACT (You Only Look At CoefficienTs)

5 min readJan 18, 2023

**Speed-performance trade-off for various instance segmentation methods on COCO.**

YOLACT++ Better Real-Time Instance Segmentation,
YOLACT++, by Georgia Institute of Technology, and University of California, Davis,
2022 TPAMI, Over 270 Citations (Sik-Ho Tsang @ Medium)
Instance Segmentation, Semantic Segmentation, YOLACT

By extending YOLACT, YOLACT++ is proposed, by incorporating deformable convolutions into the backbone network, optimizing the prediction head with better anchor scales and aspect ratios, and adding a novel fast mask re-scoring branch.
(Here, only YOLACT++ is described. For YOLACT, please feel free to read YOLACT.)

Outline

Brief Review of You Only Look At CoefficienTs (YOLACT)
YOLACT++
YOLACT++ Results

1. Brief Review of YOLACT

**You Only Look At CoefficienTs (YOLACT) Architecture** Blue/yellow indicates low/high values in the prototypes, gray nodes indicate functions that are not trained, and k=4 in this example. This architecture is based on ResNet-101 + FPN.

You Only Look At CoefficienTs (YOLACT) is proposed, which is a simple, fully-convolutional model for real-time instance segmentation, which is trained using one GPU only.
It has two parallel subtasks: (1) generating a set of prototype masks and (2) predicting per-instance mask coefficients.
Then, instance masks are produced by linearly combining the prototypes with the mask coefficients.
FastNMS is also proposed, which has a drop-in 12 ms faster replacement for standard NMS, with only a marginal performance penalty.

YOLACT only consumes 1500 MB of VRAM, which makes it an attractive model that could be deployed in low-capacity embedded systems.

2. YOLACT++

2.1. Fast Mask Re-Scoring Network

**Fast Mask Re-scoring Network Architecture**

Inspired by MS R-CNN, a fast mask re-scoring branch is introduced, which rescores the predicted masks based on their mask IoU with ground-truth.
The Fast Mask Re-Scoring Network is a 6-layer FCN with ReLU non-linearity per conv layer and a final global pooling layer.
It takes as input YOLACT’s cropped mask prediction (before thresholding) and outputs the mask IoU for each object category.
Each mask is rescored by taking the product between the predicted mask IoU for the category predicted by the classification head and the corresponding classification confidence.

Since there is no feature concatenation nor any fc layers, the speed overhead of adding the Fast Mask Re-Scoring branch to YOLACT is 1.2 ms, which changes the fps from 34.4 to 33.

2.2. Deformable Convolution With Intervals

Based on DCNv2, the 3×3 convolution layer in each ResNet block is replaced with a 3×3 deformable convolution layer for C3 to C5.
This leads to a +1.8 mask mAP gain with a speed overhead of 8 ms.
DCN can strengthen the network’s capability of handling instances with different scales, rotations, and aspect ratios by aligning to the target instances.
Specifically, deformable convolution are tested in four different configurations: (1) in the last 10 ResNet blocks, (2) in the last 13 ResNet blocks, (3) in the last 3 ResNet stages with an interval of 3 (i.e., skipping two ResNet blocks in between; total 11 deformable layers), and (4) in the last 3 ResNet stages with an interval of 4.

The DCN (interval=3) setting is chosen as the final configuration in YOLACT++, which cuts down the speed overhead by 5.2 to 2.8 ms and only has a 0.2 mAP drop compared to not having an interval.

2.3. Optimized Prediction Head

Anchor choice is revisited with 2 configs: (1) keeping the scales unchanged while increasing the anchor aspect ratios from [1, 1/2, 2] to [1, 1/2, 2, 1/3, 3], and (2) keeping the aspect ratios unchanged while increasing the scales per FPN level by threefold ([1×, 2^(1/3)×, 2^(2/3)×]).
Using multi-scale anchors per FPN level (config 2) produces the best speed versus performance trade off.

3. YOLACT++ Results

3.1. Ablation Studies

**Left: YOLACT++ Improvements Contribution, Right: Different Choices of Using Deformable Convolution Layers**

Left: The optimized anchor choice directly improves the recall of box prediction and boosts the backbone detector.
The deformable convolutions help with better feature sampling by aligning the sampling positions with the instances of interest and better handles changes in scale, rotation, and aspect ratio.
The mask re-scoring method is also fast. Compared to incorporating MS R-CNN into YOLACT, it is 26.8 ms faster yet can still improve YOLACT by 1 mAP.
Right: With the exploration of using less deformable convolution layers, the speed overhead is significantly cut down (from 8 to 2.8 ms) while keeping the performance almost the same (only 0.2 mAP drop) as compared to the original configuration proposed in DCNv2.

3.2. SOTA Comparisons

The bottom two rows in the above table show the results of YOLACT++ model with ResNet-50 and ResNet-101 backbones.

YOLACT++ obtains a huge performance boost over YOLACT (5.9 mAP for the ResNet-50 model and 4.8 mAP for the ResNet-101 model) while maintaining high speed.

3.3. Qualitative Results

Both the box prediction and instance segmentation mask by YOLACT++ are more precise.
YOLACT++ obtains increased detection recall and improves class confidence scores.

3.4. Timing Breakdown

**Timing Breakdown The Time Taken for Each Stage of the Model**

The time for each part of the method is taken with asynchronous GPU execution disabled. With parallelism turned off, the total time is much higher than the original model.
The fact that the model is 3 times faster with parallelism turned on which demonstrates how effective the method is at exploiting parallel computation.

Reference

[2022 TPAMI] [YOLACT++]
YOLACT++ Better Real-Time Instance Segmentation

1.6. Instance Segmentation

2014–2020 … [Open Images] 2021 [PVT, PVTv1] [Copy-Paste] 2022 [PVTv2] [YOLACT++]

Brief Review — YOLACT++ Better Real-Time Instance Segmentation

YOLACT++, Extends YOLACT (You Only Look At CoefficienTs)

Outline

1. Brief Review of YOLACT

2. YOLACT++

2.1. Fast Mask Re-Scoring Network

2.2. Deformable Convolution With Intervals

2.3. Optimized Prediction Head

3. YOLACT++ Results

3.1. Ablation Studies

3.2. SOTA Comparisons

3.3. Qualitative Results

3.4. Timing Breakdown

Reference

1.6. Instance Segmentation

==== My Other Previous Paper Readings ====

Written by Sik-Ho Tsang