Brief Review — YOLACT++ Better Real-Time Instance Segmentation

YOLACT++, Extends YOLACT (You Only Look At CoefficienTs)

Speed-performance trade-off for various instance segmentation methods on COCO.
  • By extending YOLACT, YOLACT++ is proposed, by incorporating deformable convolutions into the backbone network, optimizing the prediction head with better anchor scales and aspect ratios, and adding a novel fast mask re-scoring branch.
  • (Here, only YOLACT++ is described. For YOLACT, please feel free to read YOLACT.)


  1. Brief Review of You Only Look At CoefficienTs (YOLACT)
  2. YOLACT++
  3. YOLACT++ Results

1. Brief Review of YOLACT

You Only Look At CoefficienTs (YOLACT) Architecture Blue/yellow indicates low/high values in the prototypes, gray nodes indicate functions that are not trained, and k=4 in this example. This architecture is based on ResNet-101 + FPN.
  • You Only Look At CoefficienTs (YOLACT) is proposed, which is a simple, fully-convolutional model for real-time instance segmentation, which is trained using one GPU only.
  • It has two parallel subtasks: (1) generating a set of prototype masks and (2) predicting per-instance mask coefficients.
  • Then, instance masks are produced by linearly combining the prototypes with the mask coefficients.
  • FastNMS is also proposed, which has a drop-in 12 ms faster replacement for standard NMS, with only a marginal performance penalty.


2.1. Fast Mask Re-Scoring Network

Fast Mask Re-scoring Network Architecture
  • Inspired by MS R-CNN, a fast mask re-scoring branch is introduced, which rescores the predicted masks based on their mask IoU with ground-truth.
  • The Fast Mask Re-Scoring Network is a 6-layer FCN with ReLU non-linearity per conv layer and a final global pooling layer.
  • It takes as input YOLACT’s cropped mask prediction (before thresholding) and outputs the mask IoU for each object category.
  • Each mask is rescored by taking the product between the predicted mask IoU for the category predicted by the classification head and the corresponding classification confidence.

2.2. Deformable Convolution With Intervals

  • Based on DCNv2, the 3×3 convolution layer in each ResNet block is replaced with a 3×3 deformable convolution layer for C3 to C5.
  • This leads to a +1.8 mask mAP gain with a speed overhead of 8 ms.
  • DCN can strengthen the network’s capability of handling instances with different scales, rotations, and aspect ratios by aligning to the target instances.
  • Specifically, deformable convolution are tested in four different configurations: (1) in the last 10 ResNet blocks, (2) in the last 13 ResNet blocks, (3) in the last 3 ResNet stages with an interval of 3 (i.e., skipping two ResNet blocks in between; total 11 deformable layers), and (4) in the last 3 ResNet stages with an interval of 4.

2.3. Optimized Prediction Head

  • Anchor choice is revisited with 2 configs: (1) keeping the scales unchanged while increasing the anchor aspect ratios from [1, 1/2, 2] to [1, 1/2, 2, 1/3, 3], and (2) keeping the aspect ratios unchanged while increasing the scales per FPN level by threefold ([1×, 2^(1/3)×, 2^(2/3)×]).
  • Using multi-scale anchors per FPN level (config 2) produces the best speed versus performance trade off.

3. YOLACT++ Results

3.1. Ablation Studies

Left: YOLACT++ Improvements Contribution, Right: Different Choices of Using Deformable Convolution Layers
  • Left: The optimized anchor choice directly improves the recall of box prediction and boosts the backbone detector.
  • The deformable convolutions help with better feature sampling by aligning the sampling positions with the instances of interest and better handles changes in scale, rotation, and aspect ratio.
  • The mask re-scoring method is also fast. Compared to incorporating MS R-CNN into YOLACT, it is 26.8 ms faster yet can still improve YOLACT by 1 mAP.
  • Right: With the exploration of using less deformable convolution layers, the speed overhead is significantly cut down (from 8 to 2.8 ms) while keeping the performance almost the same (only 0.2 mAP drop) as compared to the original configuration proposed in DCNv2.

3.2. SOTA Comparisons

MS COCO Results
  • The bottom two rows in the above table show the results of YOLACT++ model with ResNet-50 and ResNet-101 backbones.

3.3. Qualitative Results

  • Both the box prediction and instance segmentation mask by YOLACT++ are more precise.
  • YOLACT++ obtains increased detection recall and improves class confidence scores.

3.4. Timing Breakdown

Timing Breakdown The Time Taken for Each Stage of the Model
  • The time for each part of the method is taken with asynchronous GPU execution disabled. With parallelism turned off, the total time is much higher than the original model.
  • The fact that the model is 3 times faster with parallelism turned on which demonstrates how effective the method is at exploiting parallel computation.


1.6. Instance Segmentation

==== My Other Previous Paper Readings ====



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store