Review — CenterMask : Real-Time Anchor-Free Instance Segmentation

Proposed SAG-Mask, VoVNetV2, eSE; Outperforms YOLACT

Sik-Ho Tsang
6 min readAug 21


Results of CenterMask with VoVNetV2–99 on COCO test-dev2017.

CenterMask : Real-Time Anchor-Free Instance Segmentation
CenterMask, by Electronics and Telecommunications Research Institute (ETRI)
2020 CVPR, Over 450 Citations (Sik-Ho Tsang @ Medium)

Image Segmentation
2014 … 2022
[YOLACT++] 2023 [Segment Anthing Model (SAM)]
==== My Other Paper Readings Are Also Over Here ====

CenterMask consists of three parts: (1) backbone for feature extraction, (2) FCOS detection head, and (3) mask head.
  • An adaptive ROI assignment is proposed for object proposal.
  • A novel spatial attention-guided mask (SAG-Mask) branch is proposed to anchor-free one stage object detector FCOS.
  • An VoVNet-improved backbone networks, VoVNetV2, is used wherein effective Squeeze-Excitation (eSE) is proposed.


  1. CenterMask: Adaptive RoI Assignment Function
  2. CenterMask: Spatial Attention-Guided Mask (SAG-Mask)
  3. CenterMask: VoVNetV2 Backbone
  4. Some Implementation Details
  5. Results

1. CenterMask: Adaptive RoI Assignment Function

1.1. RoI Align in Mask R-CNN

  • RoI Align in Mask R-CNN extracts features should be assigned at different scales of feature maps with respect to RoI scales. Specifically, an RoI with a large scale has to be assigned to a higher feature level and vice versa.
  • Mask R-CNN based two-stage detector uses the below equation in FPN to determine which feature map (Pk) to be assigned:
  • where k0 is 4 and w, h are the width and height of the each RoI.
  • However, it is tuned to two-stage detectors. Two-stage detectors use feature levels of P2 (stride of 2²) to P5 (2⁵) while one-stage detectors use from P3 (2³) to P7 (2⁷) that is larger receptive fields with lower-resolution.
  • Canonical ImageNet pretraining size 224 is used, which is hard-coded and not adaptive to feature scale variation.

For example, when the input dimension is 1024² and the area of an RoI is 224², the RoI is assigned to relative higher feature P4, results in reducing small object AP.

1.2. Adaptive RoI Assignment Function

  • A new RoI assignment function suited for CenterMask based one-stage detectors is proposed:
  • where kmax is the last level (e.g., 7) of feature map in backbone and Ainput, ARoI are areas of input image and the RoI, respectively.
  • k is clipped when it is out of range.

The proposed RoI assignment method improves the small object AP.

2. Spatial Attention-Guided Mask (SAG-Mask)

Spatial Attention-Guided Mask (SAG-Mask)
  • After 4 conv layers, both average and max pooling operations are performed. They are aggregated via concatenation.
  • Then it is followed by a 3×3 conv layer and normalized by the sigmoid function:
  • Finally, the attention guided feature map Xsag is:
  • where ⊗ denotes element-wise multiplication.
  • After then, a 2×2 deconv upsamples the spatially attended feature map to 28×28 resolution. Lastly, a 1×1 conv is applied for predicting class-specific masks.

3. CenterMask: VoVNetV2 Backbone

Enhanced OSA Module

3.1. Residual Connection

  • The accuracy of VoVNetV1–99 is lower than that of VoVNetV1-57 when there is no residual connection.

VoVNetV2 is improved from VoVNet by adding residual connection.

3.2. Effective Squeeze-Excitation (eSE)

  • The SE module, in SENet, squeezes the spatial dependency by global average pooling then two fully-connected (FC) layers followed by a sigmoid function:
  • The first FC reduce the dimension and the second FC expands back. SE module has a limitation: channel information loss due to dimension reduction.
  • Therefore, effective SE (eSE) is proposed that uses only one FC layer with C channels instead of two FCs without channel dimension reduction, which rather maintains channel information and in turn improves performance:

4. Some Implementation Details

  • The backbone network with more lightweight VoVNetV2–19 has 4 OSA modules on each stage comprised of 3 conv layers instead of 5 as in VoVNetv2–39/57.
  • In the box head, there are four 3×3 conv layers with 256 channels on each classification and box branch where the centerness branch is shared with the box branch. The number of conv layer is reduced from 4 to 2 with 128 channels.
  • Lastly, in the mask head, the number of conv layers and channels in the feature extractor and mask scoring part is also reduced from (4, 256) to (2, 128), respectively.
  • The number of detection boxes is reduced from the FCOS’s to 100, and the highest-scoring boxes are fed into the SAG-mask branch for training mask branch.
  • The final objective:
  • where the classification loss Lcls, centerness loss Lcenter, and box regression loss Lbox are same as those in FCOS and Lmask is the average binary cross-entropy loss as in Mask R-CNN.
  • ImageNet pre-trained weights are used.

5. Results

5.1. Ablation Studies

Spatial Attention-Guided Mask (SAG-Mask)

With the prementioned scale-adaptive RoI mapping strategy, the proposed spatial attention module, SAM, makes the mask performance forward because the spatial attention module helps the mask predictor to focus on informative pixels but also suppress noise.

P3~P5 feature range achieves the best result, which means feature maps with a bigger resolution are advantageous for the mask prediction.

Residual connection consistently improves VoVNet-39/57/99.

  • SE worsens the performance of VoVNet or has no effect.

eSE, maintaining channel information using only 1 FC layer, boosts both APmask and APbox from VoVNetV1 with slight computation.

5.2. Comparisons with Other Backbones

Comparisons with Other Backbones

VoVNetV2–39 shows better performance than ResNet-50/HRNet-W18 by a large margin of 1.2%/2.6% at faster speeds, respectively. Especially, the gain of APbox is bigger than APmask, 1.5%/3.3%, respectively.

For large model, showing much faster run time (1.5×), VoVNetV2–99 achieves competitive APmask or higher APbox than ResNeXt-101–32x8d despite fewer model parameters.

For small model, VoVNetV2–19 outperforms MobileNetV2 by a large margin of 1.7% APmask/3.3%APbox, with comparable speed.

5.3. Comparisons with SOTA

Comparisons with SOTA

Under the same ResNet-101 backbone, CenterMask outperforms all other counterparts in terms of both accuracy (APmask, APbox) and speed.

To the best of authors’ knowledge, the CenterMask with VoVNetV2–99 is the first method to achieves 40% APmask at over 10 fps.

  • Re-implementing CenterMask* on top of Detectron2 obtains further performance gain.

CenterMask-Lite is consistently superior to YOLACT in terms of accuracy and speed. CenterMask-Lite models achieve over 30 fps speed with large margins of both APmask and APbox.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.