Review — Learning to Segment Every Thing

Mask X R-CNN, Weakly Supervised Segmentation for Unseen-Class Objects

Sik-Ho Tsang
5 min readJan 5, 2023
Instance segmentation models with partial supervision: a subset of classes (green boxes) have instance mask annotations during training; the remaining classes (red boxes) have only bounding box annotations.

Learning to Segment Every Thing,
Mask X R-CNN, by BAIR, UC Berkeley, and Facebook AI Research (FAIR)
2018 CVPR, Over 290 Citations (Sik-Ho Tsang @ Medium)
Instance Segmentation, Weakly Supervised, Faster R-CNN, Mask R-CNN

  • A new partially supervised training paradigm is proposed with a novel weight transfer function, that enables training instance segmentation models on a large set of categories all of which have box annotations, but only a small fraction of which have mask annotations.


  1. Mask X R-CNN
  2. Results

1. Mask X R-CNN

Mask X R-CNN
  • Let C be the set of object categories, where C=AB where examples from the categories in A have masks, while those in B have only bounding boxes.
  • It is referred to training on the combination of strong and weak labels as a partially supervised learning problem.

1.1. Mask Prediction Using Weight Transfer

  • The method is built on Mask R-CNN.
  • During training, the mask branch is trained jointly and in parallel with the standard bounding box head found in Faster R-CNN.

Instead of learning the category-specific bounding box parameters and mask parameters independently, a category’s mask parameters are predicted from its bounding box parameters using a generic, category-agnostic weight transfer function.

  • Specifically, let wcdet be the class-specific object detection weights in the last layer of the bounding box head.
  • Instead of treating wcseg as model parameters, wcseg is parameterized using a generic weight prediction function T():
  • where θ are class-agnostic, learned parameters.
  • T() can be implemented as a small fully connected neural network.
  • The bounding box head contains two types of detection weights: the RoI classification weights wccls and the bounding box regression weights wcbox.
  • wcdet can be:

1.2. Training Procedures

1.2.1. Stage-Wise Training

  • In the first stage, Faster R-CNN is trained using only the bounding box annotations of the classes in AB, and then in the second stage the additional mask head is trained while keeping the convolutional features and the bounding box head fixed.
  • However, separate training may result in inferior performance.

1.2.2. End-to-End Training

  • To train both heads jointly while solving the discrepancy due to different classes in A and B, stop_grad is used to stop the gradient flow:

1.3. Extension: Fused FCN+MLP Mask Heads

  • Two types of mask heads: FCN head similar to Mask R-CNN and MLP head similar to DeepMask.
  • Or fusing them to combine both feature heads.

2. Results

  • Either ImageNet-pretrained ResNet-50-FPN or ResNet-101-FPN is used as the backbone architecture for Mask R-CNN.

2.1. Ablation Study

Ablation study with ResNet-50-FPN as backbone
  • (a) Input to T: Input “cls+box” giving the best results.
  • (b) Structure of T: A 2-layer MLP with Leaky ReLU gives the best mask AP on set B. Given this, “cls+box, 2-layer, LeakyReLU implementation of T are used for all subsequent experiments.
  • (c) MLP Mask Branch: A class-agnostic MLP mask branch can be fused with either the baseline or the proposed transfer approach.
  • (d) Training Strategy: End-to-end training can bring improved results, however only when backpropagation from T to wdet is disabled.
Each point corresponds to a random A/B split of COCO classes
  • There are 80 classes in COCO.
  • 20, 30, 40, 50 or 60 classes are randomly included in set A (the complement forms set B), and 5 trials are performed for each split size.

The proposed method yields to up to over 40% relative increase in mask AP.

2.2. Full Method

End-to-end training of Mask X R-CNN

Mask X R-CNN outperforms these approaches by a large margin (over 20% relative increase in mask AP).

Mask predictions from the class-agnostic baseline (top row) vs. our Mask X R-CNN approach (bottom row).

The above figure shows example mask predictions from the class-agnostic baseline and the proposed approach.

2.3. Visual Genome

Example mask predictions from our Mask X R-CNN on 3000 classes in Visual Genome.
  • Visual Genome dataset is used. 3000 classes are selected.
  • The green boxes are the 80 classes that overlap with COCO (set A with mask training data) while the red boxes are the remaining 2920 classes not in COCO (set B without mask training data).
  • The above results illustrate the exciting potential of systems that can recognize and segment thousands of concepts.


[2018 CVPR] [Mask X R-CNN]
Learning to Segment Every Thing

1.6. Instance Segmentation

2014 … 2018 [Mask X R-CNN] … 2021 [PVT, PVTv1] [Copy-Paste] 2022 [PVTv2]

==== My Other Previous Paper Readings ====



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.