Brief Review — HTC: Hybrid Task Cascade for Instance Segmentation

HTC, Better Information Guiding Between Mask Branches & Bounding Box Branches

7 min readDec 6, 2022

Hybrid Task Cascade for Instance Segmentation,
HTC, by The Chinese University of Hong Kong, SenseTime Research, Zhejiang University, The University of Sydney, and Nanyang Technological University,
2019 CVPR, Over 800 Citations (Sik-Ho Tsang @ Medium)
Instance Segmentation, Mask R-CNN, Cascade R-CNN, Multi-Task Learning

A simple combination of Cascade R-CNN and Mask R-CNN only brings limited gain. In this paper, Hybrid Task Cascade (HTC) is proposed:

which interweaves them for a joint multi-stage processing, and
it adopts a fully convolutional branch to provide spatial context, which can help distinguishing hard foreground from cluttered background.

Outline

From Cascade Mask R-CNN to Hybrid Task Cascade (HTC)
Loss Functions
Results

1. From Cascade Mask R-CNN to Hybrid Task Cascade (HTC)

(You may skip equations related to pipeline for fast read.)

1.1. Cascade Mask R-CNN

A direct combination of Mask R-CNN and Cascade R-CNN, denoted as Cascade Mask R-CNN, is shown above.

Specifically, a mask branch (Green) following the architecture of Mask R-CNN is added to each stage of Cascade R-CNN:

where x indicates the CNN features of backbone network, xboxt and xmaskt indicates box and mask features derived from x and the input RoIs. P(.) is a pooling operator, e.g., RoI Align or ROI pooling, Bt and Mt denote the box and mask head at the t-th stage, rt and mt represent the corresponding box predictions and mask predictions.

AP is improved.
However, drawback of the above design is that the two branches at each stage are executed in parallel during training, two branches are not directly interacted within a stage.

1.2. Interleaved Execution

An improved design, which interleaves the box and mask branches, is as shown above. The interleaved execution is expressed as:

In this way, the mask branch can take advantage of the updated bounding box predictions. It is found that this yields improved performance.
However, there is no direct information flow between mask branches at different stages.

1.3. Mask Information Flow

An information flow is introduced between mask branches (red arrow) by feeding the mask features of the preceding stage to the current stage.
With the direct path between mask branches, the pipeline can be written as:

where m‾t-1 denotes the intermediate feature of Mt-1.
This information flow makes it possible for progressive refinement of masks, instead of predicting masks on progressively refined bounding boxes.
A simple implementation is as below:

The RoI feature before the deconvolutional layer is adopted as the mask representation m‾t-1, whose spatial size is 14×14.
At stage t, we need to forward all preceding mask heads with RoIs of the current stage to compute m‾t-1:

**Architecture of multi-stage mask branches.**

Here, M‾t denotes the feature transformation component of the mask head Mt, which is comprised of 4 consecutive 3×3 convolutional layers.
The transformed features m‾t-1 are then embedded with a 1×1 convolutional layer Gt in order to be aligned with the pooled backbone features xmaskt.
Finally, Gt(m‾t-1) is added to xmaskt through element-wise sum.

Mask features in different stages are no longer isolated and all get supervised through backpropagation.

1.4. Hybrid Task Cascade (HTC)

An additional branch (Light Orange) is added to predict per-pixel semantic segmentation for the whole image, which adopts the fully convolutional architecture and is jointly trained with other branches.
The pipeline is formulated as below:

where S indicates the semantic segmentation head.

This semantic segmentation feature is a strong complement to existing box and mask features, which can be more discriminative on cluttered background.

Specifically, S is constructed based on the output of the Feature Pyramid Network (FPN).
Each level of the feature pyramid is first aligned to a common representation space via a 1×1 convolutional layer. Then low level feature maps are upsampled, and high level feature maps are downsampled to the same spatial scale, where the stride is set to 8.
Feature maps from different levels are subsequently fused by element-wise sum.
Four convolutional layers are added to further bridge the semantic gap.
At the end, a convolutional layer is simply adopted to predict the pixel-wise segmentation map.

2. Loss Functions

At each stage t, the box head predicts the classification score ct and regression offset rt for all sampled RoIs.
The mask head predicts pixel-wise masks mt for positive RoIs.
The semantic branch predicts a full image semantic segmentation map s. The overall loss function takes the form of a multi-task learning:

Here, Ltbbox is the loss of the bounding box predictions at stage t, which follows the same definition as in Cascade R-CNN and combines two terms Lcls and Lreg, respectively for classification and bounding box regression.
Ltmask is the loss of mask prediction at stage t, which adopts the binary cross entropy form as in Mask R-CNN.
Lseg is the semantic segmentation loss in the form of cross entropy.
The coefficients αt and β are used to balance the contributions of different stages and tasks. αt=[1, 0.5, 0.25], T=3 and β=1.

3. Results

3.1. SOTA Comparisons

**Comparison with state-of-the-art methods on COCO test-dev dataset.**

It is noted that the baseline is already higher than PANet.

The HTC achieves consistent improvements on different backbones, proving its effectiveness. It achieves a gain of 1.5%, 1.3% and 1.1% for ResNet-50, ResNet-101 and ResNeXt-101, respectively.

3.2. Ablation Study

**Effects of each component in our design. Results are reported on COCO 2017 val.**

The interleaved execution slightly improves the mask AP by 0.2%. The mask information flow contributes to a further 0.6% improvement, and the semantic segmentation branch leads to a gain of 0.6%.

(There are other ablation experiments, please feel free to read the paper directly.)

3.3. Extensions on HTC

**Results (mask AP) with better backbones and bells and whistles on COCO test-dev dataset.**

HTC Baseline: The ResNet-50 baseline achieves 38.2 mask AP.
DCN: Adopting deformable convolution, in the last stage (res5) of the backbone.
SyncBN. Synchronized Batch Normalization, in MegDet, is used in the backbone and heads.
Multi-scale Training. In each iteration, the scale of short edge is randomly sampled from [400; 1400], and the scale of long edge is fixed as 1600.
SENet-154: Different backbones are tried besides ResNet-50, and SENet-154 achieves best single model performance among them.
GA-RPN: Finetune trained detectors with the proposals generated by GA-RPN [41], which achieves near 10% higher recall than RPN.
Multi-scale Testing: 5 scales as well as horizontal flip are used at test time to ensemble the results. Testing scales are (600, 900), (800, 1200), (1000, 1500), (1200, 1800), (1400, 2100).
Ensemble: An emsemble of five networks: SENet-154, ResNeXt-101 64×4d, ResNeXt-101 32×8d, DPN-107, and FishNet.