Brief Review — HTC: Hybrid Task Cascade for Instance Segmentation

HTC, Better Information Guiding Between Mask Branches & Bounding Box Branches

  • A simple combination of Cascade R-CNN and Mask R-CNN only brings limited gain. In this paper, Hybrid Task Cascade (HTC) is proposed:
  1. which interweaves them for a joint multi-stage processing, and
  2. it adopts a fully convolutional branch to provide spatial context, which can help distinguishing hard foreground from cluttered background.


  1. From Cascade Mask R-CNN to Hybrid Task Cascade (HTC)
  2. Loss Functions
  3. Results

1. From Cascade Mask R-CNN to Hybrid Task Cascade (HTC)

  • (You may skip equations related to pipeline for fast read.)

1.1. Cascade Mask R-CNN

Cascade Mask R-CNN
  • Specifically, a mask branch (Green) following the architecture of Mask R-CNN is added to each stage of Cascade R-CNN:
  • where x indicates the CNN features of backbone network, xboxt and xmaskt indicates box and mask features derived from x and the input RoIs. P(.) is a pooling operator, e.g., RoI Align or ROI pooling, Bt and Mt denote the box and mask head at the t-th stage, rt and mt represent the corresponding box predictions and mask predictions.

1.2. Interleaved Execution

Interleaved Execution
  • An improved design, which interleaves the box and mask branches, is as shown above. The interleaved execution is expressed as:

1.3. Mask Information Flow

Mask Information Flow
  • An information flow is introduced between mask branches (red arrow) by feeding the mask features of the preceding stage to the current stage.
  • With the direct path between mask branches, the pipeline can be written as:
  • where mt-1 denotes the intermediate feature of Mt-1.
  • This information flow makes it possible for progressive refinement of masks, instead of predicting masks on progressively refined bounding boxes.
  • A simple implementation is as below:
  • The RoI feature before the deconvolutional layer is adopted as the mask representation mt-1, whose spatial size is 14×14.
  • At stage t, we need to forward all preceding mask heads with RoIs of the current stage to compute mt-1:
Architecture of multi-stage mask branches.
  • Here, Mt denotes the feature transformation component of the mask head Mt, which is comprised of 4 consecutive 3×3 convolutional layers.
  • The transformed features mt-1 are then embedded with a 1×1 convolutional layer Gt in order to be aligned with the pooled backbone features xmaskt.
  • Finally, Gt(mt-1) is added to xmaskt through element-wise sum.

1.4. Hybrid Task Cascade (HTC)

Hybrid Task Cascade (HTC)
  • An additional branch (Light Orange) is added to predict per-pixel semantic segmentation for the whole image, which adopts the fully convolutional architecture and is jointly trained with other branches.
  • The pipeline is formulated as below:
  • where S indicates the semantic segmentation head.
Semantic Segmentation Branch.
  • Specifically, S is constructed based on the output of the Feature Pyramid Network (FPN).
  • Each level of the feature pyramid is first aligned to a common representation space via a 1×1 convolutional layer. Then low level feature maps are upsampled, and high level feature maps are downsampled to the same spatial scale, where the stride is set to 8.
  • Feature maps from different levels are subsequently fused by element-wise sum.
  • Four convolutional layers are added to further bridge the semantic gap.
  • At the end, a convolutional layer is simply adopted to predict the pixel-wise segmentation map.

2. Loss Functions

  • At each stage t, the box head predicts the classification score ct and regression offset rt for all sampled RoIs.
  • The mask head predicts pixel-wise masks mt for positive RoIs.
  • The semantic branch predicts a full image semantic segmentation map s. The overall loss function takes the form of a multi-task learning:
  • Here, Ltbbox is the loss of the bounding box predictions at stage t, which follows the same definition as in Cascade R-CNN and combines two terms Lcls and Lreg, respectively for classification and bounding box regression.
  • Ltmask is the loss of mask prediction at stage t, which adopts the binary cross entropy form as in Mask R-CNN.
  • Lseg is the semantic segmentation loss in the form of cross entropy.
  • The coefficients αt and β are used to balance the contributions of different stages and tasks. αt=[1, 0.5, 0.25], T=3 and β=1.

3. Results

3.1. SOTA Comparisons

Comparison with state-of-the-art methods on COCO test-dev dataset.
  • It is noted that the baseline is already higher than PANet.

3.2. Ablation Study

Effects of each component in our design. Results are reported on COCO 2017 val.
  • (There are other ablation experiments, please feel free to read the paper directly.)

3.3. Extensions on HTC

Results (mask AP) with better backbones and bells and whistles on COCO test-dev dataset.
  • HTC Baseline: The ResNet-50 baseline achieves 38.2 mask AP.
  • DCN: Adopting deformable convolution, in the last stage (res5) of the backbone.
  • SyncBN. Synchronized Batch Normalization, in MegDet, is used in the backbone and heads.
  • Multi-scale Training. In each iteration, the scale of short edge is randomly sampled from [400; 1400], and the scale of long edge is fixed as 1600.
  • SENet-154: Different backbones are tried besides ResNet-50, and SENet-154 achieves best single model performance among them.
  • GA-RPN: Finetune trained detectors with the proposals generated by GA-RPN [41], which achieves near 10% higher recall than RPN.
  • Multi-scale Testing: 5 scales as well as horizontal flip are used at test time to ensemble the results. Testing scales are (600, 900), (800, 1200), (1000, 1500), (1200, 1800), (1400, 2100).
  • Ensemble: An emsemble of five networks: SENet-154, ResNeXt-101 64×4d, ResNeXt-101 32×8d, DPN-107, and FishNet.
Extensive study on related modules on COCO 2017 val.
  • There are many other components also tried to see if there is any improvement, such as ASSP in DeepLabv3, GCN, and SoftNMS, as above.

3.4. Qualitative Results

Examples of segmentation results on COCO dataset.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store