Review: DIS — Dual Image Segmentation (Semantic Segmentation)

Fewer Number of Training Images, Outperforms DeepLabv2, CRF-RNN, SegNet, FCN

In this story, DIS (Dual Image Segmentation), by Sun Yat-Sen University, The Chinese University of Hong Kong, and SenseTime Group (Limited), is reviewed. By using the image classification tag and reconstructed image, iterative training and testing inferences can be achieved to improve the segmentation accuracy. And DIS is published in 2017 ICCV with more than 30 citations. (Sik-Ho Tsang @ Medium)


  1. Different Kinds of Semi-Supervised Learning Settings
  2. VOC & IDW Datasets
  3. Network Overview
  4. Iterative Inference
  5. Training
  6. Experimental Results

1. Different Kinds of Semi-Supervised Learning Settings

Different kinds of semi-supervised learning settings where L is the labelmap which can be missing
  • As shown above, I is image, L is pixel-wise segmentation labelmap, T is image-level tag.
  • L can be the labelmap which can be missing during training
  • Thus, during training, sometimes, we got L, sometimes, we got T, to train the CNN segmentation. And there are 3 cases:
  • (a) treats L as missing label in multitask learning.
  • (b) regards L as latent variable that can be inferred by tags T.
  • (c), the proposed approach in this paper, infers the missing label L not only by covering clean tags T, but also reconstructing the image to capture accurate object shape and boundary. It will be explained later why the image is reconstructed again.

2. VOC & IDW Datasets

  • PASCAL VOC 2012 dataset only got labelmap L, but not image-level tag T.
  • IDW (Image Description in the Wild) dataset is a dataset downloaded from internet, with only T, but not L.
  • Different from IDW-CNN, object interaction is not utilized here.
  • (If interested, please read my review about IDW-CNN for IDW dataset.)

3. Network Overview

Network Overview, three subnets marked as ‘1’, ‘2’, and ‘3’ for labelmap prediction (blue), image reconstruction (green), and tag classification (pink)
  • There are 3 subnets: labelmap prediction (blue), image reconstruction (green), and tag classification (pink).
  • Given an image I, ResNet-101 produces a feature map of 2048×45×45 and a feature vector of 2048×1, denoted as u1 and v1 respectively.

3.1. Subnet-1

  • u1 and upsampled v1 are concatenated to produce u2. Because the pixel-level features u1 can borrow information from the image-level features v1 to improve segmentation.
  • 3×3 Convolution is applied on u2 to produce u3, which represents the response maps of 21 categories of VOC12.

3.2. Subnet-2

  • u3 as input, reconstructs the image denoted as z3, by stacking three convolutional layers.

3.3. Subnet-3

  • First, u1 is average pooled.
  • Then, a feature vector v2 of length 2048 is produced by fusing v1 and u1. In this case, the image-level features are improved by the pixel-level features to facilitate tag classification.
  • v2 is projected into a response vector v3 of 21×1, where each entry indicates the possibility of the presence of a category in an image.

4. Iterative Inference

  • As we can see, the image can be reconstructed in Subnet-2, iterative inference can be enabled to gradually improve accuracy of the predicted labelmap. This is an important contribution of DIS.
  • It is achieved by minimizing the image reconstruction loss (in Subnet-2) with respect to the pixel and image-level features u1 and v1, and keeping the learned network parameters fixed.
  • More accurate results can be obtained when iteration t increases.
  • This iterative inference can be applied during training and testing.

5. Training

  • First, three components are trained, including ResNet-101, Subnet-1, and -3 to predict labelmaps and tags.
  • Second, Subnet-2 is learned to reconstruct images by freezing the parameters of the above components.
  • Finally, all four components are jointly updated.

6. Experimental Results

6.1. PASCAL VOC 2012

PASCAL VOC 2012 Test Set
  • At the bottom part of the above table, different iterative inference number during training (ttr) and testing (tts) are tried.
  • With ttr=30, and tts=30, 86.8% mIoU is achieved which is the best setting of DIS.
  • It also outperforms SegNet, FCN, CRF-RNN, and DeepLabv2.
  • ResNet-101 is the baseline model without any helps of semi-supervised learning.

6.2. Model Size and Complexity

Model Size and Complexity
  • From above table, DIS only got 45.5M #params with speed of 140ms for inference.
  • For DeepLabv2, the post-processing CRF step has not been counted.

6.3. IDW Test Set

IDW Test Set
  • IDW test set has labelmap, this test set is also evaluated.
  • DIS got the best result of 59.8% mIoU.

6.4. Visualizations

Segmentation examples on VOC12 test set
  • In general, the predicted labelmaps produce better results to capture object classes and boundaries, when more inferences are performed.
  • For example, the regions of ‘sofa’, ‘plant’, and ‘cat’ are correctly identified in the bottom-right labelmap.


[2017 ICCV] [DIS]
Deep Dual Learning for Semantic Image Segmentation

My Previous Reviews

Image Classification [LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [Shake-Shake] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2]

Object Detection [OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]

Semantic Segmentation [FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3] [LC] [FC-DenseNet] [IDW-CNN] [DIS] [SDN]

Biomedical Image Segmentation [CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [QSA+QNT] [3D U-Net+ResNet]

Instance Segmentation [SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

Super Resolution [SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet]

Human Pose Estimation [DeepPose] [Tompson NIPS’14] [Tompson CVPR’15] [CPM]

Codec Post-Processing [ARCNN] [Lin DCC’16] [IFCNN] [Li ICME’17] [VRCNN] [DCAD] [DS-CNN]



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store