Review — Learning to Segment Images with Classification Labels

Use Image-Level Labels to Help Biomedical Image Segmentation

Sik-Ho Tsang
6 min readSep 27, 2022

Learning to Segment Images with Classification Labels
Ciga JMEDIA’21
, by University of Toronto, and Sunnybrook Research Institute
2021 JMEDIA (

@ Medium)
Image Classification, Medical Image Segmentation, Weakly Supervised Learning

  • An architecture is proposed that can alleviate the requirements for segmentation- level ground truth by making use of image-level labels to reduce the amount of time spent on data curation.


  1. Architecture
  2. Training Strategy
  3. Datasets
  4. Experimental Results

1. Architecture

Proposed architecture based on ResNet-18
  • Blue arrows indicate residual connections of ResNet-18, the red squares are the max pooling operations, blue square is the unpooling (spatial up-sampling).
  • After each convolutional layer (block) of the ResNet architecture, ReLU activation and batch-normalization is applied.

A simple modification to the existing ResNet architecture to perform segmentation and classification simultaneously, while leveraging easier-to-label classification patches to improve segmentation performance with small amounts of labeled segmentation data.

2. Training Strategy

Overview of the training procedure for each batch
  • Two alternating steps are performed using pixel-level images (with seg- mentation masks) and image-level labels (images with only classification labels) in the batch:
  • Step 1: Pixel level images are used to train the network with input images and segmentation masks with the standard backpropagation algorithm on the segmentation network (encoder+decoder) without passing through the classification layer.
  • Step 2: The data with only image level labels are passed through the segmentation network to obtain a segmentation mask output, which is then transformed by the classification layer to obtain the classification output vectorR^C, where C is the number of classes.
  • This vector is used for backpropagation with cross entropy loss as an error signal to update segmentation network weights to correct the segmentation mask for the given image.

Cross entropy loss is applied pixel-wise for segmentation, and per item for classification.

  • The network is simultaneously optimized by classification loss Lcls and segmentation loss Lseg:

3. Datasets & Metrics

Sample images from each dataset

3.1. ICIAR BACH 2018

  • Breast cancer histology BreAst Cancer Histology images (BACH) dataset.
  • The challenge is split into two parts, A and B.
  • For part A, the aim is to classify each microscopy image into four classes: normal tissue, benign tissue, ductal carcinoma in situ (DCIS), and invasive carcinoma.
  • For part B, the task is to predict the pixelwise labeling of WSI into same four classes, i.e., the segmentation of the WSI.
  • The dataset consists of 400 training and 100 test microscopy images, where each image has a single label, and 20 labeled WSIs with segmentation masks (split into 10 training and 10 testing images).

3.2. Gleason2019

  • Grading of prostate cancer Gleason score, ranging from 1 (healthy) to 5 (abnormal).
  • Gleason2019 challenge consists of 244 tissue micro-array (TMA) images and their corresponding pixel-level annotations detailing the Gleason grade of each region on an image.

3.3. DigestPath2019

  • The dataset consists of 660 image patches with binary pixel-level masks (benign and malignant) from 324 WSIs scanned at 20× resolution. The average size of each image patch is of 5000×5000 pixels, which are resized to 1024 ×1024 for the experiments in this paper.

3.4. Metrics

  • Two variants of F1 scores, called the macro and micro F1, are used.
  • Both metrics are calculated class-wise, and the macro weighs each class-wise score equally whereas the micro considers the class imbalance, weighing the scores per ground truth ratios on the WSI.

3.5. Settings

  • For ICIAR BACH 2018, 3 WSIs from part B and whole part A are used for training.
  • For Gleason2019 and DigestPath2019, 50% of the dataset is used as training set, 25% as validation, and the remaining 25% as the test set.
Tile extraction for DigestPath2019 and Gleason2019
  • Each segmentation mask is split into tiles of size 128×128 pixels.
  • If a dominant class is covering ≥90% of the tile, then it is considered as a classification patch (dashed green boxes).
  • A tile that only contains two classes is considered as a segmentation patch (solid green boxes).
  • And any other tile is ignored.

S (only segmentation): s% of segmentation patches.

S+C: (100-s)% of classification patches in the case of (segmentation + classification) experiments, where s∈{ 0 , 1 , 2.5, 5, 7.5, 10, 15, 20, 25, 30, 40, 50, 75, 100 }.

S+C*: 100% classification patches and s% of segmentation patches.

  • For s=0%, only classification patches are used, hence for the S setting, the network predicts random outputs. For S+C and S+C∗ settings, s=0% reduces to a classifier which predicts one value per patch.

4. Experimental Results

4.1. Classification

  • Though classification accuracy is not the focus, they also perform the classification experiments.
Classification results using the classification head on 50% of the classification patches.
Accuracy results for the classification task for the three datasets (These results are normalized to c=50%)
  • S2+C2 (Blue): 100–2c% of segmentation patches and c% of classification patches are used.
  • S2+C∗2 (Purple): 100–2c% of segmentation patches, and 50% of classification patches are used.
  • S∗2+C2 (Black): 100% of segmentation patches, and the number of classification patches is varied from 0 to 50%, are used.

The segmentation task does not improve the classification task’s performance on any dataset. The best performance when 0% of segmentation is used.

  • 50% of classification patches and any addition of segmentation patches decrease the performance.

Therefore, it is concluded features obtained by training a network for segmentation are not useful for classification.

4.2. Segmentation

Comparison of training performance between using only segmentation (S) patches, both segmentation and classification (S+C) images, and varying the amount of segmentation patches while using the complete set of classification patches (S+C∗) (These results are normalized to s=100%)
  • When ≤10% of segmentation patches is used, there is a significant performance gap (≥15% for both F1 metrics) between the proposed method (S+C or S+C∗) and the S setting.

The method can work either in low or high data settings. In the other words, the method work in low data setting while S setting cannot.

4.3. SOTA Comparison

  • For SOTA approaches, Ciga et al. (2019) achieve a challenge-specific score of 68% on the ICIAR BACH 2018, whereas the proposed method obtains a score of 54%.
  • Li et al. (2020) achieve 67.9% Dice (F1) score with a U-Net on DigestPath2019, whereas the proposed method only achieves 42%.
  • Finally, Zhang et al. (2020) obtain a 75% Dice to proposed one 41%.

Authors argue that the large performance gap is due to over-engineering to the specific dataset. Also, authors only use small network and do not perform specific fine-tuning.

By making use of classification labels, segmentation mask labeling efforts can be reduced.


[2021 JMIA] [Ciga JMEDIA’21]
Learning to Segment Images with Classification Labels

1.9. Biomedical Image Classification

20172021 [MICLe] [MoCo-CXR] [CheXternal] [CheXtransfer] [Ciga JMEDIA’21]

1.10. Biomedical Image Segmentation

2015 … 2020 [MultiResUNet] [UNet 3+] [Dense-Gated U-Net (DGNet)] [Rubik’s Cube+] 2021 [Ciga JMEDIA’21]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.