Brief Review — A Probabilistic U-Net for Segmentation of Ambiguous Images

Probabilistic U-Net, Using Conditional Variational Autoencoder (CVAE)

Sik-Ho Tsang
5 min readNov 28, 2022

A Probabilistic U-Net for Segmentation of Ambiguous Images,
Probabilistic U-Net, by DeepMind, and German Cancer Research Center,
2018 NeurIPS, Over 300 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Image Segmentation, Medical Image Analysis, Medical Imaging

  • A generative segmentation model based on a combination of a U-Net with a conditional variational autoencoder (CVAE) that is capable of efficiently producing an unlimited number of plausible hypotheses.


  1. Probabilistic U-Net
  2. Results

1. Probabilistic U-Net

The Probabilistic U-Net. (a) Sampling process (b) Training process.
  • The proposed network architecture is a combination of a conditional variational auto encoder (CVAE) with a U-Net.

1.1. (a) Sampling

  • The central component of architecture is a low-dimensional latent space of size N (e.g.: N=6 is the best). Each position in this space encodes a segmentation variant.
  • The ‘prior net’, parametrized by weights ω, estimates the probability of these variants for a given input image X. This prior probability distribution (called P in the following) is modelled as an axis-aligned Gaussian with mean prior μprior(X; ω) of size N, and variance σprior(X; ω) of size N.
  • To predict a set of m segmentations, the network runs for m times to the same input image (only a small part of the network needs to be re-evaluated in each iteration). In each iteration i (from 1 to m), a random sample zi is drawn from P:
  • Then, zi is broadcasted to an N-channel feature map with the same shape as the segmentation map, and this feature map is concatenated to the last activation map of a U-Net (the U-Net is parameterized by weights θ). A function fcomb. composed of three subsequent 1×1 convolutions (ψ being the set of their weights) combines the information and maps it to the desired number of classes.
  • The output, Si, is the segmentation map corresponding to point zi in the latent space:
  • When drawing m samples for the same input image, the output of the prior net and the feature activations of the U-Net are reused. Only the function fcomb. needs to be re-evaluated m times.

1.2. (b) Training

  • A ‘posterior net’ is introduced, parametrized by weights v, to learn to recognize a segmentation variant (given the raw image X and the ground truth segmentation Y) and to map this to a position μpost(X; Y; v) with some uncertainty σpost(X; Y; v) in the latent space. The output is denoted as posterior distribution Q.
  • A sample z from this distribution:
  • The networks are trained with the standard training procedure for conditional VAEs, by minimizing the variational lower bound:

where a cross-entropy loss is used to penalize differences between the predicted segmentation S and the ground truth segmentation Y.

And there is a Kullback-Leibler divergence which penalizes differences between the posterior distribution Q and the prior distribution P. During training, this KL loss “pulls” the posterior distribution (which encodes a segmentation variant) and the prior distribution towards each other.

2. Results

2.1. Metric

  • The generalized energy distance, which leverages distances between observations, is used:
  • where d is a distance measure, Y and Y are independent samples from the ground truth distribution Pgt, and similarly, S and S are independent samples from the predicted distribution Pout.
  • d(x, y)=1-IoU(x, y) is used.

2.2. Baseline

Baseline Architecture: (a) Dropout U-Net. (b) U-Net Ensemble. (c) M-Heads. (d) Image2Image VAE.
  • (a) Dropout U-Net: Incoming layers of the three inner-most encoder and decoder blocks with a Dropout probability of p=0.5.
  • (b) U-Net Ensemble: Model ensemble using U-Net.
  • (c) M-Heads: M heads are branched off after the last layer of a deep net.
  • (d) Image2Image VAE: employs a prior that is not conditioned on the input image (a fixed normal distribution) and a posterior net that is not conditioned on the input either.

2.3. Qualitative Results

Qualitative results. (a) Lung CT Scans (b) Cityscapes.

2.4. Quantitative Results

Comparison of approaches using the squared energy distance. (Lower, Better)
  • Left: The energy distance on the 1992 images large lung abnormalities test set, decreases for all models as more samples are drawn.
  • The Probabilistic U-Net outperforms all baselines when sampling 4, 8 and 16 times. The performance at 16 samples is found significantly higher than that of the baselines.
  • Right: The Probabilistic U-Net on the Cityscapes task outperforms the baseline methods when sampling 4, 8 and 16 times in terms of the energy distance.


[2018 NeurIPS] [Probabilistic U-Net]
A Probabilistic U-Net for Segmentation of Ambiguous Images

1.6. Semantic Segmentation / Scene Parsing

20152018 [Probabilistic U-Net] … 2021 [PVT, PVTv1] [SETR] 2022 [PVTv2]

4.2. Biomedical Image Segmentation

2015 … 2018 [Probabilistic U-Net] … 2020 [MultiResUNet] [UNet 3+] [Dense-Gated U-Net (DGNet)]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.