Brief Review — A Probabilistic U-Net for Segmentation of Ambiguous Images
Probabilistic U-Net, Using Conditional Variational Autoencoder (CVAE)
A Probabilistic U-Net for Segmentation of Ambiguous Images,
Probabilistic U-Net, by DeepMind, and German Cancer Research Center,
2018 NeurIPS, Over 300 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Image Segmentation, Medical Image Analysis, Medical Imaging
- A generative segmentation model based on a combination of a U-Net with a conditional variational autoencoder (CVAE) that is capable of efficiently producing an unlimited number of plausible hypotheses.
Outline
- Probabilistic U-Net
- Results
1. Probabilistic U-Net
- The proposed network architecture is a combination of a conditional variational auto encoder (CVAE) with a U-Net.
1.1. (a) Sampling
- The central component of architecture is a low-dimensional latent space of size N (e.g.: N=6 is the best). Each position in this space encodes a segmentation variant.
- The ‘prior net’, parametrized by weights ω, estimates the probability of these variants for a given input image X. This prior probability distribution (called P in the following) is modelled as an axis-aligned Gaussian with mean prior μprior(X; ω) of size N, and variance σprior(X; ω) of size N.
- To predict a set of m segmentations, the network runs for m times to the same input image (only a small part of the network needs to be re-evaluated in each iteration). In each iteration i (from 1 to m), a random sample zi is drawn from P:
- Then, zi is broadcasted to an N-channel feature map with the same shape as the segmentation map, and this feature map is concatenated to the last activation map of a U-Net (the U-Net is parameterized by weights θ). A function fcomb. composed of three subsequent 1×1 convolutions (ψ being the set of their weights) combines the information and maps it to the desired number of classes.
- The output, Si, is the segmentation map corresponding to point zi in the latent space:
- When drawing m samples for the same input image, the output of the prior net and the feature activations of the U-Net are reused. Only the function fcomb. needs to be re-evaluated m times.
1.2. (b) Training
- A ‘posterior net’ is introduced, parametrized by weights v, to learn to recognize a segmentation variant (given the raw image X and the ground truth segmentation Y) and to map this to a position μpost(X; Y; v) with some uncertainty σpost(X; Y; v) in the latent space. The output is denoted as posterior distribution Q.
- A sample z from this distribution:
- The networks are trained with the standard training procedure for conditional VAEs, by minimizing the variational lower bound:
where a cross-entropy loss is used to penalize differences between the predicted segmentation S and the ground truth segmentation Y.
And there is a Kullback-Leibler divergence which penalizes differences between the posterior distribution Q and the prior distribution P. During training, this KL loss “pulls” the posterior distribution (which encodes a segmentation variant) and the prior distribution towards each other.
2. Results
2.1. Metric
- The generalized energy distance, which leverages distances between observations, is used:
- where d is a distance measure, Y and Y’ are independent samples from the ground truth distribution Pgt, and similarly, S and S’ are independent samples from the predicted distribution Pout.
- d(x, y)=1-IoU(x, y) is used.
2.2. Baseline
- (a) Dropout U-Net: Incoming layers of the three inner-most encoder and decoder blocks with a Dropout probability of p=0.5.
- (b) U-Net Ensemble: Model ensemble using U-Net.
- (c) M-Heads: M heads are branched off after the last layer of a deep net.
- (d) Image2Image VAE: employs a prior that is not conditioned on the input image (a fixed normal distribution) and a posterior net that is not conditioned on the input either.
2.3. Qualitative Results
2.4. Quantitative Results
- Left: The energy distance on the 1992 images large lung abnormalities test set, decreases for all models as more samples are drawn.
- The Probabilistic U-Net outperforms all baselines when sampling 4, 8 and 16 times. The performance at 16 samples is found significantly higher than that of the baselines.
- Right: The Probabilistic U-Net on the Cityscapes task outperforms the baseline methods when sampling 4, 8 and 16 times in terms of the energy distance.
Reference
[2018 NeurIPS] [Probabilistic U-Net]
A Probabilistic U-Net for Segmentation of Ambiguous Images
1.6. Semantic Segmentation / Scene Parsing
2015 … 2018 [Probabilistic U-Net] … 2021 [PVT, PVTv1] [SETR] 2022 [PVTv2]
4.2. Biomedical Image Segmentation
2015 … 2018 [Probabilistic U-Net] … 2020 [MultiResUNet] [UNet 3+] [Dense-Gated U-Net (DGNet)]