AdwU-Net: Adaptive Depth and Width U-Net for Medical Image Segmentation by Differentiable Neural Architecture Search

Neural Architecture Search (NAS) for Depth & Width in U-Net

Sik-Ho Tsang
5 min readMay 29, 2023

AdwU-Net: Adaptive Depth and Width U-Net for Medical Image Segmentation by Differentiable Neural Architecture Search,
AdwU-Net, by Shanghai Jiao Tong University, and National Medical Products Administration,
2022 MIDL (

@ Medium)

Biomedical Image Segmentation
2015 … 2022 [UNETR] [Half-UNet] [BUSIS] [RCA-IUNet] [Swin-Unet] [DS-TransUNet] [UNeXt] 2023 [DCSAU-Net] [RMMLP]
==== My Other Paper Readings Are Also Over Here ====

  • AdwU-Net is proposed, which is an efficient neural architecture search (NAS) framework to search the optimal task-specific depth and width in the U-Net backbone.
  • In each block, the optimal number of convolutional layers and channels in each layer are directly learned from data. To reduce the computational costs and alleviate the memory pressure, an efficient architecture search is used and the network weights are reused.


  1. AdwU-Net
  2. Results

1. AdwU-Net


1.1. Overall Idea

  • Each AdwBlock consists of three adaptive width blocks (AwBlock).
  • The AdwBlock is designed for the search of the optimal number of AwBlocks, which is also the optimal depth in each block. The AwBlock is designed for the search of optimal channel number of convolutional layers.
  • For the depth level, each block can choose the number of convolutional layers between 1 to 3.
  • For the width level, each convolutional layer can have 5 filter number options. Therefore, each AdwBlock has 5+5²+5³=155 different candidate architectures.
  • Consider the U-Net backbone with 11 blocks, then the AdwU-Net has 155¹¹≈10²⁴ candidate architectures which are impossible to explore manually.

1.2. Depth Search (Figure 1(a))

  • In the search procedure, the output of each resolution stage is the weighted sum of the outputs of different depth options.
  • The naive implementation is to construct three independent parallel paths, GPU memory will be increased quadratically.
  • To avoid redundancy, only the deepest path is kept and weighted skip connections are added from the output of preceding layers to the sink point at the end of each block as shown in Figure 1(a).
  • The deeper path reuses the convolution weights of the shallower path.
  • Let αsl be the architecture parameter of lth depth option in stage s. Gumbel Softmax is employed:
  • where εsl ∈ Gumbel(0, 1) is a random noise following the Gumbel distribution and τ is a temperature parameter.
  • Given the input xs, the output of stage s is the weighted sum of the output of each convolutional layer:
  • where osl is the output of lth layer in stage s.

By doing so, the computational budget of the whole block during the search is roughly the same as computing the feature maps of the deepest path only once.

1.3. Width Search (Figure 1(b))

  • Inspired from FBNetV2 (Wan et al., 2020), convolutions with varying channel numbers are represented by convolutions with equal channel numbers multiplied by different channel masks. Then, the weights of different convolutions are shared to reduce computational costs and GPU memory consumption.
  • The output of preceding layer osl-1 as the input of lth layer in stage s.

The convolution operation once is only run once, then multiplied by the weighted summation of masks. Instance normalization (IN) and Leaky ReLU are applied after the convolutional layer.

  • where Msl,i is a column vector which has ones in the leading i entries and zeros at the end. gsl,i is the Gumbel weight parameter of the ith mask in the lth layer and σ denotes Leaky ReLU.
Width search space for each resolution stage.
  • There are 5 candidate channel numbers in each convolutional layer. The channel number at the first stage ranges from 16 to 48 with step 8.
  • When the resolution is reduced to half, all of the 5 candidate channel numbers double. The channel con gurations in stages greater than 4 still follow the configuration of stage 4.

1.4. Optimization

In each iteration, the network weight w is fixed first and update architecture parameters α and β using trainA and trainB in succession.

Then the architecture parameters α and β are fixed and the network weight w is updated using trainA.

  • The sum of dice loss and cross-entropy loss are used as the loss function.

After searching, the optimal depth and width for each stage are obtained by argmax operation.

  • The search procedure takes 2 days on 1 NVIDIA V100 GPU with 32GB memory. After searching, the network is retrained with the searched depth and width of 1000 epochs for validation.

2. Results

2.1. MSD Test Set

Comparison with state-of-the-art methods on the MSD challenge test set

AdwU-Net achieves best performance in 6 of 10 tasks including Heart, Liver, Hippocampus, Lung, Pancreas, and Hepatic Vessel.

Overall, AdwU-Net achieves the best average Dice of 0.7803 in all methods without pre-training in the MSD leaderboard.

2.2. Ablation Studies

Effectiveness of Depth Search and Width Search

Using both depth search and width search obtains the best results.

Scaling U-Net

With less computational costs, the searched models outperform the scaled models, which shows the effectiveness and effciency of the proposed methods.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.