Review — 3-D RoI-Aware U-Net for Accurate and Efficient Colorectal Tumor Segmentation

3-D RU-Net, Segmentation for MR Slices with Colorectal Cancer

Sik-Ho Tsang
6 min readFeb 9, 2023
Typical examples of MR slices with colorectal cancer. The target areas lack shape characteristic, intensity specificity, and positional priors.

3-D RoI-Aware U-Net for Accurate and Efficient Colorectal Tumor Segmentation,
3-D RU-Net, by Shanghai Jiao Tong University, Chinese University of Hong Kong, Sun Yat-sen University Cancer Center, and Imsight Medical Technology Company Ltd.
2021 TCYB, Over 40 Citations (Sik-Ho Tsang @ Medium)
Medical Imaging, Medical Image Analysis, Image Segmentation, U-Net

  • 3-D RoI-aware U-Net (3-D RU-Net) is proposed, which fully utilizes the global contexts covering large effective receptive fields.
  • Specifically, the proposed model consists of a global image encoder for global understanding-based RoI localization, and a local region decoder that operates on pyramid-shaped in-region global features, for the prediction with large 3-D whole volumes.


  1. 3-D RU-Net Model Architecture
  2. Loss Functions, Ensemble, and Other Details
  3. Results

1. 3-D RU-Net Model Architecture

Illustration of 3-D RU-Net.
  • There are 4 main parts as shown above: Global Image Encoder, RoI Locator, RoI Pyramid Layer, and Local region Decoder.

1.1. Global Image Encoder

  • Due to limited GPU memory, it is essential to carefully design the 3-D backbone feature extractor to avoid GPU memory overflow and overfitting.
  • Specifically, the encoder employs a stack of ResBlocks, as in ResNet, and MaxPooling layers to encode whole volume images.
  • Each residual block has three convolutional layers, three normalization layers, three ReLU layers, and a skip connection for better gradient flowing.
  • Batch size of 1 and instance normalization are used.
  • Feature maps (FI, FII, FIII) are generated.

1.2. RoI Locator

  • The locator is designed as a module taking feature map FIII as input, consisting of a convolutional layer with kernel size 1 and the Sigmoid activation function.
  • To tackle the extremely imbalanced foreground-to-background ratio, instead of partial sampling, a fixed proportion of foreground and background is sampled or online hard example mining (OHEM) is employed.
  • The locator is trained toward a global dice loss (DL).
  • A fast 3-D connectivity analysis is performed to compute desired bounding boxes formulated as BboxIII=(z3, y3, x3, d3, h3, w3), where (z3, y3, x3) denotes the starting coordinates and (d3, h3, w3) denotes the depth, height, and width of BboxIII in feature map FIII.

1.3. RoI Pyramid Layer

  • It pools local RoI tensors from a heuristically selected single-scale feature map. It extracts pyramid-shaped in region features, forming tensor groups (fI, fII, fIII) from each scale of globally encoded tensors (FI, FII, FIII) produced by the global image encoder.
  • Specifically, to extract a feature group for a detected target, the detected bounding box BboxIII=(z3, y3, x3, d3, h3, w3) is passed to its former feature scales, constructing a pyramid-shaped bounding box set (BboxI, BboxII, BboxIII).
  • The bounding box set is computed iteratively by inverting the MaxPooling strides as is listed as follows:

Given the bounding box set (BboxI, BboxII, BboxIII), the raw in-region features (fI, fII, fIII) can be cropped from whole volume feature maps FI, FII, and FIII.

1.4. Local Region Decoder

  • With an in-region feature set (fI, fII, fIII) cropped from the encoder path, a multilevel subnetwork for in-region segmentation is constructed called local region decoder by applying the successful multilevel feature fusion mechanism.
  • The construction of the decoder is more or less symmetrical to the encoder part with skip connections to fuse feature maps of corresponding scales, while the beneficial difference lies on much smaller sizes of the decoder branch’s feature tensors.

2. Loss Functions, Ensemble, and Other Details

2.1. Loss Functions

  • A Dice-Based Multitask Hybrid Loss (MHL) Function is used.
  • Dice Loss (DL) for global RoI localization is employed:
  • where Pg and Gg denote predictions of RoILocator and downsampled annotations, respectively.
  • The DL helps the global image encoder branch learn better discriminate foreground regions from the background and get rid of the influence of class imbalance.
  • Another is Dice-Based Contour-Aware Loss for Local Segmentation.
  • Practically, an extra output head called SegHead2 is added at the output terminal of the Local Region Decoder to predict the contour voxels, trained in parallel with the region segmentation head SegHead1.
  • The weighted losses is used:
  • where λc=0.5 decided using the grid search.
  • Finally, the overall loss function is
  • where β=10^(-5) denotes the balance of the weight decay term.

2.2. Multiple Receptive Field Model Ensemble

Parameters and GPU Memory Footprint.
  • To cover contexts of different scales, the proposed 3-D RU-Nets are constructed with different receptive fields by adding dilation to the convolutional layer.
  • Specifically, as is illustrated above, an original 3-D R-U-Net of receptive field 26×64×64, called 3-D RU-Net (RF64), are first constructed.
  • Next, the dilation rate of ResBlock3 is tuned as 2, enlarging the receptive field to 26×88×88 and formulate 3-D RU-Net (RF88);
  • The dilation rates of ResBlock2, ResBlock3, and ResBlock4 are further tuned as 2 and a 3-D R-U-Net of receptive field 26×112×112 called 3-D RU-Net (RF112) is constructed.
3-D RU-Net (RF64), 3-D RU-Net (RF88), and 3-D RU-Net (RF112) are of different dilation rates. The green, blue, and red spheres of different sizes indicate receptive fields of 26×64×64, 26×88×88, and 26×112×112, respectively. In the output end, their predictions are averaged.
  • In the inference stage, three networks’ outputs are averaged to generate the final prediction.

2.3. Other Details

  • The dataset contains a total of 64 3-D MR images of the pelvic cavity of still T2 modality. 4-fold cross-validation was conducted.
  • By OTSU [59] thresholding, body masks M are extracted where in-body mi=1 and other mi=0. The in-body mean intensity and standard deviation are computed according to the following formulas:
  • Then, the image is normalized using μM and σM according to standard normalization criterion.
  • The backbone network was initialized using criterion proposed in [61], then pretrained using our previous work’s patchwise HL-FCN [52].
  • Then, the RoI locator is first trained with loss Lg until evaluation loss no longer decrease, then the full model is jointly trained with loss L.

3. Results

Comparison of Accuracy and Efficiency
Illustrations inside a chosen RoI of (1) cancerous region, (2) expert delineation, (3) proposed method (predicted regions), (4) proposed method (predicted contours), (5) 3D U-Net+ DL (V-Net) [10] (ensemble), (6) 3D U-Net [9], (7) 3-D FCN + 3D U-Net, (8) 3-D Mask R-CNN [41], (9) supervoxel clustering [5], and (10) 2-D kU-Net + LSTM [17].

The proposed method presented a sensitive response to boundary details and retained general correctness, competing methods presented inferior boundary details or limited correctness.

  • With an input volume of size 160 × 256 × 320 mm, the 3-D RU-Net takes 9.7 GB while the 3D U-Net takes 18874.81 MB.
  • By enabling in-place computing, the ReLU activations become memory free. The memory footprint of standard U-Net further drops from 18.9 to 13.3 GB and the footprint of 3-D RU-Net drops from 9.7 to 6.5 GB.
Illustration of selected 2-D key slices and 3-D segmentation results from different patient cases numbered from (1) to (8). (1) cancerous region, (2) expert delineation, (3) proposed method (predicted regions), (4) proposed method (predicted contours), (5) 3D U-Net + DL (V-Net) [10] (ensemble), (6) 3D U-Net [9], (7) 3-D FCN + 3D U-Net, (8) 3-D Mask R-CNN [41]
  • The 3-D rendering module of SimpleITK [63] is used.
  • Despite the background complexity, the proposed method correctly located and segmented targets without being significantly misguided by nearby distractions thanks to the fully utilized global contexts.


[2021 TCYB] [3-D RU-Net]
3-D RoI-Aware U-Net for Accurate and Efficient Colorectal Tumor Segmentation

4.2. Biomedical Image Segmentation

2015 2021 [Expanded U-Net] [3-D RU-Net]

==== My Other Previous Paper Readings ====



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.