Review: VoxResNet — Deep Voxelwise Residual Networks for Volumetric Brain Segmentation (Biomedical Image Segmentation)

First place in the MICCAI MRBrainS challenge leaderboard out of 37 competitors

6 min readSep 30, 2019

**First Row: Brain Images, Second Row: Segmentation annotations by experts**

In this story, VoxResNet, a novel voxelwise residual network, by The Chinese University of Hong Kong (CUHK), The Hong Kong Polytechnic University (PolyU), and Chinese Academy of Sciences (Shenzhen), is reviewed. Segmentation of key brain tissues from 3D medical images is of great significance for brain disease diagnosis, progression assessment and monitoring of neurologic conditions.

It is built with 25 layers, and hence can generate more representative features to deal with the large variations of brain tissues.
Multi-modality and multi-level contextual information are integrated into the network, so that the complementary information of different modalities can be harnessed and features of different scales can be exploited.
The segmentation performance is further improved by combining the low-level image appearance features, implicit shape information, and high-level context together.

It is firstly appeared as 2016 arXiv tech report with more than 70 citations, and then appeared in 2018 JNeuroImage (Impact Factor 5.812) with more than 170 citations. (Sik-Ho Tsang @ Medium)

It also achieved the first place in the challenge out of 37 competitors in well-known benchmark, MRBrainS.

Outline

VoxResNet Architecture
Multi-Modality Inputs
Auto-Context VoxResNet
Ablation Study
Comparison with other methods

1. VoxResNet Architecture

**(a) VoxResNet Architecture, (b) VoxRes module**

1.1. VoxRes Module

Generally, a residual unit in ResNet can be expressed as following:

Fl denotes the residual function, i.e., a stack of two convolutional layers with batch normalization (BN).
By unfolding the above equation recursively:

The feature xL of any deeper layers can be represented as the feature xl of shallow unit l plus summarized residual functions.
During backpropagation, a chain rule is applied:

This shows that the residual unit mechanism can make information propagate through the entire network smoothly in both forward and backward passes.

1.2. VoxResNet

VoxResNet architecture consists of stacked residual modules (i.e., VoxRes module) with a total of 25 volumetric convolutional/deconvolutional layers.
Small convolutional kernels (i.e.,1 × 3 × 3 or 3 × 3 × 3) are employed in the convolutional layers, which have demonstrated evident advantages on computation efficiency and representation capability.
In order to handle the large variation of size, multi-level contextual information (i.e., 4 auxiliary classifiers C1-C4 in the above figure) is fused with deep supervision in the network.
The whole network is trained by minimizing following objective function with standard back-propagation:

First term: regularization term using L2 norm.
Latter terms: The fidelity term consisting of auxiliary classifiers and final target classifier.
This design is similar to FCN and CUMedVision1.

**Multi-Modality Input & Auto-Context VoxResNet**

2. Multi-Modality Inputs

The volumetric data is usually acquired with multiple imaging modalities for robustly examining different tissue structures.
In this paper, three imaging modalities including T1, T1-IR, and T2-FLAIR are available in the brain structure segmentation task.
The main reason for acquiring multi-modality images is that the information from multi-modality dataset can be complementary, which provides more robust diagnosis results.
Inspired by this clinical observation, these multi-modality data are concatenated as input channels into neural network.

3. Auto-Context VoxResNet

First, a VoxResNet classifier is trained on the original training sub-volumes with image appearance information.
Then, the discriminative probability maps generated from VoxResNet are used as the context information, together with the original volumes (i.e., appearance information) as input, to train a new classifier Auto-context VoxResNet, which further refines the semantic segmentation results and removes the outliers.

4. Ablation Study

4.1. Metrics

Dice coefficient (DC): The Dice coefficient measures the spatial overlap between the segmentation result and ground truth, with a larger value denoting a higher segmentation accuracy.
The 95th-percentile of the Hausdorff distance (HD): The Hausdorff distance measure the distance between the segmentation results and the ground truth. A smaller value of HD(G, S) denotes a higher proximity between ground truth and segmentation results.
Absolute volume difference (AVD): A smaller value of AVD(G, S) denotes a better segmentation accuracy.

4.2. Multi-Modality Inputs

**Cross-validation results of MRI brain segmentation using different image modalities**

When combining the multi-modality information from all available image modalities, the segmentation performance is obviously improved for almost all the evaluation metrics compared with that of any single image modality, especially on the metric of DC.
It is also observed in the table that by integrating the auto-context information, the performance of DC can be further improved.

**The example results of validation data using different image modalities**

(a)-(c): Original T1, T1-IR, and T2-FLAIR MR images.
(d): Ground-truth label.
(e)-(g): Corresponding segmentation result using single image modality.
(h): The result using all image modalities without auto-context information.
The results using all image modalities are visually more accurate than those of single image modality.

**The qualitative results of brain segmentation with or without auto-context information**

(a): Original T1 MR images.
(b): Results of VoxResNet.
(c): Results of Auto-context VoxResNet.
(d): ground truth labels.
The results by fusing auto-context information can generate more accurate results than the network without integrating it.

4.3. Multiple Classifiers Fusion

**Results of MRI brain segmentation using different levels of contextual information**

C1-C4: Using C1-C4 almost got the best performance in different parts.

5. Comparison with other methods

**Results of MICCAI MRBrainS challenge of different methods**

CU_DL: VoxResNet.
CU_DL2: Auto-Context VoxResNet.
Overall, our methods achieved the first place in the challenge leaderboard out of 37 competitors, outperforming other methods on most of evaluation metrics.