Review: EDSR & MDSR — Enhanced Deep Residual Networks for Single Image Super-Resolution (Super Resolution)

Outperforms SRCNN, VDSR, SRResNet, Won the NTIRE2017 Super-Resolution Challenge

7 min readApr 24, 2020

In this story, Enhanced Deep Super-Resolution Network (EDSR) and Multi-Scale Deep Super-Resolution System (MDSR), by Seoul National University, is reviewed.

**×4 Super-resolution result of EDSR compared with existing algorithms**

SRResNet Applies ResNet architecture directly to low-level vision problems like super-resolution can be suboptimal. In this paper:

Based on SRResNet architecture, EDSR is optimized by analyzing and removing unnecessary modules to simplify the network architecture.
Then the model is further enhanced by supporting multi-scale SR using one single network, i.e. MDSR.

The proposed approach shows superior performance over SOTA approaches and finally won the NTIRE2017 Super-Resolution Challenge. By joining the Challenge, with also the novelty provided, the approach can be published in the CVPR Workshop. And this is a paper in 2017 CVPRW with over 1000 citations. (Sik-Ho Tsang @ Medium)

Outline

Modified Residual Blocks
EDSR (Single-Scale Model)
MDSR (Multi-Scale Model)
Some Training & Testing Details
Experimental Results

1. Modified Residual Blocks

As we can see above, batch normalization is removed.
Since batch normalization layers normalize the features, they get rid of range flexibility from networks by normalizing the features, it is better to remove them.
GPU memory usage is also sufficiently reduced.
The proposed baseline model without batch normalization layer saves approximately 40% of memory usage during training, compared to SRResNet.

2. EDSR (Single-Scale Model)

General CNN architecture with depth (the number of layers) B and width (the number of feature channels) F occupies roughly O(BF) memory with O(BF²) parameters.
Therefore, increasing F instead of B can maximize the model capacity when considering limited computational resources.
However, increasing the number of feature maps make the training unstable.
Residual scaling with factor 0.1 is applied at the residual path before adding back to the convolutional path, as suggested in Inception-v4. In each residual block, constant scaling layers are placed after the last convolution layers. In the test phase, this layer can be integrated into the previous convolution layer for the computational efficiency.
There is no ReLU activation layers outside the residual blocks.
Baseline model: Residual scaling layers are not used since only 64 feature maps are used for each convolution layer.
EDSR: The baseline model is expanded by setting B = 32, F = 256 with a scaling factor 0.1.
When training our model for upsampling factor ×3 and ×4, we initialize the model parameters with pre-trained ×2 network.

**Effect of using pre-trained ×2 network for ×4 model (EDSR)**

As shown above, for upscaling ×4, if we use a pre-trained scale ×2 model (blue line), the training converges much faster than the one started from random initialization (green line).

3. MDSR (Multi-Scale Model)

Super resolution at multiple scales is inter-related tasks.
Baseline Model: A single main branch with B = 16 residual blocks so that most of the parameters are shared across different scales as shown above.
First, pre-processing modules are located at the head of networks to reduce the variance from input images of different scales. Each of pre-processing module consists of two residual blocks with 5×5 kernels. By adopting larger kernels for pre-processing modules, the scale-specific part can be shallow while the larger receptive field is covered in early stages of networks.
At the end of the multi-scale model, scale-specific upsampling modules are located in parallel to handle multi-scale reconstruction.
Single-scale baseline models for 3 different scales have about 1.5M parameters each, totaling 4.5M, the baseline multi-scale model has only 3.2M parameters, with comparable performance as the single-scale models.
MDSR: B = 80 and F = 64, approximately 5 times more depth compared to the baseline multi-scale model, only 2.5 times more parameters are required, as the residual blocks are lighter than scale-specific parts.

4. Some Training & Testing Details

4.1. Datasets

DIV2K dataset is a newly proposed high-quality (2K resolution) image dataset for image restoration tasks. The DIV2K dataset consists of 800 training images, 100 validation images, and 100 test images. But the test dataset ground truth is not released, only performances on the validation dataset are reported and compared.
Four standard benchmark datasets are also compared: Set5, Set14, B100, and Urban100.

4.2. Training

RGB input patches of size 48×48 from LR image with the corresponding HR patches.
All the images are pre-processed by subtracting the mean RGB value of the DIV2K dataset.
The training data is augmented with random horizontal flips and 90-degree rotations.
For MDSR, for each update, the minibatch with a randomly selected scale among ×2, ×3 and ×4. Only the modules that correspond to the selected scale are enabled and updated.
Networks are trained using L1 loss instead of L2. Minimizing L2 is generally preferred since it maximizes the PSNR. However, it is empirically found that L1 loss provides better convergence than L2.
Using Torch7 as framework, with NVIDIA Titan X GPUs, it takes 8 days and 4 days to train EDSR and MDSR.

4.3. Geometric Self-ensemble During Testing

The input image ILR to generate seven augmented inputs:

where Ti is the 8 geometric transformation.
After super resolved, all images are inverse transformed to the orginal geometry:

The transformed outputs are averaged all together to make the self-ensemble result:

This self-ensemble method has an advantage over other ensembles as it does not require additional training of separate models. It is beneficial especially when the model size or training time matters.
It gives approximately same performance gain compared to conventional model ensemble method that requires individually trained models.
The methods using self-ensemble are denoted by adding ’+’ postfix to the method name, i.e. EDSR+/MDSR+.

5. Experimental Results

5.1. Ablation Study on DIV2K

Starting from the SRResNet, we gradually change various settings to perform ablation tests. SRResNet is trained.
Then, the loss function is changed from L2 to L1, and the network architecture is reformed as described in the previous sections and summarized as follows:

**Summarization for Different Networks**

**Ablation Study (PSNR and SSIM) on DIV2K Validation Set**

Evaluation is conducted on the 10 images of DIV2K validation set, with PSNR and SSIM criteria.
We can see there is margin of improvements for EDSR, MDSR, EDSR+ and MDSR+.
The models require much less GPU memory since they do not have batch normalization layers.

5.2. Set5, Set14, B100, Urban100, & DIV2K

The models exhibit a significant improvement compared to the other methods suchas SRCNN, VDSR and SRResNet.
The gaps further increase after performing self-ensemble.

5.3. NTIRE2017 SR Challenge

The challenge aims to develop a single image super resolution system with the highest PSNR.
There exist two tracks for different degraders (bicubic, unknown) with three downsample scales (×2, 3, 4) each. Input images for the unknown track are not only downscaled but also suffer from severe blurring.

EDSR+ and MDSR+ won the first and second places, respectively, with outstanding performances as shown above.

During the days of coronavirus, I hope to write 30 stories in this month to give myself a small challenge. This is the 24th story in this month. Thanks for visiting my story…

Reference

[2017 CVPRW] [EDSR & MDSR]
Enhanced Deep Residual Networks for Single Image Super-Resolution

Super Resolution

[SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet] [SRGAN & SRResNet] [EDSR & MDSR] [SR+STN]