Review: SRDenseNet — DenseNet for SR (Super Resolution)

DenseNet Blocks with Skip Connections, Outperforms SRCNN, VDSR and DRCN

Published in

Towards Data Science

7 min readFeb 27, 2019

In this story, SRDenseNet, by Imperial Vision Technology in FuZhou China, is reviewed. In SRDenseNet, dense blocks suggested by DenseNet are used to extract high-level features. Besides, skip connections are added in between dense blocks. Bottleneck and deconvolutional layers are used for upscaling before reconstructing the high resolution (HR) images. It is published in 2017 ICCV with over 70 citations. (Sik-Ho Tsang @ Medium)

Dense Block
SRDenseNet Variants
Deconvolution, Bottleneck and Reconstruction Layers
Ablation Study
Results

1. Dense Block

**Dense Block (The paths at the bottom are copied from previous layers to deeper layers.**

1.1. Concatenation Rather Than Summation

Different from ResNet, the feature maps are concatenated in DenseNet rather than directly summed.
Consequently, the i-th layer receives the feature maps of all preceding layers as input:

where [X1,X2, …,Xi−1] represents the concatenation of the feature maps generated in the preceding convolution layers 1, 2, …, i − 1.
This DenseNet structure alleviates vanishing-gradient problem.
Reusing feature maps that are already learnt forces the current layer to learn complementary information, thus avoiding the learning of redundant features.
In addition, each layer has a short path to the loss in the proposed network, leading to an implicit deep supervision.

1.2. Growth Rate

There are 8 convolution layers in one dense block in this work.
When each convolution layer in dense block produce k feature maps as output, the total number of feature maps generated by one DenseNet block is k×8, where k is referred to as growth rate.
The growth rate k regulates how much new information each layer contributes to the final reconstruction.
In the above figure, each block consists of 8 convolution layers. To prevent the network from growing too wide, the growth rate is set to 16 and the output of each block has 128 feature maps.

2. SRDenseNet Variants

2.1. SRDenseNet_H

This is the basic SRDenseNet.
8 Dense blocks are used to extract high level features.

2.2. SRDenseNet_HL

A skip connection is used to concatenate the low-level and high-level features.
The concatenated feature maps are then used as input for deconvolution layers.

2.3. SRDenseNet_All

Densely skip connections are used to combine the feature maps produced at all convolution layers for SR reconstruction.
Bottleneck layer is also added before deconvolution layers.
SRDenseNet_All has 69 weight layers and 68 activation layers.
The size of the receptive field is proportional to the depth, a large amount of contextual information in LR images can be utilized to infer the high frequency information in HR images.
Due to the use of many ReLU layers, high nonlinearity can be exploited in very deep networks to model the complex mapping functions between LR image and HR images.

3. Deconvolution, Bottleneck and Reconstruction Layers

3.1. Bottleneck Layer

All feature maps in the network concatenated in SRDenseNet_All, yielding many inputs for the subsequent deconvolution layers.
A convolution layer with 1×1 kernel is used as a bottleneck layer to reduce the number of input feature maps.
The number of feature maps is reduced to 256 using 1×1 bottleneck layer.
After that, the deconvolution layers transform the 256 feature maps from the LR space to the HR space.

3.2. Deconvolution Layers

In SRCNN and VDSR, bicubic interpolation i used to upscale low-resolution (LR) images to the HR space before convolutions.
All convolutions are carried out in HR space, which increases the computational complexity for SR.
Also, the interpolation approaches do not bring new information for solving the SR problem.
Thus, deconvolution layers are employed to learn the upscaling filters after convolutions. There are two advantages.
First, it accelerates the SR reconstruction process. After the deconvolution layers are added at the end of networks, the whole computational process is performed in the LR space. If the upscaling factor is r, it will reduce the computational cost by a factor of r².
In addition, a large amount of contextual information from the LR images is used to infer the high frequency details.
In this work, two successive deconvolution layers with small 3×3 kernels and 256 feature maps are used for upscaling.

3.3. Reconstruction Layer

The reconstruction layer is a convolution layer with 3×3 kernel and one channel of output.

4. Ablation Study

50,000 images were randomly selected from ImageNet for the training.
Non-overlapping sub-images with a size of 100×100 were cropped in the HR space. The LR images were obtained by downsampling the HR images using bicubic with a scale factor of 4×. Only the Y-channel was used for training.
ReLU is used for all weight layers and Adam optimizer is used.
A mini-batch size of 32 is used.
During testing, the datasets Set5, Set14, B100, from the Berkeley segmentation dataset consisting of 100 natural images, Urban100, which includes 100 challenging images.
A scale factor of 4× between LR and HR images are tested.
The PSNR and SSIM were calculated on the Y-channel of images.
NVIDIA Titan X GPU is used.
SRDenseNet_All has the highest PSNR and SSIM among the SRDenseNet variants.

5. Results

5.1. Quantitative Results

For SRCNN, the best 9–5–5 image model is used.
As for the A+ method, it did not predict image boundaries. For fair comparison, the borders of HR images were cropped so that all the results had the same region.
In terms of PSNR, the proposed method achieves an improvement of 0.2dB-0.8dB over state-of-the-art results on different datasets.
On average, an increase of about 1.0 dB is achieved over SRCNN with 3-layer and an increase of about 0.5 dB over VDSR with 20-layer.
The most significant improvement over all approaches, including SRCNN, VDSR and DRCN, is obtained on the very challenging dataset Urban100.

5.2. Qualitative Results

For the above images on Urban100, SRDenseNet can well reconstruct the lines and the contours while other methods generate blurry results.
For the above images on B100 and Set14, SRDenseNet can reconstruct the texture pattern and avoid the distortions.
An average speed of 36ms is achieved for super-resolving B100 on Titan X GPU, reaching a real-time SR with a scaling factor of 4×.

At the end, authors mentioned that right now, the research trend is tended to investigate the perceptual loss for SR problem, such as SRGAN, which ‘fakes’ the texture to have a better perceptual quality to human eyes though with lower PSNR. And they are going to investigate this as well.