Review: RDN — Residual Dense Network (Super Resolution)
DenseNet + ResNet, Outperforms SRCNN, FSRCNN, VDSR, LapSRN, DRRN, SRDenseNet, MemNet, IRCNN, MDSR
In this story, Residual Dense Network (RDN), by Northeastern University, and University of Rochester, is reviewed. In this paper:
- Residual Dense Block (RDB) to extract abundant local features via dense connected convolutional layers.
- RDB further allows direct connections from the state of preceding RDB to all the layers of current RDB, leading to a contiguous memory (CM) mechanism.
- Local feature fusion in RDB is then used to adaptively learn more effective features from preceding and current local features and stabilizes the training of wider network.
- Global feature fusion is used to jointly and adaptively learn global hierarchical features in a holistic way.
This is a paper in 2018 CVPR with over 500 citations. (Sik-Ho Tsang @ Medium)
Outline
- RDN Network Architecture
- Residual Dense Block (RDB)
- Dense Feature Fusion (DFF)
- Differences to DenseNet, SRDenseNet, MemNet
- Ablation Study
- State-Of-The-Art (SOTA) Comparisons
1. RDN Network Architecture
- As shown above, RDN mainly consists four parts: shallow feature extraction net (SFENet), residual dense blocks (RDBs), dense feature fusion (DFF), and finally the up-sampling net (UPNet).
1.1. Shallow Feature Extraction Net (SFENet)
- Two Conv layers are used to extract shallow features. The first Conv layer extracts features F-1 from the LR input:
- This F-1 is then used for further shallow feature extraction and global residual learning. Then the second conv layer F0 is applied:
- This F0, the second shallow feature extraction layer, is used as input to residual dense blocks.
1.2. Residual Dense Blocks (RDBs)
- Supposing we have D residual dense blocks, the output Fd of the d-th RDB can be obtained by:
- where HRDB,d denotes the operations of the d-th RDB.
1.3. Dense Feature Fusion (DFF)
- DFF fuses features which include global feature fusion (GFF) and global residual learning (GRL). DFF makes full use of features from all the preceding layers and can be represented as:
- FDF is the output feature-maps of DFF by utilizing a composite function HDFF.
1.4. Up-Sampling Net (UPNet)
- ESPCN is used here in UPNet:
- (Please read ESPCN if interested.)
2. Residual Dense Block (RDB)
2.1. Contiguous Memory (CM)
- RDB contains dense connected layers, local feature fusion (LFF), and local residual learning, leading to a contiguous memory (CM) mechanism.
- This contiguous memory mechanism is realized by passing the state of preceding RDB to each layer of current RDB.
- Let Fd-1 and Fd be the input and output of the d-th RDB respectively and both of them have G0 feature-maps. The output of c-th Conv layer of d-th RDB can be formulated as
- where σ denotes ReLU.
- Assume Fd,c consists of G (also known as growth rate as in DenseNet) feature-maps. [Fd-1; Fd,1; …; Fd,c-1] refers to the concatenation of the feature-maps from previous blocks. There will be G0+(c-1)×G feature maps.
2.2. Local Feature Fusion (LFF)
- Inspired by MemNet, a 1×1 convolutional layer is introduced to adaptively control the output information:
- It is found that as the growth rate G becomes larger, very deep dense network without LFF would be hard to train.
2.3. Local Residual Learning (LRL)
- Local residual learning is introduced in RDB to further improve the information flow, as there are several convolutional layers in one RDB:
- LRL can also further improve the network representation ability, resulting better performance.
- Because of the dense connectivity and local residual learning, this block architecture is referred as residual dense block (RDB).
3. Dense Feature Fusion (DFF)
- DFF consists of global feature fusion (GFF) and global residual learning.
3.1. Global Feature Fusion (GFF)
- Global feature fusion is proposed to extract the global feature FGF by fusing features from all the RDBs:
- HGFF is a composite function of 1×1 and 3×3 convolutions.
- The 1×1 convolutional layer is used to adaptively fuse a range of features with different levels. The following 3×3 convolutional layer is introduced to further extract features for global residual learning.
3.2. Global Residual Learning (GRL)
- Global residual learning is then utilized to obtain the feature-maps before conducting up-scaling by:
- where F-1 denotes the shallow feature-maps mentioned in Section 1.1.
- All the other layers before global feature fusion are fully utilized with the proposed residual dense blocks (RDBs).
- RDBs produce multi-level local dense features, which are further adaptively fused to form FGF. After global residual learning, we obtain dense feature FDF.
4. Differences to DenseNet, SRDenseNet, MemNet
4.1. Differences to DenseNet
- DenseNet is widely used in high-level computer vision tasks while RDN is designed for image SR.
- Batch Nomalization (BN) layers are removed here.
- Pooling layers are also removed here, which could discard some pixel-level information.
- In DenseNet, transition layers are placed into two adjacent dense blocks. While in RDN, dense connected layers are combined with local feature fusion (LFF) by using local residual learning.
- Global feature fusion is adopted to fully use hierarchical features, which are neglected in DenseNet.
4.2. Differences to SRDenseNet
- Residual dense block (RDB) improves it in three ways:
- Contiguous memory (CM) mechanism is introduced, in which it allows the state of preceding RDB have direct access to each layer of the current RDB.
- RDB allow larger growth rate by using local feature fusion (LFF), which stabilizes the training of wide network.
- Local residual learning (LRL) is utilized in RDB to further encourage the flow of information and gradient. There is no dense connections among RDB. Instead, global feature fusion (GFF) and global residual learning are used to extract global features.
4.3. Differences to MemNet
- Memory block in MemNet, the preceding layers don’t have direct access to all the subsequent layers. The local feature information are not fully used, limiting the ability of long-term connections.
- In addition, MemNet extracts features in the HR space, increasing computational complexity. While in this paper, local and global features in the LR space are extracted.
5. Ablation Study
5.1. Settings
- DIV2K consists of 800 training images, 100 validation images, and 100 test images. All of our models are trained with 800 training images and 5 validation images are used in the training process.
- Set5, Set14, B100, Urban100, and Manga109 are used for testing.
5.2. Study of Number of RDB (D), Number of Conv per RDB (C), and Growth Rate (G)
- As shown above, larger D or C would lead to higher performance. This is mainly because the network becomes deeper with larger D or C.
- As the proposed LFF allows larger G, larger G (see (c)) can contribute to better performance.
5.3. Study of CM, LRL and GFF
- The above 8 networks have the same RDB number (D = 20), Conv number (C = 6) per RDB, and growth rate (G = 32).
- The one without CM, LRL and GFF, act as baseline, obtains a very poor result, caused by the difficulty during training.
- Adding one of them validate that each component can efficiently improve the performance of the baseline, because each component contributes to the flow of information and gradient.
- It can be seen that two components would perform better than only one component. And RDN using three components performs the best.
- The above convergence curves are consistent with the analyses above and show that CM, LRL, and GFF can further stabilize the training process without obvious performance drop.
6. State-Of-The-Art (SOTA) Comparisons
- Three Degradation Models are used.
- BI: Bicubic downsampling.
- BD: Blurred using 7×7 Gaussian kernel then downsampling.
- DN: Bicubic downsample then add Gaussian noise with noise level 30.
6.1. Results with BI Degradation Model
- RDN achieves the best average results on most datasets, outperforms SRCNN, LapSRN, DRRN, SRDenseNet, MemNet.
- Specifically, for the scaling factor ×2, RDN performs the best on all datasets. When the scaling factor becomes larger (e.g., ×3 and ×4), RDN would not hold the similar advantage over MDSR. There are mainly three reasons for this case:
- First, MDSR is deeper (160 v.s. 128), having about 160 layers to extract features in LR space.
- Second, MDSR utilizes multi-scale inputs as VDSR does.
- Third, MDSR uses larger input patch size (65 v.s. 32) for training.
- But RDN+ can achieve further improvement with self-ensemble, which is introduced in EDSR & MDSR.
- RDN can recover sharper and clearer edges, more faithful to the ground truth because RDN uses hierarchical features through dense feature fusion.
6.2. Results with BD and DN Degradation Models
- RDN and RDN+ perform the best on all the datasets with BD and DN degradation models, outperforms, SRCNN, FSRCNN, VDSR, and IRCNN.
- RDN suppresses the blurring artifacts and recovers sharper edges
- RDN can not only handle the noise efficiently, but also recover more details.
6.3. Super-Resolving Real-World Images
- Two representative real-world images, “chip” (with 244×200 pixels) and “hatc” (with 133×174 pixels) are tried.
- RDN recovers sharper edges and finer details than other state-of-the-art methods.
During the days of coronavirus, I challenge to write 30 stories in this month. And then I start to challenge to have 35 stories. And this is the 35th story in this month. It is also the last story in this month. Thanks for visiting my story…
Reference
[2018 CVPR] [RDN]
Residual Dense Network for Image Super-Resolution
Super Resolution
[SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DnCNN] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [MemNet] [IRCNN] [WDRN / WavResNet] [MWCNN] [SRDenseNet] [SRGAN & SRResNet] [EDSR & MDSR] [MDesNet] [RDN] [SR+STN]