Review: RCAN — Deep Residual Channel Attention Networks (Super Resolution)

ResNet + SENet, Outperforms SRCNN, FSRCNN, VDSR, LapSRN & MS-LapSRN, DRCN, IRCNN, MemNet, EDSR, SRMDNF, D-DBPN, and RDN

Sik-Ho Tsang

9 min readMay 1, 2020

In this story, Deep Residual Channel Attention Networks (RCAN), by Northeastern University, is reviewed. In RCAN:

A residual in residual (RIR) structure is proposed to form very deep network, which consists of several residual groups with long skip connections. Each residual group contains some residual blocks with short skip connections.
Furthermore, a channel attention mechanism is proposed to adaptively rescale channel-wise features by considering interdependencies among channels.

This is a paper in 2018 ECCV with about 400 citations. (Sik-Ho Tsang @ Medium)

Outline

Network Architecture
Residual in Residual (RIR)
Channel Attention (CA)
Residual Channel Attention Block (RCAB)
Ablation Study
State-Of-The-Art (SOTA) Comparison

1. Network Architecture

**Network architecture of the residual channel attention network (RCAN)**

As shown in the above figure, the RCAN mainly consists four parts: shallow feature extraction, residual in residual (RIR) deep feature extraction, upscale module, and reconstruction part.

1.1. Shallow Feature Extraction

Only one convolutional layer (Conv) is used to extract the shallow feature F0 from the LR input.

where HSF denotes convolution operation.

1.2. Residual in Residual (RIR) Deep Feature Extraction

F0 is then used for deep feature extraction with RIR module.

where HRIR denotes our proposed very deep residual in residual structure, which contains G residual groups (RG).
The proposed RIR achieves the largest depth so far and provides very large receptive field size.

1.3. Upscale Module

The deep feature output from RIR, which is then upscaled via a upscale module:

where ESPCN is used as the upscale module. (Please read ESPCN if interested.)
The upscaled feature is then reconstructed via one Conv layer:

where HREC and HRCAN denote the reconstruction layer and the function of the RCAN respectively.
Standard loss function is used:

RG number is set as G=10 in the RIR structure.
In each RG, RCAB number is set as 20.

2. Residual in Residual (RIR)

Inspired by previous works in SRResNet and EDSR, RIR structure (as shown above figure) is proposed, which contains G residual groups (RG) and long skip connection (LSC).
Each RG further contains B residual channel attention blocks (RCAB) with short skip connection (SSC). Such RIR structure allows to train very deep CNN (over 400 layers) for image SR with high performance.
The proposed residual group (RG) as the basic module for deeper networks. A RG in the g-th group is formulated as:

where Hg denotes the function of g-th RG. Fg-1 and Fg are the input and output for g-th RG.
Simply stacking many RGs would fail to achieve better performance. To solve the problem, the long skip connection (LSC) is further introduced in RIR to stabilize the training of very deep network:

LSC can not only ease the flow of information across RGs, but only make it possible for RIR to learning residual information in a coarse level.
B residual channel attention blocks are stacked in each RG. The b-th residual channel attention block (RCAB) in g-th RG can be formulated as:

where Fg,b-1 and Fg,b are the input and output of the b-th RCAB in g-th RG.
A short skip connection (SSC) is introduced to obtain the block output via:

With LSC and SSC, more abundant low-frequency information is easier bypassed in the training process.

3. Channel Attention (CA)

**Channel attention (CA). ⊗ denotes element-wise product**

3.1. Global Average Pooling

As shown above, the interdependencies among feature channels are exploited, resulting in a channel attention (CA) mechanism.
This is originated in SENet. (If interested, please read SENet.)
How to generate different attention for each channel-wise feature is a key step. There are mainly two concerns:
First, information in the LR space has abundant low-frequency and valuable high-frequency components. The low-frequency parts seem to be more complanate. The high-frequency components would usually be regions, being full of edges, texture, and other details.
On the other hand, each filter in Conv layer operates with a local receptive field. Consequently, the output after convolution is unable to exploit contextual information outside of the local region.
Based on these analyses, the channel-wise global spatial information is taken into a channel descriptor by using global average pooling:

Such channel statistic can be viewed as a collection of the local descriptors, whose statistics contribute to express the whole image.

3.2. Gating Mechanism

A gating mechanism is introduced, the gating mechanism should meet two criteria:
First, it must be able to learn nonlinear interactions between channels.
Second, as multiple channel-wise features can be emphasized opposed to one-hot activation, it must learn a non-mutually-exclusive relationship.
Here, sigmoid function is adopted to as a simple gating mechanism.

where f is sigmoid gating and δ is ReLU.
The final channel statistics s, which is used to rescale the input xc:

With channel attention, the residual component in the RCAB is adaptively rescaled.
Conv layers in shallow feature extraction and RIR structure have C=64.
Conv layer in channel-downscaling has C/r = 4 filters, where the reduction ratio r is set as 16.

4. Residual Channel Attention Block (RCAB)

CA is integrated into RB and propose residual channel attention block (RCAB) (see the above figure). For the b-th RB in g-th RG, we have:

where Rg,b denotes the function of channel attention. Fg,b and Fg,b-1 are the input and output of RCAB, which learns the residual Xg,b from the input. The residual component is mainly obtained by two stacked Conv layers.

It is found that the RBs used in EDSR & MDSR can be viewed as special cases of our RCAB.
For RB in MDSR, there is no rescaling operation. It is the same as RCAB, where we set Rg;b as constant 1.
For RB with constant rescaling (e.g., 0.1) in EDSR, it is the same as RCAB with Rg,b set to be 0.1.

5. Ablation Study

5.1. Settings

800 training images from DIV2K dataset are used as training set.
For testing, standard benchmark datasets are used: Set5, Set14, B100, Urban100, and Manga109.
Experiments are conducted with Bicubic (BI) and blur-downscale (BD) degradation models.
PSNR and SSIM on Y channel are measured.
Data augmentation is performed on the 800 training images, which are randomly rotated by 90, 180, 270 and flipped horizontally.
In each training batch, 16 LR color patches with the size of 48×48 are extracted as input.

5.2. Effects of RIR and CA

**Investigations of RIR (including LSC and SSC) and CA**

The number of residual block as 200, namely 10 residual groups, resulting in very deep networks with over 400 Conv layers.
In the above Table, when both LSC and SSC are removed, the PSNR value on Set5 (×2) is relatively low, 37.45dB.
After adding RIR, the performance reaches 37.87 dB.
When CA is added, the performance reached 37.90 dB by using RIR.
This indicates that simply stacking residual blocks is not applicable to achieve very deep and powerful networks for image SR. The performance would increase with LSC or SSC and can obtain better results by using both of them.
From the above table, it is found that networks with CA would perform better than those without CA. Benefiting from very large network depth, the very deep trainable networks can achieve a very high performance.
Even without RIR, CA can improve the performance from 37.45 dB to 37.52 dB.

6. State-Of-The-Art (SOTA) Comparison

6.1. Results with Bicubic (BI) Degradation Model

**Quantitative results with BI degradation model.**

RCAN+ performs the best on all the datasets with all scaling factors, outperforms SRCNN, FSRCNN, VDSR, LapSRN, MemNet, EDSR, SRMDNF, D-DBPN, and RDN.
Even without self-ensemble, CAN also outperforms other compared methods. On the other hand, when the scaling factor become larger (e.g., ×8), the gains of our RCAN over EDSR also becomes larger.
Instead of constantly rescaling the features in EDSR, RCAN adaptively rescales features with channel attention (CA). CA allows the network to further focus on more informative features.

**Visual comparison for ×4 SR with BI model on Urban100 and Manga109**

For image “img_004”, it is observed that most of the compared methods cannot recover the lattices and would suffer from blurring artifacts. In contrast, RCAN can alleviate the blurring artifacts better and recover more details.
For “img_073”, most of the compared methods produce blurring artifacts along the horizontal lines. Only RCAN produces more faithful results.
For “YumeiroCooking”, the cropped part is full of textures. All the compared methods suffer from heavy blurring artifacts, failing to recover more details. While, RCAN can recover them obviously, being more faithful to the ground truth.

**Visual comparison for ×8 SR with BI model on Urban100 and Manga109**

For image “img_040”, due to very large scaling factor, the result by Bicubic would lose the structures and produce different structures. This wrong pre-scaling result would also lead some state-of-the-art methods (e.g., SRCNN, VDSR, and MemNet) to generate totally wrong structures. While, RCAN can recover them correctly.
For smaller details, like the image “TaiyouNiSmash”, the tiny lines can be lost in the LR image. Most of compared methods cannot achieve this goal and produce serious blurring artifacts. However, RCAN can obtain more useful information and produce finer results.

6.2. Results with Blur-downscale (BD) Degradation Model

**Quantitative results with BD degradation model**

As shown in the above table, RDN has achieved very high performance on each dataset. While, RCAN can obtain notable gains over RDN. Using self-ensemble, RCAN+ achieves even better results, outperforms SRCNN, FSRCNN, VDSR, IRCNN, SRMDNF, and RDN.

**Visual comparison for ×3 SR with BD model on Urban100**

For challenging details in images “img_062” and “img_078”, most methods suffer from heavy blurring artifacts.
RDN alleviates it to some degree and can recover more details. In contrast, RCAN obtains much better results by recovering more informative components.

6.3. Object Recognition Performance

**ResNet** **object recognition performance**

ResNet-50 is used as the evaluation model and use the first 1,000 images from ImageNet CLS-LOC validation dataset for evaluation. The original cropped 224×224 images are used for baseline and downscaled to 5656 for SR methods. The LR images are upscaled and then calculate their accuracies.
As shown in the table above, RCAN achieves the lowest top-1 and top-5 errors. These comparisons further demonstrate the highly powerful representational ability of RCAN.

6.4. Model Size Analyses

**Performance and number of parameters.**

RCAN is the deepest network, it has less parameter number than that of EDSR and RDN. And RCAN and RCAN+ achieve higher performance.

Should I have a challenge in this month again …?

Reference

[2018 ECCV] [RCAN]
Image Super-Resolution Using Very Deep Residual Channel Attention Networks

Super Resolution

[SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DnCNN] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [MemNet] [IRCNN] [WDRN / WavResNet] [MWCNN] [SRDenseNet] [SRGAN & SRResNet] [EDSR & MDSR] [MDesNet] [RDN] [RCAN] [SR+STN]