Reading: EDVR — Enhanced Deformable Convolutional Networks — 1st Place in NTIRE19 video restoration and enhancement challenges (Video Super Resolution)

1st Place in NTIRE19 Video Restoration and Enhancement Challenges, Utilize DCNv2, Outperforms RCAN, DUF, VESPCN and RBPN, etc.

6 min readJul 14, 2020

In this story, Video Restoration with Enhanced Deformable Convolutional Networks (EDVR), by The Chinese University of Hong Kong, Nanyang Technological University, and Chinese Academy of Sciences, is presented. In this paper:

A Pyramid, Cascading and Deformable (PCD) alignment module is used to align frames at feature level to handle large motions.
A Temporal and Spatial Attention (TSA) fusion module, in which attention is applied both temporally and spatially, so as to emphasize important features for subsequent restoration.
Finally, EDVR wins the champions and outperforms the second place by a large margin in all four tracks in the NTIRE19 video restoration and enhancement challenges.

This is a paper in 2019 CVPRW with over 50 citations. (Sik-Ho Tsang @ Medium)

Outline

EDVR: Network Architecture
Pyramid, Cascading and Deformable (PCD) Alignment Module
Temporal and Spatial Attention (TSA) Fusion Module
Two-Stage Restoration
Ablation Study
SOTA Comparison

1. EDVR: Network Architecture

EDVR takes 2N+1 low-resolution frames as inputs and generates a high-resolution output.
Each neighboring frame is aligned to the reference one by the PCD alignment module at the feature level.
The TSA fusion module fuses image information of different frames.
The fused features then pass through a reconstruction module, which is a cascade of residual blocks.
The upsampling operation is performed at the end of the network to increase the spatial size.
Finally, the high-resolution frame ^Ot is obtained by adding the predicted image residual to a direct upsampled image.

2. Pyramid, Cascading and Deformable (PCD) Alignment Module

Deformable alignment is applied on features of each frame.
The modulated deformable module in DCNv2, i.e. DConv in the above figure, is used, with pyramidal processing and cascading refinement.
Specifically, as shown with black dash lines in the above figure, to generate feature Flt+i at the l-th level, strided convolution filters are used to downsample the features at the (l-1)-th pyramid level by a factor of 2, obtaining L-level pyramids of feature representation.
At the l-th level, offsets and aligned features are predicted also with the 2 upsampled offsets and aligned features from the upper (l+1)-th level, respectively (purple dash lines in the above figure.):

where ΔP is the learned offset as in DCNv2, g is a general function with several convolution layers. Bilinear interpolation is adopted to implement the ×2 upsampling.
PCD module in such a coarse-to-fine manner improves the alignment to the sub-pixel accuracy.

3. Temporal and Spatial Attention (TSA) Fusion Module

Inter-frame temporal relation and intra-frame spatial relation are critical in fusion because:

different neighboring frames are not equally informative due to occlusion, blurry regions and parallax problems;
misalignment and unalignment arising from the preceding alignment stage adversely affect the subsequent reconstruction performance.

TSA fusion module is proposed to assign pixel-level aggregation weights on each frame. Specifically, the temporal and spatial attentions during the fusion process are adopted.
The goal of temporal attention is to compute frame similarity in an embedding space. Frames that are more similar to the reference one, should be paid more attention.
The similarity distance h can be calculated as :

where (Fat+i) and (Fat) are two embeddings, which can be achieved with simple convolution filters.
The temporal attention maps are then multiplied in a pixel-wise manner to the original aligned features Fat+i.
An extra fusion convolution layer is adopted to aggregate these attention-modulated features ~Fat+i:

where ⊙ and [, ,] denote the element-wise multiplication and concatenation, respectively.
Spatial attention masks are then computed from the fused features.
A pyramid design is employed to increase the attention receptive field. After that, the fused features are modulated by the masks through element-wise multiplication and addition.

4. Two-Stage Restoration

The restored images are not perfect, especially when the input frames are blurry or severely distorted.
Specifically, a similar but shallower EDVR network is cascaded to refine the output frames of the first stage. The benefits are two-fold:

It effectively removes the severe motion blur that cannot be handled in the preceding model, improving the restoration quality;
It alleviates the inconsistency among output frames.

5. Ablation Study

5.1. PCD and TSA Modules

Baseline (Model 1) only adopts one deformable convolution for alignment.
Model 2 follows the design of TDAN [40] to use four deformable convolutions for alignment, achieving an improvement of 0.2 dB.
With the proposed PCD module, Model 3 is nearly 0.4 dB better than Model 2 with roughly the same computational cost, demonstrating the effectiveness of PCD alignment module.
With also the TSA attention module, Model 4 achieves 0.14 dB performance gain compared to Model 3 with similar computations.

**Ablation on the PCD alignment module**

Compared with the flow without PCD alignment, the flow of the PCD outputs is much smaller and cleaner, indicating that the PCD module can successfully handle large and complex motions.

It is observed that the frames and regions with lower flow magnitude tend to have higher attention, indicating that the smaller the motion is, the more informative the corresponding frames and regions are.

5.2. Dataset Bias

The results show that there exists a large dataset bias.
The performance decreases 0.5–1.5 dB when the distribution of training and testing data mismatch.

6. SOTA Comparison

6.1. Quantitative Comparison

Three datasets are tested: Vid4, Vimeo-90K-T and REDS4.
EDVR nearly outperforms other SOTA approaches such as RCAN, VESPCN, DUF and RBPN.
They also perform comparison for deblurring at the right of Table 3. EDVR also outperforms the others.

6.2. Qualitative Comparison

EDVR obtains much sharper images compared with others.

6.3. Evaluation on REDS Dataset in the NTIRE19 video restoration and enhancement challenges

EDVR wins the champions and outperforms the second place by a large margin in all tracks.
During the test time, self-ensemble is used, the input image is flipped and rotated to generate four augmented inputs for each sample, then apply the EDVR method on each, reverse the transformation on the restored outputs and average for the final result.
The two-stage restoration is also used.

‘+’ and ‘-S2’ denote the self-ensemble strategy and two-stage restoration strategy, respectively.
The two-stage restoration largely improves the performance around 0.5 dB (EDVR(+) vs. EDVR-S2(+)).
While the self-ensemble is helpful in the first stage (EDVR vs. EDVR+), it only brings marginal improvement in the second stage (EDVR-S2 vs. EDVR-S2+).