Reading: VSR-DUF / DUF — Dynamic Upsampling Filters Without Explicit Motion Compensation (Video Super Resolution)

With Learned Upsampling Filters, Outperforms STMC / VESPCN & VSRnet

5 min readJul 13, 2020

**×4 VSR for the scene ferriswheel, Bottom Right: VSR-DUF / DUF produces more sharper images**

In this story, Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation (VSR-DUF / DUF), by Yonsei University, is presented. Some papers call it DUF. And it is called VSR-DUF in GitHub. In this paper:

Dynamic upsampling filters (DUF) are computed depending on the local spatio-temporal neighborhood of each pixel to avoid explicit motion compensation.
With also the residual learning, fine details are able to be reconstructed.
Temporal data augmentation is also proposed.

This is a paper in 2018 CVPR with over 90 citations. (Sik-Ho Tsang @ Medium)

Outline

VSR-DUF / DUF: Overall Scheme
VSR-DUF / DUF: Network Architecture
Temporal Data Augmentation
Experimental Results

1. VSR-DUF / DUF: Overall Scheme

1.1. Dynamic Upsampling Filters

With Xt-N to Xt+N as input LR frames, going through the network Gθ. We obtain the reconstructed HR frame ^Yt:

where the input tensor shape for G is T×H×W×C, where T=2N+1, H and W are the height and the width of the input LR frames, and C is the number of color channels. Corresponding output tensor shape is 1×rH×rW×C, where r is the upscaling factor.
First, a set of input LR frames {X t−N:t+N} (7 frames in the network: N=3) is fed into the dynamic filter generation network.
The trained network outputs a set of r2HW upsampling filters Ft of a certain size (5×5 in our network), which will be used to generate new pixels in the filtered HR frame ˜Yt.
Finally, each output HR pixel value is created by local filtering on an LR pixel in the input frame Xt with the corresponding filter Fy,x,v,u t as follows:

(This is similar to the approaches AdaConv or SepConv where the filters are learnt from input, but they are for video frame interpolation.)

1.2. Residual Learning

The result after applying the dynamic upsampling filters alone lacks sharpness as it is still a weighted sum of input pixels.
To address this, a residual image is additionally estimated to increase high frequency details. A residual was added to the bicubically upsampled baseline to produce a final output.

With residual learning, much sharper images are obtained.

2. VSR-DUF / DUF: Network Architecture

Dense block, used in DenseNet, is used, with weights shared.
2D convolutional layers are replaced with 3D convolutional layers to learn spatio-temporal features from video data.
Each part of the dense block is composed of batch normalization (BN), ReLU, 1×1×1 convolution, BN, ReLU, and 3×3×3 convolution in order.
To produce the final output ˆYt, the filtered output ˜Yt is added with the generated residual Rt.
Huber loss is used:

3. Temporal Data Augmentation

**Sampling data from a video with the temporal radius N = 1. Training data with faster or reverse motion can be sampled with the temporal augmentation.**

Data augmentation in the temporal axis on top of the general data augmentation like random rotation and flipping.
A variable TA is introduced that determines sampling interval of the temporal augmentation.
With TA=2, for example, we will sample every other frames to simulate faster motion.
Video sample is in the reverse order when we set the TA value as negative.
Using various sizes of TA (from -3 to 3 in our work), training data is with rich motion.

4. Experimental Results

4.1. Dataset

A total of 351 videos are collected from the Internet with various contents including wildlife, activity, and landscape for training. 160,000 ground truth training data with the spatial resolution of 144×144 by selecting areas with sufficient amount of motion.
For the validation set, we use 4 videos, coastguard, foreman, garden, and husky from the Derf’s collection, which we name as Val4.
For the test set, Vid4 from [23] is used to compare with other methods.

4.2. SOTA Comparison

3 networks: 16 layers (Ours-16L), networks with 28 layers (Ours-28L) and 52 layers (Ours-52L) are also tested.
The PSNR value of Ours-28L is increased by 0.18dB from Ours-16L with 0.2M additional parameters.
VSR-DUF works well even when stacking up to 52 layers and the PSNR value for Vid4 is improved to 27.34dB, which is 0.53dB higher than that of Ours-16L.
Even with Ours-16L, we outperforms all other methods by a large margin in terms of PSNR and SSIM for all upscale factors. For example, the PSNR of Ours-16L is 0.8dB higher than the second highest result [34] (r = 4).

4.3. Qualitative Comparisons

Increasing the number of layers provide better result.

VSR-DUF shows sharper outputs with more smooth temporal transition compared to other works.

There is a section about the visualization of learned filters. Please feel free to read the paper if interested. :)
This is the 9th story in this month.

Reference

[2018 CVPR] [VSR-DUF / DUF]
Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation

Video Super Resolution

[STMC / VESPCN] [VSR-DUF / DUF]