Reading: VSR-DUF / DUF — Dynamic Upsampling Filters Without Explicit Motion Compensation (Video Super Resolution)

With Learned Upsampling Filters, Outperforms STMC / VESPCN & VSRnet

Sik-Ho Tsang
5 min readJul 13, 2020
×4 VSR for the scene ferriswheel, Bottom Right: VSR-DUF / DUF produces more sharper images

In this story, Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation (VSR-DUF / DUF), by Yonsei University, is presented. Some papers call it DUF. And it is called VSR-DUF in GitHub. In this paper:

  • Dynamic upsampling filters (DUF) are computed depending on the local spatio-temporal neighborhood of each pixel to avoid explicit motion compensation.
  • With also the residual learning, fine details are able to be reconstructed.
  • Temporal data augmentation is also proposed.

This is a paper in 2018 CVPR with over 90 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. VSR-DUF / DUF: Overall Scheme
  2. VSR-DUF / DUF: Network Architecture
  3. Temporal Data Augmentation
  4. Experimental Results

1. VSR-DUF / DUF: Overall Scheme

1.1. Dynamic Upsampling Filters

VSR-DUF / DUF: Overall Scheme
  • With Xt-N to Xt+N as input LR frames, going through the network . We obtain the reconstructed HR frame ^Yt:
  • where the input tensor shape for G is T×H×W×C, where T=2N+1, H and W are the height and the width of the input LR frames, and C is the number of color channels. Corresponding output tensor shape is 1×rH×rW×C, where r is the upscaling factor.
  • First, a set of input LR frames {X tN:t+N} (7 frames in the network: N=3) is fed into the dynamic filter generation network.
  • The trained network outputs a set of r2HW upsampling filters Ft of a certain size (5×5 in our network), which will be used to generate new pixels in the filtered HR frame ˜Yt.
  • Finally, each output HR pixel value is created by local filtering on an LR pixel in the input frame Xt with the corresponding filter Fy,x,v,u t as follows:
  • (This is similar to the approaches AdaConv or SepConv where the filters are learnt from input, but they are for video frame interpolation.)

1.2. Residual Learning

  • The result after applying the dynamic upsampling filters alone lacks sharpness as it is still a weighted sum of input pixels.
  • To address this, a residual image is additionally estimated to increase high frequency details. A residual was added to the bicubically upsampled baseline to produce a final output.
Effects of residual learning
  • With residual learning, much sharper images are obtained.

2. VSR-DUF / DUF: Network Architecture

VSR-DUF / DUF: Network Architecture
  • Dense block, used in DenseNet, is used, with weights shared.
  • 2D convolutional layers are replaced with 3D convolutional layers to learn spatio-temporal features from video data.
  • Each part of the dense block is composed of batch normalization (BN), ReLU, 1×1×1 convolution, BN, ReLU, and 3×3×3 convolution in order.
  • To produce the final output ˆYt, the filtered output ˜Yt is added with the generated residual Rt.
  • Huber loss is used:

3. Temporal Data Augmentation

Sampling data from a video with the temporal radius N = 1. Training data with faster or reverse motion can be sampled with the temporal augmentation.
  • Data augmentation in the temporal axis on top of the general data augmentation like random rotation and flipping.
  • A variable TA is introduced that determines sampling interval of the temporal augmentation.
  • With TA=2, for example, we will sample every other frames to simulate faster motion.
  • Video sample is in the reverse order when we set the TA value as negative.
  • Using various sizes of TA (from -3 to 3 in our work), training data is with rich motion.

4. Experimental Results

4.1. Dataset

  • A total of 351 videos are collected from the Internet with various contents including wildlife, activity, and landscape for training. 160,000 ground truth training data with the spatial resolution of 144×144 by selecting areas with sufficient amount of motion.
  • For the validation set, we use 4 videos, coastguard, foreman, garden, and husky from the Derf’s collection, which we name as Val4.
  • For the test set, Vid4 from [23] is used to compare with other methods.

4.2. SOTA Comparison

SOTA Comparison
  • 3 networks: 16 layers (Ours-16L), networks with 28 layers (Ours-28L) and 52 layers (Ours-52L) are also tested.
  • The PSNR value of Ours-28L is increased by 0.18dB from Ours-16L with 0.2M additional parameters.
  • VSR-DUF works well even when stacking up to 52 layers and the PSNR value for Vid4 is improved to 27.34dB, which is 0.53dB higher than that of Ours-16L.
  • Even with Ours-16L, we outperforms all other methods by a large margin in terms of PSNR and SSIM for all upscale factors. For example, the PSNR of Ours-16L is 0.8dB higher than the second highest result [34] (r = 4).

4.3. Qualitative Comparisons

  • Increasing the number of layers provide better result.
  • VSR-DUF shows sharper outputs with more smooth temporal transition compared to other works.

There is a section about the visualization of learned filters. Please feel free to read the paper if interested. :)

This is the 9th story in this month.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet