Reading: VSR-DUF / DUF — Dynamic Upsampling Filters Without Explicit Motion Compensation (Video Super Resolution)
With Learned Upsampling Filters, Outperforms STMC / VESPCN & VSRnet
In this story, Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation (VSR-DUF / DUF), by Yonsei University, is presented. Some papers call it DUF. And it is called VSR-DUF in GitHub. In this paper:
- Dynamic upsampling filters (DUF) are computed depending on the local spatio-temporal neighborhood of each pixel to avoid explicit motion compensation.
- With also the residual learning, fine details are able to be reconstructed.
- Temporal data augmentation is also proposed.
This is a paper in 2018 CVPR with over 90 citations. (Sik-Ho Tsang @ Medium)
Outline
- VSR-DUF / DUF: Overall Scheme
- VSR-DUF / DUF: Network Architecture
- Temporal Data Augmentation
- Experimental Results
1. VSR-DUF / DUF: Overall Scheme
1.1. Dynamic Upsampling Filters
- With Xt-N to Xt+N as input LR frames, going through the network Gθ. We obtain the reconstructed HR frame ^Yt:
- where the input tensor shape for G is T×H×W×C, where T=2N+1, H and W are the height and the width of the input LR frames, and C is the number of color channels. Corresponding output tensor shape is 1×rH×rW×C, where r is the upscaling factor.
- First, a set of input LR frames {X t−N:t+N} (7 frames in the network: N=3) is fed into the dynamic filter generation network.
- The trained network outputs a set of r2HW upsampling filters Ft of a certain size (5×5 in our network), which will be used to generate new pixels in the filtered HR frame ˜Yt.
- Finally, each output HR pixel value is created by local filtering on an LR pixel in the input frame Xt with the corresponding filter Fy,x,v,u t as follows:
- (This is similar to the approaches AdaConv or SepConv where the filters are learnt from input, but they are for video frame interpolation.)
1.2. Residual Learning
- The result after applying the dynamic upsampling filters alone lacks sharpness as it is still a weighted sum of input pixels.
- To address this, a residual image is additionally estimated to increase high frequency details. A residual was added to the bicubically upsampled baseline to produce a final output.
- With residual learning, much sharper images are obtained.
2. VSR-DUF / DUF: Network Architecture
- Dense block, used in DenseNet, is used, with weights shared.
- 2D convolutional layers are replaced with 3D convolutional layers to learn spatio-temporal features from video data.
- Each part of the dense block is composed of batch normalization (BN), ReLU, 1×1×1 convolution, BN, ReLU, and 3×3×3 convolution in order.
- To produce the final output ˆYt, the filtered output ˜Yt is added with the generated residual Rt.
- Huber loss is used:
3. Temporal Data Augmentation
- Data augmentation in the temporal axis on top of the general data augmentation like random rotation and flipping.
- A variable TA is introduced that determines sampling interval of the temporal augmentation.
- With TA=2, for example, we will sample every other frames to simulate faster motion.
- Video sample is in the reverse order when we set the TA value as negative.
- Using various sizes of TA (from -3 to 3 in our work), training data is with rich motion.
4. Experimental Results
4.1. Dataset
- A total of 351 videos are collected from the Internet with various contents including wildlife, activity, and landscape for training. 160,000 ground truth training data with the spatial resolution of 144×144 by selecting areas with sufficient amount of motion.
- For the validation set, we use 4 videos, coastguard, foreman, garden, and husky from the Derf’s collection, which we name as Val4.
- For the test set, Vid4 from [23] is used to compare with other methods.
4.2. SOTA Comparison
- 3 networks: 16 layers (Ours-16L), networks with 28 layers (Ours-28L) and 52 layers (Ours-52L) are also tested.
- The PSNR value of Ours-28L is increased by 0.18dB from Ours-16L with 0.2M additional parameters.
- VSR-DUF works well even when stacking up to 52 layers and the PSNR value for Vid4 is improved to 27.34dB, which is 0.53dB higher than that of Ours-16L.
- Even with Ours-16L, we outperforms all other methods by a large margin in terms of PSNR and SSIM for all upscale factors. For example, the PSNR of Ours-16L is 0.8dB higher than the second highest result [34] (r = 4).
4.3. Qualitative Comparisons
- Increasing the number of layers provide better result.
- VSR-DUF shows sharper outputs with more smooth temporal transition compared to other works.
There is a section about the visualization of learned filters. Please feel free to read the paper if interested. :)
This is the 9th story in this month.
Reference
[2018 CVPR] [VSR-DUF / DUF]
Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation
Video Super Resolution
[STMC / VESPCN] [VSR-DUF / DUF]