Reading: U+DVPN — Upsampled & Downsampled Video Prediction Network for Video Coding (HEVC Inter Prediction)

SepConv With Upsampling & Downsampling, Outperforms SepConv, FRUC+DVRF, FRUC+DVRF+VECNN, DeepFrame

6 min readMay 31, 2020

In this story, “Deep Video Prediction Network Based Inter-Frame Coding in HEVC” (U+DVPN), by Ewha Womans University, and Kyungnam University, is presented. I read this because I work on video coding research. In this paper:

By including the downsampling and upsampling operations in SepConv, large motion can also be considered with small-kernel- sized convolution.
With better quality synthesized video frame, better quality video frame can be synthesized

This is an early access paper in 2020 IEEE ACCESS where it is an open-accessed journal with high impact factor of 4.098. (Sik-Ho Tsang @ Medium)

Outline

U+DVPN: Network Architecture
HEVC Implementation
Experimental Results

1. U+DVPN: Network Architecture

**U+DVPN: Network Architecture (VPN: Video Prediction Network)**

1.1. Up-sampled resolution VPN (UVPN)

SepConv was developed originally for interpolation tasks of low resolution videos. In high resolution videos, the quality of the generated frames drops significantly. This is due to the kernel size which cannot cover large motion when interpolating the frames.
In SepConv, the interpolated frame is:

This is also the same as UVPN that as shown above, i.e. ^XUt, before the weight mask.
In addition, weight map is generated, which is different from SepConv.

1.2. Down-sampled resolution VPN (DVPN)

In addition, DVPN is added.
DVPN down-samples an input frame of h×w to the size of h/2×w/2, denoted by XDT1 and XDT2 , and produce a downsampled hypothesis of the prediction, denoted by ^XDt.
^XDt is then up-sampled to ^XDUt with the original size of h×w and concatenated to the two original input frames in a chronological order, and input them to the UVPN.

1.3. Final Stage

The final realization is the weighted combination of intermediate results, as follows:

where M is a trained weight map ranging from 0 to 1, and the dot operator refers to the element-wise multiplication of pixels in frames.

1.4. Weight Map

**Left: Original Frame, Right: Weight Map**

Brighter regions of the weight map as shown above means higher activation. The weight map responds to object with motions.
The large resolution might be hardly managed due to the limited kernel sizes in the original VPN. So, DVPN is important to capture the motion.
The DVPN and UVPN share the same network architectures.
The loss function is:

in order to consider the both outputs from DVPN and UVPN.

1.5. Performance

HD and UHD videos, obtained from YouTube, are used as training set.
About 350,000 256×256 video patches are collected.

The frame prediction performance of our CNN VPN in video interpolation (VI) and video extrapolation (VX) is as shown above.
The test video TV6-TV10 have lower resolution than 1280×720. In VI, the prediction performance is almost same. However, the difference increases in TV1-TV5, whose resolution is higher than TV6-TV10.
This shows the importance of DVPN.

**(a) VI (b) VX, Left: Original, Middle: UVPN Only, Right: U+DVPN**

As shown above, U+DVPN can capture motions with more details.

2. HEVC Implementation

**Decoded Picture Buffer in HEVC Using U+DVPN**

For RA configuration, the Virtual Reference Frame (VRF) is applied to the last layer of the hierarchical B-picture coding structure e.g. t=1, 3, 5, and 7 because they can be generated using input frames in the nearest temporal distance.
And the VRF is always placed at the second index in the buffer.
For LD configuration, VRF is always placed in the third index in the buffer for t=1, 3, 5, 6, and 7.
When using AMVP, the current block may be coded using a VRF, or the col-Pic may be determined as a VRF. For both the cases, the motion vector scaling process cannot be conducted, the temporal motion vector candidate as a zero-motion vector.
When using merge mode, for a temporal block candidate, the merge mode is changed in a way that a motion vector is set to zero when the VRF is considered in the searching process.

3. Experimental Results

3.1. BD-Rate

**BD-Rate (%) on HEVC Test Sequences Under RA Configuration**

U+DVPN achieves 2.9% BD-rate reduction which outperforms FRUC+DVRF [8], and FRUC+DVRF+VECNN [27].

**BD-Rate (%) on HEVC Test Sequences Under LD Configuration**

U+DVPN achieves 5.7% BD-rate reduction which outperforms DeepFrame [28].

U+DVPN achieves 5.7% BD-rate reduction which outperforms SepConv [26].

**BD-Rate (%) on HEVC Test Sequences Under LDP Configuration**

U+DVPN achieves 7.2% BD-rate reduction under LDP configuration.

3.2. Time Analysis

HM-16.6 is used.
When GPU is used, nearly no time difference for encoding, but with 20% to 43% increase in decoding time.
When CPU is used only, large amount of time is increased as shown above.

3.3. Further Analysis

**Motion Vector Difference Distribution**

U+DVPN has more compact and skewed to the zero motion vector, compared to the original HEVC.

**(a) Original, (b) Residuals by HEVC, (c) Residuals by U+DVPN**

U+DVPN synthesizes frames that are much close to the original frame.

In higher QP, i.e. lower bitrate application, using U+DVPN can obtain larger coding gain.

It is empirically found that placing VRF at other indices obtain lower coding gain.

During the days of coronavirus, a challenge of writing 30/35/40/45/50 stories for this month has been accomplished. This is the 50th story in this month..!!
In these two months, I have written 85 stories, 35 in April and 50 in May. Among these: Image Classification (1), Objection Detection (3), Semantic Segmentation (1), Instance Segmentation (2), Biomedical Image Segmentation (1), GAN (5), Super Resolution (8), Image Restoration (3), Video Frame Interpolation (2), Video Coding (64), and Tutorial (1). (Some stories have more than 1 topics or more than 1 papers.)
I have read a lot of video coding papers for my literature review. I will continue this. But I will slow down a bit. (Reading papers and writing wrap-up take me quite a lot of time. After looking at this small statistics, I also think myself crazy, lol !!) Indeed, it may be quite annoying for those who receive my new story notifications every day. Anyway, thanks for visiting my story..

Reference

[2020 ACCESS] [U+DVPN]
Deep Video Prediction Network Based Inter-Frame Coding in HEVC

Codec Inter Prediction

HEVC [CNNIF] [Zhang VCIP’17] [NNIP] [Ibrahim ISM’18] [VI-CNN] [FRUC+DVRF][FRUC+DVRF+VECNN] [RSR] [Zhao ISCAS’18 & TCSVT’19] [Ma ISCAS’19] [ES] [CNN-SR & CNN-UniSR & CNN-BiSR] [DeepFrame] [U+DVPN]
VVC [FRUC+DVRF+VECNN]