Review: Zhang VCIP’17 — VDSR for Fractional Interpolation in HEVC (HEVC Inter Prediction)

Network Similar to VDSR, 0.45% BD-rate Reduction Compared With HEVC

4 min readApr 22, 2020

In this story, Learning a Convolutional Neural Network for Fractional Interpolation in HEVC Inter Coding (Zhang VCIP’17), by Shanghai Jiao Tong University, Cooperative Medianet Innovation Center, and Shanghai University of Electric Power, is briefly reviewed. I read this because I work on video coding research. This is a paper in 2017 VCIP. (Sik-Ho Tsang @ Medium)

Outline

The Use of Fractional Interpolation in Video Coding
CNN Based Interpolation Process & Network Architecture
Experimental Results

1. The Use of Fractional Interpolation in Video Coding

There are correlations between frames for efficient compression using a process called motion compensated prediction (MCP).
In MCP, for the current block to be coded, the best matching block is searched in previously reconstructed reference frames, and the differences between these two blocks, i.e. residues, are transmitted to the decoder side.
The positional relationship between current block and its corresponding reference block is represented by a motion vector (MV), which also describes the displacement of these blocks.
The true frame-to-frame displacements of moving objects may not be integer-pel displacement.
Fractional-pel precision motion vectors have to be adopted to describe the continuous motion of objects.
Hence, the reference frame needs to be interpolated.

2. CNN Based Interpolation Process & Network Architecture

2.1. Conventional Interpolation

**Positions of integer and fractional samples in luma component**

The capitalized A represent integer samples (Blue), which are real exist at corresponding reference block.
Half-sample positions are labeled as b,h,j (Yellow).
In conventional HEVC, Half-samples need to be interpolated with a symmetric 8-tap DCT-based filter.
For the remaining pixels without any letters, they are quarter-pel pixels in which they will be bilinearly interpolated using the half-pel pixels.

2.2. Proposed Approach Restrictions

Authors propose to train a deep convolutional neural network to replace the DCT-based interpolation filters and generate samples at these three half-pel positions.
It’s not appropriate to use an already trained super resolution (SR) network directly since integer samples are needed to be kept.
The network is restricted to be only trained for the three half-pel positions.

2.3. Network Architecture

The network consists of two parts a VDSR super-resolution network as the main body and a constraint weighted mask.
VDSR consists of 20 weighted layers, each of which except the first and the last consists of 64 filters of size 3×3×64.
By cascading filters in a 20 layers network, a large region of image contextual information is taken into account to predict image details.
Instead of predicting a high-resolution image directly like SRCNN, VDSR predicts image details, which is called residual image.
A residual image is defined as the difference between HR and interpolated LR image:

where Y_H is the output high-resolution image, and X_ILR refers to the interpolated low-resolution image (ILR).
Since the input ILR image and the output HR image are highly correlated, most values of the residual image tend to be zero or very small.
The constrained mask is added to the original VDSR structure to restrict the integer positions.
The weights vary with locations. W_INT is for integer positions while the remaining three half-sample positions share the same weight W_H.
Euclidean loss function is used.

3. Experimental Results

400 natural images are used as training set, encoded by HM-16.7 using five different quantization parameters (QP) — 22, 27, 32, 37 and 42.

**BD-Rate (%) of Proposed Approach Against the Conventional HEVC Interpolation Process DCTIF**

0.45% BD-rate reduction is obtained under low-delay P configuration compared with the conventional HEVC interpolation process DCTIF.

**BD-Rate (%) of** **VDSR** **Against the Conventional HEVC Interpolation Process DCTIF**

Without the constrained mask, the network is VDSR.
There is huge performance loss that BD-rate is increased by 0.4% to 2.6%.

**BD-Rate (%) of Proposed Approach by Encoding 5 Frames Only**

With few (more) frames encoded, the BD-rate reduction is smaller (larger).
Because increased number of testing frames, the performance of our proposed method would be further improved due to the better interpolation process since subsequently coded frames will benefit from this.

During the days of coronavirus, I hope to write 30 stories in this month to give myself a small challenge. This is the 19th story in this month. Thanks for visiting my story…

Reference

[2017 VCIP] [Zhang VCIP’17]
Learning a Convolutional Neural Network for Fractional Interpolation in HEVC Inter Coding

Codec Prediction

[CNNIF] [Zhang VCIP’17] [Xu VCIP’17] [Song VCIP’17] [IPCNN] [IPFCN] [NNIP] [Li TCSVT’18]