Review — Deep Inter Coding with Interpolated Reference Frame for Hierarchical Coding Structure (HEVC Inter)

Using DSepConv for Video Frame Interpolation, Outperforms DeepFrame

4 min readMay 19, 2021

**Block diagram of the proposed inter coding scheme with the architecture of interpolation network from** **DSepConv** **[10].**

In this story, Deep Inter Coding with Interpolated Reference Frame for Hierarchical Coding Structure, (Guo VCIP’20), is briefly reviewed. In this paper:

As shown above, a new reference frame Fmid is interpolated from two-sided previously reconstructed frames Fprev and Fnext, using CNN.
The synthesized frame is merged into reference picture list for motion estimation to further decrease the prediction residual.

This is a paper in 2020 VCIP. (Sik-Ho Tsang @ Medium)

Outline

Generation of Interpolated Reference Frame
Experimental Results

1. Generation of Interpolated Reference Frame

1.1. Hierarchical B Coding Structure

**Hierarchical B coding structure with 5 temporal layers**

In HEVC, hierarchical B coding structure is adopted to improve coding efficiency of inter prediction.
5 temporal layers are used under random access (RA) configuration.

Frames in high level can exploit the reconstruction of lower layers as reference.

1.2. Interpolation Process using DSepConv

Let F() denote the interpolation process conducted by DSepConv, and Irp denote the reconstruction of Ip with POC = p, which will be utilized as reference.
For a to-be-coded frame Ip at temporal layer L(Ip) in the above figure, the generation procedure of the interpolated frame Ig can be presented as:

Only a single model is used to handle interpolation task for all layers.
DSepConv can be divided into four modules.
Given two input pictures, the encoder-decoder structure extracts features which are given to three sub-modules to estimate the parameters of deformable separable convolution including the kernels, offsets and masks.

In brief, according to the temporal layer, different combinations of Fprev and Fnext are chosen to interpolate the frame Fmid.

1.3. Integration of Interpolated Reference Frame

There are two integration approaches.
One is to replace an existing reference frame by the interpolated frame.
One is to increase the reference list size, append the interpolated frame into the list.

Authors choose the latter one, use more reference frames to extend the diversity of reference picture list, insert the frame synthesized by network into the end of both List0 and List1.

Both encoder and decoder can obtain the synthesized frames in the same pipeline.

2. Experimental Results

2.1. Training

The training dataset is Vimeo90K, which consists of 55095 triplets with a resolution of 448×256.
The first and the last frames in triplets are encoded by HM-16.20 under all intra (AI) configuration in various QP values from 20 to 44.
Moreover, the QPs of two input frames are selected with a random difference from 0 to 10. As a result, only one model is trained to handle the condition of different QPs.
Small patches of 128×128 are cropped randomly from training samples.
The network is further fine-tuned with patches of size 256×256.
For data augmentation, patches are randomly rotated or flipped before fed into the network.
The loss function proposed in DSepConv [10] is used, which combined two kinds of loss considering the prediction distortion on pixel color and frame gradient.

2.2. BD-Rate

**BD-Rate (%) on CTC Testing Sequences**

On average, the proposed method obtained 4.6% coding gain for the luma components.
In particular, the proposed method shows considerable performance on high resolution sequences. About 8.7% BD-rate reduction is achieved on the sequence PeopleOnStreet.
The one without fine-tuning is also compared, on average 0.5% further BD-rate reduction can be achieved by enabling fine-tuning.

2.3. Comparison with Existing Method

The proposed method is also compared with DeepFrame [12].
As shown above, the proposed method achieves superior performance on most of the sequences, especially on high-resolution content.

Reference

[2020 VCIP] [Guo VCIP’20]
Deep Inter Coding with Interpolated Reference Frame for Hierarchical Coding Structure

Codec Inter Prediction

H.264 [DRNFRUC & DRNWCMC]
HEVC [CNNIF] [Zhang VCIP’17] [NNIP] [GVTCNN] [Ibrahim ISM’18] [VC-LAPGAN] [VI-CNN] [CNNMCR] [FRUC+DVRF] [FRUC+DVRF+VECNN] [RSR] [Zhao ISCAS’18 & TCSVT’19] [Ma ISCAS’19] [Xia ISCAS’19] [Zhang ICIP’19] [ES] [GVCNN] [FRCNN] [Pham ACCESS’19] [CNNInvIF / InvIF] [CNN-SR & CNN-UniSR & CNN-BiSR] [DeepFrame] [U+DVPN] [Multi-Scale CNN] [Klopp TIP’20] [Guo VCIP’20]
AVS3 [Zhang ICMEW’20]
VVC [FRUC+DVRF+VECNN] [ScratchCNN] [Fischer QoMEX’20]