Review — Deep Inter Coding with Interpolated Reference Frame for Hierarchical Coding Structure (HEVC Inter)

Using DSepConv for Video Frame Interpolation, Outperforms DeepFrame

Sik-Ho Tsang
4 min readMay 19, 2021
Block diagram of the proposed inter coding scheme with the architecture of interpolation network from DSepConv [10].

In this story, Deep Inter Coding with Interpolated Reference Frame for Hierarchical Coding Structure, (Guo VCIP’20), is briefly reviewed. In this paper:

  • As shown above, a new reference frame Fmid is interpolated from two-sided previously reconstructed frames Fprev and Fnext, using CNN.
  • The synthesized frame is merged into reference picture list for motion estimation to further decrease the prediction residual.

This is a paper in 2020 VCIP. (Sik-Ho Tsang @ Medium)

Outline

  1. Generation of Interpolated Reference Frame
  2. Experimental Results

1. Generation of Interpolated Reference Frame

1.1. Hierarchical B Coding Structure

Hierarchical B coding structure with 5 temporal layers
  • In HEVC, hierarchical B coding structure is adopted to improve coding efficiency of inter prediction.
  • 5 temporal layers are used under random access (RA) configuration.

Frames in high level can exploit the reconstruction of lower layers as reference.

1.2. Interpolation Process using DSepConv

  • Let F() denote the interpolation process conducted by DSepConv, and Irp denote the reconstruction of Ip with POC = p, which will be utilized as reference.
  • For a to-be-coded frame Ip at temporal layer L(Ip) in the above figure, the generation procedure of the interpolated frame Ig can be presented as:
  • Only a single model is used to handle interpolation task for all layers.
  • DSepConv can be divided into four modules.
  • Given two input pictures, the encoder-decoder structure extracts features which are given to three sub-modules to estimate the parameters of deformable separable convolution including the kernels, offsets and masks.

In brief, according to the temporal layer, different combinations of Fprev and Fnext are chosen to interpolate the frame Fmid.

1.3. Integration of Interpolated Reference Frame

  • There are two integration approaches.
  • One is to replace an existing reference frame by the interpolated frame.
  • One is to increase the reference list size, append the interpolated frame into the list.

Authors choose the latter one, use more reference frames to extend the diversity of reference picture list, insert the frame synthesized by network into the end of both List0 and List1.

  • Both encoder and decoder can obtain the synthesized frames in the same pipeline.

2. Experimental Results

2.1. Training

  • The training dataset is Vimeo90K, which consists of 55095 triplets with a resolution of 448×256.
  • The first and the last frames in triplets are encoded by HM-16.20 under all intra (AI) configuration in various QP values from 20 to 44.
  • Moreover, the QPs of two input frames are selected with a random difference from 0 to 10. As a result, only one model is trained to handle the condition of different QPs.
  • Small patches of 128×128 are cropped randomly from training samples.
  • The network is further fine-tuned with patches of size 256×256.
  • For data augmentation, patches are randomly rotated or flipped before fed into the network.
  • The loss function proposed in DSepConv [10] is used, which combined two kinds of loss considering the prediction distortion on pixel color and frame gradient.

2.2. BD-Rate

BD-Rate (%) on CTC Testing Sequences
  • On average, the proposed method obtained 4.6% coding gain for the luma components.
  • In particular, the proposed method shows considerable performance on high resolution sequences. About 8.7% BD-rate reduction is achieved on the sequence PeopleOnStreet.
  • The one without fine-tuning is also compared, on average 0.5% further BD-rate reduction can be achieved by enabling fine-tuning.

2.3. Comparison with Existing Method

  • The proposed method is also compared with DeepFrame [12].
  • As shown above, the proposed method achieves superior performance on most of the sequences, especially on high-resolution content.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet