Reading: DeepFrame — Deep Frame Prediction for Video Coding (HEVC Inter Prediction)

Enhanced SepConv, Outperforms FRUC+DVRF, 4.4%, 2.4%, 2.3% BD-Rate Reduction in LDP, LD & RA Configurations Respectively

Sik-Ho Tsang
6 min readMay 31, 2020

In this story, “Deep Frame Prediction for Video Coding” (DeepFrame), by Simon Fraser University, is briefly presented. I just call it DeepFrame since the synthesized frame is called Deep Frame in the paper. I read this because I work on video coding research. In this paper:

  • SepConv is enhanced, not just interpolate the mid-frame, but frames at any time instant.
  • Also, uni-directional prediction is also supported so that low delay P configuration can also be supported where there are only P frames.

This is a paper in 2019 TCSVT where TCSVT has a high impact factor of 4.046. (Sik-Ho Tsang @ Medium)

Outline

  1. Network Architecture
  2. Loss Function
  3. Some Other Details
  4. Experimental Results

1. Network Architecture

Network Architecture
Network Architecture Details
  • (Since the network is based on SepConv, the main difference is at the beginning of the network. Thus, I will not cover too much details. It is better to read SepConv before reading this paper.)
  • There are two inputs, i.e. two patch tensors ~Pt-l and ~Pt-k. of size N×M×3 (N=M=128, 3 color channels).
  • If needed, the patches are converted from YUV420 to YUV444 to make the resolution of all color channels the same first. And the final frame ^Pt is converted back to YUV420.

1.1. New Stuff From SepConv

  • In addition to the color channels, the input tensor contains an additional temporal index channel, which is new from SepConv:
  • This tensor contains a matrix of constant ci, depending on the temporal index:
  • The sign of ci indicates whether the corresponding patch comes from a previous or subsequent frame, and its magnitude indicates the relative distance to the current frame.
  • Convolution B1 is performed separately to fuse the spatial information and temporal information before going through a U-Net-like network, as shown above.

1.2. Afterwards, all the things are the same as SepConv

  • SepConv addresses AdaConv’s memory and complexity problem by estimating a pair of 1D kernels that approximate a 2D kernel, i.e. (fh t-l, fv t-l) and (fh t-k, fv t-k) to approximate ft-l as fh t-lfv t-l and ft-k as fh t-kfv t-k, as shown above, near the end of the network.
  • Thus, SepConv can reduce the number of kernel parameters from n² to 2n for each kernel.
  • After estimating two kernels for two patches correspondingly, they convolve with the patches and fuse together to get the final interpolated patch, which called as Deep Frame.

2. Loss Function

  • There are several terms for the loss function.
  • The first term is the Mean Squared Error (MSE) between the predicted patch ^Pt and the original patch Pt:
  • The second loss term is based on the feature reconstruction loss, i.e. the VGG loss:
  • where Φ is the feature extraction function. Herein, the output of the relu4_4 layer of the VGG-19 network is used.
  • This VGG feature provides good global features but does not capture the local structure of the input signal.
  • For this purpose, another loss term that captures more localized information is employed. It is based on geometric features, i.e. MAD of the gradients:
  • Finally, the loss function is:
  • where λN, λF, and λG are 2, 2, 1 respectively.

3. Some Other Details

3.1. HEVC Implementation

HEVC Implementation
  • For each block, an additional flag, DFP flag, is added.
  • If it is 1, deep frame is used for prediction, no motion vector is utilized. Only residual information is needed to be coded.
  • Otherwise, the conventional inter prediction is used.

3.2. Pre-Training & Fine-Tuning

Ablation Study Performed at Pre-Training Stage
  • HM-16.20 is used.
  • Pre-training is performed using small resolution videos of these 27 sequences were either 352×240 (SIF) or 352×288 (CIF).
  • Ablation study is carried out at the pre-training stage.
  • As seen above, without the correct temporal index (orange), the performance of the network drops by about 2 dB.
  • Removing B1 blocks (green) from the proposed network such that the two input patches are stacked and directly fed to B2, performance degrades by approximately 1 dB.
  • Removing skip connections (purple) from the merge point B2 to each of the outputs B10, performance dropped by about 0.5 dB.
  • Without the geometric loss terms (red), i.e. λG = 0, the prediction performance degrades by about 1 dB.
  • After pre-training, fine-tuning is performed using higher resolution videos, ranging from SIF to FullHD.

4. Experimental Results

4.1. BD-Rate

BD-Rate (%) on HEVC Test Sequences
  • Sep.: Two separate models, one for uni-directional, one for bi-directional prediction.
  • Comb.: One single model for both uni and bi-directional predictions.
  • For the LP configuration, the proposed method achieves the largest bits savings of up to 10.1% and 9.8% with the separate model and the combined model, respectively.
  • On average, 2.3% to 4.8% BD-rate reduction on Y is achieved.
  • The encoding time is increased by 49% to 63%. And the decoding time is increased by 114× to 165× since every time, we also need to estimate the kernel and convolve with the patches when Deep Frame is in use.
BD-Rate (%) Using MS-SSIM on HEVC Test Sequences
  • BD-Rate (%) Using MS-SSIM is also measured.
  • 2.09% to 4.01% BD-rate reduction is achieved.

4.2. Mode Distribution

BQMall in LP Configuration
  • As shown above, DNN (i.e. DeepFrame) occupies some of the inter and skip modes, to improve the coding efficiency.

4.3. SOTA Comparison

BD-Rate (%) on HEVC Test Sequences Under RA Configuration
  • Compared with FRUC+DVRF [28], “Sep.” is the best overall, with the highest average reduction in bitrate (3.3%) and providing the best performance in 9 out of 13 sequences in this test.
  • The results of FRUC+DVRF [28] are the second best overall, with the average bit rate reduction of 3.2% and top performance in 4 out of 13 sequences. They achieve especially good performance on BQSquare, which significantly boosts their average bit saving.
  • “Comb.” comes in third with a slightly lower overall bit rate reduction of 3.1%. However, even the combined DNN provides better coding gain than FRUC+DVRF [28] in 8 out of 13 sequences.
  • It is noted that the above mainly involves bi-directional prediction as RA configuration is used. And FRUC+DVRF [28] cannot perform uni-directional prediction.

4.4. Visual Comparison

(a) original, (b) HEVC Inter-frame coding, (c) SepConv, (d) ‘Sep’, (e) ‘Comb’
  • Especially for the second row, SepConv cannot interpolate the basketball well while the deep frame generated by the proposed approach can synthesize the basketball without any ghost effect.
  • For the HEVC, it is noted that it needs more bitrate though it has good quality.

During the days of coronavirus, A challenge of writing 30/35/40/45 stories again for this month has been accomplished. This is the 49th story in this month..!! 1 more to go. Can I finish 50 stories in this month (i.e. less than 7 hrs in my timezone)?? Thanks for visiting my story..

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet