Reading: DeepFrame — Deep Frame Prediction for Video Coding (HEVC Inter Prediction)

Enhanced SepConv, Outperforms FRUC+DVRF, 4.4%, 2.4%, 2.3% BD-Rate Reduction in LDP, LD & RA Configurations Respectively

6 min readMay 31, 2020

In this story, “Deep Frame Prediction for Video Coding” (DeepFrame), by Simon Fraser University, is briefly presented. I just call it DeepFrame since the synthesized frame is called Deep Frame in the paper. I read this because I work on video coding research. In this paper:

SepConv is enhanced, not just interpolate the mid-frame, but frames at any time instant.
Also, uni-directional prediction is also supported so that low delay P configuration can also be supported where there are only P frames.

This is a paper in 2019 TCSVT where TCSVT has a high impact factor of 4.046. (Sik-Ho Tsang @ Medium)

Outline

Network Architecture
Loss Function
Some Other Details
Experimental Results

1. Network Architecture

(Since the network is based on SepConv, the main difference is at the beginning of the network. Thus, I will not cover too much details. It is better to read SepConv before reading this paper.)
There are two inputs, i.e. two patch tensors ~Pt-l and ~Pt-k. of size N×M×3 (N=M=128, 3 color channels).
If needed, the patches are converted from YUV420 to YUV444 to make the resolution of all color channels the same first. And the final frame ^Pt is converted back to YUV420.

1.1. New Stuff From SepConv

In addition to the color channels, the input tensor contains an additional temporal index channel, which is new from SepConv:

This tensor contains a matrix of constant ci, depending on the temporal index:

The sign of ci indicates whether the corresponding patch comes from a previous or subsequent frame, and its magnitude indicates the relative distance to the current frame.
Convolution B1 is performed separately to fuse the spatial information and temporal information before going through a U-Net-like network, as shown above.

1.2. Afterwards, all the things are the same as SepConv

SepConv addresses AdaConv’s memory and complexity problem by estimating a pair of 1D kernels that approximate a 2D kernel, i.e. (fh t-l, fv t-l) and (fh t-k, fv t-k) to approximate ft-l as fh t-l ∗ fv t-l and ft-k as fh t-k ∗ fv t-k, as shown above, near the end of the network.
Thus, SepConv can reduce the number of kernel parameters from n² to 2n for each kernel.
After estimating two kernels for two patches correspondingly, they convolve with the patches and fuse together to get the final interpolated patch, which called as Deep Frame.

2. Loss Function

There are several terms for the loss function.
The first term is the Mean Squared Error (MSE) between the predicted patch ^Pt and the original patch Pt:

The second loss term is based on the feature reconstruction loss, i.e. the VGG loss:

where Φ is the feature extraction function. Herein, the output of the relu4_4 layer of the VGG-19 network is used.
This VGG feature provides good global features but does not capture the local structure of the input signal.
For this purpose, another loss term that captures more localized information is employed. It is based on geometric features, i.e. MAD of the gradients:

Finally, the loss function is:

where λN, λF, and λG are 2, 2, 1 respectively.

3. Some Other Details

3.1. HEVC Implementation

For each block, an additional flag, DFP flag, is added.
If it is 1, deep frame is used for prediction, no motion vector is utilized. Only residual information is needed to be coded.
Otherwise, the conventional inter prediction is used.

3.2. Pre-Training & Fine-Tuning

**Ablation Study Performed at Pre-Training Stage**

HM-16.20 is used.
Pre-training is performed using small resolution videos of these 27 sequences were either 352×240 (SIF) or 352×288 (CIF).
Ablation study is carried out at the pre-training stage.
As seen above, without the correct temporal index (orange), the performance of the network drops by about 2 dB.
Removing B1 blocks (green) from the proposed network such that the two input patches are stacked and directly fed to B2, performance degrades by approximately 1 dB.
Removing skip connections (purple) from the merge point B2 to each of the outputs B10, performance dropped by about 0.5 dB.
Without the geometric loss terms (red), i.e. λG = 0, the prediction performance degrades by about 1 dB.
After pre-training, fine-tuning is performed using higher resolution videos, ranging from SIF to FullHD.

4. Experimental Results

4.1. BD-Rate

Sep.: Two separate models, one for uni-directional, one for bi-directional prediction.
Comb.: One single model for both uni and bi-directional predictions.
For the LP configuration, the proposed method achieves the largest bits savings of up to 10.1% and 9.8% with the separate model and the combined model, respectively.
On average, 2.3% to 4.8% BD-rate reduction on Y is achieved.
The encoding time is increased by 49% to 63%. And the decoding time is increased by 114× to 165× since every time, we also need to estimate the kernel and convolve with the patches when Deep Frame is in use.

**BD-Rate (%) Using MS-SSIM on HEVC Test Sequences**

BD-Rate (%) Using MS-SSIM is also measured.
2.09% to 4.01% BD-rate reduction is achieved.

4.2. Mode Distribution

As shown above, DNN (i.e. DeepFrame) occupies some of the inter and skip modes, to improve the coding efficiency.

4.3. SOTA Comparison

**BD-Rate (%) on HEVC Test Sequences Under RA Configuration**

Compared with FRUC+DVRF [28], “Sep.” is the best overall, with the highest average reduction in bitrate (3.3%) and providing the best performance in 9 out of 13 sequences in this test.
The results of FRUC+DVRF [28] are the second best overall, with the average bit rate reduction of 3.2% and top performance in 4 out of 13 sequences. They achieve especially good performance on BQSquare, which significantly boosts their average bit saving.
“Comb.” comes in third with a slightly lower overall bit rate reduction of 3.1%. However, even the combined DNN provides better coding gain than FRUC+DVRF [28] in 8 out of 13 sequences.
It is noted that the above mainly involves bi-directional prediction as RA configuration is used. And FRUC+DVRF [28] cannot perform uni-directional prediction.

4.4. Visual Comparison

**(a) original, (b) HEVC Inter-frame coding, (c)** **SepConv, (d) ‘Sep’, (e) ‘Comb’**

Especially for the second row, SepConv cannot interpolate the basketball well while the deep frame generated by the proposed approach can synthesize the basketball without any ghost effect.
For the HEVC, it is noted that it needs more bitrate though it has good quality.

During the days of coronavirus, A challenge of writing 30/35/40/45 stories again for this month has been accomplished. This is the 49th story in this month..!! 1 more to go. Can I finish 50 stories in this month (i.e. less than 7 hrs in my timezone)?? Thanks for visiting my story..

Reference

[2019 TCSVT] [DeepFrame]
Deep Frame Prediction for Video Coding

Codec Inter Prediction

HEVC [CNNIF] [Zhang VCIP’17] [NNIP] [Ibrahim ISM’18] [VI-CNN] [FRUC+DVRF][FRUC+DVRF+VECNN] [RSR] [Zhao ISCAS’18 & TCSVT’19] [Ma ISCAS’19] [ES] [CNN-SR & CNN-UniSR & CNN-BiSR] [DeepFrame]
VVC [FRUC+DVRF+VECNN]