Reading: DeepFrame — Deep Frame Prediction for Video Coding (HEVC Inter Prediction)
Enhanced SepConv, Outperforms FRUC+DVRF, 4.4%, 2.4%, 2.3% BD-Rate Reduction in LDP, LD & RA Configurations Respectively
In this story, “Deep Frame Prediction for Video Coding” (DeepFrame), by Simon Fraser University, is briefly presented. I just call it DeepFrame since the synthesized frame is called Deep Frame in the paper. I read this because I work on video coding research. In this paper:
- SepConv is enhanced, not just interpolate the mid-frame, but frames at any time instant.
- Also, uni-directional prediction is also supported so that low delay P configuration can also be supported where there are only P frames.
This is a paper in 2019 TCSVT where TCSVT has a high impact factor of 4.046. (Sik-Ho Tsang @ Medium)
Outline
- Network Architecture
- Loss Function
- Some Other Details
- Experimental Results
1. Network Architecture
- (Since the network is based on SepConv, the main difference is at the beginning of the network. Thus, I will not cover too much details. It is better to read SepConv before reading this paper.)
- There are two inputs, i.e. two patch tensors ~Pt-l and ~Pt-k. of size N×M×3 (N=M=128, 3 color channels).
- If needed, the patches are converted from YUV420 to YUV444 to make the resolution of all color channels the same first. And the final frame ^Pt is converted back to YUV420.
1.1. New Stuff From SepConv
- In addition to the color channels, the input tensor contains an additional temporal index channel, which is new from SepConv:
- This tensor contains a matrix of constant ci, depending on the temporal index:
- The sign of ci indicates whether the corresponding patch comes from a previous or subsequent frame, and its magnitude indicates the relative distance to the current frame.
- Convolution B1 is performed separately to fuse the spatial information and temporal information before going through a U-Net-like network, as shown above.
1.2. Afterwards, all the things are the same as SepConv
- SepConv addresses AdaConv’s memory and complexity problem by estimating a pair of 1D kernels that approximate a 2D kernel, i.e. (fh t-l, fv t-l) and (fh t-k, fv t-k) to approximate ft-l as fh t-l ∗ fv t-l and ft-k as fh t-k ∗ fv t-k, as shown above, near the end of the network.
- Thus, SepConv can reduce the number of kernel parameters from n² to 2n for each kernel.
- After estimating two kernels for two patches correspondingly, they convolve with the patches and fuse together to get the final interpolated patch, which called as Deep Frame.
2. Loss Function
- There are several terms for the loss function.
- The first term is the Mean Squared Error (MSE) between the predicted patch ^Pt and the original patch Pt:
- The second loss term is based on the feature reconstruction loss, i.e. the VGG loss:
- where Φ is the feature extraction function. Herein, the output of the relu4_4 layer of the VGG-19 network is used.
- This VGG feature provides good global features but does not capture the local structure of the input signal.
- For this purpose, another loss term that captures more localized information is employed. It is based on geometric features, i.e. MAD of the gradients:
- Finally, the loss function is:
- where λN, λF, and λG are 2, 2, 1 respectively.
3. Some Other Details
3.1. HEVC Implementation
- For each block, an additional flag, DFP flag, is added.
- If it is 1, deep frame is used for prediction, no motion vector is utilized. Only residual information is needed to be coded.
- Otherwise, the conventional inter prediction is used.
3.2. Pre-Training & Fine-Tuning
- HM-16.20 is used.
- Pre-training is performed using small resolution videos of these 27 sequences were either 352×240 (SIF) or 352×288 (CIF).
- Ablation study is carried out at the pre-training stage.
- As seen above, without the correct temporal index (orange), the performance of the network drops by about 2 dB.
- Removing B1 blocks (green) from the proposed network such that the two input patches are stacked and directly fed to B2, performance degrades by approximately 1 dB.
- Removing skip connections (purple) from the merge point B2 to each of the outputs B10, performance dropped by about 0.5 dB.
- Without the geometric loss terms (red), i.e. λG = 0, the prediction performance degrades by about 1 dB.
- After pre-training, fine-tuning is performed using higher resolution videos, ranging from SIF to FullHD.
4. Experimental Results
4.1. BD-Rate
- Sep.: Two separate models, one for uni-directional, one for bi-directional prediction.
- Comb.: One single model for both uni and bi-directional predictions.
- For the LP configuration, the proposed method achieves the largest bits savings of up to 10.1% and 9.8% with the separate model and the combined model, respectively.
- On average, 2.3% to 4.8% BD-rate reduction on Y is achieved.
- The encoding time is increased by 49% to 63%. And the decoding time is increased by 114× to 165× since every time, we also need to estimate the kernel and convolve with the patches when Deep Frame is in use.
- BD-Rate (%) Using MS-SSIM is also measured.
- 2.09% to 4.01% BD-rate reduction is achieved.
4.2. Mode Distribution
- As shown above, DNN (i.e. DeepFrame) occupies some of the inter and skip modes, to improve the coding efficiency.
4.3. SOTA Comparison
- Compared with FRUC+DVRF [28], “Sep.” is the best overall, with the highest average reduction in bitrate (3.3%) and providing the best performance in 9 out of 13 sequences in this test.
- The results of FRUC+DVRF [28] are the second best overall, with the average bit rate reduction of 3.2% and top performance in 4 out of 13 sequences. They achieve especially good performance on BQSquare, which significantly boosts their average bit saving.
- “Comb.” comes in third with a slightly lower overall bit rate reduction of 3.1%. However, even the combined DNN provides better coding gain than FRUC+DVRF [28] in 8 out of 13 sequences.
- It is noted that the above mainly involves bi-directional prediction as RA configuration is used. And FRUC+DVRF [28] cannot perform uni-directional prediction.
4.4. Visual Comparison
- Especially for the second row, SepConv cannot interpolate the basketball well while the deep frame generated by the proposed approach can synthesize the basketball without any ghost effect.
- For the HEVC, it is noted that it needs more bitrate though it has good quality.
During the days of coronavirus, A challenge of writing 30/35/40/45 stories again for this month has been accomplished. This is the 49th story in this month..!! 1 more to go. Can I finish 50 stories in this month (i.e. less than 7 hrs in my timezone)?? Thanks for visiting my story..
Reference
[2019 TCSVT] [DeepFrame]
Deep Frame Prediction for Video Coding
Codec Inter Prediction
HEVC [CNNIF] [Zhang VCIP’17] [NNIP] [Ibrahim ISM’18] [VI-CNN] [FRUC+DVRF][FRUC+DVRF+VECNN] [RSR] [Zhao ISCAS’18 & TCSVT’19] [Ma ISCAS’19] [ES] [CNN-SR & CNN-UniSR & CNN-BiSR] [DeepFrame]
VVC [FRUC+DVRF+VECNN]