Reading: Zhao ISCAS’18 & TCSVT’19 — Enhanced Bi-prediction with Convolutional Neural Network (HEVC Inter Prediction)

VDSR-Like Network, 3.0% Average BD-Rate Reduction Compared to the Conventional HEVC

6 min readMay 9, 2020

In this story, Enhanced Bi-prediction with Convolutional Neural Network for High Efficiency Video Coding (Zhao TCSVT’19), by Peking University, City University of Hong Kong, University of Southern California, and Capital Normal University, is reviewed. I read this because I work on video coding research.

In the conventional HEVC, bi-predictive block is the average of 2 blocks: One block from one reference frame and one block from another reference frame. Therefore, 2 assumptions here:

The motion is linear between 2 frames.
And only translational motion is considered.

In this paper, CNN is introduced to tackle the above problems. It is first published in 2018 ISCAS, and with more details in 2019 TCSVT where TCSVT has a high impact factor of 4.046. I here mainly mention about 2019 TCSVT. (Sik-Ho Tsang @ Medium)

Outline

Bi-Prediction in Video Coding
Proposed Network Architecture
HEVC Implementation
Experimental Results

1. Bi-Prediction in Video Coding

**Bi-Prediction in Conventional Video Coding (Above are Blocks Compensated by Motions)**

In the conventional video coding including HEVC, bi-prediction is that for each current block (in Frame 2), we need to find the matching blocks. One from one reference frame (in Frame 1). Another one from another reference frame (in Frame 3) Both matching blocks are found by using sum of absolute difference (SAD) with the current block.
Then, simple averaging is taken for the matching blocks inFrame 1 and Frame 3 to form a bi-directional motion compensated block.
The difference between this bi-directional motion compensated block (averaging from the blocks in Frame 1 and Frame 3) and the current block (in Frame 2), which is also called prediction error, residual signals or residue, is encoded into the bitstream.
However, the simple averaging assumes the motion is linear along the frames.
And the SAD calculation is pixel-to-pixel, which assumes there is only translational motion between frames.

**Bi-Prediction in Proposed Approach (Above are Blocks Compensated by Motions)**

To tackle the above problems, CNN is used such that patch-to-patch calculation is considered instead of pixel-to-pixel.
Also, with the weight trained by CNN, they are no longer doing simple averaging.
Another illustrative figure for comparison:

**Bi-Prediction Approaches in Conventional Video Coding and in CNN**

2. Proposed Network Architecture

A a six-layer CNN is proposed. It is similar to VDSR but not 20 layers.
The input of our network is a two-channel tensor where each channel is one of bi-prediction blocks. Thus, the network provides a patch-based non-linear fusion.
Each convolution layer consists of 64 convolution kernels of which the spatial shape is set to be 3×3. Thus, the receptive field is 13×13.
ReLU is used except the last layer

Skip connection, originated in ResNet, is used.
MSE is used as loss function:

3. HEVC Implementation

3.1. CNN-Based Motion Compensation (MC)

**The Conventional Bi-Prediction and Proposed CNN-Based Bi-Prediction in HEVC**

First, the first matching block, Prediction Block 0, is found from one reference frame, as in the conventional HEVC.
This “Prediction Block 0” is pointed by a motion vector MV0.
Then, the second matching block, Prediction Block 1, is found from another reference frame, as in the conventional HEVC.
This “Prediction Block 1” is pointed by a motion vector MV1.
If both MV0 and MV1 exists, both matching blocks/patches pointed by MV0 and MV1 are inputted into CNN.
Thus, CNN is not in use during finding the matching blocks, which can release some of the complexity burden.
That is, CNN is not in use during motion estimation (ME), but only during bi-directional motion compensation (MC).

3.2. Some Training Details

It is found that 64×64, 32×32 and 16×16 Coding Units (CUs) occupies 75% of the whole area in the bi-prediction. Thus, CNN is only applied for 64×64, 32×32 and 16×16 CUs.
Different models are trained for different QPs.
Data augmentation is performed by down-sampling based on the bicubic interpolation and antialiasing with Gaussian filter.
Also, the frame is shifted with the step which is less than the size of the CTU for generating more samples since video coding only works on non-overlapping block partitions.
HM-16.15 is used.
The values in the bi-directional prediction blocks are normalized to the unit interval [0,1] as the input of our proposed network.
The sequences for training include CampfireParty, Fountains, Marathon, Runners, RushHour, TallBuildings, TrafficFlow and Wood.
Finally, over 1M samples are generated.

4. Experimental Results

4.1. Ablation Study

4.1.1. Skip Connection

With Skip Connection, PSNR is higher.

4.1.2. Filter Numbers

Better quality can be achieved by increasing the number of the filter.
However, exponentially increased number of parameters lead to significant consumption on the computation time and memory
n = 64 for the trade-off between accuracy and complexity.

4.1.3. Filter Sizes

The filter size 3×3 stacked in six layers provides large enough perception field to describe the motion which achieves the same performance of the filter size 5×5.
With the increasing trainable parameters in the neural network, it is more difficult to optimize the CNN which causes the performance degradation with the filter size 7×7.

4.1.4. Number of Layers

By increasing the depth of proposed model moderately, the performance of the network is improved, and the speed of the convergence becomes slower conversely.
6 layers are chosen to balance the tradeoff.

4.2. Rate-Distortion Performance

**BD-Rate (%) Using RA and LDB Configuration Compared to the Conventional HEVC**

RA and LDB are configuration to have B-frames encoded so that there will be bi-prediction performed during encoding process.
An average of 3.0% and 1.6% bitrate savings on luma component for the RA and LDB configurations, respectively.
Significant bitrate savings for the sequence BQSquare are achieved since large prediction errors are generated in the conventional HEVC.

With CNN, more large CUs are encoded which also induces bitrate saving.

4.3. Visual Quality

**Left: Ground Truth, Middle: HEVC, Right: CNN**

4.4. Time Complexity

However, the encoding and decoding time increase are 4470.0% and 164.9% on average respectively for RA with computation configuration CPU i7–8700K and GPU GTX-1080. (But I don’t know if it is in CPU mode or GPU mode.)

During the days of coronavirus, let me have a challenge of writing 30 stories again for this month ..? Is it good? This is the 10th story in this month. Thanks for visiting my story..

References

[2018 ISCAS] [Zhao ISCAS’18]
CNN-Based Bi-Directional Motion Compensation for High Efficiency Video Coding

[2019 TCSVT] [Zhao TCSVT’19]
Enhanced Bi-prediction with Convolutional Neural Network for High Efficiency Video Coding

Codec Inter Prediction

HEVC [Zhang VCIP’17] [NNIP] [Ibrahim ISM’18] [Zhao ISCAS’18 & TCSVT’19]