Review: Yu ICIP’15 / Liu ISCAS’16 / Liu TIP’16 — CNN for Fast Intra Coding (Fast HEVC Prediction)

Network Similar to LeNet, Over 60% Time Reduction Compared With the Conventional HEVC Reference Software HM-12.0

6 min readApr 18, 2020

In this story, three papers are reviewed together because they are from the same group of authors and mentioning the same CNN with some small differences:

VLSI friendly fast CU/PU mode decision for HEVC intra encoding: Leveraging convolution neural network (Yu ICIP’15)
CNN Oriented Fast HEVC Intra CU Mode Decision (Liu ISCAS’16)
CU Partition Mode Decision for HEVC Hardwired Intra Encoder Using Convolution Neural Network (Liu TIP’16)

Video coding is a computational expensive process. In these papers, Convolutional Neural Network (CNN) is used to speed up the HEVC encoding process by skipping some time-consuming optimization process. (For those who wants to know what video coding is, please feel free to read Sections 1 & 2 in IPCNN.) (There are VLSI design stuffs in the papers. But I would mainly mention the CNN process.)

They are written by authors from Tsinghua University and Huawei, which were published in 2015 ICIP, 2016 ISCAS and 2016 TIP where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)

Outline

Network & Results in Yu ICIP’15
Network & Results in Liu ISCAS’16
Network & Results in Liu TIP’16

1. Network & Results in Yu ICIP’15

1.1. Whole Algorithm

In HEVC, there are 64×64, 32×32, 16×16 and 8×8 Coding Units (CUs) to be encoded using quadtree coding to see which combination of CUs is the best to encode one 64×64 Coding Tree Unit (CTU).
In this paper, there are some simple decisions without using CNN to decide whether 64×64 and 16×16 CUs to be split or not.
For 64×64 and 16×16 CUs, there is a coarse edge strength analysis (I will not focus on this in this story.) to detect two special cases, i.e., the homogeneous block and the block with strong edge.
If it is homogeneous (HOMO), there is no further split.
If it is with strong edge, it is decided to be split (SPLIT) without any trial of intra prediction at the current CU level.
Otherwise, it is classified as COMB, intra prediction at the current CU level will be tried. And it is decided to be split as well, just like the conventional HEVC.
For 32×32 and 8×8 CUs, CNNs are applied to decide to be split or not.

1.2. CNN Network Architecture

Since the CNN architecture is simple just like LeNet, it is also a good start to understand CNN.
The input is a 8×8 pixel block. The input signals are not normalized. The 8×8 CU is directly fed into the input layer. For the larger 32 × 32 CU, the 4 × 4 local averaging and subsampling schemes are applied to generate the 8 × 8 input matrix.
The first hidden layer is a convolutional layer with 6 feature maps. Each neuron is connected to a 3×3 receptive field in the input. The size of the feature map is 6×6, to prevent the convolution from falling off the boundary. The kernels in this layer are deemed as feature extractors. There are 60 trainable parameters.
The second hidden layer performs the local maximum and subsampling. This layer is composed of six 3×3 feature maps. There are 12 trainable parameters.
The third layer performs the second convolution, which consists of sixteen 1×1 feature maps. The kernel size is 3×3, so the trainable parameter number is 960.
The last two hidden layers are fully connected MLP. The fourth layer consists of 10 unit, and the fifth layer contains 2 output units. The trainable parameter numbers in the fourth and fifth layers are 170 and 22, respectively.
The output layer contains {O2N, ON}. O2N means not split and ON means split.
It is noted that the samples belong to the homogeneous or strong edge types, will not be used for training.

1.3. Experimental Results

**BD-Rate (%) and Time Reduction (%) Against the Conventional HM-12.0**

3.39% increase in BD-rate (BR), with 61.1% time reduction (ΔT) is obtained.

The proposed approach only use the pixels within CU for speeding up the encoder which is beneficial to VLSI design. Thus, it is better to compare with similar approach which can use VLSI for parallelism. Compared with [12], the proposed approach got similar time reduction but with lower increase in BD-rate (BDBR).

2. Network & Results in Liu ISCAS’16

2.1. Main Differences From Yu ICIP’15

The network architecture is similar to the one in Yu ICIP’15.
But with the quantization parameter (QP) added at the fully connected layer since QP will affect the image quality.
Higher (Lower) QP, lower (higher) image quality and bitrate.
And thus, lower QP, more CUs are decided to be split for higher quality coding using more small-sized CUs.
For 32×32, 16×16 and 8×8 CUs, CNNs can be applied according to settings by users.

Activation functions are different depending on different ranges of values and CU sizes. This part is quite special since normally, we usually use softmax, tanh, or sigmoid for all ranges.

2.2. Experimental Results

**BD-Rate (%) and Time Reduction (%) Against the Conventional HM-12.0 ([9] is Yu ICIP’15)**

[0/1,0/1,0/1] means the enabling of CNNs for 32×32, 16×16, 8×8 CU sizes respectively.
With [1,0,1], 2.26% increase in BD-rate (BR), with 63.0% time reduction (ΔT) is obtained. This result is already better than Yu VCIP’15.
With [1,1,1], 4.79% increase in BD-rate (BR), with 73.3% time reduction (ΔT) is obtained.

Compared with VLSI friendly algorithm like [7] and [8], the proposed approach outperforms them with higher time reduction and lower BD-rate (BDBR).

3. Network & Results in Liu TIP’16

3.1. Main Differences From Liu ISCAS’16

Actually, this is the extension of the conference paper Liu ISCAS’16. The network is totally the same.
But the activation function is changed with thresholds the same as Liu ISCAS’16:

Of course, there are differences in VLSI design, such as finite bit-depth to support fixed-point calculation instead of float-point calculation. But I will not focus this in this story.

3.2. Experimental Results

The configuration of [1,0,1] achieved the good balance between the coding efficiency and the computational reduction. In this context, the BDBR augment was +2.67%, while 61.1% encoding complexity was saved.

Compared with [20], [26], [32] which are VLSI friendly approaches, the proposed approach obtains the same or larger time reduction with lower increase in BD-rate (BR).

References

[2015 ICIP] [Yu ICIP’15]
VLSI friendly fast CU/PU mode decision for HEVC intra encoding: Leveraging convolution neural network

[2016 ISCAS] [Liu ISCAS’16]
CNN Oriented Fast HEVC Intra CU Mode Decision

[2016 TIP] [Liu TIP’16]
CU Partition Mode Decision for HEVC Hardwired Intra Encoder Using Convolution Neural Network

Codec Fast Prediction

[Yu ICIP’15 / Liu ISCAS’16 / Liu TIP’16] [Laude PCS’16]