Review — DLT: Deep Learning-Based Nonlinear Transform for HEVC Intra Coding (HEVC Intra)

Deep Learning-based Transform (DLT) Using Autoencoder, 0.75% BD-Rate Reduction is Achieved

Sik-Ho Tsang

5 min readMay 9, 2021

**Overview of the Proposed Deep Learning-Based Transform (DLT)**

In this story, Deep Learning-Based Nonlinear Transform for HEVC Intra Coding, (DLT), is reviewed. In this paper:

A convolutional neural network (CNN) model is designed as Deep Learning-Based Transform (DLT) to achieve better decorrelation and energy compaction than the conventional discrete cosine transform (DCT).
The intra prediction signal is utilized as side information to reduce the directionality in the residual.
A novel loss function is used to characterize the efficiency of the transform during the training.

This is a paper in 2020 VCIP. (Sik-Ho Tsang @ Medium)

Outline

Directionality in Residual Domain
DLT: Network Architecture
Loss Function
Experimental Results

1. Directionality in Residual Domain

**The directional information in the prediction block and the residual block after directional intra prediction.**

When directional intra prediction is applied to predict the block, there are directional information in the prediction block.
After getting the difference between the original block and the prediction block, as shown above (right), DCT transformed and quantized, there is still directional information in the residual domain.
That means DCT cannot decorrelate the block well to compact the energy.
Thus, CNN is designed to solve this problem, as shown in the first figure at the top of this story.
It is noted that Deep Learning-based Transform (DLT) is only applied onto 8×8 blocks.
The flag DLT_Flag is transmitted to the decoder for each 8×8 transform unit (TU) to inform whether the DLT is used or not.

2. DLT: Network Architecture

**Top: Directional Module, Bottom, Transform Module**

2.1. Directional Module (Top)

To capture the directionality, an additional branch which utilizes the coding information is used.

3 convolutional layers with stride 1 and kernel size 3×3 is used. The numbers of output channels are 128, 64, 1 respectively.
All of these layers use the tanh activation function.
Before feeding to the neural network, the mean is substracted from the prediction.
Finally, the directional information Idir, is extracted.

2.2. Transform Module (Bottom)

Autoencoder architecture is used as the transform.
The encoder and decoder are both composed of 1 convolutional layer and initialized using the DCT basis functions.

2.2.1. Encoder for Forward Transform

The encoder of the Autoencoder is adopted to perform the forward transform.
To eliminate the directionality in the residuals, the difference between the residual Xres and extracted directional information Idir is used as the input of the encoder:

where enc indicates the encoder in the autoencoder, and Z represents the transform coefficients.
The number of the transform coefficients equals the size of input Xres.

2.2.2. Decoder for Inverse Transform

For the inverse transform, the decoder of the Autoencoder takes the transform coefficients Z as the input.
To recover the directionality of the residual, the extracted directional information Idir is added back to output of the decoder.
Furthermore, to compact the signal energy into a few coefficients, only K transform coefficients in Z are used to perform the inverse transform. In the implementation, K = 8:

3. Loss Function

3.1. L2 Loss

A pixel-wise L2 loss is used:

3.2. Energy Compact Loss

The transform coding gain is widely adopted to measure the amount of energy compaction achieved by the transform.
It is defined as a ratio of the arithmetic mean to the geometric mean of the variances of the transform coefficients:

where σi² denotes the variance of the i-th coefficient in Z.

3.3. Total Loss

The total loss L is the weighted sum of L2 loss and energy compact loss:

where α and β are weights reflecting the influence of different losses. In the implementation, authors set α = 1.0, β = 0.2.

4. Experimental Results

4.1. Training

Uncompressed Color Image Database (UCID) is used as training set.
HEVC (HM-16.20) is used to compress the images in the UCID under default configuration, and mark all the 4×4 TUs.
Then, the minimum transform size is set to 8 and the images are compressed again.
At the same time, the 8 × 8 TUs where all its 4 × 4 TUs are marked at the first step are extracted. In this way, the complex 8×8 TUs where the DCT has a poor performance, can be extracted.
Only luma is considered, and all the values are normalized to the range [-1, 1] by dividing by 255.
Only one model is trained for four QPs {22, 27, 32, 37}.

4.2. BD-Rate

The proposed DLT brings 0.75%, 0.1%, 0.2% BD-rate reductions for Y, U, V components respectively.
Higher performance can be achieved for these test sequences with abundant textures, e.g., ParkScene, RaceHorses.

4.3. Usage

**The red blocks represent the 8×8 TUs that use the proposed DLT instead of the DCT**

The above figure gives further insights into TUs using the proposed DLT.
It illustrates that the proposed DLT can capture the non-stationary at the textural area of the natural video.

Reference

[2020 VCIP] [DLT]
Deep Learning-Based Nonlinear Transform for HEVC Intra Coding

Codec Intra Prediction

JPEG [MS-ROI] [Baig JVICU’17]
JPEG-HDR [Han VCIP’20]
HEVC [Xu VCIP’17] [Song VCIP’17] [Li VCIP’17] [Puri EUSIPCO’17] [IPCNN] [IPFCN] [HybridNN, Li ICIP’18] [Liu MMM’18] [CNNAC] [Li TCSVT’18] [Spatial RNN] [PS-RNN] [AP-CNN] [MIP] [Wang VCIP’19] [IntraNN] [CNNAC TCSVT’19] [CNN-CR] [CNNMC Yokoyama ICCE’20] [PNNS] [CNNCP] [Zhu TMM’20] [Sun VCIP’20] [DLT] [Zhong ELECGJ’21]
VVC [CNNIF & CNNMC] [Brand PCS’19] [Bonnineau ICASSP’20] [Santamaria ICMEW’20] [Zhu TMM’20]

Review — DLT: Deep Learning-Based Nonlinear Transform for HEVC Intra Coding (HEVC Intra)

Deep Learning-based Transform (DLT) Using Autoencoder, 0.75% BD-Rate Reduction is Achieved

Outline

1. Directionality in Residual Domain

2. DLT: Network Architecture

2.1. Directional Module (Top)

2.2. Transform Module (Bottom)

2.2.1. Encoder for Forward Transform

2.2.2. Decoder for Inverse Transform

3. Loss Function

3.1. L2 Loss

3.2. Energy Compact Loss

3.3. Total Loss

4. Experimental Results

4.1. Training

4.2. BD-Rate

4.3. Usage

Reference

Codec Intra Prediction

My Other Previous Paper Readings

Written by Sik-Ho Tsang