# Review — DLT: Deep Learning-Based Nonlinear Transform for HEVC Intra Coding (HEVC Intra)

## Deep Learning-based Transform (DLT) Using Autoencoder, 0.75% BD-Rate Reduction is Achieved

--

In this story, **Deep Learning-Based Nonlinear Transform for HEVC Intra Coding**, (DLT), is reviewed. In this paper:

- A convolutional neural network (CNN) model is designed as
**Deep Learning-Based Transform (DLT)**to achieve**better decorrelation and energy compaction**than the conventional discrete cosine transform (DCT). **The intra prediction signal**is utilized as**side information**to reduce the directionality in the residual.**A novel loss function**is used to characterize the efficiency of the transform during the training.

This is a paper in **2020 VCIP**. (Sik-Ho Tsang @ Medium)

# Outline

**Directionality in Residual Domain****DLT: Network Architecture****Loss Function****Experimental Results**

**1. Directionality in Residual Domain**

- When directional intra prediction is applied to predict the block, there are directional information in the prediction block.
- After getting the difference between the original block and the prediction block, as shown above (right), DCT transformed and quantized,
**there is still directional information in the residual domain.** - That means DCT cannot decorrelate the block well to compact the energy.
- Thus, CNN is designed to solve this problem, as shown in the first figure at the top of this story.
- It is noted that
**Deep Learning-based Transform (DLT)**is only applied onto 8×8 blocks. - The flag
is transmitted to the decoder for each 8×8 transform unit (TU) to inform whether the DLT is used or not.*DLT_Flag*

# 2. DLT: Network Architecture

## 2.1. Directional Module (Top)

To capture the directionality, an additional branch which utilizes the coding information is used.

**3 convolutional layers with stride 1 and kernel size 3×3 is used**. The numbers of output channels are 128, 64, 1 respectively.- All of these layers use the tanh activation function.
- Before feeding to the neural network, the mean is substracted from the prediction.
- Finally, the directional information
*Idir*, is extracted.

## 2.2. Transform Module (Bottom)

- Autoencoder architecture is used as the transform.
- The encoder and decoder are both composed of
**1 convolutional layer**and**initialized using the DCT basis functions.**

## 2.2.1. Encoder for Forward Transform

- The
**encoder**of the Autoencoder is adopted to perform the**forward transform.** - To eliminate the directionality in the residuals, the difference between the residual
*Xres*and extracted directional information*Idir*is used as the input of the encoder:

- where
*enc*indicates the encoder in the autoencoder, andrepresents the*Z***transform coefficients**. - The number of the transform coefficients equals the size of input
*Xres*.

## 2.2.2. Decoder for Inverse Transform

- For the inverse transform, the decoder of the Autoencoder takes the
**transform coefficients**as the*Z***input.** - To recover the directionality of the residual, the extracted directional information
*Idir*is added back to output of the decoder. **Furthermore, to compact the signal energy into a few coefficients, only***K*transform coefficients in*Z*are used to perform the inverse transform. In the implementation,*K*= 8:

# 3. Loss Function

## 3.1. L2 Loss

- A pixel-wise L2 loss is used:

## 3.2. Energy Compact Loss

**The transform coding gain**is widely adopted to measure the amount of energy compaction achieved by the transform.- It is defined as
**a ratio of the arithmetic mean to the geometric mean of the variances of the transform coefficients**:

- where
*σi*² denotes the variance of the*i*-th coefficient in*Z*.

## 3.3. Total Loss

- The total loss
*L*is the weighted sum of L2 loss and energy compact loss:

- where
*α*and*β*are weights reflecting the influence of different losses. In the implementation, authors set*α*= 1.0,*β*= 0.2.

# 4. Experimental Results

## 4.1. Training

**Uncompressed Color Image Database (UCID)**is used as training set.**HEVC (HM-16.20)**is used to compress the images in the UCID under default configuration, and mark all the 4×4 TUs.- Then, the minimum transform size is set to 8 and the images are compressed again.
- At the same time, the 8 × 8 TUs where all its 4 × 4 TUs are marked at the first step are extracted. In this way, the complex 8×8 TUs where the DCT has a poor performance, can be extracted.
- Only luma is considered, and all the values are normalized to the range [-1, 1] by dividing by 255.
- Only one model is trained for four QPs {22, 27, 32, 37}.

## 4.2. BD-Rate

- The proposed DLT brings
**0.75%, 0.1%, 0.2% BD-rate reductions for Y, U, V components respectively.** **Higher performance can be achieved for these test sequences with abundant textures, e.g., ParkScene, RaceHorses.**

## 4.3. Usage

- The above figure gives further insights into TUs using the proposed DLT.
- It illustrates that
**the proposed DLT can capture the non-stationary at the textural area of the natural video**.

## Reference

[2020 VCIP] [DLT]

Deep Learning-Based Nonlinear Transform for HEVC Intra Coding

## Codec Intra Prediction

**JPEG** [MS-ROI] [Baig JVICU’17]**JPEG-HDR** [Han VCIP’20]**HEVC **[Xu VCIP’17] [Song VCIP’17] [Li VCIP’17] [Puri EUSIPCO’17] [IPCNN] [IPFCN] [HybridNN, Li ICIP’18] [Liu MMM’18] [CNNAC] [Li TCSVT’18] [Spatial RNN] [PS-RNN] [AP-CNN] [MIP] [Wang VCIP’19] [IntraNN] [CNNAC TCSVT’19] [CNN-CR] [CNNMC Yokoyama ICCE’20] [PNNS] [CNNCP] [Zhu TMM’20] [Sun VCIP’20] [DLT] [Zhong ELECGJ’21]**VVC** [CNNIF & CNNMC] [Brand PCS’19] [Bonnineau ICASSP’20] [Santamaria ICMEW’20] [Zhu TMM’20]