Review — DLT: Deep Learning-Based Nonlinear Transform for HEVC Intra Coding (HEVC Intra)
Deep Learning-based Transform (DLT) Using Autoencoder, 0.75% BD-Rate Reduction is Achieved
In this story, Deep Learning-Based Nonlinear Transform for HEVC Intra Coding, (DLT), is reviewed. In this paper:
- A convolutional neural network (CNN) model is designed as Deep Learning-Based Transform (DLT) to achieve better decorrelation and energy compaction than the conventional discrete cosine transform (DCT).
- The intra prediction signal is utilized as side information to reduce the directionality in the residual.
- A novel loss function is used to characterize the efficiency of the transform during the training.
This is a paper in 2020 VCIP. (Sik-Ho Tsang @ Medium)
Outline
- Directionality in Residual Domain
- DLT: Network Architecture
- Loss Function
- Experimental Results
1. Directionality in Residual Domain
- When directional intra prediction is applied to predict the block, there are directional information in the prediction block.
- After getting the difference between the original block and the prediction block, as shown above (right), DCT transformed and quantized, there is still directional information in the residual domain.
- That means DCT cannot decorrelate the block well to compact the energy.
- Thus, CNN is designed to solve this problem, as shown in the first figure at the top of this story.
- It is noted that Deep Learning-based Transform (DLT) is only applied onto 8×8 blocks.
- The flag DLT_Flag is transmitted to the decoder for each 8×8 transform unit (TU) to inform whether the DLT is used or not.
2. DLT: Network Architecture
2.1. Directional Module (Top)
To capture the directionality, an additional branch which utilizes the coding information is used.
- 3 convolutional layers with stride 1 and kernel size 3×3 is used. The numbers of output channels are 128, 64, 1 respectively.
- All of these layers use the tanh activation function.
- Before feeding to the neural network, the mean is substracted from the prediction.
- Finally, the directional information Idir, is extracted.
2.2. Transform Module (Bottom)
- Autoencoder architecture is used as the transform.
- The encoder and decoder are both composed of 1 convolutional layer and initialized using the DCT basis functions.
2.2.1. Encoder for Forward Transform
- The encoder of the Autoencoder is adopted to perform the forward transform.
- To eliminate the directionality in the residuals, the difference between the residual Xres and extracted directional information Idir is used as the input of the encoder:
- where enc indicates the encoder in the autoencoder, and Z represents the transform coefficients.
- The number of the transform coefficients equals the size of input Xres.
2.2.2. Decoder for Inverse Transform
- For the inverse transform, the decoder of the Autoencoder takes the transform coefficients Z as the input.
- To recover the directionality of the residual, the extracted directional information Idir is added back to output of the decoder.
- Furthermore, to compact the signal energy into a few coefficients, only K transform coefficients in Z are used to perform the inverse transform. In the implementation, K = 8:
3. Loss Function
3.1. L2 Loss
- A pixel-wise L2 loss is used:
3.2. Energy Compact Loss
- The transform coding gain is widely adopted to measure the amount of energy compaction achieved by the transform.
- It is defined as a ratio of the arithmetic mean to the geometric mean of the variances of the transform coefficients:
- where σi² denotes the variance of the i-th coefficient in Z.
3.3. Total Loss
- The total loss L is the weighted sum of L2 loss and energy compact loss:
- where α and β are weights reflecting the influence of different losses. In the implementation, authors set α = 1.0, β = 0.2.
4. Experimental Results
4.1. Training
- Uncompressed Color Image Database (UCID) is used as training set.
- HEVC (HM-16.20) is used to compress the images in the UCID under default configuration, and mark all the 4×4 TUs.
- Then, the minimum transform size is set to 8 and the images are compressed again.
- At the same time, the 8 × 8 TUs where all its 4 × 4 TUs are marked at the first step are extracted. In this way, the complex 8×8 TUs where the DCT has a poor performance, can be extracted.
- Only luma is considered, and all the values are normalized to the range [-1, 1] by dividing by 255.
- Only one model is trained for four QPs {22, 27, 32, 37}.
4.2. BD-Rate
- The proposed DLT brings 0.75%, 0.1%, 0.2% BD-rate reductions for Y, U, V components respectively.
- Higher performance can be achieved for these test sequences with abundant textures, e.g., ParkScene, RaceHorses.
4.3. Usage
- The above figure gives further insights into TUs using the proposed DLT.
- It illustrates that the proposed DLT can capture the non-stationary at the textural area of the natural video.
Reference
[2020 VCIP] [DLT]
Deep Learning-Based Nonlinear Transform for HEVC Intra Coding
Codec Intra Prediction
JPEG [MS-ROI] [Baig JVICU’17]
JPEG-HDR [Han VCIP’20]
HEVC [Xu VCIP’17] [Song VCIP’17] [Li VCIP’17] [Puri EUSIPCO’17] [IPCNN] [IPFCN] [HybridNN, Li ICIP’18] [Liu MMM’18] [CNNAC] [Li TCSVT’18] [Spatial RNN] [PS-RNN] [AP-CNN] [MIP] [Wang VCIP’19] [IntraNN] [CNNAC TCSVT’19] [CNN-CR] [CNNMC Yokoyama ICCE’20] [PNNS] [CNNCP] [Zhu TMM’20] [Sun VCIP’20] [DLT] [Zhong ELECGJ’21]
VVC [CNNIF & CNNMC] [Brand PCS’19] [Bonnineau ICASSP’20] [Santamaria ICMEW’20] [Zhu TMM’20]