Review — DLT: Deep Learning-Based Nonlinear Transform for HEVC Intra Coding (HEVC Intra)

Deep Learning-based Transform (DLT) Using Autoencoder, 0.75% BD-Rate Reduction is Achieved

Sik-Ho Tsang
5 min readMay 9, 2021


Overview of the Proposed Deep Learning-Based Transform (DLT)

In this story, Deep Learning-Based Nonlinear Transform for HEVC Intra Coding, (DLT), is reviewed. In this paper:

  • A convolutional neural network (CNN) model is designed as Deep Learning-Based Transform (DLT) to achieve better decorrelation and energy compaction than the conventional discrete cosine transform (DCT).
  • The intra prediction signal is utilized as side information to reduce the directionality in the residual.
  • A novel loss function is used to characterize the efficiency of the transform during the training.

This is a paper in 2020 VCIP. (Sik-Ho Tsang @ Medium)


  1. Directionality in Residual Domain
  2. DLT: Network Architecture
  3. Loss Function
  4. Experimental Results

1. Directionality in Residual Domain

The directional information in the prediction block and the residual block after directional intra prediction.
  • When directional intra prediction is applied to predict the block, there are directional information in the prediction block.
  • After getting the difference between the original block and the prediction block, as shown above (right), DCT transformed and quantized, there is still directional information in the residual domain.
  • That means DCT cannot decorrelate the block well to compact the energy.
  • Thus, CNN is designed to solve this problem, as shown in the first figure at the top of this story.
  • It is noted that Deep Learning-based Transform (DLT) is only applied onto 8×8 blocks.
  • The flag DLT_Flag is transmitted to the decoder for each 8×8 transform unit (TU) to inform whether the DLT is used or not.

2. DLT: Network Architecture

Top: Directional Module, Bottom, Transform Module

2.1. Directional Module (Top)

To capture the directionality, an additional branch which utilizes the coding information is used.

  • 3 convolutional layers with stride 1 and kernel size 3×3 is used. The numbers of output channels are 128, 64, 1 respectively.
  • All of these layers use the tanh activation function.
  • Before feeding to the neural network, the mean is substracted from the prediction.
  • Finally, the directional information Idir, is extracted.

2.2. Transform Module (Bottom)

  • Autoencoder architecture is used as the transform.
  • The encoder and decoder are both composed of 1 convolutional layer and initialized using the DCT basis functions.

2.2.1. Encoder for Forward Transform

  • The encoder of the Autoencoder is adopted to perform the forward transform.
  • To eliminate the directionality in the residuals, the difference between the residual Xres and extracted directional information Idir is used as the input of the encoder:
  • where enc indicates the encoder in the autoencoder, and Z represents the transform coefficients.
  • The number of the transform coefficients equals the size of input Xres.

2.2.2. Decoder for Inverse Transform

  • For the inverse transform, the decoder of the Autoencoder takes the transform coefficients Z as the input.
  • To recover the directionality of the residual, the extracted directional information Idir is added back to output of the decoder.
  • Furthermore, to compact the signal energy into a few coefficients, only K transform coefficients in Z are used to perform the inverse transform. In the implementation, K = 8:

3. Loss Function

3.1. L2 Loss

  • A pixel-wise L2 loss is used:

3.2. Energy Compact Loss

  • The transform coding gain is widely adopted to measure the amount of energy compaction achieved by the transform.
  • It is defined as a ratio of the arithmetic mean to the geometric mean of the variances of the transform coefficients:
  • where σi² denotes the variance of the i-th coefficient in Z.

3.3. Total Loss

  • The total loss L is the weighted sum of L2 loss and energy compact loss:
  • where α and β are weights reflecting the influence of different losses. In the implementation, authors set α = 1.0, β = 0.2.

4. Experimental Results

4.1. Training

  • Uncompressed Color Image Database (UCID) is used as training set.
  • HEVC (HM-16.20) is used to compress the images in the UCID under default configuration, and mark all the 4×4 TUs.
  • Then, the minimum transform size is set to 8 and the images are compressed again.
  • At the same time, the 8 × 8 TUs where all its 4 × 4 TUs are marked at the first step are extracted. In this way, the complex 8×8 TUs where the DCT has a poor performance, can be extracted.
  • Only luma is considered, and all the values are normalized to the range [-1, 1] by dividing by 255.
  • Only one model is trained for four QPs {22, 27, 32, 37}.

4.2. BD-Rate

BD-Rate (%) on CTC Test Sequences
  • The proposed DLT brings 0.75%, 0.1%, 0.2% BD-rate reductions for Y, U, V components respectively.
  • Higher performance can be achieved for these test sequences with abundant textures, e.g., ParkScene, RaceHorses.

4.3. Usage

The red blocks represent the 8×8 TUs that use the proposed DLT instead of the DCT
  • The above figure gives further insights into TUs using the proposed DLT.
  • It illustrates that the proposed DLT can capture the non-stationary at the textural area of the natural video.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.