Reading: Zhang ICMEW’20 — CNN-Based Inter Prediction Refinement for AVS3 (AVS3 Inter)

1.88% and 1.17% BD-Rate Reduction Under LD and RA Configurations

Sik-Ho Tsang
5 min readAug 1, 2020

In this story, CNN-Based Inter Prediction Refinement for AVS3 (Zhang ICMEW’20), by Harbin Institute of Technology, Peking University, and Peng Cheng Laboratory, is presented. I read this because I work on video coding research. In this paper:

  • A CNN-based inter prediction refinement algorithm is proposed to enhance CUs with different shapes.
  • In order to ensure the network robust for various CU shapes, a progressive training scheme is applied.

This is a paper in 2020 ICMEW. (Sik-Ho Tsang @ Medium)


  1. Network Architecture
  2. Training Strategy
  3. AVS3 Integration
  4. Experimental Results

1. Network Architecture

QTBT+EQT partition mode in AVS3
  • Due to the adoption of QTBT+EQT partition mode in AVS3, as shown in the Figure above, the different sizes of CU in AVS3 are much more than those in AVS2.
  • For luma component, the size of each CU may be M×N (M,N = {4, 8, 16, 32, 64, 128}, ratio(M, N) ≤ 8), thus the total number of different sizes is 30. It is unpractical to design individual network for each size. Therefore, a fully convolutional neural network is proposed, which is not affected by the shape of CU.
Network Architecture
  • The proposed network consists of five convolutional layers. The inter predictor obtained by the traditional motion compensation process is fed into the net-work. These five convolutional layers are utilized to remove the artifacts and refine the inter prediction.
  • The inception structure, originated in GoogLeNet, is also employed in the second layer.
  • PReLU is used except the last layer.
  • Residual learning, originated in ResNet, is used to speed up the convergence.
  • The details are as follows:
Details of the Network

2. Training Strategy

  • Ten 4K video sequences are selected from the SJTU dataset for training.
  • To improve the generalization of the network, these 10 4K sequences are down-sampled to 5 different resolutions (2560×1600, 1920×1080, 1280×720, 832×480, 416×240). Then these sequences are compressed by AVS3 reference software (HPM 5.0) to generate training data at 4 different QP values, which are 27, 32, 38, 45.
  • Only CUs of 128×128, 64×64, 32×32 are extracted from the bitstream and used for training.
  • To ensure the quality of the training data and avoid some over-smooth blocks with simple texture hindering the training performance, blocks with standard deviation less than 2 are removed from the dataset.
  • MSE loss is used:
  • Progressive training strategy is used.
  • First, the network is trained using 128×128 blocks, then the net-work is fine-tuned using 64×64 blocks, finally the network is fine-tuned using 32×32 blocks.

3. AVS3 Integration

AVS3 Integration
  • The network is located after the motion compensation module both in encoder and decoder.
  • The proposed method takes the inter predictor after motion compensation as input, the predictor is fed into the above mentioned fully convolutional network.
  • An enhanced block is obtained at the output layer.
  • This process is also deployed at the decoder-side, as shown above.
  • Only luma component is refined.
  • Since CNN may not be able to perform the refinement well for some blocks, a CU-level flag is signaled to indicate whether to apply the CNN-based refinement or not, which is determined by the rate-distortion optimization.
Block occurrence frequency and proportion
  • The block number and the proportion of each size are calculated to draw the above figure.
  • Although the blocks above 64×64 account for only 20% in quantity, their overall area accounts for about 90% of the whole image.
  • Therefore, by limiting the CU size refined by the proposed method, the encoding time could be greatly reduced with a small efficiency drop.

4. Experimental Results

4.1. BD-Rate

BD-Rate (%)
  • Total 12 sequences with different resolutions are tested by the proposed method compared with HPM-5.0, including UHD 4K, 1080p, 720p.
  • The proposed method achieves an average 1.88%, up to 4.15% BD-rate saving under LD configuration and an average 1.17%, up to 2.64% BD-rate saving under RA configuration.
  • Although the network is trained under LDP configuration, it still shows satisfactory performance under RA and LD con-figurations, which proves the great generalization of the proposed network.

4.2. Subjective Quality

  • As shown above, there could be blocking or blurring in the reconstructed frame with low bit-rate in HPM5.0.
  • After refined by the proposed method, these kinds of distortion are greatly resolved, the texture in the block becomes smoother, which improves the subjective quality.

4.3. Computational Complexity

Encoding and decoding time (%) Compared to HPM-5.0
  • The encoding time is less than twice that of HPM5.0 while decoding time is about 2.5 times on GPU.

4.4. Block Size Limitation

Coding performance and computational complexity comparison of limiting block size
  • Only 10 frames of 720P sequences were tested under LD configuration.
  • Only blocks greater than or equal to 64×64 are enhanced in Exp.1, while there is no limit on block size in Exp.2.
  • As shown in the above table, by limiting the size of the enhanced block, the encoding and decoding time are greatly reduced with little coding efficiency degradation.

This is the 1st story in this month.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.