Reading: Pooling-Variable CNN — Adaptive CU Split Decision (Fast VVC Prediction)

33% Coding Time Saving With Only 0.99% Increase in BD-rate

Sik-Ho Tsang
4 min readMay 24, 2020

In this story, Adaptive CU Split Decision with Pooling-variable CNN for VVC Intra Encoding, by Fudan University, is presented. Thus, I just call it Pooling-Variable CNN. I read this because I work on video coding research.

Versatile Video Coding (VVC) is a very new video coding standard, which is still under development at this moment.

In this paper:

  • An adaptive CU split decision for intra frame with the pooling-variable convolutional neural network (CNN), targeting at various coding unit (CU) shape, is proposed.
  • Thus, the CU split or not will be decided by only one trained network, same architecture and parameters for the CUs with multiple sizes.

This is a paper in 2019 VCIP. (Sik-Ho Tsang @ Medium)


  1. Quad-Tree plus Multi-type Tree (QTMT) in VVC
  2. Pooling-Variable CNN: Network Architecture
  3. VVC Implementation
  4. Experimental Results

1. Quad-Tree plus Multi-type Tree (QTMT) in VVC

Quad-Tree plus Multi-type Tree (QTMT) in VVC
  • At the early development, Quad-Tree plus Binary Tree (QTBT) was introduced first. (If interested, please read Jin VCIP’17 about QTBT.)
  • And recently, QTMT is introduced.
  • Except QT and BT, there are also horizontal and vertical ternary tree (TT) partition, which shown as dashed line in the figure above.
  • Therefore, there are five choices for splitting, 1 QT, 2 BT and 2 TT which further increases the encoder complexity.
  • Thus, fast approach is urged.

2. Pooling-Variable CNN: Network Architecture

Pooling-Variable CNN: Network Architecture
  • Residual block (The difference of the original block and the predicted block, not the one in ResNet, lol) is used as input into CNN instead of original block.
  • After the first 3×3 convolutional layer, there will be a shape-adaptive pooling layer. The first pooling layer is designed for the CU whose width or height is no less than 32.
  • And the second shape-adaptive pooling layer is for the CU width or height is no less than 16.
  • Shape-adaptive CNN is used here where the pooling layer does not need to learn parameters, and the pooling size can be variable with the input shape, as shown above.
  • For example, if the input CU shape is width = 32, height = 16, the first pooling layer size will be [2, 1], and the second pooling layer size will be [2, 2].
  • To be aware, the convolutional feature map size is not only 2, which is 16, 24, and 32.
  • Before the fully connected layer, all the input block will be transformed into the 4×4 block.
  • With shape-adaptive pooling, only one network can handle all CU sizes.
  • All the training sample is extracted from the test sequence by the VTM Encoder Version 5.0, including the “Campfire”, “BasketballDrill”, “PartyScene”, “KristenAndSara” and so on, under four QP, 22, 27, 32, 37, more than 60K in total.
  • The network will be trained as the CU size, from the 8×8 to the 32×32 in turn. The 8×8 CU will be trained firstly, and then the 16×8 will be input to the CNN, whose parameters have been trained by the 8×8, not the randomly initialized parameters. And the 32×32 CU data will be trained in the end.

3. VVC Implementation

3.1. Flowchart

VVC Implementation
  • The proposed adaptive CU split decision flowchart is as shown above.
  • For a single CU, whether square or rectangle, the gradient of the residual block will be calculated and the pre-decision will be made based on it.
  • After that, based on the CNN inference result, the current CU coding process will be terminated or the sub-CU coding will be performed by the CheckModeSplit.

3.2. Pre-Decision

  • The gx and gy are the gradients in x and y direction, calculated by the Sobel operator.
  • The sum of a gradient in the CU will be compared with the square of the quantization parameter (QP) by parameters, as shown in the equation above.
  • Thus, the pre-decision can filter some CUs which is no need to use CNN to save the external time of CNN and benefit the CNN training.

4. Experimental Results

BD-Rate (%) Compared to VTM-5.0
  • When compared with the anchor, the proposed algorithm can achieve 33.41% time saving with only 0.99% BD-rate increase in average.

During the days of coronavirus, A challenge of writing 30 stories again for this month has been accomplished. A new target of 35 stories is set by now. This is the 33rd story in this month.. Thanks for visiting my story..



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.