Reading: Amna JRTIP’20 — Fast Intra‑Coding Unit Partition Decision in H.266/FVC (Fast VVC)

35% Average Encoding Time Reduction, With Only 1.7% Increase in BD-Rate

5 min readJul 17, 2020

In this story, “Fast intra‑coding unit partition decision in H.266/FVC based on deep learning” (Amna JRTIP’20), University of Monastir, Sfax, and King Khalid University, is presented. I read this because I work on video coding research. In this paper:

Three-level CNN is designed to predict the intra-mode coding unit (CU) partition size.

This is a journal paper in 2020 Springer JRTIP. (Sik-Ho Tsang @ Medium)

Outline

Statistical Analysis
Network Architecture
Experimental Results

1. Statistical Analysis

In JEM-7.0, coding units (CUs) can be non-square size as shown above, which is called binary tree (BT). But it also increase complexity so much.
(Together with quad-tree (QT), it is called QTBT.)

With BT disabled in JEM-7.0, BD-rate is increased by 5.4%. The encoding time is decreased by 88%!

**block size division distribution for QT depths 2, 3 and 4**

DQT is the depth of QT whereas DBT is the depth of BT.
Starting with a CU depth (DQT = 2, DBT = 0), final selected decision are 80% square size 16×16, 9% rectangular horizontal division 32×8, 8% rectangular vertical division 8×32 and 3% non-division and selected size is 32×32.
When DQT = 3, only 30% of initial block size 16×16 are selected to be divided into 8×8 block size.
If the non-partitioned square CUs can be early predicted, the RDO process of considered CUs could be eliminated which can greatly speed up the QTBT partitioning process.

2. Network Architecture

The binary label for each CU is split (1) or non-split (0).
Three CNN models to predict the depth range of QT partitioning for 128×128 , 64×64 and 32×32 CUs.
Three separate CNN models, sharing the same deep CNN structure with different kernel sizes, are learnt to obtain classifiers at three levels.
The raw CTU is first preprocessed by means of removal module, used to reduce the variation of CTU input samples, and then a down-sampling module is applied.
The proposed CNN structure is composed of an input layer, three convolutional layers, a concatenation layer and two fully connected layers.

4×4 kernels (8 filters in total), with stride 4 at the first convolutional layer.
At the second and third layers, data are sequentially convoluted twice with 2×2 kernels, with stride 2 (16 filters for the second layer and 24 filters for the third layer) to generate features at a higher level.
The vectorized features of the concatenating layer are collected from the second and third convolutional layers of three models. This concatenation is computed to obtain a variety of both global and local features.
Next, all features in the concatenated vector are processed in three models, to pass through three fully connected layers: two hidden fully-connected layers successively generate feature vectors, and one output layer extracting P1, P2 and P3 outputs containing, respectively, 1, 4 and 16 binary elements.
ReLU is used.
Since all the labels are binary, the sigmoid function is used to activate all the output layers.
Cross-entropy function is used:

2000 images with resolution 4928×3264 are arbitrarily selected from raw images database [24], to be randomly divided into training (1700 images), validation (100 images) and test (200 images) sets [25].
Those sequences are further down sampled to resolutions ( 768×512 ; 1536×1024 ; 2880×1920 ) then coded with JEM7.0 software at intra-mode configuration to extract our own database.
Finally, after encoding with 4 QP values and three square Cus possibilities; each CU with its corresponding binary label indicating split (1) or non-split (0) decision is considered as database sample.

3. Experimental Results

3.1. BD-Rate

**BD-Rate and Time on FVC/VVC Test Sequences Using JEM-7.0 Under AI**

35% of encoding time is saved on average compared to JEM 7.0 with a slight BDBR increase of 1.7% on average.

**BD-Rate and Time on FVC/VVC Test Sequences Using JEM-3.1 Under AI**

To compare with Jin ACCESS’18 [13], the proposed approach is also implemented in JEM-3.1.
49.1% average time reduction is achieved, with a slight increase in the BDBR of 1%.
(But it is quite difficult to say which one is better.)