Reading: Amna JRTIP’20 — Fast Intra‑Coding Unit Partition Decision in H.266/FVC (Fast VVC)

35% Average Encoding Time Reduction, With Only 1.7% Increase in BD-Rate

Sik-Ho Tsang
5 min readJul 17, 2020

In this story, “Fast intra‑coding unit partition decision in H.266/FVC based on deep learning” (Amna JRTIP’20), University of Monastir, Sfax, and King Khalid University, is presented. I read this because I work on video coding research. In this paper:

  • Three-level CNN is designed to predict the intra-mode coding unit (CU) partition size.

This is a journal paper in 2020 Springer JRTIP. (Sik-Ho Tsang @ Medium)

Outline

  1. Statistical Analysis
  2. Network Architecture
  3. Experimental Results

1. Statistical Analysis

Statistical Analysis
  • In JEM-7.0, coding units (CUs) can be non-square size as shown above, which is called binary tree (BT). But it also increase complexity so much.
  • (Together with quad-tree (QT), it is called QTBT.)
BT disabled in JEM-7.0
  • With BT disabled in JEM-7.0, BD-rate is increased by 5.4%. The encoding time is decreased by 88%!
block size division distribution for QT depths 2, 3 and 4
  • DQT is the depth of QT whereas DBT is the depth of BT.
  • Starting with a CU depth (DQT = 2, DBT = 0), final selected decision are 80% square size 16×16, 9% rectangular horizontal division 32×8, 8% rectangular vertical division 8×32 and 3% non-division and selected size is 32×32.
  • When DQT = 3, only 30% of initial block size 16×16 are selected to be divided into 8×8 block size.
  • If the non-partitioned square CUs can be early predicted, the RDO process of considered CUs could be eliminated which can greatly speed up the QTBT partitioning process.

2. Network Architecture

Network Architecture
  • The binary label for each CU is split (1) or non-split (0).
  • Three CNN models to predict the depth range of QT partitioning for 128×128 , 64×64 and 32×32 CUs.
  • Three separate CNN models, sharing the same deep CNN structure with different kernel sizes, are learnt to obtain classifiers at three levels.
  • The raw CTU is first preprocessed by means of removal module, used to reduce the variation of CTU input samples, and then a down-sampling module is applied.
  • The proposed CNN structure is composed of an input layer, three convolutional layers, a concatenation layer and two fully connected layers.
Details of Network Architecture
  • 4×4 kernels (8 filters in total), with stride 4 at the first convolutional layer.
  • At the second and third layers, data are sequentially convoluted twice with 2×2 kernels, with stride 2 (16 filters for the second layer and 24 filters for the third layer) to generate features at a higher level.
  • The vectorized features of the concatenating layer are collected from the second and third convolutional layers of three models. This concatenation is computed to obtain a variety of both global and local features.
  • Next, all features in the concatenated vector are processed in three models, to pass through three fully connected layers: two hidden fully-connected layers successively generate feature vectors, and one output layer extracting P1, P2 and P3 outputs containing, respectively, 1, 4 and 16 binary elements.
  • ReLU is used.
  • Since all the labels are binary, the sigmoid function is used to activate all the output layers.
  • Cross-entropy function is used:
  • 2000 images with resolution 4928×3264 are arbitrarily selected from raw images database [24], to be randomly divided into training (1700 images), validation (100 images) and test (200 images) sets [25].
  • Those sequences are further down sampled to resolutions ( 768×512 ; 1536×1024 ; 2880×1920 ) then coded with JEM7.0 software at intra-mode configuration to extract our own database.
  • Finally, after encoding with 4 QP values and three square Cus possibilities; each CU with its corresponding binary label indicating split (1) or non-split (0) decision is considered as database sample.

3. Experimental Results

3.1. BD-Rate

BD-Rate and Time on FVC/VVC Test Sequences Using JEM-7.0 Under AI
  • 35% of encoding time is saved on average compared to JEM 7.0 with a slight BDBR increase of 1.7% on average.
BD-Rate and Time on FVC/VVC Test Sequences Using JEM-3.1 Under AI
  • To compare with Jin ACCESS’18 [13], the proposed approach is also implemented in JEM-3.1.
  • 49.1% average time reduction is achieved, with a slight increase in the BDBR of 1%.
  • (But it is quite difficult to say which one is better.)

3.2. RD Curves

RD Curves Using JEM-7.0
  • A bit larger loss at low bitrate condition is observed for CatRobot and BasKetBallDrive sequences.

This is the 12th story in this month.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet