Reading: Amna JRTIP’20 — Fast Intra‑Coding Unit Partition Decision in H.266/FVC (Fast VVC)
35% Average Encoding Time Reduction, With Only 1.7% Increase in BD-Rate
In this story, “Fast intra‑coding unit partition decision in H.266/FVC based on deep learning” (Amna JRTIP’20), University of Monastir, Sfax, and King Khalid University, is presented. I read this because I work on video coding research. In this paper:
- Three-level CNN is designed to predict the intra-mode coding unit (CU) partition size.
This is a journal paper in 2020 Springer JRTIP. (Sik-Ho Tsang @ Medium)
Outline
- Statistical Analysis
- Network Architecture
- Experimental Results
1. Statistical Analysis
- In JEM-7.0, coding units (CUs) can be non-square size as shown above, which is called binary tree (BT). But it also increase complexity so much.
- (Together with quad-tree (QT), it is called QTBT.)
- With BT disabled in JEM-7.0, BD-rate is increased by 5.4%. The encoding time is decreased by 88%!
- DQT is the depth of QT whereas DBT is the depth of BT.
- Starting with a CU depth (DQT = 2, DBT = 0), final selected decision are 80% square size 16×16, 9% rectangular horizontal division 32×8, 8% rectangular vertical division 8×32 and 3% non-division and selected size is 32×32.
- When DQT = 3, only 30% of initial block size 16×16 are selected to be divided into 8×8 block size.
- If the non-partitioned square CUs can be early predicted, the RDO process of considered CUs could be eliminated which can greatly speed up the QTBT partitioning process.
2. Network Architecture
- The binary label for each CU is split (1) or non-split (0).
- Three CNN models to predict the depth range of QT partitioning for 128×128 , 64×64 and 32×32 CUs.
- Three separate CNN models, sharing the same deep CNN structure with different kernel sizes, are learnt to obtain classifiers at three levels.
- The raw CTU is first preprocessed by means of removal module, used to reduce the variation of CTU input samples, and then a down-sampling module is applied.
- The proposed CNN structure is composed of an input layer, three convolutional layers, a concatenation layer and two fully connected layers.
- 4×4 kernels (8 filters in total), with stride 4 at the first convolutional layer.
- At the second and third layers, data are sequentially convoluted twice with 2×2 kernels, with stride 2 (16 filters for the second layer and 24 filters for the third layer) to generate features at a higher level.
- The vectorized features of the concatenating layer are collected from the second and third convolutional layers of three models. This concatenation is computed to obtain a variety of both global and local features.
- Next, all features in the concatenated vector are processed in three models, to pass through three fully connected layers: two hidden fully-connected layers successively generate feature vectors, and one output layer extracting P1, P2 and P3 outputs containing, respectively, 1, 4 and 16 binary elements.
- ReLU is used.
- Since all the labels are binary, the sigmoid function is used to activate all the output layers.
- Cross-entropy function is used:
- 2000 images with resolution 4928×3264 are arbitrarily selected from raw images database [24], to be randomly divided into training (1700 images), validation (100 images) and test (200 images) sets [25].
- Those sequences are further down sampled to resolutions ( 768×512 ; 1536×1024 ; 2880×1920 ) then coded with JEM7.0 software at intra-mode configuration to extract our own database.
- Finally, after encoding with 4 QP values and three square Cus possibilities; each CU with its corresponding binary label indicating split (1) or non-split (0) decision is considered as database sample.
3. Experimental Results
3.1. BD-Rate
- 35% of encoding time is saved on average compared to JEM 7.0 with a slight BDBR increase of 1.7% on average.
- To compare with Jin ACCESS’18 [13], the proposed approach is also implemented in JEM-3.1.
- 49.1% average time reduction is achieved, with a slight increase in the BDBR of 1%.
- (But it is quite difficult to say which one is better.)
3.2. RD Curves
- A bit larger loss at low bitrate condition is observed for CatRobot and BasKetBallDrive sequences.
This is the 12th story in this month.
Reference
[2020 JRTIP] [Amna JRTIP’20]
Fast intra‑coding unit partition decision in H.266/FVC based on deep learning
Codec Fast Prediction
H.264 to HEVC [Wei VCIP’17] [H-LSTM]
HEVC [Yu ICIP’15 / Liu ISCAS’16 / Liu TIP’16] [Laude PCS’16] [Li ICME’17] [Katayama ICICT’18] [Chang DCC’18] [ETH-CNN & ETH-LSTM] [Zhang RCAR’19] [Kim TCVST’19] [LFHI & LFSD & LFMD Using AK-CNN] [Yang AICAS’20]
3D-HEVC [AQ-CNN]
VVC [Jin VCIP’17] [Jin PCM’17] [Jin ACCESS’18] [Wang ICIP’18] [Galpin DCC’19] [Pooling-Variable CNN] [Amna JRTIP’20] [DeepQTMT]