Reading: H-FCN — Hierarchical Fully Convolutional Network (VP9 & HEVC Fast Intra)
1.17% Increase in BD-Rate With 69.7% Time Reduction in VP9; On Par With ETH-CNN in HEVC
In this story, Speeding up VP9 Intra Encoder with Hierarchical Deep Learning Based Partition Prediction (H-FCN), is briefly presented. I read this because I work on video coding research. In this paper:
- A large database of VP9 superblocks and the corresponding partitions are created to train an H-FCN model.
- Subsequently H-FCN is integrated with the VP9 encoder to reduce the intra-mode encoding time.
This is a paper found in 2020 arXiv, it has been accepted by 2020 TIP where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)
Outline
- VP9 Superblock Partitioning
- A Large Database of VP9 Superblocks
- H-FCN: Network Architecture
- Experimental Results
1. VP9 Superblock Partitioning
- In VP9, sizes of prediction blocks are decided by a recursive splitting of non-overlapping spatial units of size 64×64, called superblocks.
- This recursive partition takes place at four hierarchical levels, possibly down to 4×4 blocks, through a search over the possible partitions at each level, guided by a rate-distortion optimization (RDO) process.
- There are four partition choices at each of the four levels of the VP9 partition tree for each block at that level: no split, horizontal split, vertical split and four-quadrant split.
- However, this RDO process largely increases the encoder complexity.
2. A large database of VP9 superblocks
2.1. Hierarchical Labels
- The partition tree was represented in the form of a set of four matrices, as shown above.
- The content for the database comprises 89 movies and 17 television episodes, which were selected from video sources in the Netflix catalog.
- Each video content was encoded at three different resolutions (1080p, 720p and 540p) using the reference VP9 encoder from the libvpx package.
- The contents were encoded in VP9 Profile 0, using speed level 1 and the good quality configuration.
- The raw pixel data for each superblock was obtained by extracting the luma channels of non-overlapping 64×64 blocks from the source videos downsampled to the encode resolution.
- The database encompasses internal QP values in the range 8–105.
- The above shows the summary of the training and validation database.
3. H-FCN: Network Architecture
- Architecture of H-FCN model having 26,336 parameters and 54,610 FLOPs.
3.1. Trunk
- As shown above, there is a main truck (blue) which consists of convolutions and max pooling layers.
3.2. Branches
- At certain point of the main trunk, there is branch for each coding tree level. And there are 4 branches in total.
- At each branch, convolutions are performed. There are no fully connected layers. That’s why it is called fully convolutional network (FCN).
3.3. Training
- Categorical cross entropy loss is used:
- The prediction accuracy at each level was evaluated on 10⁵ randomly drawn samples from the training and validation sets:
3.4. Inconsistency Correction
- At each level, the model predictions are made independently of all other levels.
- Possible inconsistencies between the predictions of any two levels are corrected by a top-down approach.
- The partitioning using H-FCN is quite close to the ground truth one.
4. Experimental Results
- The trained model was integrated with the reference VP9 encoder using the Tensorflow C API.
4.1. BD-Rate
- The encoding performance was evaluated on 30 test sequences at 3 resolutions in terms of both BD-rate and speedup (ΔT).
- 1.17% BD-rate loss with 69.7% time reduction is achieved.
4.2. Comparison with VP9 Speed Level 4
- The speedup and BD-rate of our approach was also compared with speed level 4 of the reference VP9 encoder, the highest recommended speed level for the baseline configuration:
- H-FCN outperforms VP9 using speed level 4.
- The benefit by the approach in terms of speedup persists across the range of QP values used to learn the H-FCN model.
4.3. Comparison with SOTA Approach in HEVC
- H-FCN is implemented in HEVC to compare with ETH-CNN.
- Compared with ETH-CNN, H-FCN obtains lower BD-rate loss, but also smaller time reduction. It is difficult to say which one is better for me.
- Using a subset of JCT-VC test set, H-FCN also obtains lower BD-rate loss, but still smaller time reduction.
This is the 25th story in this month.
Reference
[2020 arXiv] [H-FCN]
Speeding up VP9 Intra Encoder with Hierarchical Deep Learning Based Partition Prediction
Codec Fast Prediction
H.264 to HEVC [Wei VCIP’17] [H-LSTM]
HEVC [Yu ICIP’15 / Liu ISCAS’16 / Liu TIP’16] [Laude PCS’16] [Li ICME’17] [Katayama ICICT’18] [Chang DCC’18] [ETH-CNN & ETH-LSTM] [Zhang RCAR’19] [Kim TCVST’19] [LFHI & LFSD & LFMD Using AK-CNN] [Yang AICAS’20] [H-FCN]
3D-HEVC [AQ-CNN]
VP9 [H-FCN]
VVC [Jin VCIP’17] [Jin PCM’17] [Jin ACCESS’18] [Wang ICIP’18] [Galpin DCC’19] [Pooling-Variable CNN] [Amna JRTIP’20] [DeepQTMT]