Reading: H-FCN — Hierarchical Fully Convolutional Network (VP9 & HEVC Fast Intra)

1.17% Increase in BD-Rate With 69.7% Time Reduction in VP9; On Par With ETH-CNN in HEVC

Sik-Ho Tsang
5 min readJul 28, 2020

In this story, Speeding up VP9 Intra Encoder with Hierarchical Deep Learning Based Partition Prediction (H-FCN), is briefly presented. I read this because I work on video coding research. In this paper:

  • A large database of VP9 superblocks and the corresponding partitions are created to train an H-FCN model.
  • Subsequently H-FCN is integrated with the VP9 encoder to reduce the intra-mode encoding time.

This is a paper found in 2020 arXiv, it has been accepted by 2020 TIP where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)

Outline

  1. VP9 Superblock Partitioning
  2. A Large Database of VP9 Superblocks
  3. H-FCN: Network Architecture
  4. Experimental Results

1. VP9 Superblock Partitioning

VP9 Superblock Partitioning
  • In VP9, sizes of prediction blocks are decided by a recursive splitting of non-overlapping spatial units of size 64×64, called superblocks.
  • This recursive partition takes place at four hierarchical levels, possibly down to 4×4 blocks, through a search over the possible partitions at each level, guided by a rate-distortion optimization (RDO) process.
  • There are four partition choices at each of the four levels of the VP9 partition tree for each block at that level: no split, horizontal split, vertical split and four-quadrant split.
  • However, this RDO process largely increases the encoder complexity.

2. A large database of VP9 superblocks

2.1. Hierarchical Labels

Hierarchical Labels
  • The partition tree was represented in the form of a set of four matrices, as shown above.
  • The content for the database comprises 89 movies and 17 television episodes, which were selected from video sources in the Netflix catalog.
  • Each video content was encoded at three different resolutions (1080p, 720p and 540p) using the reference VP9 encoder from the libvpx package.
  • The contents were encoded in VP9 Profile 0, using speed level 1 and the good quality configuration.
  • The raw pixel data for each superblock was obtained by extracting the luma channels of non-overlapping 64×64 blocks from the source videos downsampled to the encode resolution.
  • The database encompasses internal QP values in the range 8–105.
  • The above shows the summary of the training and validation database.

3. H-FCN: Network Architecture

H-FCN: Network Architecture
  • Architecture of H-FCN model having 26,336 parameters and 54,610 FLOPs.

3.1. Trunk

  • As shown above, there is a main truck (blue) which consists of convolutions and max pooling layers.

3.2. Branches

  • At certain point of the main trunk, there is branch for each coding tree level. And there are 4 branches in total.
  • At each branch, convolutions are performed. There are no fully connected layers. That’s why it is called fully convolutional network (FCN).

3.3. Training

  • Categorical cross entropy loss is used:
H-FCN loss with training progress.
  • The prediction accuracy at each level was evaluated on 10⁵ randomly drawn samples from the training and validation sets:
Prediction accuracy of H-FCN model

3.4. Inconsistency Correction

  • At each level, the model predictions are made independently of all other levels.
  • Possible inconsistencies between the predictions of any two levels are corrected by a top-down approach.
Top-down inconsistency correction.
Visualization
  • The partitioning using H-FCN is quite close to the ground truth one.

4. Experimental Results

  • The trained model was integrated with the reference VP9 encoder using the Tensorflow C API.

4.1. BD-Rate

  • The encoding performance was evaluated on 30 test sequences at 3 resolutions in terms of both BD-rate and speedup (ΔT).
ΔT (%) and BD-rate (%) on 30 Testing Sequences
  • 1.17% BD-rate loss with 69.7% time reduction is achieved.

4.2. Comparison with VP9 Speed Level 4

  • The speedup and BD-rate of our approach was also compared with speed level 4 of the reference VP9 encoder, the highest recommended speed level for the baseline configuration:
ΔT (%) and BD-rate (%) on 30 Testing Sequences
  • H-FCN outperforms VP9 using speed level 4.
ΔT (%) Against QP.
  • The benefit by the approach in terms of speedup persists across the range of QP values used to learn the H-FCN model.

4.3. Comparison with SOTA Approach in HEVC

  • H-FCN is implemented in HEVC to compare with ETH-CNN.
ΔT (%) and BD-rate (%) Using the Test Set Provided by Authors
  • Compared with ETH-CNN, H-FCN obtains lower BD-rate loss, but also smaller time reduction. It is difficult to say which one is better for me.
ΔT (%) and BD-rate (%) Using the JCT-VC Test Set (Subset only)
  • Using a subset of JCT-VC test set, H-FCN also obtains lower BD-rate loss, but still smaller time reduction.

This is the 25th story in this month.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet