Reading: H-FCN — Hierarchical Fully Convolutional Network (VP9 & HEVC Fast Intra)

1.17% Increase in BD-Rate With 69.7% Time Reduction in VP9; On Par With ETH-CNN in HEVC

5 min readJul 28, 2020

In this story, Speeding up VP9 Intra Encoder with Hierarchical Deep Learning Based Partition Prediction (H-FCN), is briefly presented. I read this because I work on video coding research. In this paper:

A large database of VP9 superblocks and the corresponding partitions are created to train an H-FCN model.
Subsequently H-FCN is integrated with the VP9 encoder to reduce the intra-mode encoding time.

This is a paper found in 2020 arXiv, it has been accepted by 2020 TIP where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)

Outline

VP9 Superblock Partitioning
A Large Database of VP9 Superblocks
H-FCN: Network Architecture
Experimental Results

1. VP9 Superblock Partitioning

In VP9, sizes of prediction blocks are decided by a recursive splitting of non-overlapping spatial units of size 64×64, called superblocks.
This recursive partition takes place at four hierarchical levels, possibly down to 4×4 blocks, through a search over the possible partitions at each level, guided by a rate-distortion optimization (RDO) process.
There are four partition choices at each of the four levels of the VP9 partition tree for each block at that level: no split, horizontal split, vertical split and four-quadrant split.
However, this RDO process largely increases the encoder complexity.

2. A large database of VP9 superblocks

2.1. Hierarchical Labels

The partition tree was represented in the form of a set of four matrices, as shown above.
The content for the database comprises 89 movies and 17 television episodes, which were selected from video sources in the Netflix catalog.
Each video content was encoded at three different resolutions (1080p, 720p and 540p) using the reference VP9 encoder from the libvpx package.
The contents were encoded in VP9 Profile 0, using speed level 1 and the good quality configuration.
The raw pixel data for each superblock was obtained by extracting the luma channels of non-overlapping 64×64 blocks from the source videos downsampled to the encode resolution.
The database encompasses internal QP values in the range 8–105.

The above shows the summary of the training and validation database.

3. H-FCN: Network Architecture

Architecture of H-FCN model having 26,336 parameters and 54,610 FLOPs.

3.1. Trunk

As shown above, there is a main truck (blue) which consists of convolutions and max pooling layers.

3.2. Branches

At certain point of the main trunk, there is branch for each coding tree level. And there are 4 branches in total.
At each branch, convolutions are performed. There are no fully connected layers. That’s why it is called fully convolutional network (FCN).

3.3. Training

Categorical cross entropy loss is used:

The prediction accuracy at each level was evaluated on 10⁵ randomly drawn samples from the training and validation sets:

3.4. Inconsistency Correction

At each level, the model predictions are made independently of all other levels.
Possible inconsistencies between the predictions of any two levels are corrected by a top-down approach.

The partitioning using H-FCN is quite close to the ground truth one.

4. Experimental Results

The trained model was integrated with the reference VP9 encoder using the Tensorflow C API.

4.1. BD-Rate

The encoding performance was evaluated on 30 test sequences at 3 resolutions in terms of both BD-rate and speedup (ΔT).

**ΔT (%) and BD-rate (%) on 30 Testing Sequences**

1.17% BD-rate loss with 69.7% time reduction is achieved.

4.2. Comparison with VP9 Speed Level 4

The speedup and BD-rate of our approach was also compared with speed level 4 of the reference VP9 encoder, the highest recommended speed level for the baseline configuration: