Reading: FRCNN — Fractional-Pixel Reference Generation CNN (HEVC Inter Prediction)

VRCNN-Like Network, Outperforms CNNIF, 3.9%, 2.7%, and 1.3% Bits Saving Compared with HEVC, Under LDP, LDB, and RA configurations, Respectively

5 min readJun 15, 2020

In this story, Fractional-Pixel Reference Generation CNN (FRCNN), by University of Science and Technology of China, and Microsoft Research Asia (MSRA), and University of Missouri-Kansas City, is presented. I read this because I work on video coding research. In this paper:

FRCNN is designed to generate the fractional pixels based on the integer-pel pixels.
The most important thing for these interpolation problems is: how to find the ground-truth samples for training while the fractional samples are not really available in the original videos.

This is a paper in 2019 TCVST where TCSVT has a high impact factor of 4.046. (Sik-Ho Tsang @ Medium)

Outline

FRCNN: Network Architecture
Sample Collection
Training & HEVC Implementation
Experimental Results

1. FRCNN: Network Architecture

1.1. Input & Output

Integer-Pel Pixels (Left): are Ai,j. Interpolation is performed to obtained ai,j, to ni,j, which are sub-pel pixels including half-pel and quarter-pel pixels
Traditional Methods (Top-Right): is to performed to predict all half-pel and quarter-pel pixels altogether based on the reference pixels (Red), i.e. integer-pel pixels.
FRCNN (Bottom-Right): Input is the reference pixels (Red), i.e. integer-pel pixels. By going through CNN, the predicted different sub-pel position blocks are output.
In HEVC, as the interpolated blocks are 4 times bigger at the x and y directions, it is 16× larger than the reference blocks. Therefore, 15 FRCNN models are used to predict those sub-pel pixels.

1.2. Network Architecture

FRCNN actually uses VRCNN network architecture. (Please feel free to read VRNN if interested.)
(Also, I cannot fin

2. Sample Collection

2.1. FRCNN-U for Uni-Directional Prediction

The current block extracted from the original video sequence is marked as the target/label Yi, as shown above.
According to the coded fractional-pixel MV, the “referenced fractional block” in the reference picture is found out, depicted by yellow dash line.
Then the corresponding integer block is found out by moving the referenced fractional block towards the top-left direction until the nearest integer pixels, depicted by purple line.
Next, the corresponding integer block is padded in four directions (up, down, left, right) by a specific width.
The padding width is determined by the effective kernel size of the FRCNN model. The padded block, depicted by red line, is extracted from the reconstructed video sequence and marked as the input Xi.
Since HEVC enables quarter-pixel MV precision, all the training samples are divided into 15 sets, each set is used to train an individual model.
When generating training data for FRCNN-U we adopt the Low-Delay P configuration.

2.2. FRCNN-B for Bi-Directional Prediction

Similar to FRCNN-U, but only the blocks coded with bi-directional prediction, where the second MV is fractional, are selected.
When generating training data for FRCNN-B we adopt the Low-Delay B configuration.

3. Training & HEVC Implementation

FRCNN-U models are used if the PU is coded with uni-directional mode.
FRCNN-U and FRCNN-B are used simultaneously (FRCNN-U for list-0 and FRCNN-B for list-1) if the PU is coded with bi-directional mode.
Block-based fractional reference filter type (FRFT) selection: An additional CU-level flag is added so that DCTIF (DCT Interpolation Filter) and FRCNN are competed, and being selected based on the minimum rate distortion (RD) cost.
FRFT Merge: When a CU is coded with merge 2N×2N mode, its FRFT is also merged rather than decided by R-D cost.
Different FRCNN Models are used for different QPs
Thus, in total there are 120 models (FRCNN-U and FRCNN-B, 4 QPs: 22, 27, 32, 37, 15 proper fractional MVs).
For the training data, we use only one video sequence, namely BlowingBubbles, which is a common test sequence in HEVC.

4. Experimental Results

4.1. BD-Rate

3.9%, 2.7%, and 1.3% BD-rate reduction is obtained compared with HEVC, under LDP, LDB, and RA configurations respectively.

**BD-Rate (%) on HEVC Test Sequences, Only FRCNN-U Under LDB Configurations**

When only FRCNN-U is used in LDB configuration, only 2.0% BD-rate reduction is obtained which shows the importance of FRCNN-B.

2.2. Hitting Ratios

**Hitting Ratio Under LDP Configuration**

Certain number of CUs chooses the FRCNN rather than DCTIF.
Hitting ratio decreases when QP increases.

2.3. RD Curves

RD curves show that FRCNN is more useful at high bitrate condition than at low bitrate condition, which is consistent to the hitting ratio results.

2.4. Visualization

Pink CUs use FRCNN while blue CUs use DCTIF.
FRCNN tends to select rich texture region, such as water and clothes.

2.5. Comparison with Prior Art CNNIF [12]

**BD-Rate by** **CNNIF** **[12] Under LDP Configuration**

Compared with CNNIF which only obtains 0.5% BD-rate reduction, FRCNN obtains 3.9% BD-rate reduction, which is much powerful than CNNIF.

There are large amount of ablation experiments in the paper, which not yet shown here. Please read the paper if interested.
This is the 22nd story in this month!

Reference

[2019 TCSVT] [FRCNN]
Convolutional Neural Network-Based Fractional-Pixel Motion Compensation

Codec Inter Prediction

H.264 [DRNFRUC & DRNWCMC]
HEVC [CNNIF] [Zhang VCIP’17] [NNIP] [Ibrahim ISM’18] [VC-LAPGAN] [VI-CNN] [FRUC+DVRF][FRUC+DVRF+VECNN] [RSR] [Zhao ISCAS’18 & TCSVT’19] [Ma ISCAS’19] [Zhang ICIP’19] [ES] [FRCNN] [CNN-SR & CNN-UniSR & CNN-BiSR] [DeepFrame] [U+DVPN] [Multi-Scale CNN]
VVC [FRUC+DVRF+VECNN]