Reading: GVCNN — One-For-All Group Variation Convolutional Neural Network (HEVC Inter)

Outperforms CNNIF & Zhang VCIP’17. 2.2% Average BD-Rate Reduction Under LDP.

5 min readJun 21, 2020

In this story, “One-for-All: Grouped Variation Network-Based Fractional Interpolation in Video Coding” (One-For-All GVCNN), by , is briefly presented since most of the stuffs has been mentioned in the conference version GVTCNN. I’ve just found this transaction paper recently. I read this because I work on video coding research. In this paper:

GVCNN is used to interpolate the sub-pel pixels based on the integer-pel pixels for inter coding.
One-For-All means one CNN is used to interpolate pixels of all 15 sub-pel positions from the pixels of integer-pel positions

This is a paper in 2019 TIP where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)

Outline

Fractional Interpolation in HEVC
GVCNN: Network Architecture
Training Data Generation
Experimental Results

1. Fractional Interpolation in HEVC

**Positions of different fractional pixels**

Ai,j represent integer samples.
hki,j (k ∈ {1, 2, 3}) and qki,j (k ∈ {1, 2, · · · , 12}) denote half-pixel positions and quarter-pixel positions, respectively.
Given a reference block IA, whose pixels are regarded as integer samples (Ai,j), the half-pixel blocks Ihk and quarter-pixel blocks Iqk are interpolated from IA.
For a reference block IA, the variations between its pixels and sub-pixels are demonstrated below:

where ΔIhk denotes the variations of half-pixels.
Similarly, the variations of quarter-pixels are constructed as:

Thus, the mapping function becomes to learn the variations:

fh(·) and fq(·) represent the learned mapping between integer pixels and grouped variations of half- and quarter-pixel positions, respectively.
And GVCNN attempts to estimate the mapping function.

2. GVCNN: Network Architecture

A feature map with 48 channels is initially generated from the integer-position sample, followed by 8 convolution layers with 10 channels which are lightweight and cost less to save the learnt parameters.
The 10-th layer later derives a 48-channel feature map.
After 9 convolutional layers with 3×3 kernel size, the receptive field of each point in the shared feature map is 19×19.
The group variational transformation is further performed over the shared feature maps with a specific convolutional layer for each sub-pixel sample.
Different residual maps are then generated and we obtain the final inferred sub-pixel position samples by adding the residual maps to the integer-position sample.
The same network is used at GVTCNN.

3. Training Data Generation

The training data is first blurred with a Gaussian filter to simulate the correlations between the integer-position sample and sub-pixel position samples.
Sub-pixel position samples are later sampled from the blurred image.
As for the input integer-position sample generation, an intermediate integer sample is previously down-sampled from the raw image. Then, the intermediate down-sampled version is coded by HEVC.
The blurring operation is to reduce aliasing effects due to the down-sampling.
Two networks are trained separately for 3 half-pixel position samples and 12 quarter-pixel position samples to better generate the samples at different sub-pixel levels, called GVCNN-H and GVCNN-Q respectively.
This part is the same as GVTCNN.
(And there is large portion of passages mathematical explanations mentioning about the performance bound for motion estimation. If interested, please read the paper.)

4. Experimental Results

4.1. BD-Rate

HM-16.4 is used.
2.2%, 1.2% and 0.9% average BD-rate saving respectively under LDP, LDB and RA conditions.

For UHD, the gain is much less.
The explanation from authors is that, the sampling precision of the high-resolution test sequences is high enough and the signals of adjacent pixels are more continuous.