Reading: GVTCNN — Group Variational Transformation Convolutional Neural Network (HEVC Inter)

Outperforms CNNIF, 1.9% BD-rate Reduction Obtained Under LDP Configuration

4 min readJun 17, 2020

**interpolate the sub-pel pixels If based on the integer-pel pixels Il**

In this story, Group Variational Transformation Convolutional Neural Network (GVTCNN), by Peking University, is briefly presented. I read this because I work on video coding research. In this paper:

GVTCNN is used to interpolate the sub-pel pixels based on the integer-pel pixels for inter coding.

This is a paper in 2018 DCC. (Sik-Ho Tsang @ Medium)

Outline

GVTCNN: Network Architecture
Sample Collection
Experimental Results

1. GVTCNN: Network Architecture

The integer-position sample Il is the input of the network.
3×3 kernel size for all the convolutional layers with PReLU used.
Standard MSE loss function is used.

1.1. Shared Feature Map Extraction Component

A feature map with 48 channels is initially generated from the integer-position sample, followed by 8 convolution layers with 10 channels which are lightweight and cost less to save the learnt parameters.
The 10-th layer later derives a 48-channel feature map.
After 9 convolutional layers with 3×3 kernel size, the receptive field of each point in the shared feature map is 19×19.

1.2. Group Variational Transformation

The group variational transformation is further performed over the shared feature maps with a specific convolutional layer for each sub-pixel sample.
Different residual maps are then generated and we obtain the final inferred sub-pixel position samples by adding the residual maps to the integer-position sample.

2. Sample Collection

The training data is first blurred with a Gaussian filter to simulate the correlations between the integer-position sample and sub-pixel position samples.
Sub-pixel position samples are later sampled from the blurred image.
As for the input integer-position sample generation, an intermediate integer sample is previously down-sampled from the raw image. Then, the intermediate down-sampled version is coded by HEVC.
Two networks are trained separately for 3 half-pixel position samples and 12 quarter-pixel position samples to better generate the samples at different sub-pixel levels, called GVTCNN-H and GVTCNN-Q respectively.
For GVTCNN-H, 200 training images and 200 testing images in the set BSDS500 are used to train it. 3×3 Gaussian kernels with random standard deviations in the range [0.5, 0.6] are used for blurring.
For GVTCNN-Q, 10 YUV sequences at size 1024×768 and 1920×1080 are used to extract 89 high resolution frames to synthesize training data. Standard deviations of Gaussian kernels ranges from [0.7, 0.8].
A model is trained for each QP.

3. Experimental Results

HM-16.15 is used under LDP configuration.
1.9% BD-rate reduction is obtained.

CNNIF only interpolates half-pel pixels, thus GVTCNN-H is used for fair comparison.
GVTCNN-H outperforms CNNIF with 2.4% BD-rate reduction.

By training a model separately for each sub-pel position, the BD-rate reduction is still similar to the one using GVT. (Not so clear to me about the network architecture here.)

This is the 26th story in this month!

Reference

[2018 DCC] [GVTCNN]
A Group Variational Transformation Neural Network for Fractional Interpolation of Video Coding

Codec Inter Prediction

H.264 [DRNFRUC & DRNWCMC]
HEVC [CNNIF] [Zhang VCIP’17] [NNIP] [GVTCNN] [Ibrahim ISM’18] [VC-LAPGAN] [VI-CNN] [FRUC+DVRF][FRUC+DVRF+VECNN] [RSR] [Zhao ISCAS’18 & TCSVT’19] [Ma ISCAS’19] [Zhang ICIP’19] [ES] [FRCNN] [CNN-SR & CNN-UniSR & CNN-BiSR] [DeepFrame] [U+DVPN] [Multi-Scale CNN]
VVC [FRUC+DVRF+VECNN]