Reading: GVTCNN — Group Variational Transformation Convolutional Neural Network (HEVC Inter)

Outperforms CNNIF, 1.9% BD-rate Reduction Obtained Under LDP Configuration

Sik-Ho Tsang
4 min readJun 17, 2020
interpolate the sub-pel pixels If based on the integer-pel pixels Il

In this story, Group Variational Transformation Convolutional Neural Network (GVTCNN), by Peking University, is briefly presented. I read this because I work on video coding research. In this paper:

  • GVTCNN is used to interpolate the sub-pel pixels based on the integer-pel pixels for inter coding.

This is a paper in 2018 DCC. (Sik-Ho Tsang @ Medium)

Outline

  1. GVTCNN: Network Architecture
  2. Sample Collection
  3. Experimental Results

1. GVTCNN: Network Architecture

GVTCNN: Network Architecture
  • The integer-position sample Il is the input of the network.
  • 3×3 kernel size for all the convolutional layers with PReLU used.
  • Standard MSE loss function is used.

1.1. Shared Feature Map Extraction Component

  • A feature map with 48 channels is initially generated from the integer-position sample, followed by 8 convolution layers with 10 channels which are lightweight and cost less to save the learnt parameters.
  • The 10-th layer later derives a 48-channel feature map.
  • After 9 convolutional layers with 3×3 kernel size, the receptive field of each point in the shared feature map is 19×19.

1.2. Group Variational Transformation

  • The group variational transformation is further performed over the shared feature maps with a specific convolutional layer for each sub-pixel sample.
  • Different residual maps are then generated and we obtain the final inferred sub-pixel position samples by adding the residual maps to the integer-position sample.

2. Sample Collection

Training data generation for GVTCNN
  • The training data is first blurred with a Gaussian filter to simulate the correlations between the integer-position sample and sub-pixel position samples.
  • Sub-pixel position samples are later sampled from the blurred image.
  • As for the input integer-position sample generation, an intermediate integer sample is previously down-sampled from the raw image. Then, the intermediate down-sampled version is coded by HEVC.
  • Two networks are trained separately for 3 half-pixel position samples and 12 quarter-pixel position samples to better generate the samples at different sub-pixel levels, called GVTCNN-H and GVTCNN-Q respectively.
  • For GVTCNN-H, 200 training images and 200 testing images in the set BSDS500 are used to train it. 3×3 Gaussian kernels with random standard deviations in the range [0.5, 0.6] are used for blurring.
  • For GVTCNN-Q, 10 YUV sequences at size 1024×768 and 1920×1080 are used to extract 89 high resolution frames to synthesize training data. Standard deviations of Gaussian kernels ranges from [0.7, 0.8].
  • A model is trained for each QP.

3. Experimental Results

BD-Rate (%) on HEVC Test Sequences
  • HM-16.15 is used under LDP configuration.
  • 1.9% BD-rate reduction is obtained.
BD-Rate (%) on HEVC Test Sequences
  • CNNIF only interpolates half-pel pixels, thus GVTCNN-H is used for fair comparison.
  • GVTCNN-H outperforms CNNIF with 2.4% BD-rate reduction.
BD-Rate (%) on HEVC Test Sequences
  • By training a model separately for each sub-pel position, the BD-rate reduction is still similar to the one using GVT. (Not so clear to me about the network architecture here.)

This is the 26th story in this month!

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet