Reading: GVCNN — One-For-All Group Variation Convolutional Neural Network (HEVC Inter)

Outperforms CNNIF & Zhang VCIP’17. 2.2% Average BD-Rate Reduction Under LDP.

Sik-Ho Tsang
5 min readJun 21, 2020

In this story, “One-for-All: Grouped Variation Network-Based Fractional Interpolation in Video Coding” (One-For-All GVCNN), by , is briefly presented since most of the stuffs has been mentioned in the conference version GVTCNN. I’ve just found this transaction paper recently. I read this because I work on video coding research. In this paper:

  • GVCNN is used to interpolate the sub-pel pixels based on the integer-pel pixels for inter coding.
  • One-For-All means one CNN is used to interpolate pixels of all 15 sub-pel positions from the pixels of integer-pel positions

This is a paper in 2019 TIP where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)

Outline

  1. Fractional Interpolation in HEVC
  2. GVCNN: Network Architecture
  3. Training Data Generation
  4. Experimental Results

1. Fractional Interpolation in HEVC

Positions of different fractional pixels
  • Ai,j represent integer samples.
  • hki,j (k ∈ {1, 2, 3}) and qki,j (k ∈ {1, 2, · · · , 12}) denote half-pixel positions and quarter-pixel positions, respectively.
  • Given a reference block IA, whose pixels are regarded as integer samples (Ai,j), the half-pixel blocks Ihk and quarter-pixel blocks Iqk are interpolated from IA.
  • For a reference block IA, the variations between its pixels and sub-pixels are demonstrated below:
  • where ΔIhk denotes the variations of half-pixels.
  • Similarly, the variations of quarter-pixels are constructed as:
  • Thus, the mapping function becomes to learn the variations:

fh(·) and fq(·) represent the learned mapping between integer pixels and grouped variations of half- and quarter-pixel positions, respectively.

And GVCNN attempts to estimate the mapping function.

2. GVCNN: Network Architecture

GVCNN: Network Architecture
  • A feature map with 48 channels is initially generated from the integer-position sample, followed by 8 convolution layers with 10 channels which are lightweight and cost less to save the learnt parameters.
  • The 10-th layer later derives a 48-channel feature map.
  • After 9 convolutional layers with 3×3 kernel size, the receptive field of each point in the shared feature map is 19×19.
  • The group variational transformation is further performed over the shared feature maps with a specific convolutional layer for each sub-pixel sample.
  • Different residual maps are then generated and we obtain the final inferred sub-pixel position samples by adding the residual maps to the integer-position sample.
  • The same network is used at GVTCNN.

3. Training Data Generation

Training Data Generation
  • The training data is first blurred with a Gaussian filter to simulate the correlations between the integer-position sample and sub-pixel position samples.
  • Sub-pixel position samples are later sampled from the blurred image.
  • As for the input integer-position sample generation, an intermediate integer sample is previously down-sampled from the raw image. Then, the intermediate down-sampled version is coded by HEVC.
  • The blurring operation is to reduce aliasing effects due to the down-sampling.
  • Two networks are trained separately for 3 half-pixel position samples and 12 quarter-pixel position samples to better generate the samples at different sub-pixel levels, called GVCNN-H and GVCNN-Q respectively.
  • This part is the same as GVTCNN.
  • (And there is large portion of passages mathematical explanations mentioning about the performance bound for motion estimation. If interested, please read the paper.)

4. Experimental Results

4.1. BD-Rate

BD-Rate (%) on HEVC Test Sequences
  • HM-16.4 is used.
  • 2.2%, 1.2% and 0.9% average BD-rate saving respectively under LDP, LDB and RA conditions.
BD-Rate (%) on UHD Sequences
  • For UHD, the gain is much less.
  • The explanation from authors is that, the sampling precision of the high-resolution test sequences is high enough and the signals of adjacent pixels are more continuous.

4.2. RD Curves

RD Curves for 3 Sequences
  • The proposed approach is more efficient at high bitrate than at low bitrate.

4.3. SOTA Comparison

BD-Rate (%) on HEVC Test Sequences

4.4. GVCNN Separate Models

GVCNN Separate Models
  • GVCNN-Separate: Separate models are trained for each sub-pel positions. But Only little gain of 0.1% is obtained.

4.5. Blurring

Blurring
  • Without blurring during training sample collection, 0.2% BD-rate loss is obtained.

4.6. Hitting Ratios

Red: GVCNN, Blue: DCTIF
Hitting Ratios
  • There are 3% to 25% of CUs choosing GVCNN as shown above.
  • And visualization of CUs choosing GVCNN is also shown.

This is the 30th story in this month!

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

Responses (1)