Reading: Pham ACCESS’19 — Deep Learning-Based Luma and Chroma Fractional Interpolation (HEVC Inter)

Outperforms GVTCNN, Xia ISCAS’19, CNNIF & Zhang VCIP’17. 2.9%, 0.3%, 0.6% Y, U, V BD-Rate Reduction, Respectively, Under LDP Configuration.

5 min readJun 18, 2020

In this story, “Deep Learning-Based Luma and Chroma Fractional Interpolation in Video Coding” (PHAM ACCESS’19), by Hosei University, is briefly presented. I read this because I work on video coding research. In this paper:

CNN-based fractional interpolation for Luminance (Luma) and Chrominance (Chroma) components in motion compensated prediction to improve the coding efficiency.
Two syntax elements are introduced to enabled on/off of the CNNs for Luma and Chroma respectively.

This is a paper in 2019 IEEE ACCESS where ACCESS is an open access journal with high impact factor of 4.098.

Outline

Overall Framework & Network Architecture
Sample Collection
Experimental Results

1. Overall Framework & Network Architecture

1.1. Overall Framework

Two flags are encoded to indicate fractional interpolation method for Luma and Chroma components by choosing the smallest RDO cost between four possible interpolation methods:
DCTIF and DCTIF, DCTIF and CNN, CNN and CNN, and CNN and DCTIF for Luma and Chroma, respectively. (As shown in the figure above)

1.2. Network Architecture

A reconstructed frame is first interpolated by DCTIF to get 15 fractional samples.
Then the proposed models take an input of a fractional sample and output corresponding fractional sample.
Only one model is trained for 15 fractional samples at each QP.
One model for Y component interpolation and a shared Chroma model for U and V components interpolation.
In total, eight models are trained for four QPs in Luma and Chroma components.
The model is inspired from VDSR, as authors mentioned.
The network architecture includes 20 convolutional layers.
Each layer does convolution by applying 64 3×3 filters with the stride of one.
ReLU is used except the last layer
50 epoches are performed with batch size of 128.
Standard MSE loss function is used:

2. Sample Collection

(1): First, extract the integer and fractional-position video by assuming integer and fractional pixels in every 4-by-4 non-overlapping blocks of each frame, as shown above.
Then, obtain a low-resolution video of integer pixels (integer-position video) and 15 low-resolution videos including three half- and 12 quarter-position videos corresponding to 15 fractional samples.
(2): Encode low-resolution video using under low delay P configuration with QPs of 22, 27, 32 and 37 to get reconstructed down-sampled video.
(3): Extract the Y component from reconstructed frames and interpolate them to 15 fractional samples by DCTIF. These 15 fractional samples are used as training input for the proposed CNN.
(4): Extract the Y component from each fractional-position video frame to be the CNN ground-truth label for training.
All the processes for generating the Chroma training set are the same as the Luma Y component.
HM-16.18 is used.
Training set for QP 32 and 37 models are acquired from three sequences Pedestrian, Traffic and PeopleOntheStreet. For QP 22 and 27’s models, training set is from Traffic and PeopleOnTheStreet.

3. Experimental Results

3.1. Hitting Ratios

Cyan blocks indicate CU that choose DCTIF for interpolating all components.
Magenta blocks indicate CUs that choose DCTIF for Y and CNN for UV components.
Yellow blocks indicate CUs that choose CNN for Y and DCTIF for UV components.
Red blocks indicate CUs that choose CNN for interpolating all components.
The rest parts are CUs coded with integer motion vector or intra coding.

CUs at the static background tend to choose DCTIF for fractional interpolation and CUs at moving object tend to choose CNN for fractional interpolation.

Class F, where Luma and Chroma components rarely choose CNN for fractional interpolations, obtains the lowest bitrate saving compare to other classes.

3.2. BD-Rate

The proposed CNN obtains 3.7% BD-rate reduction, outperforms GVTCNN [18] and Xia ISCAS’19 [19].

**BD-Rate (%) on HEVC Test Sequences (half-pel pixels only)**

CNNIF [15] and Zhang VCIP’17 [16] are proposed to interpolate half-pel pixels only.
With 1.1% BD-rate reduction, the proposed CNN also outperforms CNNIF [15] and Zhang VCIP’17 [16].