Review: CNNIF — CNN-based Interpolation Filter (HEVC Inter Prediction)

Network Similar to SRCNN, Up to 3.2%, Average 0.9% BD-Rate Reduction Compared to the Conventional HEVC

4 min readApr 20, 2020

In this story, CNN-based Interpolation Filter (CNNIF), by University of Science and Technology of China, is reviewed. A network similar to SRCNN is used for frame interpolation in HEVC inter coding. I read this because I work on video coding research. This is a paper in 2017 ISCAS. (Sik-Ho Tsang @ Medium)

Outline

The Use of Fractional Interpolation in Video Coding
CNN Based Interpolation Process & Network Architecture
Experimental Results

1. The Use of Fractional Interpolation in Video Coding

There are correlations between frames for efficient compression using a process called motion compensated prediction (MCP).
In MCP, for current block to be coded, the best matching block is searched in previously reconstructed reference frames, and the differences between these two blocks (residual) are transmitted to the decoder side.
The positional relationship between current block and its corresponding reference block is represented by a motion vector (MV), which also describes the displacement of these blocks.
The true frame-to-frame displacements of moving objects may not be integer-pel displacement.
Fractional-pel precision motion vectors have to be adopted to describe the continuous motion of objects.
Hence, the reference frame needs to be interpolated.

2. CNN Based Interpolation Process & Network Architecture

2.1. Conventional Interpolation

**Positions of integer and fractional samples in luma component**

The capitalized A represent integer samples (Blue), which are real exist at corresponding reference block.
Half-sample positions are labeled as b,h,j (White but with letters).
b: Horizontal half-pel position
h: Vertical half-pel position
j: Diagonal half-pel position
In conventional HEVC, Half-samples need to be interpolated with a symmetric 8-tap DCT-based filter. Since the DCT-based filter uses fixed parameters to interpolate which is not suitable for all kinds of pixel blocks. Thus, this DCT-based filter can only obtain sub-optimal results.
In this paper, authors try to train 3 SRCNN models to interpolate b, h and j.
For the remaining pixels without any letters, they are quarter-pel pixels in which they will be bilinearly interpolated using the half-pel pixels.

2.2. Proposed Network Architecture

**Proposed Network Architecture Using** **SRCNN**

SRCNN consists of 3 convolutional layers.
The first layer is used for patch extraction and representation, extracting the features from low-resolution image. Here, W1 is of size 9×9×64 and B1 is a 64-dimensional vector.
The second layer can be seen as non-linear mapping, which converts the features of low-resolution image to those of high-resolution. Here, W2 is of size 1×1×32 and B2 is a 32-dimensional vector.
The third layer, where W3 is of size 5×5×1, is used to recover the high-resolution image from the high-resolution features.
SRCNN is reused but with a bit different on preparing the input and output labels.

2.3. Training Data Preparation

The input of the network is the image consisting of integer-position pixels, denoted by Y_int, and the output is the interpolated image of fractional positions, F_h, which has the same size with the input image.
Firstly, the training images are blurred with a low-pass filter to simulate the process of sampling.
The, input and labels for training are extracted as shown above.
Input: Integer pixels (red).
Labels: Horizontal half pixels (black), vertical half pixels (green), and diagonal half pixels (purple).
Therefore, there are 3 network models called CNNIF_H, CNNIF_V and CNNIF_D for horizontal half pixels (black), vertical half pixels (green), and diagonal half pixels (purple) respectively.
As the interpolated pixel value is sensitive to the quantization parameters (QPs). To consider different QPs or video quality, in this paper, 4 QPs, a separate network is trained. Thus, 12 models are trained in total.

3. Experimental Results

**BD-Rate (%) by CNNIF Compared to the Conventional HEVC HM-16.7**

HM-16.7 is used for encoding under low delay P configuration.
The proposed method achieves on average 0.9% BD-rate reduction.
For the test sequence BQTerrace, the BD-rate reduction can be as high as 3.2%, 1.6%, 1.6% for Y, U, V components, respectively.

**BD-Rate (%) by** **SRCNN** **Compared to the Conventional HEVC HM-16.7**

By just using pre-trained SRCNN for interpolation, it can be observed that all the sequences suffer from significant loss. For the test sequence BQSquare, the loss can be as high as 8.2% for luma component.

Reference

[2017 ISCAS] [CNNIF]
A Convolutional Neural Network Approach for Half-Pel Interpolation in Video Coding

Codec Prediction

[CNNIF] [IPCNN] [IPFCN] [NNIP] [Li TCSVT’18]