Reading: ScratchCNN — Low Complexity Learned Sub-Pixel Motion Compensation (VVC Inter)

Outperforms SRCNN-Like Network, Up to 4.5% BD-Rate Reduction

3 min readJul 8, 2020

In this story, Interpreting CNN for Low Complexity Learned Sub-Pixel Motion Compensation in Video Coding (ScratchCNN), by BBC Research and Development, and Dublin City University, is presented. I read this because I work on video coding research. In this paper:

CNN is used to improve the interpolation of reference samples needed for fractional precision motion compensation.
Complexity reduction of CNN is achieved by interpreting the interpolation filters learned by the networks.

This is a paper in 2020 ICIP which will be held in October, yet authors put the paper in arXiv. (Sik-Ho Tsang @ Medium)

(For the information of fractional interpolation in video coding, please feel free to read CNNIF.)

Outline

Proposed ScratchCNN
Experimental Results

1. Proposed ScratchCNN

1.1. ScratchCNN Network Architecture

ScratchCNN network architecture is similar to SRCNN one.
It contains 64 individual 9×9 convolutional kernels in the first layer, 32 individual 1×1 kernels in the second layer, and 32 individual 5×5 kernels in the final layer.
Residual learning is used.
One special thing is that ReLU is removed.
Also, the bias is also removed.
No padding is applied .
SAD is used as loss function.

1.2. ScratchCNN Simplication for Complexity Reduction

**Left: Interpolation in VVC, Middle:** **SRCNN-Like Network, Right: ScratchCNN After Simplification**

With ReLU removed, a non-separable 2D filter M can be obtained:

Therefore, during inference, only M*X is computed, X no needs to go through 3 layers. As a result, interpolation process is speed up.

1.3. VVC Implementation

VTM-6.0 is used.
To support different CU sizes and shapes, 60 networks are trained.
The selection between the conventional VVC filters and the 13×13 filters is performed at a CU level.
The filters are only applied on luma samples.

2. Experimental Results

2.1. Ablation Study

**BD-Rate (%) and Time (%) on Class D Sequences Under LDB Configurations**

Using ScratchCNN with SAD as loss function and with no padding, BD-rate reduction is obtained.
Also, the encoding time and decoding time are much less compared to SRCNN.
Rather than integrating a deep learning software within VTM, all weights and biases (8129 parameters in total) are extracted from each of the 15 trained SRCNNs and implemented in VTM as a series of matrix multiplications.
In contrast, each trained ScratchCNN model is condensed in one 2D matrix that contains 169 parameters.