Reading: ADCNN — Attention-Based Dual-Scale CNN for VVC (Codec Filtering)

6.54%, 13.27%, 15.72% BD-rate savings under AI & 2.81%, 7.86%, 8.60% BD-rate savings under RA, for Y, U, V, respectively

In this story, Attention-Based Dual-Scale CNN (ADCNN), by Northwestern Polytechnical University, and Royal Melbourne Institute of Technology, is briefly presented. I read this because I work on video coding research. In this paper:

  • Attention based processing block is used to reduce artifacts of I frames and B frames, which take advantage of informative priors such as the quantization parameter (QP) and partitioning information.

Outline

  1. ADCNN: Network Architecture
  2. Self-Attention Block
  3. CU Map & QP Map & Loss Function & HEVC Implementation
  4. Ablation Study
  5. Experimental Results

1. ADCNN: Network Architecture

ADCNN: Network Architecture

1.1. First Stage

  • In the first stage, a dual-scale pipeline is implemented. The high-resolution branch (i.e., luma branch) takes the reconstructed luma component as the input, and the low-resolution branch (i.e., chroma branch) take two concatenated reconstructed chroma components as the inputs.
  • Each branch is processed by 4 basic blocks, and there is feature exchange and fusion between two branches. (The feature maps from luma branch will be sent to chroma branch after 3×3 convolution (channel=16) with stride=2, in the meanwhile, the feature maps from the chroma branch will be sent to the luma branch after 3×3 convolution (channel=16) and upsampling.)
  • The exchanged feature maps are fused into the corresponding branch by concatenation and 1×1 convolution (channel D 64).

1.2. Second Stage

  • In the second stage, the 1.1. First Stage, one for U, and another for V.
  • Each of the three branches is fused with its own Coding Unit map (CUmap) and QPmap which will be introduced later.
  • Then 4 basic processing blocks are conducted for each branch to generate the final residual image, followed by a global skip connection for the reconstructed image.

2. Self-Attention Block

Self-Attention Block
  • Wide-activated convolutional layers (wconv): wider features before Rectied Linear Unit (ReLU) activation have signicantly better performance for single image superresolution and restoration.
  • The output feature maps of the first convolution layer have wider channels which is r times that of the input feature maps. r denotes the expansion factor and r=1.5 here. Hence, the channel of Y1 will be r×C. Then the output channel of second convolution layer is reduced to C.
  • Spatial Attention Module: reduces the size of channels by conducting two convolutional layers followed by the sigmoid activation to generate a spatial-attention map (SAmap) for every spatial pixels:

3. CU Map & QP Map & Loss Function & VVC Implementation

3.1. CU Map

(a) Y. (b) U. (c) V. (d) CUmap for luma. (e) CUmap for chroma.
  • This is useful since blocking artifacts are happened at block boundary.

3.2. QP Map

  • A feature map named as QPmap which is with the same size as the input size of component, filled with the normalized QP value of the current component:

3.3. Loss Function

  • If the MSE loss is used, frames with smaller QP will contribute to lower proportion of loss of a mini-batch. Thus MAE is used:

3.4. VVC Implementation

  • There are frame-level and CTU-level flags.
  • Frame-level RD comparison and CTU-level RD comparison are performed to select the best.
  • ADCNN is to replace the conventional filters DF, SAO and ALF, not built on top of them.

4. Ablation Study

  • Training: DIV2K dataset which is consist of 900 2K resolution PNG pictures (800 images for training, and 100 images for validation).
  • VTM-4.0 is used.
  • Model size is 9.11MB.
Ablation experiments on validation dataset.
  • Similar for other tests: Without either wconv, CAmap, SAmap, QPmap or CUmap, PSNR is dropped.
  • Finally, with all equipped, i.e. ADCNN, PSNR obtained is the highest.
The converging curve of different test.
(a) Ground truth frame. (b) SAmap.
  • The scaling factor in the SAmap well coincide with the distribution area of the real compensation frame, leading to local adaptability and good ltering quality.
Different QP bands
  • The model with the QPmap is well adaptive for other QP values even if it is not trained using those kind of training samples, especially on lower QP bands.

5. Experimental Results

5.1. BD-Rate Compared with VVC

Comparison with VVC under AI configuration
Comparison with VVC under RA configuration
  • With GPU, there is no additional cost of encoding time.

5.2. Subjective Comparison

(a) Original (b) VVC without in-loop filter (c) VVC (d) Proposed ADCNN
(a) Original (b) VVC without in-loop filter (c) VVC (d) Proposed ADCNN

5.3. SOTA Comparison

Comparison with other methods under RA configuration
  • On the other hand, the proposed ADCNN completely replace the current DBF, SAO, ALF, while still outperforms all the compared methods although NN-based filter commonly have higher codec complexity.

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG