Reading: ADCNN — Attention-Based Dual-Scale CNN for VVC (Codec Filtering)

6.54%, 13.27%, 15.72% BD-rate savings under AI & 2.81%, 7.86%, 8.60% BD-rate savings under RA, for Y, U, V, respectively

6 min readJun 13, 2020

In this story, Attention-Based Dual-Scale CNN (ADCNN), by Northwestern Polytechnical University, and Royal Melbourne Institute of Technology, is briefly presented. I read this because I work on video coding research. In this paper:

Single-model solution is proposed for different QPs, different frame types, and all the components (Y, U and V)
Attention based processing block is used to reduce artifacts of I frames and B frames, which take advantage of informative priors such as the quantization parameter (QP) and partitioning information.

This is a paper in 2019 IEEE ACCESS where ACCESS is an open access journal with high impact factor of 4.098. (Sik-Ho Tsang @ Medium)

Outline

ADCNN: Network Architecture
Self-Attention Block
CU Map & QP Map & Loss Function & HEVC Implementation
Ablation Study
Experimental Results

1. ADCNN: Network Architecture

1.1. First Stage

In the first stage, a dual-scale pipeline is implemented. The high-resolution branch (i.e., luma branch) takes the reconstructed luma component as the input, and the low-resolution branch (i.e., chroma branch) take two concatenated reconstructed chroma components as the inputs.
Each branch is processed by 4 basic blocks, and there is feature exchange and fusion between two branches. (The feature maps from luma branch will be sent to chroma branch after 3×3 convolution (channel=16) with stride=2, in the meanwhile, the feature maps from the chroma branch will be sent to the luma branch after 3×3 convolution (channel=16) and upsampling.)
The exchanged feature maps are fused into the corresponding branch by concatenation and 1×1 convolution (channel D 64).

1.2. Second Stage

In the second stage, the 1.1. First Stage, one for U, and another for V.
Each of the three branches is fused with its own Coding Unit map (CUmap) and QPmap which will be introduced later.
Then 4 basic processing blocks are conducted for each branch to generate the final residual image, followed by a global skip connection for the reconstructed image.

2. Self-Attention Block

Wide-activated convolutional layers, spatial attention, channel attention, and local skip connection are used, as shown above.
Wide-activated convolutional layers (wconv): wider features before Rectied Linear Unit (ReLU) activation have signicantly better performance for single image superresolution and restoration.
The output feature maps of the first convolution layer have wider channels which is r times that of the input feature maps. r denotes the expansion factor and r=1.5 here. Hence, the channel of Y1 will be r×C. Then the output channel of second convolution layer is reduced to C.

Attention operation is used to adaptively generate scale factors in every pixels of feature maps. two attention modules are introduced
Spatial Attention Module: reduces the size of channels by conducting two convolutional layers followed by the sigmoid activation to generate a spatial-attention map (SAmap) for every spatial pixels:

Channel attention module: just like SENet, reduces the spatial size by global average pooling (GAP) and conducts two fully-connected layers followed by the sigmoid activation to generate a channel-attention map (CAmap) for every channel:

Then the SAmap and the CAmap are point-wisely multiplied by feature maps:

A skip connection will be added from the input to the output directly to learn the residual, which also contributes to fast convergence.

3. CU Map & QP Map & Loss Function & VVC Implementation

3.1. CU Map

**(a) Y. (b) U. (c) V. (d) CUmap for luma. (e) CUmap for chroma.**

A feature map named as CUmap, with the positions of the boundary filled by 1 and other positions by 0.5 as shown above.
This is useful since blocking artifacts are happened at block boundary.

3.2. QP Map

A feature map named as QPmap which is with the same size as the input size of component, filled with the normalized QP value of the current component:

where MAXQP in VVC is 63. This guides the network to convert different QP values into compensation values of different amplitudes through attention blocks.

3.3. Loss Function

If the MSE loss is used, frames with smaller QP will contribute to lower proportion of loss of a mini-batch. Thus MAE is used:

3.4. VVC Implementation

There are frame-level and CTU-level flags.
Frame-level RD comparison and CTU-level RD comparison are performed to select the best.
ADCNN is to replace the conventional filters DF, SAO and ALF, not built on top of them.

4. Ablation Study

Training: DIV2K dataset which is consist of 900 2K resolution PNG pictures (800 images for training, and 100 images for validation).
VTM-4.0 is used.
Model size is 9.11MB.

**Ablation experiments on validation dataset.**

Test 1: When removing the feature exchange between the luma branch and chroma branch, and the result shows that PSNR of all three component declines.
Similar for other tests: Without either wconv, CAmap, SAmap, QPmap or CUmap, PSNR is dropped.
Finally, with all equipped, i.e. ADCNN, PSNR obtained is the highest.

**The converging curve of different test.**

As shown above, ADCNN converges with the fastest speed.

The above shows the SAmap of the first attention block at the second stage.
The scaling factor in the SAmap well coincide with the distribution area of the real compensation frame, leading to local adaptability and good ltering quality.

Only one QP band (QP of [28-36]) of training dataset is used to train the model, and test it in three other QP bands.
The model with the QPmap is well adaptive for other QP values even if it is not trained using those kind of training samples, especially on lower QP bands.

5. Experimental Results

5.1. BD-Rate Compared with VVC

**Comparison with VVC under AI configuration**

**Comparison with VVC under RA configuration**

6.54%, 13.27%, 15.72% BD-rate savings under AI & 2.81%, 7.86%, 8.60% BD-rate savings under RA, are obtained for Y, U, V, respectively.
With GPU, there is no additional cost of encoding time.

5.2. Subjective Comparison

**(a) Original (b) VVC without in-loop filter (c) VVC (d) Proposed ADCNN**

The proposed network can effciently remove different kinds of artifacts and outperform the conventional filters of VVC.

5.3. SOTA Comparison

**Comparison with other methods under RA configuration**

It should be noticed that the compared filters are all hybrid, where the corresponding CNN-based filter is used as an additional filter to the conventional ones.
On the other hand, the proposed ADCNN completely replace the current DBF, SAO, ALF, while still outperforms all the compared methods although NN-based filter commonly have higher codec complexity.

This is the 17th story in this month!

Reference

[2019 IEEE ACCESS] [ADCNN]
Attention-Based Dual-Scale CNN In-Loop Filter for Versatile Video Coding

Codec Filtering

JPEG [ARCNN] [RED-Net] [DnCNN] [Li ICME’17] [MemNet] [MWCNN]
HEVC [Lin DCC’16] [IFCNN] [VRCNN] [DCAD] [MMS-net] [DRN] [Lee ICCE’18] [DS-CNN] [RHCNN] [VRCNN-ext] [S-CNN & C-CNN] [MLSDRN] [ARTN] [Double-Input CNN] [CNNIF & CNNMC] [B-DRRN] [Residual-VRN] [Liu PCS’19] [QE-CNN] [EDCNN] [VRCNN-BN] [MACNN]
3D-HEVC [RSVE+POST]
AVS3 [Lin PCS’19]
VVC [Lu CVPRW’19] [Wang APSIPA ASC’19] [ADCNN]