Reading: DIA_Net — Dense Inception Attention Neural Network for In-Loop Filter (Codec Filtering)

Combining Ideas From DenseNet, GoogLeNet / Inception-v1, Residual Attention Network & SENet, 8.2% & 5.6% BD-Rate Reduction in AI & RA configurations Respectively

5 min readJun 19, 2020

**The Process of Channel Attention Layer**

In this story, Dense Inception Attention Network (DIA_Net), by Huazhong Univ. of Sci. & Tech., and ZTE Corporation, is presented. I read this because I work on video coding research.

As we can see in the network name, ideas of dense block from DenseNet, Inception block from GoogLeNet / Inception-v1, and attention layers from Residual Attention Network & SENet, are utilized together to form a powerful network for in-loop filtering.
DIA Net is embedded into in-loop filter where behind SAO to further deblock and offset lost information.

This is a paper in 2019 PCS. (Sik-Ho Tsang @ Medium)

Outline

DIA_Net: Network Architecture
Experimental Results

1. DIA_Net: Network Architecture

1.1. Overall Framework

The DIA_Net mainly includes three parts: 1. inception structure, 2. attention mechanism, 3. residual dense structure.
As shown in the figure above, input frame fi is first fed to feature extraction layer, then extracted features get through dense blocks which contains several residual blocks orderly attached with spatial attention layer, channel attention layer.
Finally features from dense blocks are sent to global fusion layer and added by a long skip connection from the input.

1.2. Inception Structure

At the bottom of the above figure, Inception structure is shown.
Different size kernels capture different scale information.
Specifically speaking, small size kernel focuses more on details like dense contours while big size kernel benefits for processing coarse outlines.

Therefore, kernel sizes of 1×1, 3×3, 5×5, 7×7 are used. Then all outputs from the convolution layers are concatenated together and sent to local fusion layer.
(Please feel free to read GoogLeNet / Inception-v1.)

1.3. Attention Mechanism

There are 2 parts: channel attention (CA), spatial attention (SA).

The introduction of CA is inspired by the fact that some channels in the model always take no effects to the final output while other channels have strong impact.
Average and max pooling are employed to extract the weights of C channels where xc denotes input and y is output of CA layer. σ is sigmoid activation, zc, ^zc represent output of average and max pooling.

In SA layer, inputs are first fed to 2 convolution and ReLU layers and then element-wise multiplied with activated outputs, where x is input, MSA is SA layer and y denotes output.
(Please feel free to read Residual Attention Network & SENet.)

1.4. Dense Residual Structure

Each DR takes features from the last DR block as input, and contains M residual blocks.
While a fully connected dense network consumes much memory, we just concatenate the output of each DR rather than residual block.
The concatenated features are then fed to next DR block.
The whole DIA_Net contains N DR blocks.
(Please feel free to read DenseNet.)

2. Experimental Results

HM16.20 is used.
29 training sequences and 1 evaluation sequence are selected.
One model for intra and one model for inter.
The first model for QP 37 and I frame is trained from scratch.
Then models for QP 32,27,22 and Intra/Inter frame are fine-tuned based on the first model.
The testing set is obtained from ”Picture Coding Symposium 2019 Grand Challenge on Short Video Coding” which contains 8 video sequences.
The reconstructed frame is first processed by DF and SAO blocks and then fed to DIA_Net for enhancement.
The BD-rate averagely decreases 8.2% in AI prediction and 5.6% in RA prediction.
The encoding run-time increases by 3.3 times in AI prediction and 1.26 times in RA prediction, since they are run in CPU.
The decoding run-time increases by 217 times in AI prediction and 131 times in RA prediction, since they are run in CPU.