Review — Sun VCIP’20: Fully Neural Network Mode Based Intra Prediction of Variable Block Size (HEVC Intra)

FCN for Small Blocks, CNN for Large Blocks, Outperforms IPFCN, PNNS and PS-RNN With Smaller Complexity

5 min readMar 21, 2021

In this story, Fully Neural Network Mode Based Intra Prediction of Variable Block Size, (Sun VCIP’20), by Waseda University, JST, PRESTO, and Zhejiang University, is reviewed. In this paper:

For small blocks 4×4 and 8×8, fully connected networks are used, while for large blocks 16×16 and 32×32, convolutional neural networks are exploited.
This is the first work to explore a fully neural network modes (NM) based framework for intra prediction.

This is a paper in 2020 VCIP. (Sik-Ho Tsang @ Medium)

Outline

Fully Connected (FC) Networks for Small Blocks 4×4 and 8×8
Convolutional Neural Networks (CNN) for Large Blocks 16×16 and 32×32
Coding Framework with Fully Neural Network Modes (NM)
Experimental Results

1. Fully Connected (FC) Networks for Small Blocks 4×4 and 8×8

**Network for block 4×4 and 8×8. N is the block size.**

First, the neighboring references blocks are flattened to one-dimension vector with (4N+8)×8 nodes.
By passing through four FC layers, the one-dimension vector is reshaped to two-dimension N×N block.

**The number of node/filter based on the trade-off between coding gain (PSNR) and complexity (FLOPs)**

A baseline heavy model with 512 nodes is trained, and then reduce the number of nodes by half.
When reducing the number of nodes to 256 and 128, the coding loss is small.
However, when further reducing the dimension to 64, there is an obvious coding loss that is 0.21dB.
Thus, the node is selected as 128.

2. Convolutional Neural Networks (CNN) for Large Blocks 16×16 and 32×32

To keep the spatial information, the above three blocks and the left two blocks are sent to two separate convolutional paths.
For each path, the down-sampling is conducted to obtain the latent information, and then flatten to one-dimensional vector.
Two vectors are concatenated and then pass a FC layer.
The number of outputs nodes of the FC layer is 1/5 of the input nodes.
Finally, the vector is reshaped to two-dimension and deconvolved to up-sample to the original block size N×N.

**Convolutional layer structures for 16×16 and 32×32**

Four and five convolutional layers are used for 16×16 and 32×32, respectively.
PReLU is used.
The number of filters F is selected as 16 for 16×16 and 32×32 as a trade-off between coding gain and complexity.

3. Coding Framework with Fully Neural Network Modes (NM)

Mode Signaling for 35 Modes

There are overall 35 NMs, the best NM is selected. (Thus, there should be 35 networks trained for the each block size.)

The 35 conventional intra modes are abandoned. Thus, authors mentioned that this is the first work to explore a fully neural network modes (NM) based framework for intra prediction.

First, several candidate modes are selected by sum of absolute transformed differences (SATD) cost. Eight candidates are picked up for block 4×4 and 8×8, while three candidates are chosen for the other blocks.
This is similar to the Rough Mode Decision (RMD) in the conventional HEVC intra prediction.
In addition to the candidate modes selected by SATD, Most Probable Modes (MPMs) are also appended in the candidate mode list.
(I think that the same strategy as the conventional one is used to derive MPMs.)

The New York city library isas the training set. Each image is encoded with four QPs (22, 27, 32, 37), the batch size M is 16.

A baseline model based on all the training set is trained.
Then, for each mode, the corresponding samples encoded by that mode is subset from the training set, to form a smaller training set which is dedicated to that mode for fine-tuning.

MSE with weight decay is used as loss function:

4. Experimental Results

4.1. BD-Rate

HM-16.9 with all intra configuration is used.
On average, 3.55%, 3.03% and 3.27% Y, U, V BD-rate can be saved compared with the anchor.
Compared with IPFCN [4], a large BD-rate reduction is achieved for all the three channels.

When using the proposed model, the best coding gain is achieved at Class B and E among all the works: IPFCN [4], PNNS [5], PS-RNN [6].

4.2. RD Comparison

**R-D comparison at low bitrates (QP37).**

Bitrates are saved when achieving better PSNR compared with the anchor.
(But full RD curves are not plotted?)

4.3. Computational Complexity

The time is measured under the CPU platform.
When using the proposed model, 36× and 174× encoding and decoding complexity is cost.
Compared with IPFCN [4], 60% encoding and 24% decoding complexity can be reduced.
Compared with PNNS [5], 29% encoding and 9% decoding complexity can be decreased.
Compared with PS-RNN [6], 16% decoding complexity can be reduced.

Reference

[2020 VCIP] [Sun VCIP’20]
Fully Neural Network Mode Based Intra Prediction of Variable Block Size

Codec Intra Prediction

JPEG [MS-ROI] [Baig JVICU’17]
JPEG-HDR [Han VCIP’20]
HEVC [Xu VCIP’17] [Song VCIP’17] [Li VCIP’17] [Puri EUSIPCO’17] [IPCNN] [IPFCN] [HybridNN, Li ICIP’18] [Liu MMM’18] [CNNAC] [Li TCSVT’18] [Spatial RNN] [PS-RNN] [AP-CNN] [MIP] [Wang VCIP’19] [IntraNN] [CNNAC TCSVT’19] [CNN-CR] [CNNMC Yokoyama ICCE’20] [PNNS] [CNNCP] [Zhu TMM’20] [Sun VCIP’20] [Zhong ELECGJ’21]
VVC [CNNIF & CNNMC] [Brand PCS’19] [Bonnineau ICASSP’20] [Santamaria ICMEW’20] [Zhu TMM’20]