Reading: Spatial RNN — CNN Guided Spatial RNN for Video Coding (HEVC Intra Prediction)

1.2% BD-Rate Reduction Compared to the Conventional HEVC

4 min readMay 31, 2020

In this story, “Optimized Spatial Recurrent Network for Intra Prediction in Video Coding” (Spatial RNN), by Peking University, is briefly presented. I read this because I work on video coding research. In this paper:

GRU is used as spatial RNN to predict the coding block pixels.
A CNN is used to guide the GRU weight.

This is published in 2018 DCC as well as 2018 VCIP. (Sik-Ho Tsang @ Medium)

Outline

Reference Pixels
Spatial RNN: Network Architecture
Experimental Results

1. Reference Pixels

**Reference Pixels — Left: Conventional HEVC, Right: Spatial RNN**

In conventional HEVC, only the nearest neighbor reference pixels are used to predict the missing pixels.
In this paper, 5 blocks of pixels as shown at the right of the figure above are used to predict the missing pixels at the center.
For some cases, the pixels of the bottom-left or top-right blocks are not available. In this case, only 3 blocks, i.e. top, left, and top-left blocks are used.
Thus, for implementation, there are 2 models trained. One uses 5 blocks. One uses 3 blocks.
One additional flag is needed for each PU to save whether to use the deep learning model or the original HEVC.

2. Spatial RNN: Network Architecture

Guiding CNN (Top Path): We first map the input pixels to feature space with convolutional layers. As these layers are shallow and the sizes of the kernels are relatively small, global interference of the large missing area is not significant.
The CNN is used only for feature extraction rather than pixel level mapping.
Re-sampling (Bottom Path): The feature maps are re-sampled to several scales, making the network compatible for variable content scale in videos.
After the concatenation of each scales, the network progressively generates predictions for the feature maps.
In Spatial RNN, Gated Recurrent Units (GRU) is used:

The parameters are learned during training and fixed after the training ends. Differently, in this paper, a guiding CNN is exploited to generate the parameters.
Finally, element-wise max operation followed by convolutional layers are performed to predict the missing pixels at the center.
MSE is used as loss function.

3. Experimental Results

The training data is generated from high-resolution images provided in DIV2K [13]. The model is trained using reconstructed data.
The images are cropped and downsampled to 3 scales, namely 1792×1024, 1344×768, and 896×512.
The images are encoded using HEVC with Quantization Parameter (QP) set to 22, 27, 32, 37 respectively and the reconstructed blocks in the decoding process are used to form the training pairs.
About 3,000,000 samples are randomly sampled for the training.
HM-16.15 is used.
The anchor and proposed method only allow CU size of 16×16 and forced to do a split. That is, each PU is restricted to have the size of 8×8. (That means they may not work if there is no constraint on variable CU sizes?)

1.2% BD-rate reduction is achieved compared to the conventional HEVC.

During the days of coronavirus, A challenge of writing 30/35/40/45 stories again for this month has been accomplished. This is the 47th story in this month..!! Can I write 50 stories in this month (i.e. in today)?? Thanks for visiting my story..

References

[2018 DCC] [Spatial RNN]
Enhanced Intra Prediction with Recurrent Neural Network in Video Coding

[2018 VCIP] [Spatial RNN]
Optimized Spatial Recurrent Network for Intra Prediction in Video Coding

Codec Intra Prediction

HEVC [Xu VCIP’17] [Song VCIP’17] [IPCNN] [IPFCN] [CNNAC] [Li TCSVT’18] [Spatial RNN] [AP-CNN] [MIP] [Wang VCIP’19] [IntraNN] [CNNMC Yokoyama ICCE’20]
VVC [CNNIF & CNNMC] [Brand PCS’19]