Reading: PS-RNN — Progressive Spatial Recurrent Neural Network for Intra Prediction (HEVC Intra)

Using GRU, 2.65% BD-Rate Reduction, Outperforms IPFCN

Sik-Ho Tsang
6 min readMay 31, 2020

In this story, “Progressive Spatial Recurrent Neural Network for Intra Prediction” (PS-RNN), by Peking University, is presented. I read this because I work on video coding research. There are drawbacks for previously proposed CNNs and FCs, which either fail to fully utilize spatial correlations, or do not well handle the asymmetric shape of context. In this paper, by using PS-RNN, it can handle the intra-prediction task:

  • 2-D spatial modeling is realized by stacking spatial RNNs in two orthogonal directions.
  • Long-term and complex spatial dependency is captured by stacking hierarchical spatial RNNs.

This is a paper in 2019 TMM where TMM has high impact factor of 5.452. (Sik-Ho Tsang @ Medium)

Outline

  1. PS-RNN: Network Architecture
  2. Experimental Results

1. PS-RNN: Network Architecture

1.1. Overall Architecture

PS-RNN: Network Architecture
  • Input: The grey area is the missing pixels that we want to predict at the output. The area bounded by yellow lines is the available pixels which is used for predicting the grey area.
  • Left — Preprocessing Convolutional Layers: In the first stage, convolutional layers extract local features of the input context block and transform the image to feature space.
  • As the pixels are filtered and abstracted to features, the network is able to reduce quantization noise in the reference pixels.
  • Though CNN is used, the size of the kernels in these convolutional layers is small and these layers are designed to extract local features. Thus, they are not affected by the asymmetry of the inputs, which disturbs the reconstruction process of CNN based models.
  • The progressiveness of the PS-RNN unit helps mitigate the problem of the asymmetry.
  • Middle — Cascaded PS-RNN Units: In the second stage, cascaded PS-RNN units are exploited to generate the prediction of the feature vectors.
  • Right — Reconstructed Layer: At last, two convolutional layers map the predicted feature vectors back to pixels,which finally form the prediction signals.

Different from Spatial RNN, there is no guiding CNN to guide the GRU weights.

1.2. PS-RNN Units

PS-RNN Units
  • Suppose the shape of the feature tensor is (n, n, c), with c to be the number of channels.
  • It is split to Xh = {X0··,X1··, . . . ,Xn−1··} and Xv = {X·0·,X·1·, . . . ,X·n−1·}.
  • Each element in Xh or Xv is a feature map of shape (n, c). To conduct recurrent learning, each element of shape (n, c) in the sequence is flattened to a vector of n × c dimensions.
  • The t-th feature vector in the sequence is defined as xt. By taking the horizontal form as an example, the definition of Xh can be simplified as Xh = {x0, x1, . . . , xn−1} and input to GRU:
Gated Recurrent Unit (GRU)
  • where ⊙ denotes element-wise multiplication of tensors and σ is the non-linear activation function tanh.
  • The combined horizontal and vertical RNNs can handle complex textures.
  • The GRU includes reset and update gates, which enable to capture both short-term and long-term spatial dependency. Thus, it is good at modelling piecewise smooth regions and non-stationary textures.
  • By stacking multiple spatial RNNs, PS-RNN can percept context information from a large region and is capable to predict textures along any direction with rather complex structures.
  • The feature vectors are concatenated back into feature maps and merged using a convolutional layer.

1.3. Some Other Details

  • In the implementation, 3 PS-RNNs are used.
  • For the first PS-RNN unit, 8 cells are used for the vertical and the horizontal RNNs, respectively. For the other two units, we use 4 cells in the GRU.
  • PReLU is utilized as the activation function for the convolutional layers.
  • SATD is used as loss function where SATD is a transformed coefficient SAD which is commonly used within the codec.
BD-Rate (%) At Each Checkpoint for SATD Loss and MSE Loss
  • As seen above, using SATD loss can obtain larger BD-rate reduction than using MSE loss.
  • 5 blocks of pixels as shown at the right of the figure above are used to predict the missing pixels at the center.
  • For some cases, the pixels of the bottom-left or top-right blocks are not available. In this case, only 3 blocks, i.e. top, left, and top-left blocks are used.
  • Thus, Similar to Spatial RNN, two models are trained, there are 2 models trained. One uses 5 blocks. One uses 3 blocks.
  • One additional flag is needed for each PU to save whether to use the deep learning model or the original HEVC.
  • DIV2K is used as training set. The images are cropped and downsampled to 3 scales, namely 1792×1024, 1344×768, and 896×512.
  • The images are encoded using HEVC, HM-16.15, with Quantization Parameter (QP) set to 22, 27, 32, 37 respectively and the reconstructed blocks in the decoding process are used to form the training pairs.

2. Experimental Results

2.1. PU Size Set to 8×8

BD-Rate (%) on HEVC Test Sequences
  • PS-RNN-SATD: PS-RNN using SATD loss outperforms the one using MSE loss, i.e. PS-RNN-MSE.
  • Also, it outperforms the conference version of IPFCN [17] using either SATD loss or MSE loss.

2.2. Variable PU Size

BD-Rate (%) on HEVC Test Sequences
  • In this experiment, the scale of Coding Units (CU) to be up to 32×32. The size of PU ranges from 4×4 to 32×32 and is adaptively decided by RDO.
  • PS-RNN+: The base network is aPS-RNNtrained with 8 × 8 blocks as its input. When the block size is larger than 8×8, it is downsampled to 8×8 before going into the network. After the prediction, the output of the last PS-RNN unit of the base network is up-sampled to the original size by post-processing network.
  • (For block size smaller than 4×4, authors mention it won’t be upsampled, it is not so clear if there is another model trained for 4×4.)
  • Finally, PS-RNN+ outperforms PS-RNN with 2.65% BD-rate reduction.
BD-Rate (%) on HEVC Test Sequences
  • There is also PS-RNN Full, but it is also not so clear about the setting, but at last, it has similar performance as the transaction version of IPFCN-D [17] and PNNS Full [18]. (Hope I can present PNNS later.)

2.3. Visual Comparison

Visual Comparison
  • As seen above, PS-RNN can produce much similar predicted block as the ground-truth one.

During the days of coronavirus, A challenge of writing 30/35/40/45 stories again for this month has been accomplished. This is the 48th story in this month..!! Can I write 50 stories in this month (i.e. less than 11 hrs in my timezone)?? Thanks for visiting my story..

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.