Review: Brand PCS’19 — Intra Frame Prediction Using Conditional Autoencoder (VVC Intra Prediction)

0.85dB Increase in PSNR, or 2 Bits Per Prediction Unit Reduced with Similar PSNR

Sik-Ho Tsang
6 min readMay 6, 2020

In this story, Intra Frame Prediction for Video Coding Using a Conditional Autoencoder Approach (Brand PCS’19), Friedrich-Alexander-Universität Erlangen Nürnberg (FAU), is reviewed. I read this because I work on video coding research.

Previously published approaches usually add additional ANN based modes or replace all modes by training several networks. In this paper, a single autoencoder network is used to first compress the original block with help of already transmitted pixels to four parameters. The parameters together with this support area are used to generate a prediction for the block. This way, all angular intra modes are replaced by a single ANN. This is a paper in 2019 PCS. (Sik-Ho Tsang @ Medium)

Outline

  1. Versatile Video Coding (VVC) Intra Coding
  2. Concept of Proposed Approach
  3. Proposed Conditional AutoEncoder
  4. Experimental Results

1. Versatile Video Coding (VVC) Intra Coding

67 Intra Prediction Modes in Versatile Video Coding (VVC)
  • For each CU in intra prediction, there are 67 predictions as shown above, which is much more than 35 predictions in the previous coding standard, High Efficiency Video Coding (HEVC).
  • With more prediction modes, more accurate prediction can be obtained to lower the prediction error (residue) and thus reduce the coding bits.
  • However, more coding bits are needed to encode the prediction modes.
  • As long as the residue can be much smaller, we can have more efficient video coding scheme.
  • Neighbor reference samples are used to predict the current CU.
  • 0: planar, to predict smooth gradual change within the CU.
  • 1: DC, using the average value to fill in the CU as prediction.
  • 2–67: Angular, using different angles to predict the current CU.
  • (If interested, please read IPCNN for Sections 1 and 2 about the importance of video coding and the conventional HEVC intra coding.)
  • In this paper, an autoencoder is to replace the above 67 intra prediction modes.

2. Concept of Proposed Approach

The support area
  • let x denote a vector containing the pixels of the support area and ^y the predicted area. In the classical mode-based intra prediction with M modes, we can obtain the prediction signal by:
  • where fm denotes the prediction function corresponding to mode m∈{0,1,…,M-1}. The mode has to be transmitted.
  • In this paper, the mode is not transmitted but a four-dimensional parameter vector p with all elements between -1 and 1.
  • p needs to be quantized for transmission, which yields the quantized parameter vector ^p. Then we can compute the prediction signal with a single ANN-based function f:
  • We transmit the vector quantized parameter ^p. We obtain p with another neural network g, which makes use of the original signal y and the support area:
  • The network g essentially compresses the high dimensional vector y to a low dimensional representation p before f tries to restore the original again. This structure is equivalent to an autoencoder.
  • As there is a parameter p as latent variable which used to describe the block under the condition of the support area x, it can be interpreted as conditional autoencoder.

3. Proposed Conditional AutoEncoder

Proposed Conditional AutoEncoder

3.1. Encoder

  • The encoder network mainly consists of fully connected layers (FCL), where N denotes the number of neurons, which decreases after each layer.
  • All layers except the last one are followed by a parametric rectified linear unit (PReLU), which is a piecewise defined linear function and yields good results and a stable gradient.
  • The last layer has a hyperbolic tangent (TanH) activation function to assure that the parameters assume only values between -1 and 1.

3.2. Quantizer

  • The vector quantizer Q quantises the vector p and maps it to an index i, which we can transmit.
  • Q^-1 maps the index back to the quantized parameter ^p.
  • LBG vector quantizer [19] is used.

2.3. Decoder

  • After de-quantization, we feed the parameter together with the support area into the decoder network.
  • This network also consists of FCLs with PReLU activation functions. All layers have the same number of neurons which is independent of the block size n. Only the last layer which produces the final output of n×n neurons.
  • Keep the number of hidden neurons constant because the decoder consist the image is split into blocks of different size according to the content of the block. The network needs more hidden neurons (relative to the number of pixels) to cope with the complex structure.
  • MSE is used to train the network:

4. Experimental Results

  • DIV2K dataset is used in which image 1–800 for training and 801–900 for testing.
  • Additionally, the images 51–100 from the TECNICK dataset is used to train the quantizer.
  • Versatile Video Coding (VVC) Test Model: VTM-4.0.1 is used.
The average PSNR over all 100 test frames for several predictors.
  • A value of q = Inf denotes the case of autoencoder with no quantization.
  • The network outperforms VTM (purple line) for all values of q > 64 under qtrain.
PSNR (dB)) for Different Block Sizes Against VTM
  • While the PSNR decreases, the gain increases with larger block sizes.
  • This is because blocks with simple structures as captured by the VTM modes become less likely for larger blocks.
  • On the other hand, the autoencoder is capable of creating more complex structures.
Example images of blocks predicted with VTM and autoencoders with different q
  • (a): shows one block from the frame which yields the worst results in all above tests. The network apparently is not able to recreate sharp, straight edges, which is exactly the strength of angular prediction.
  • (b): The autoencoder is able to predict changes that are not apparent from the support area, such as the darker area in the lower right corner. VTM is not able to do that.
  • (c): While VTM predicts the upper part of the edge accurately, the lower part has a high deviation of the original. The autoencoder predict the edge less sharp, but follows its shape better.
Prediction PSNR over Side-Information for different predictors
  • For qtrain = 64, only a little side-information is saved but the PSNR is increased by up to 0.85 dB.
  • On the other hand, with qtrain = 16, a comparable PSNR is achieved while saving more than 2 bit per block.

During the days of coronavirus, let me have a challenge of writing 30 stories again for this month ..? Is it good? This is the 7th story in this month. Thanks for visiting my story..

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.