Review: NNIP — Neural Network based Inter Prediction for HEVC (HEVC Inter Prediction)
Average 1.7% (up to 8.6%) BD-rate reduction in low delay P test condition compared to HM 16.9
In this story, Neural Network based Inter Prediction (NNIP), by Harbin Institute of Technology and Peking University, is briefly reviewed. I review this because I work on video coding research. To understand this story, it’s better to have some background of video coding, especially HEVC inter coding. And this is a paper in 2018 ICME. (Sik-Ho Tsang @ Medium)
- Conventional Inter Coding at Encoder
- Neural Network based Inter Prediction (NNIP)
- Experimental Results
1. Conventional Inter Coding at Encoder
- First, motion estimation and compensation are performed to find the most matched block in the previous frame for the current block. Then this matched block, we call it as predicted block P.
- The difference between this predicted block, P, and the original block, C, is obtained, which is called residue (as shown in the figure). This residue should be small if we can find a very similar block in the previous frame.
- This residue is then transformed and quantized according to what video quality or bitrate we want.
- Finally, the quantized residue is entropy coded and send to decoder side.
2. Neural Network based Inter Prediction (NNIP)
2.1. Placement of NNIP
- However, the residue obtained by the conventional inter prediction may not be small enough.
- Authors proposed NNIP to make the residue further smaller.
- This NNIP places after inter prediction and before transform and quantization process.
2.2. Network Architecture
- To do this, the neighboring pixels T and S of both the predicted block P and the original current block C respectively are used, as shown above.
- First, the neighboring pixels S & T are input to fully connected network (FCN) to obtain a feature map which contains relation information of P & C.
- There are d=4 layers here. For the hidden layer, the dimension is twice of the input layer in FCN. The external lines L is set to be 4.
- Then it is added with P, and input into a convolutional neural network (CNN) to obtain a smaller residue.
- VRCNN network architecture is used here. (Please read my review on VRCNN if interested.)
- This residue is then transformed, quantized and entropy coded as usual.
- To generate the training data, we first compress three video sequences in HEVC common test condition (BasketballDrive, BQMall, and BlowingBubbles) by using HM 16.9 with low delay P (LDP) configuration.
- All frames of these three sequences are encoded with different quantization parameters (QP = 22, 27, 32, and 37).
- Loss function used is the mean squared error (MSE) loss:
- Different models are trained for different sizes of CU, which varies from 8×8 to 64×64.
3. Experimental Results
- The average coding gain is about 1.7% (up to 8.6%) for luma component, which demonstrates the efficiency of the proposed NNIP method.
3.2. Visual Quality
- PSNR is increased. The visual quality is enhanced.
- The encoding time increasing is about 3444% on average.
- The decoding time increasing is about 2022% on average.
- The high complexity mainly comes from two reasons. The first reason is that rate distortion optimization must be done for all sizes of CU and each mode of inter prediction.
- The second reason is that the forward operation of the proposed network.