Review: IPFCN — Intra Prediction Using Fully Connected Network (HEVC Intra Prediction)
Deep Learning Based Intra Prediction, Outperforms HEVC Intra Prediction
--
In this paper, Intra Prediction using Fully Connected Network (IPFCN), by Peking University and Microsoft Research Asia (MSRA), is briefly reviewed. I review this because I work on video coding research. First, the proposed IPFCN is published in 2017 ICIP. Then, authors enhanced IPFCN and proposed IPFCN-S and IPFCN-D. It is published in 2018 TIP where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)
Outline
- Conventional HEVC Intra Prediction
- IPFCN (2017 ICIP)
- Enhanced IPFCN (2018 TIP)
1. Conventional HEVC Intra Prediction
1.1. HEVC Video Coding
- A video is composed of a sequence of frames. In HEVC, each frame is encoded one by one.
- A frame is divided into non-overlapping blocks, called Coding Tree Units (CTUs). Each CTU has the size of 64×64. CTUs are encoded from top left to bottom right using raster scan order.
- For each CTU, quad-tree coding is applied to divide the CTU into 4 smaller square coding units (CUs), from 64×64, 32×32, 16×16 down to 8×8. By comparing the cost of CUs at each CU level, different sizes of CUs are chosen to encode each CTU.
- (8×8 CUcan be divided into four 4×4 Prediction Units (PUs), but this is not the focus in this story.)
- Each CU is encoded by different approaches, such as intra prediction and inter prediction.
- In this paper, authors focus on intra prediction only.
1.2. 35 Intra Predictions in HEVC
- For each CU in intra prediction, there are 35 predictions as shown above.
- Neighbor reference samples are used to predict the current CU.
- 0: planar, to predict smooth gradual change within the CU.
- 1: DC, using the average value to fill in the CU as prediction.
- 2–34: Angular, using different angles to predict the current CU.
- Some examples are shown at the right of the figure.
2. IPFCN (2017 ICIP)
2.1. Network Architecture
- The idea of IPFCN is to input the L×L+2N+2N Neighbor reference samples (Orange) into the neural network, and output the N×N predicted samples.
- The neural network composed of fully connected (FC) layers only, or called multi-layer perceptron (MLP).
- PReLU (Parametric Rectified Linear Unit) is used as activation function.
- Mean squared error (MSE) loss function is used:
2.2. Validation
- All IPFCNs are trained from the same training data set which is extracted from Netflix sequences.
- The validation set consists of BasketballDrill, FourPeople, BQSquare, ParkScene, and Traffic.
- To determine how many layers (how deep) are used. validation set is used:
- The 3-layer model can outperform the 2-layer model with a relatively large margin.However, the deeper model cannot further improve the performance.
- The 8-layer model even has performance loss.
- To determine how many dimensions are used. validation set is used:
- Finally, 128-dimensional IPFCN with 3-layer is used.
- A binary flag will be transmitted to the decoder to indicate whether IPFCN or the conventional HEVC intra prediction is used.
2.3. Results
- The proposed IPFCN can achieve an average of 1.1% bitrate saving on luma component.
- The maximum one is 3.3% for Tango.
- At the same time, the two chroma components both have 1.6% bitrate saving.
- For encoding and decoding time, the proposed method bring additional 48% and 190% cost. This mainly comes from the forward computation of IPFCN.
- The parameters are in float precision, which is not computationally friendly for video coding.
3. Enhanced IPFCN (2018 TIP)
3.1. Major Differences from IPFCN (2017 ICIP)
- In contrast to the conference version, the MSE loss function is used with regularization to reduce overfitting:
- Also, there are two IPFCN variants: IPFCN-S and IPFCN-D.
- IPFCN-S: Single model, just like the conference version.
- IPFCN-D: Dual models. The training data is classified into two groups. One group is with the angular directions, namely modes 2-34, for directional blocks. The other group is with non-angular directions, namely DC and planar modes, for homogeneous blocks. The two groups of training data have different attributes.
- By training using two groups of data, more accurate models can be trained. And it is expected to have more bitrate reduction.
- Thus, for IPFCN-D, one more bin is introduced to indicate the dual IPFCN models.
- In HEVC, the size of CU varies from 8×8 to 64×64. Considering that the 64×64 CU is seldom chosen in intra coding, the proposed IPFCN will be not enabled for 64×64 CU.
3.2. Validation
- Left: It is found that 4-layer model has the lowest loss. 4-layer model is chosen.
- Right: An example for 8×8 blocks, the result of 1024-dimensional model gets very close to that of 2048-dimensional model. For this reason, 1024-dimension is used.
- The dimensions are set as 512, 1024, 1024, and 2048 for 4×4, to 32×32 block sizes, respectively.
3.3. Results
3.3.1 BD-rate
- On average, IPFCN-D outperforms IPFCN-S for all three Y, U and V components.
- For both models, larger BD-rate reduction can be obtained for large QPs.
3.3.2 Complexity
- The test is done with Intel Xeon E7–4870 CPU.
- “L” means light model using fewer dimensions within the network. The dimensions are set as 64, 128, 128, and 256 for 4×4 to 32×32 block sizes respectively.
- The encoding time of IPFCN-S-L is about 3 times the HEVC anchor, and the decoding time is about 8 times.
- At the same time, IPFCN-S-L still achieves 2.3% bitrate reduction on average, and 3.2% bitrate reduction on 4K sequences.
3.3.3 Some qualitative results
- As can be observed, the network is clearly capable of producing more accurate prediction when handling these complex blocks.
- Irregular shape can also be predicted.
3.3.4. Some Analyses
- IPFCN-D can reduce 4.7% prediction error for 8×8 block.
- Similarly, the prediction error of 16×16 block decreases by 4.4%.
- The percentages of IPFCN-D in Rollercoaster sequence are 78%, 69%, and 78% for 8 × 8 32 × 32 CUs, respectively.
- The other sequences also have remarkable numbers. These results verify the effectiveness of the proposed network.
References
[2017 ICIP] [IPFCN]
Intra Prediction Using Fully Connected Network for Video Coding
[2018 TIP] [IPFCN-S/IPFCN-D]
Fully Connected Network-Based Intra Prediction for Image Coding
Codec Prediction
[IPFCN]