Review: IPFCN — Intra Prediction Using Fully Connected Network (HEVC Intra Prediction)

Deep Learning Based Intra Prediction, Outperforms HEVC Intra Prediction

Sik-Ho Tsang
6 min readApr 11, 2020

In this paper, Intra Prediction using Fully Connected Network (IPFCN), by Peking University and Microsoft Research Asia (MSRA), is briefly reviewed. I review this because I work on video coding research. First, the proposed IPFCN is published in 2017 ICIP. Then, authors enhanced IPFCN and proposed IPFCN-S and IPFCN-D. It is published in 2018 TIP where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)

Outline

  1. Conventional HEVC Intra Prediction
  2. IPFCN (2017 ICIP)
  3. Enhanced IPFCN (2018 TIP)

1. Conventional HEVC Intra Prediction

1.1. HEVC Video Coding

A video is composed of a sequence of frames
  • A video is composed of a sequence of frames. In HEVC, each frame is encoded one by one.
  • A frame is divided into non-overlapping blocks, called Coding Tree Units (CTUs). Each CTU has the size of 64×64. CTUs are encoded from top left to bottom right using raster scan order.
Quad-Tree Coding
  • For each CTU, quad-tree coding is applied to divide the CTU into 4 smaller square coding units (CUs), from 64×64, 32×32, 16×16 down to 8×8. By comparing the cost of CUs at each CU level, different sizes of CUs are chosen to encode each CTU.
  • (8×8 CUcan be divided into four 4×4 Prediction Units (PUs), but this is not the focus in this story.)
  • Each CU is encoded by different approaches, such as intra prediction and inter prediction.
  • In this paper, authors focus on intra prediction only.

1.2. 35 Intra Predictions in HEVC

35 Intra Predictions in HEVC (Left), Some Examples (Right)
  • For each CU in intra prediction, there are 35 predictions as shown above.
  • Neighbor reference samples are used to predict the current CU.
  • 0: planar, to predict smooth gradual change within the CU.
  • 1: DC, using the average value to fill in the CU as prediction.
  • 2–34: Angular, using different angles to predict the current CU.
  • Some examples are shown at the right of the figure.

2. IPFCN (2017 ICIP)

2.1. Network Architecture

Intra Prediction using Fully Connected Network (IPFCN)
  • The idea of IPFCN is to input the L×L+2N+2N Neighbor reference samples (Orange) into the neural network, and output the N×N predicted samples.
  • The neural network composed of fully connected (FC) layers only, or called multi-layer perceptron (MLP).
  • PReLU (Parametric Rectified Linear Unit) is used as activation function.
  • Mean squared error (MSE) loss function is used:

2.2. Validation

  • All IPFCNs are trained from the same training data set which is extracted from Netflix sequences.
  • The validation set consists of BasketballDrill, FourPeople, BQSquare, ParkScene, and Traffic.
  • To determine how many layers (how deep) are used. validation set is used:
Different number of layers
  • The 3-layer model can outperform the 2-layer model with a relatively large margin.However, the deeper model cannot further improve the performance.
  • The 8-layer model even has performance loss.
  • To determine how many dimensions are used. validation set is used:
Different dimensions (neurons) with 3-layer models
  • Finally, 128-dimensional IPFCN with 3-layer is used.
  • A binary flag will be transmitted to the decoder to indicate whether IPFCN or the conventional HEVC intra prediction is used.

2.3. Results

BD-rate (%) using HM-16.9 as anchor for each sequence
Average BD-rate (%)
  • The proposed IPFCN can achieve an average of 1.1% bitrate saving on luma component.
  • The maximum one is 3.3% for Tango.
  • At the same time, the two chroma components both have 1.6% bitrate saving.
  • For encoding and decoding time, the proposed method bring additional 48% and 190% cost. This mainly comes from the forward computation of IPFCN.
  • The parameters are in float precision, which is not computationally friendly for video coding.

3. Enhanced IPFCN (2018 TIP)

3.1. Major Differences from IPFCN (2017 ICIP)

  • In contrast to the conference version, the MSE loss function is used with regularization to reduce overfitting:
  • Also, there are two IPFCN variants: IPFCN-S and IPFCN-D.
  • IPFCN-S: Single model, just like the conference version.
  • IPFCN-D: Dual models. The training data is classified into two groups. One group is with the angular directions, namely modes 2-34, for directional blocks. The other group is with non-angular directions, namely DC and planar modes, for homogeneous blocks. The two groups of training data have different attributes.
  • By training using two groups of data, more accurate models can be trained. And it is expected to have more bitrate reduction.
  • Thus, for IPFCN-D, one more bin is introduced to indicate the dual IPFCN models.
  • In HEVC, the size of CU varies from 8×8 to 64×64. Considering that the 64×64 CU is seldom chosen in intra coding, the proposed IPFCN will be not enabled for 64×64 CU.

3.2. Validation

Different number of layers (Left), Different dimensions (neurons) with 4-layer models for 8×8 blocks (Right)
  • Left: It is found that 4-layer model has the lowest loss. 4-layer model is chosen.
  • Right: An example for 8×8 blocks, the result of 1024-dimensional model gets very close to that of 2048-dimensional model. For this reason, 1024-dimension is used.
  • The dimensions are set as 512, 1024, 1024, and 2048 for 4×4, to 32×32 block sizes, respectively.

3.3. Results

3.3.1 BD-rate

BD-rate (%) using HM-16.9 as anchor for each sequence
  • On average, IPFCN-D outperforms IPFCN-S for all three Y, U and V components.
BD-rate (%) for different ranges of QPs
  • For both models, larger BD-rate reduction can be obtained for large QPs.

3.3.2 Complexity

Complexity for Different Models
  • The test is done with Intel Xeon E7–4870 CPU.
  • L” means light model using fewer dimensions within the network. The dimensions are set as 64, 128, 128, and 256 for 4×4 to 32×32 block sizes respectively.
  • The encoding time of IPFCN-S-L is about 3 times the HEVC anchor, and the decoding time is about 8 times.
  • At the same time, IPFCN-S-L still achieves 2.3% bitrate reduction on average, and 3.2% bitrate reduction on 4K sequences.

3.3.3 Some qualitative results

  • As can be observed, the network is clearly capable of producing more accurate prediction when handling these complex blocks.
  • Irregular shape can also be predicted.

3.3.4. Some Analyses

Prediction Error Reduction (%)
  • IPFCN-D can reduce 4.7% prediction error for 8×8 block.
  • Similarly, the prediction error of 16×16 block decreases by 4.4%.
The percentage distribution of CU mode for sequence Rollercoaster. The model is IPFCN-D.
  • The percentages of IPFCN-D in Rollercoaster sequence are 78%, 69%, and 78% for 8 × 8 32 × 32 CUs, respectively.
  • The other sequences also have remarkable numbers. These results verify the effectiveness of the proposed network.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.