Review: IPFCN — Intra Prediction Using Fully Connected Network (HEVC Intra Prediction)

Deep Learning Based Intra Prediction, Outperforms HEVC Intra Prediction

In this paper, Intra Prediction using Fully Connected Network (IPFCN), by Peking University and Microsoft Research Asia (MSRA), is briefly reviewed. I review this because I work on video coding research. First, the proposed IPFCN is published in 2017 ICIP. Then, authors enhanced IPFCN and proposed IPFCN-S and IPFCN-D. It is published in 2018 TIP where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)

Outline

  1. Conventional HEVC Intra Prediction
  2. IPFCN (2017 ICIP)
  3. Enhanced IPFCN (2018 TIP)

1. Conventional HEVC Intra Prediction

1.1. HEVC Video Coding

A video is composed of a sequence of frames
  • A video is composed of a sequence of frames. In HEVC, each frame is encoded one by one.
  • A frame is divided into non-overlapping blocks, called Coding Tree Units (CTUs). Each CTU has the size of 64×64. CTUs are encoded from top left to bottom right using raster scan order.
Quad-Tree Coding
  • For each CTU, quad-tree coding is applied to divide the CTU into 4 smaller square coding units (CUs), from 64×64, 32×32, 16×16 down to 8×8. By comparing the cost of CUs at each CU level, different sizes of CUs are chosen to encode each CTU.
  • (8×8 CUcan be divided into four 4×4 Prediction Units (PUs), but this is not the focus in this story.)
  • Each CU is encoded by different approaches, such as intra prediction and inter prediction.
  • In this paper, authors focus on intra prediction only.

1.2. 35 Intra Predictions in HEVC

35 Intra Predictions in HEVC (Left), Some Examples (Right)
  • For each CU in intra prediction, there are 35 predictions as shown above.
  • Neighbor reference samples are used to predict the current CU.
  • 0: planar, to predict smooth gradual change within the CU.
  • 1: DC, using the average value to fill in the CU as prediction.
  • 2–34: Angular, using different angles to predict the current CU.
  • Some examples are shown at the right of the figure.

2. IPFCN (2017 ICIP)

2.1. Network Architecture

Intra Prediction using Fully Connected Network (IPFCN)
  • The idea of IPFCN is to input the L×L+2N+2N Neighbor reference samples (Orange) into the neural network, and output the N×N predicted samples.
  • The neural network composed of fully connected (FC) layers only, or called multi-layer perceptron (MLP).
  • PReLU (Parametric Rectified Linear Unit) is used as activation function.
  • Mean squared error (MSE) loss function is used:

2.2. Validation

  • All IPFCNs are trained from the same training data set which is extracted from Netflix sequences.
  • The validation set consists of BasketballDrill, FourPeople, BQSquare, ParkScene, and Traffic.
  • To determine how many layers (how deep) are used. validation set is used:
Different number of layers
  • The 3-layer model can outperform the 2-layer model with a relatively large margin.However, the deeper model cannot further improve the performance.
  • The 8-layer model even has performance loss.
  • To determine how many dimensions are used. validation set is used:
Different dimensions (neurons) with 3-layer models
  • Finally, 128-dimensional IPFCN with 3-layer is used.
  • A binary flag will be transmitted to the decoder to indicate whether IPFCN or the conventional HEVC intra prediction is used.

2.3. Results

BD-rate (%) using HM-16.9 as anchor for each sequence
Average BD-rate (%)
  • The proposed IPFCN can achieve an average of 1.1% bitrate saving on luma component.
  • The maximum one is 3.3% for Tango.
  • At the same time, the two chroma components both have 1.6% bitrate saving.
  • For encoding and decoding time, the proposed method bring additional 48% and 190% cost. This mainly comes from the forward computation of IPFCN.
  • The parameters are in float precision, which is not computationally friendly for video coding.

3. Enhanced IPFCN (2018 TIP)

3.1. Major Differences from IPFCN (2017 ICIP)

  • In contrast to the conference version, the MSE loss function is used with regularization to reduce overfitting:
  • Also, there are two IPFCN variants: IPFCN-S and IPFCN-D.
  • IPFCN-S: Single model, just like the conference version.
  • IPFCN-D: Dual models. The training data is classified into two groups. One group is with the angular directions, namely modes 2-34, for directional blocks. The other group is with non-angular directions, namely DC and planar modes, for homogeneous blocks. The two groups of training data have different attributes.
  • By training using two groups of data, more accurate models can be trained. And it is expected to have more bitrate reduction.
  • Thus, for IPFCN-D, one more bin is introduced to indicate the dual IPFCN models.
  • In HEVC, the size of CU varies from 8×8 to 64×64. Considering that the 64×64 CU is seldom chosen in intra coding, the proposed IPFCN will be not enabled for 64×64 CU.

3.2. Validation

Different number of layers (Left), Different dimensions (neurons) with 4-layer models for 8×8 blocks (Right)
  • Left: It is found that 4-layer model has the lowest loss. 4-layer model is chosen.
  • Right: An example for 8×8 blocks, the result of 1024-dimensional model gets very close to that of 2048-dimensional model. For this reason, 1024-dimension is used.
  • The dimensions are set as 512, 1024, 1024, and 2048 for 4×4, to 32×32 block sizes, respectively.

3.3. Results

3.3.1 BD-rate

BD-rate (%) using HM-16.9 as anchor for each sequence
  • On average, IPFCN-D outperforms IPFCN-S for all three Y, U and V components.
BD-rate (%) for different ranges of QPs
  • For both models, larger BD-rate reduction can be obtained for large QPs.

3.3.2 Complexity

Complexity for Different Models
  • The test is done with Intel Xeon E7–4870 CPU.
  • L” means light model using fewer dimensions within the network. The dimensions are set as 64, 128, 128, and 256 for 4×4 to 32×32 block sizes respectively.
  • The encoding time of IPFCN-S-L is about 3 times the HEVC anchor, and the decoding time is about 8 times.
  • At the same time, IPFCN-S-L still achieves 2.3% bitrate reduction on average, and 3.2% bitrate reduction on 4K sequences.

3.3.3 Some qualitative results

  • As can be observed, the network is clearly capable of producing more accurate prediction when handling these complex blocks.
  • Irregular shape can also be predicted.

3.3.4. Some Analyses

Prediction Error Reduction (%)
  • IPFCN-D can reduce 4.7% prediction error for 8×8 block.
  • Similarly, the prediction error of 16×16 block decreases by 4.4%.
The percentage distribution of CU mode for sequence Rollercoaster. The model is IPFCN-D.
  • The percentages of IPFCN-D in Rollercoaster sequence are 78%, 69%, and 78% for 8 × 8 32 × 32 CUs, respectively.
  • The other sequences also have remarkable numbers. These results verify the effectiveness of the proposed network.

--

--

Get the Medium app