Reading: Kim TCSVT’19 — Fast CU Depth Decision Using Neural Network (Fast HEVC Prediction)
LeNet-Like Architecture, 61.77% Average Time Reduction, 3.91% Increase in BD-Rate, Outperforms Liu ISCAS’16
In this story, “Fast CU Depth Decision for HEVC Using Neural Networks” (Kim TCSVT’19), by Yonsei University, is presented. I read this because I work on video coding research. In this paper:
- A LeNet-Like Architecture is used, which consists of the convolution and pooling layers for analyzing the image property of the CU.
- The feature map is then concatenated with the vector data at fully connected layers in order to analyze the encoding property of the CU.
This is a paper in 2019 TCVST where TCSVT has a high impact factor of 4.046. (Sik-Ho Tsang @ Medium)
Outline
- Database Construction
- Network Architecture
- HEVC Implementation
- Experimental Results
1. Database Construction
- Training set and test set are non-overlapping.
- A total of 240k samples are used for the training, with 10k training samples for each sequence.
- The image data for training and testing is converted into the lightning memory-mapped database (LMDB) format, which is more efficient in memory usage.
1.1. Image Data
- The image data is stored as NON-SPLIT if the corresponding CU is not divided into a low depth based on the encoding information.
- If the image is divided even once, it is stored as SPLIT image data, which as above.
1.2. Vector Data
- The vector data is 5-integer for both INTRA and INTER CU.
- In the case of INTRA CU with NON-SPLIT data, the PU mode in Vec[0], luminance in Vec[1,3], and chrominance angle in Vec[2,4] are used as the vector data.
- In order to match the size of the vector, the luminance and chrominance angles are stored twice.
- In the case of INTER CU, the PU mode in Vec[0] and bidirectional MVs in Vec[1–4] are stored as vector data.
- If the MV exists in only one direction, the horizontal MV in Vec[1,3] and vertical MV in Vec[2,4] are stored twice. Otherwise, with the bidirectional MV, a horizontal and a vertical component of the first MV is stored in Vec[1,2], and the components of the second MV is in Vec[3,4].
2. Network Architecture
- A LeNet-like architecture is used with additional vector data input at fully connected layer.
- After the convolutional layer, a max-pooling layer is implemented to reduce the number of nodes by eliminating ambiguous local data. The convolutional layer and the max-pooling layer are repeated in order to obtain sufficient feature maps and reduce the number of nodes for image data.
- At the end of the convolution and max-pooling of the image data, the vector data is concatenated and the fully connected layer is performed.
- After the PU search for the current CU (search for the prediction mode and angles or MVs) is completed, the corresponding information is used as the input for the inference.
3. HEVC Implementation
- When the network inference is executed, it is performed in parallel using general-purpose computing on graphics processing units (GPGPU).
- In the meantime, the encode part performs entropy coding concurrently on the CPU for the compressed CTU.
- (It is unknown that whether authors use this when comparing with SOTA approaches. If yes, if maybe unfair.)
- If the inference result is determined to be NON-SPLIT, no further operation is performed on the lower depth.
- However, when it is determined as SPLIT, it moves to the lower depth.
- HM-15.0 is used under RA and LD configurations.
4. Experimental Results
4.1. BD-Rate
- Three versions of proposed approach are used.
- Proposaldef: The proposed neural-network inference is used to determine all CU depths, which is the most aggressive version in order to maximize the speed increase of the encoding.
- Proposald0d1: Inference decision only in 64×64 and 32×32 depths, similar for Proposald0.
- Proposalθ=0.7, Proposalθ=0.8, Proposalθ=0.9: When the result of the network inference is 0.18% in SPLIT and 0.82% in NON-SPLIT, the CU is determined to be a NON-SPLIT CU in Proposalθ=0.7 and Proposalθ=0.8, but is considered SPLIT in Proposalθ=0.9.
- Proposaldef achieves TS on average of 61.77%, whereas Proposald0d1 and Proposald0 achieve TS of 56.25% and 50.08%, respectively. In each case, the degradation of the BD-rate is 3.91%, 3.64%, and 2.75%, respectively.
- In addition, Proposalθ=0.7, Proposalθ=0.8, and Proposalθ=0.9 obtain TS by 55.51%, 51.82%, and 47.32%, respectively.
- Similar performance under LD configuration.
4.2. Comparison with SOTA Approaches
- Both Proposald0 and Proposalθ=0.9 outperform Liu ISCAS’16 [20] which is a CNN-based approach.
Reference
[2019 TCSVT] [Kim TCSVT’19]
Fast CU Depth Decision for HEVC Using Neural Networks
Codec Fast Prediction
H.264 to HEVC [Wei VCIP’17] [H-LSTM]
HEVC [Yu ICIP’15 / Liu ISCAS’16 / Liu TIP’16] [Laude PCS’16] [Li ICME’17] [Katayama ICICT’18] [Chang DCC’18] [ETH-CNN & ETH-LSTM] [Zhang RCAR’19] [Kim TCVST’19]
3D-HEVC [AQ-CNN]
VVC [Jin VCIP’17] [Jin PCM’17] [Wang ICIP’18] [Pooling-Variable CNN]