Reading: Zhang RCAR’19 — CNN-Based Coding Unit Partition (Fast HEVC Prediction)

54.08% Time Reduction With Only 1.53% Increase in BD-Rate

Sik-Ho Tsang
5 min readMay 28, 2020

In this story, “A CNN-based Coding Unit Partition in HEVC for Video Processing” (Zhang RCAR’19), by is briefly presented. I read this because I work on video coding research. The target of this paper is to speed up the coding unit (CU) partitioning process in HEVC. (To know more about video coding and HEVC intra coding partitioning, please read IPCNN.) This is a paper in 2019 RCAR. (Sik-Ho Tsang @ Medium)

Outline

  1. Network Architecture
  2. Three Fast CU Partitioning Variants
  3. Some Implementation Details
  4. Experimental Results

1. Network Architecture

Network Architecture
  • The proposed CNN architecture consists of one input layer, four convolutional (Conv) layers, one concatenate (Concat) layer, one fully connected (FC) layer and one softmax layer.
  • Three models are trained, for CU sizes of 64×64, 32×32 and 16×16 respectively.
Details of the Network
  • The first convolutional layer (Conv1) contains 64 filters with size of 5×5 and stride of 1.
  • A Max pooling is performed with size of 3×3 and stride 2.
  • The size of output feature map after pooling layer is 64×32×32.
  • The second convolutional layer (Conv2) and pooling layer are employed with the same parameters as the first convolutional layer. The size of output feature map is 64×16×16.
  • The third (Conv3) and fourth (Conv4) convolutional layer contains 64 filters with size of 3×3 and with 1 stride.
  • The output feature maps of third and fourth convolutional layers are concatenated (Concat5) together, and the resulting feature map is with size of 128×16×16.
  • A max pooling is performed with size of 3×3 and stride 2 to reduce the feature dimension. And the size of final feature map is 128×8×8.
  • One fully connected (FC6) layer with 64 hidden units is added after Concat5 layer. A dropout is performed with ratio of 0.5 on FC6 layer.
  • The Output layer (SoftMax) has 2 units, which are used to decide whether to partition input CU or not. 0: Not Split, 1: Split.

2. Three Fast CU Partitioning Variants

The whole CTU partition process of a CTU
  • Three variants are namely as Pro-A, Pro-B and Pro-C.

2.1. Pro-A

  • If the flag of 64×64 is 0, then depth of the whole CTU is 0. Otherwise, if the flag of 32×32 sub-CU is 1, and the flag of its 16×16 sub-CU is 1, depth of this 16×16 CU is a fixed value 3. And it is also suitable for all the 16× 16 CUs.

2.2. Pro-B

  • Similar to Pro-A, only the depth search range is changed when the partition flag of CTU and 32 × 32 CUs are 1.
  • The depth search range is modified from “2 to 3” as a fixed value “2”. That means all other smaller CU sizes are skipped.

2.3. Pro-C

  • On the basis of Pro-B, the depth search range is changed when the partition flag of CTU is 1 and the partition flag of 32×32 CU is 0.
  • The depth search range is modified from “1 to 2” as a fixed value “1”.

3. Some Implementation Details

3.1. GPU Implementation

  • In HEVC, each CTU would be split into sub-CUs and copied multiple times from Central Processing Unit (CPU) memory to GPU device. The great amount of CUs introduce redundancy for data transmission.
  • To solve this problem, the whole frame is copied directly to GPU device first, and split it into sizes of 64×64, 32×32 and 16×16.

3.2. Training

  • Only luminance is used to train CNN models.
  • During training stage, for testing the videos in each specific class, the videos in other class are selected as training.
  • For example, when the videos in Class A are tested, the videos in Class B, C, D and E on QP 32 are used as training data.
  • Caffe is used.
  • HM-13.0 is used.

4. Experimental Results

4.1. Three Individual Models for Different CU Sizes

Three Individual Models for Different CU Sizes
  • For Pro-A, 54.08% average time reduction is obtained with only 1.53% increase in BD-rate.
  • For Pro-B, 61.71% average time reduction is obtained with only 2.20% increase in BD-rate.
  • For Pro-C, 63.19% average time reduction is obtained with only 2.62% increase in BD-rate.

4.2. One 32×32 CU Model for Different CU Sizes

One 32×32 CU Model for Different CU Sizes
  • For Pro-A, 52.24% average time reduction is obtained with only 1.01% increase in BD-rate.
  • For Pro-B, 66.01% average time reduction is obtained with only 2.86% increase in BD-rate.
  • For Pro-C, 67.69% average time reduction is obtained with only 3.61% increase in BD-rate.
  • The performance is worse than the one in 4.1 but still not bad.

4.3. SOTA Comparison

  • Pro-C-3 means the experiment results with Pro-C by using three models and can save 63.19% coding time with 2.62% BD-rate increase.
  • Pro-B-1 means the experiment results with Pro-B by using one model (32×32) and can save 66.01% coding time with 2.86% BD-rate increase.
  • The proposed approach can save much more time with a little higher BD-rate compared with other SOTA approaches (where [13] is a CNN approach, hope I can read it in the coming future.)

During the days of coronavirus, Challenges of writing 30 and 35 stories again for this month have been accomplished. This is the 40th story in this month.. Let me challenge 45 stories!! Thanks for visiting my story..

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet