Reading: Kuanar JCCSP’19 — Object Detection Approach for Video Coding (Fast HEVC Prediction)

Encoding Time Saving Up to 66.89%, BD-Rate Loss Low to 1.31%, Outperforms Li ICME’17

5 min readMay 30, 2020

In this paper, “Adaptive CU Mode Selection in HEVC Intra Prediction: A Deep Learning Approach” (Kuanar JCCSP’19), by University of Texas at Arlington, and University of Texas at Dallas, is briefly presented. In this paper:

CNN-based Object Detection Network is used which includes Region Proposal Network (RPN).
Regions are classified as homogeneous object, big object, granular texture, and small object.
According to the object type, different fast approaches are applied.

This is a paper in 2019 JCCSP (Springer Journal of Circuits, Systems, and Signal Processing) with impact factor of 1.922. (Sik-Ho Tsang @ Medium)

Outline

Overall Framework
Network Architecture
Experimental Results

1. Overall Framework

For each CTU, each CTU and its sub-CUs are input to the CNN-based classifier, as shown in the flowchart above.
The CU is classified as homogeneous object (class 0), big object (class 1), granular texture (class 2), and small object (class 3).
In conventional HEVC, there are 35 intra prediction modes in total. (Please read Sections 1 & 2 for HEVC intra coding in IPCNN if interested.) According to the object type, different intra prediction modes, which are much fewer than 35, are applied, as shown in the Algorithm I (PUMS) above. Therefore, encoding time reduction can be reduced.
For class 0, no more CU partitioning is applied since CUs are homogeneous. Thus, time saving is also achieved largely in this part.
To classify the CU, CNN is used which is mentioned in the coming section.

2. Network Architecture

**Network Architecture (Upper: RPN, Lower: Object Detection Network)**

2.1. Object Detection Network

A modified ZFNet is used for the object detection network where there are 5 conv layers and 3 fully connected layers.
The layers learned series of features starting from low-level features (edges, blobs) to high-level features (object structures).

2.2. Region Proposal Network

A two-stage object detection approach just like Faster R-CNN is used here.
A range of anchor box scales from 16×16, 32×32, 64×64 to 128×128, as shown above.
Each anchor box is accommodated with four aspect ratios {(0.5, 0.5), (1, 2), (2, 1), (1, 1)}.
A 2×2 sliding window is used over all the positions on ConvNet5 feature maps at the object detection network.
Each window is then mapped to a 128-dimensional vector which comprises 64 positive anchors and 64 negative anchors.
The above 128-d vectors are then connected to FC-6 and FC-7 layers.
The last FC-7 layer is fed into (1) a box classification layer with two probability scores, and (2) a bounding box regression layer with four coordinates.
The above process is repeated for 16 times (4 box scales 4 aspect ratios) at a single position and continued for all 20×30 ConvNet5 features locations.
Overall, the prediction process generates 20×30×16 possible anchor boxes. The total number of anchor boxes is reduced to 1000 by using non-maxima suppression technique and ignores boxes which cross the image boundary.

2.3. Texture Feature Calculation

Also, as seen in the Object Detection Network figure, there is an additional branch.
Textures have an important role in characterizing many natural objects, particularly for those objects that best qualify a pattern such as a granularity, static, and dynamic.
The Fisher Vector (FV) pools the local features densely within the described regions, and is therefore more apt at describing textures.
(FV has been widely used to aggregate the local descriptors of an image into a global representation in large-scale image retrieval.)
So, on the top of the CNN Conv3 layer, the representations using the Fisher vector (FV) pooling is used, which is commonly done in the bag-of-words approach, with single scale of 26×26.
The FV pools the local SIFT features densely within the described regions and is therefore apt at describing the image textures.
A 256-dimensional local features are generated for Fisher vector computation and pools into a representation with 32 Gaussian components.
Finally, it results into 16 K dimensional descriptors which are much higher than the FC-6 layer (1024-d). So the FV dimension is projected to 1024-d by using the PCA technique, and is merged at FC-6 layer.

2.4. Loss Function

The objective loss function includes both feature classification and bounding box regression, which is similar to those object detection task network, and with also the FV loss:

(There are more details about the loss function and also the training details. Please read the paper if interested. Here, I just want to focus on how authors make use of object detection network for speeding up the video coding process.)

3. Experimental Results

**BD-Rate (%) on Class A to Class E HEVC Test Sequences**

Using the proposed approach, 66.89% time reduction is achieved with only BD-rate loss of 1.31%, which outperforms Li ICME’17 [20].

**BD-Rate (%) on Class F HEVC Test Sequences**

For class F screen content sequences, using the proposed approach, 71.72% time reduction is achieved with only BD-rate loss of 1.84%, which outperforms Li ICME’17 [20].

Compared to the conventional HEVC, using the proposed approach can reduce the mode checking by up to 48.325%.

During the days of coronavirus, A challenge of writing 30/35/40/45 stories again for this month has been accomplished. This is the 46th story in this month..!! Can I write 50 stories in this month?? Thanks for visiting my story..

Reference

[2019 JCSSP] [Kuanar JCSSP’19]
Adaptive CU Mode Selection in HEVC Intra Prediction: A Deep Learning Approach

Codec Fast Prediction

H.264 to HEVC [Wei VCIP’17] [H-LSTM]
HEVC [Yu ICIP’15 / Liu ISCAS’16 / Liu TIP’16] [Laude PCS’16] [Li ICME’17] [Katayama ICICT’18] [Chang DCC’18] [ETH-CNN & ETH-LSTM] [Zhang RCAR’19] [Kuanar JCSSP’19]
VVC [Jin VCIP’17] [Jin PCM’17] [Wang ICIP’18] [Pooling-Variable CNN]