Reading: Kuanar JCCSP’19 — Object Detection Approach for Video Coding (Fast HEVC Prediction)

Encoding Time Saving Up to 66.89%, BD-Rate Loss Low to 1.31%, Outperforms Li ICME’17

Sik-Ho Tsang
5 min readMay 30, 2020

In this paper, “Adaptive CU Mode Selection in HEVC Intra Prediction: A Deep Learning Approach” (Kuanar JCCSP’19), by University of Texas at Arlington, and University of Texas at Dallas, is briefly presented. In this paper:

  • CNN-based Object Detection Network is used which includes Region Proposal Network (RPN).
  • Regions are classified as homogeneous object, big object, granular texture, and small object.
  • According to the object type, different fast approaches are applied.

This is a paper in 2019 JCCSP (Springer Journal of Circuits, Systems, and Signal Processing) with impact factor of 1.922. (Sik-Ho Tsang @ Medium)

Outline

  1. Overall Framework
  2. Network Architecture
  3. Experimental Results

1. Overall Framework

Overall Framework
  • For each CTU, each CTU and its sub-CUs are input to the CNN-based classifier, as shown in the flowchart above.
  • The CU is classified as homogeneous object (class 0), big object (class 1), granular texture (class 2), and small object (class 3).
  • In conventional HEVC, there are 35 intra prediction modes in total. (Please read Sections 1 & 2 for HEVC intra coding in IPCNN if interested.) According to the object type, different intra prediction modes, which are much fewer than 35, are applied, as shown in the Algorithm I (PUMS) above. Therefore, encoding time reduction can be reduced.
  • For class 0, no more CU partitioning is applied since CUs are homogeneous. Thus, time saving is also achieved largely in this part.
  • To classify the CU, CNN is used which is mentioned in the coming section.

2. Network Architecture

Network Architecture (Upper: RPN, Lower: Object Detection Network)

2.1. Object Detection Network

Object Detection Network
  • A modified ZFNet is used for the object detection network where there are 5 conv layers and 3 fully connected layers.
  • The layers learned series of features starting from low-level features (edges, blobs) to high-level features (object structures).

2.2. Region Proposal Network

ROI Anchor Boxes
  • A two-stage object detection approach just like Faster R-CNN is used here.
  • A range of anchor box scales from 16×16, 32×32, 64×64 to 128×128, as shown above.
  • Each anchor box is accommodated with four aspect ratios {(0.5, 0.5), (1, 2), (2, 1), (1, 1)}.
  • A 2×2 sliding window is used over all the positions on ConvNet5 feature maps at the object detection network.
  • Each window is then mapped to a 128-dimensional vector which comprises 64 positive anchors and 64 negative anchors.
  • The above 128-d vectors are then connected to FC-6 and FC-7 layers.
  • The last FC-7 layer is fed into (1) a box classification layer with two probability scores, and (2) a bounding box regression layer with four coordinates.
  • The above process is repeated for 16 times (4 box scales 4 aspect ratios) at a single position and continued for all 20×30 ConvNet5 features locations.
  • Overall, the prediction process generates 20×30×16 possible anchor boxes. The total number of anchor boxes is reduced to 1000 by using non-maxima suppression technique and ignores boxes which cross the image boundary.

2.3. Texture Feature Calculation

  • Also, as seen in the Object Detection Network figure, there is an additional branch.
  • Textures have an important role in characterizing many natural objects, particularly for those objects that best qualify a pattern such as a granularity, static, and dynamic.
  • The Fisher Vector (FV) pools the local features densely within the described regions, and is therefore more apt at describing textures.
  • (FV has been widely used to aggregate the local descriptors of an image into a global representation in large-scale image retrieval.)
  • So, on the top of the CNN Conv3 layer, the representations using the Fisher vector (FV) pooling is used, which is commonly done in the bag-of-words approach, with single scale of 26×26.
  • The FV pools the local SIFT features densely within the described regions and is therefore apt at describing the image textures.
  • A 256-dimensional local features are generated for Fisher vector computation and pools into a representation with 32 Gaussian components.
  • Finally, it results into 16 K dimensional descriptors which are much higher than the FC-6 layer (1024-d). So the FV dimension is projected to 1024-d by using the PCA technique, and is merged at FC-6 layer.

2.4. Loss Function

  • The objective loss function includes both feature classification and bounding box regression, which is similar to those object detection task network, and with also the FV loss:
  • (There are more details about the loss function and also the training details. Please read the paper if interested. Here, I just want to focus on how authors make use of object detection network for speeding up the video coding process.)

3. Experimental Results

BD-Rate (%) on Class A to Class E HEVC Test Sequences
  • Using the proposed approach, 66.89% time reduction is achieved with only BD-rate loss of 1.31%, which outperforms Li ICME’17 [20].
BD-Rate (%) on Class F HEVC Test Sequences
  • For class F screen content sequences, using the proposed approach, 71.72% time reduction is achieved with only BD-rate loss of 1.84%, which outperforms Li ICME’17 [20].
Mode Reduction (%)
  • Compared to the conventional HEVC, using the proposed approach can reduce the mode checking by up to 48.325%.

During the days of coronavirus, A challenge of writing 30/35/40/45 stories again for this month has been accomplished. This is the 46th story in this month..!! Can I write 50 stories in this month?? Thanks for visiting my story..

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet