Reading: VC-LAPGAN — Video Coding oriented LAplacian Pyramid of Generative Adversarial Networks (HEVC Inter Prediction)

LAPGAN-Like Network, 2.0% BD-Rate Reduction Under LDP Configuration

5 min readJun 14, 2020

In this story, Video Coding oriented LAplacian Pyramid of Generative Adversarial Networks (VC-LAPGAN), by University of Science and Technology of China, is briefly presented. I read this because I work on video coding research. In this paper:

By using LAPGAN onto video coding, extrapolated frame is generated.
This frame is used as reference to improve the coding efficiency.

This is a paper in 2018 VCIP. (Sik-Ho Tsang @ Medium)

Outline

VC-LAPGAN: Network Architecture
Experimental Results

1. VC-LAPGAN: Network Architecture

**VC-LAPGAN: Network Architecture** (The original image is blurred like this..)

1.1. Overall Framework

The idea of LAPGAN is used in VC-LAPGAN.
There are in total 4 scales. Let s1, . . . , s4 be the sizes of the inputs of the network. Typically, in the experiments, s1=16×16, s2=32×32, s3=64×64, s4=128×128 during training.
Four previous frames are used to extrapolate the current frame. During training, collocated patches are sampled from the frames.
These patches sampled from previous frames are indexed by i ∈ {1, 2, 3, 4}, and are down-sampled three times to achieve the four scales. Let k ∈ {1, 2, 3, 4} denotes the k-th scale.
Then let U be the upscaling operator. Let Xik be the i-th previous patch at the k-th scale, and Yk be the expected output (i.e. the ground-truth of the current patch) at the k-th scale, and Gk be a network that operates at the k-th scale.
Thus, Gk, to make a prediction of Yk:

where U(ˆY0) is assumed to be 0 and omitted. A series of prediction is made starting from the lowest resolution.

1.2. Discriminative & Generative Nets

There are 4 discriminative nets at different scales.
Each net Dk takes the generated patch ˆYk and the real patch Yk as inputs. Dk is trained to indicate whether the patch is real or generated, which is a binary classification problem.
In VC-LAPGAN, each of the generative nets {G1, . . . , G4} is fully convolutional, i.e. composed of several convolutional layers exclusively.
Each convolutional layer is followed by a ReLU.
Each of the discriminative nets {D1, . . . , D4} is composed of several convolutional layers followed by several fully-connected layers followed by a logistic regression layer.

1.3. Training Data

CDVL is used, with RGB converted to YUV first.
100 videos are used, whose resolutions are either 720p or 1080p, with the patch size to 128×128.
700,000 samples are randomly selected. Larger motion is more likely to be chosen than the one with smaller motion. Furthermore, samples with little texture are also not very useful to train the network.

1.4. Loss Functions

The loss function for generative nets is:

where λ’s are weights. all λ=1 in the paper.
lGadv is the adversarial loss in the general GAN.
lG1 is the loss–difference between the predicted ˆYk and the real Yk.
lGgdl is also the difference between the prediction and the real, but calculated in the gradient domain. The gradient-domain loss is helpful to make the generated images more sharp.
The loss function for discriminative nets is:

where lDadv is the adversarial loss in the general GAN.
As a common practice, stochastic gradient descent is used for back-propagation to train the generative nets and the discriminative nets alternately.

1.5. HEVC Implementation

An extrapolated frame is generated by VC-LAPGAN.
And it is regarded as a long-term reference frame, so as to avoid the ambiguity of its picture-order-count (POC).

2. Experimental Results

2.1. BD-Rate

HM-12.0 is used with low-delay P configuration used.
2.2% BD-rate reduction is obtained against HEVC.
By using the immediately previous frames as reference frames, 2.0% BD-rate reduction is obtained.

2.2. RD Curves

**R-D curves of the sequences FourPeople (a) and Kimono (b).** (The original image is blurred like this..)

The PSNR improvements are almost the same for the entire bit-rate range, showing that the method is not sensitive to the quantization error in video coding.

2.3. Visual Quality

(The original image is blurred like this..)

(a) The predicted frame by using the HEVC anchor
(b) The extrapolated frame by VC-LAPGAN. Note the zoomed-in portions shown in (a) and (b) indicated by green and blue blocks.
(c) The original frame.
(d) The blocks that chose the extrapolated frame (b) as reference are shown in green.
The blocks correspond to regions with complex motions, choose the extrapolated frame as reference.

(a): Extrapolated frame generated by [6] but [6] is trained on RGB, color distortion is noticeable
(b): VC-LAPGAN has much fewer parameters than the network in [6], and thus incurs less computational complexity.

This is the 19th story in this month!

Reference

[2018 VCIP] [VC-LAPGAN]
Generative Adversarial Network-Based Frame Extrapolation for Video Coding

Generative Adversarial Network

Image Synthesis [GAN] [CGAN] [LAPGAN] [DCGAN]
Super Resolution [SRGAN & SRResNet] [ESRGAN]
Video Coding [VC-LAPGAN]

Codec Inter Prediction

H.264 [DRNFRUC & DRNWCMC]
HEVC [CNNIF] [Zhang VCIP’17] [NNIP] [Ibrahim ISM’18] [VC-LAPGAN] [VI-CNN] [CNNMCR] [FRUC+DVRF][FRUC+DVRF+VECNN] [RSR] [Zhao ISCAS’18 & TCSVT’19] [Ma ISCAS’19] [ES] [CNN-SR & CNN-UniSR & CNN-BiSR] [DeepFrame] [U+DVPN] [Multi-Scale CNN]
VVC [FRUC+DVRF+VECNN]