Reading: GVTCNN — Group Variational Transformation Convolutional Neural Network (HEVC Inter)
Outperforms CNNIF, 1.9% BD-rate Reduction Obtained Under LDP Configuration
In this story, Group Variational Transformation Convolutional Neural Network (GVTCNN), by Peking University, is briefly presented. I read this because I work on video coding research. In this paper:
- GVTCNN is used to interpolate the sub-pel pixels based on the integer-pel pixels for inter coding.
This is a paper in 2018 DCC. (Sik-Ho Tsang @ Medium)
- GVTCNN: Network Architecture
- Sample Collection
- Experimental Results
1. GVTCNN: Network Architecture
- The integer-position sample Il is the input of the network.
- 3×3 kernel size for all the convolutional layers with PReLU used.
- Standard MSE loss function is used.
1.1. Shared Feature Map Extraction Component
- A feature map with 48 channels is initially generated from the integer-position sample, followed by 8 convolution layers with 10 channels which are lightweight and cost less to save the learnt parameters.
- The 10-th layer later derives a 48-channel feature map.
- After 9 convolutional layers with 3×3 kernel size, the receptive field of each point in the shared feature map is 19×19.
1.2. Group Variational Transformation
- The group variational transformation is further performed over the shared feature maps with a specific convolutional layer for each sub-pixel sample.
- Different residual maps are then generated and we obtain the final inferred sub-pixel position samples by adding the residual maps to the integer-position sample.
2. Sample Collection
- The training data is first blurred with a Gaussian filter to simulate the correlations between the integer-position sample and sub-pixel position samples.
- Sub-pixel position samples are later sampled from the blurred image.
- As for the input integer-position sample generation, an intermediate integer sample is previously down-sampled from the raw image. Then, the intermediate down-sampled version is coded by HEVC.
- Two networks are trained separately for 3 half-pixel position samples and 12 quarter-pixel position samples to better generate the samples at different sub-pixel levels, called GVTCNN-H and GVTCNN-Q respectively.
- For GVTCNN-H, 200 training images and 200 testing images in the set BSDS500 are used to train it. 3×3 Gaussian kernels with random standard deviations in the range [0.5, 0.6] are used for blurring.
- For GVTCNN-Q, 10 YUV sequences at size 1024×768 and 1920×1080 are used to extract 89 high resolution frames to synthesize training data. Standard deviations of Gaussian kernels ranges from [0.7, 0.8].
- A model is trained for each QP.
3. Experimental Results
- HM-16.15 is used under LDP configuration.
- 1.9% BD-rate reduction is obtained.
- CNNIF only interpolates half-pel pixels, thus GVTCNN-H is used for fair comparison.
- GVTCNN-H outperforms CNNIF with 2.4% BD-rate reduction.
- By training a model separately for each sub-pel position, the BD-rate reduction is still similar to the one using GVT. (Not so clear to me about the network architecture here.)
This is the 26th story in this month!
Codec Inter Prediction
H.264 [DRNFRUC & DRNWCMC]
HEVC [CNNIF] [Zhang VCIP’17] [NNIP] [GVTCNN] [Ibrahim ISM’18] [VC-LAPGAN] [VI-CNN] [FRUC+DVRF][FRUC+DVRF+VECNN] [RSR] [Zhao ISCAS’18 & TCSVT’19] [Ma ISCAS’19] [Zhang ICIP’19] [ES] [FRCNN] [CNN-SR & CNN-UniSR & CNN-BiSR] [DeepFrame] [U+DVPN] [Multi-Scale CNN]