Reading: Ma ISCAS’19 — Neural Network-Based Arithmetic Coding for Inter Prediction (HEVC Inter)
DenseNet-Like Network, Up to 0.5% and on average 0.3% BD-rate reduction in LDP configuration
In this story, Neural Network-Based Arithmetic Coding for Inter Prediction Information in HEVC (Ma ISCAS’19), by University of Science and Technology of China, and Microsoft Research Asia (MSRA), is briefly mentioned. Before this paper, they have already proposed neural network-based arithmetic coding for intra coding in Song VCIP’17 and CNNAC.
In this paper, neural network-based arithmetic coding is proposed for inter coding. Instead of block prediction, this network is to predict the syntax elements (SEs): MergeFlag, MergeIdx, RefIdx, MVD_X, MVD_Y, and MVPIdx. These SEs are the intermediate symbols which will be encoded into the bitstream.
This is a paper in 2019 ISCAS. By the way, ISCAS in this year (2020) is postponed from May to October. (Sik-Ho Tsang @ Medium)
Outline
- HEVC Inter Coding
- Proposed Network Architecture
- Experimental Results
1. HEVC Inter Coding
- In HEVC, the coding unit (CU) size varies from 64×64 to 8×8.
- In inter coding, the split modes can divide CU to prediction unit (PU) include N×N, N/2×N, N×N/2, N/2×N/2, N/4×N(L), N/4×N(R), N×N/4(U) and N×N/4(D), as shown above, where N is the CU size.
- The reason is that there maybe different motions within the CU. A more fine-grained division is used to find the correct motions in order for finding more accurate matching blocks. With accurate matching block, the difference between the current block and matching blocks (residue) is reduced, and bitrate saving can be achieved.
- (Please visit Sections 1 & 2 of IPCNN for the importance of video coding and the introduction of HEVC quad tree coding.)
- As mentioned, this network is to predict the syntax elements (SEs): MergeFlag, MergeIdx, RefIdx, MVD_X, MVD_Y, and MVPIdx.
- Simply speaking, MergeFlag: A flag to indicate the use of merge mode.
- MergeIdx: An index to indicate which candidates are used as motion vector when merge mode is used.
- RefIdx: An index to indicate which reference frame is used.
- MVD_X (MVD_Y): The difference between the motion vector and motion vector predictor in x- (y-) direction.
- MVPIdx: An index to indicate which motion vector predictor is used.
- They are some SEs to be encoded when we need to find a matching block from another reference frame to predict the current block.
Basically, they are some numerical values to be encoded when the CU is inter coded. And now CNN is used to predict the value to reduce the coding bits of them.
2. Proposed Network Architecture
2.1. Input
- There are so many various PU sizes to be support, 25 models are needed. But in the paper, different PU sizes are unified so that only one model is trained.
- For each PU (red color as shown above which can have various sizes), the PU is extended to a square size, where the top and the left are extended with have 20 units, and to right and to down are extended with 2 units. And a 22×22 unit area is obtained in current picture.
- Since 4 pixels of width/height is the smallest dimensions in PU, each unit represent 4 pixels. Therefore, the 22×22 unit area means 88×88 pixel area which is input into the CNN.
- For the SEs of MergeFlag, MergeIdx, MVD_X, MVD_Y and MVPIdx, the input has 9 channels.
- 5 channels from current frame. 4 channels from reference frame.
- Both contains the RefIdx, MV_X, MV_Y, and the encoded values of the syntax element (i.e. MergeFlag, MergeIdx, MVD or MVPIdx). Thus, 4 channels from the current frame and 4 channels from reference frame here.
- The last channel is a mask where 1 represents the area which is encoded, 2 represents the area of current prediction unit, 0 represents the area which is not encoded.
- For the SEs of RefIdx, the input has 7 channels where the RefIdx channel is omitted for both the current and reference pictures.
2.2. DenseNet-Like Network Architecture
- The entire network of DenseNet is divided into multiple densely connected (dense) blocks.
- In dense blocks, the basic layer consists of one convolutional layer followed by ReLU and batch normalization (BN).
- Between dense blocks, transition layer is used to down-sample the feature maps.
- At the end of the last dense block, a softmax layer is attached to predict the probability distribution of every candidate.
- By validation, it is found that different networks are efficient for different SEs at different QPs as shown above.
2.3. Arithmetic Coding
- After obtaining the probability distribution from the neural network, the value of the syntax element together with its probability distribution is fed into an arithmetic coding engine.
- For MergeFlag, MergeIdx, RefIdx and MVPIdx, 2, 5, 4, 2-level arithmetic codec are used.
- For MVD_X (MVD_Y), a serie of binary decisions is used:
- Whether the value of the motion vector difference is zero.
- If it is not, then the second is whether it is larger than zero,
- Whether the absolute value of it is larger than one, . . . .
- After that, a sequence of bins is encoded with binary arithmetic coding.
3. Experimental Results
- HM-12.0 is used with LDP configuration.
- LDP (Low-Delay P) means there is no B frames, mainly P-frames.
- CDVL and SJTU databases are used as training data.
- 1M samples are used for training, 50K samples are for validation.
- With the proposed approach, inter prediction information is reduced by 13.9%, 13.5% and 13.5% for Y, U and V respectively.
- And 0.3%, 0.3% and 0.3% BD-rate reduction is obtained for Y, U and V respectively.
During the days of coronavirus, let me have a challenge of writing 30 stories again for this month ..? Is it good? This is the 11th story in this month. Thanks for visiting my story..
Reference
[2019 ISCAS] [Ma ISCAS’19]
Neural Network-Based Arithmetic Coding for Inter Prediction Information in HEVC
Codec Inter Prediction
HEVC [Zhang VCIP’17] [NNIP] [Ibrahim ISM’18] [Zhao ISCAS’18 & TCSVT’19] [Ma ISCAS’19]