# Review — SepConv++: Revisiting Adaptive Convolutions for Video Frame Interpolation

**SepConv++: A Bunch of Small Improvements for Adaptive Separable Convolutions, Achieve SOTA Performance**

--

In this story,** Revisiting Adaptive Convolutions for Video Frame Interpolation**, by Adobe Research, is reviewed. In this paper:

**Network using adaptive separable convolutions, is improved by a subtle set of low level improvements.**- These improvements are
**delayed padding**(+0.37 dB),**input normalization**(+0.30 dB),**network improvements**(+0.42 dB),**kernel normalization**(+0.52 dB),**contextual training**(+0.18 dB),**self-ensembling**(+0.18 dB).

With all these improvements, it is a paper published in **2021 WACV**. (Sik-Ho Tsang @ Medium)

# Outline

**Proposed Video Frame Interpolation Framework****Delayed Padding (+0.37 dB)****Input Normalization (+0.30 dB)****Network Improvements (+0.42 dB)****Kernel Normalization (+0.52 dB)****Contextual Training (+0.18 dB)****Self-Ensembling (+0.18 dB)****Experimental Results**

**1. Proposed Video Frame Interpolation Framework**

- Given two consecutive frames
*I*1 and*I*2 from a video, target is to**synthesize the intermediate frame ˆ***I*that is temporally centered. - SepConv leverages adaptive separable convolutions by having a neural network predict a set of pixel-wise spatially-varying one-dimensional filter kernels {
*K*1,*h*,*K*1,*v*,*K*2,*h*,*K*2,*v*} as follows:

*I*1 is filtered with the separable filters*K*1,*h*,*K*1,*v*while*I*2 is filtered with the separable filters*K*2,*h*,*K*2,*v*as follows:

These spatially-varying kernels capture motion and resampling information, which makes for an effective image formation model for frame interpolationTo be able to account for large motion, the kernels should be as large as possible. However,

with larger kernels it is more difficult to estimate all coefficients.

**2. Delayed Padding**

- Specifically, the original SepConv pads the input frames by 25 pixels before estimating the adaptive kernel coefficients via a neural network.

- In contrast, authors propose not to pad the input images when they are given to but instead to pad them when the predicted kernels are applied to the input images as

- This delayed padding has two positive effects.

- First, it
**improves the computational efficiency**. The**original****SepConv****0.027 seconds**to interpolate a frame at a resolution of 512 × 512 pixels. In comparison,**it takes 0.018 seconds with the delayed padding.** - Second, it
**improves the quality**of the interpolated results since the neural network**does not have to deal with large padded boundaries that are outside of the manifold of natural images.**

# 3. Input Normalization

- The contrast and brightness of the input frames should not affect the quality of the synthesized results.
- To normalize the input frames, the intensity values are shifted and rescaled to have zero mean and unit standard deviation.
- There are multiple possibilities to do so, it is found that normalizing the two images jointly while treating each color channel separately to work well. That is, for each color channel the mean and standard deviation of
*I*1 and*I*2 are computed as if they were one image.

# 4. Network Improvements

- As shown in the figure above,
**residual blocks**are added to the skip connections that join the two halves of the U-Net. - The activation function is changed to
**parametric rectified linear units (PReLU)**. - The average pooling with strided convolutions is replaced by
**a Kahan sum within the adaptive separable convolutions.**

**5. Kernel Normalization**

- The neural network that predicts
**the kernel coefficients****needs to take great care not to alter the apparent brightness of a synthesized pixel.** - A simple normalization step that can be applied to any kernel-based image formation model is proposed:

- This simple kernel normalization step
**improves the quality**of the synthesis results and greatly**helps with the convergence**of the model during training. - This kernel normalization has the most significant impact on the quality of the synthesized results.

# 6. Contextual Training

- Originally,
**there is no constraint that forces the kernel prediction network to estimate coefficients that account for the true motion.**Instead, the kernel prediction network may simply index pixels that have the desired color. This may**hurt the generalizability.** **Contextual training forces it to predict coefficients that agree with the true motion through a contextual loss**.- Specifically, the context is obtained from relu1_2 of pre-trained VGGNet
*ψ*, with α=0.1. - The loss function right now no just minimizes the difference between the prediction and the ground truth in color space but also in the contextual space, as follows:

- where

- Since each pixel in the contextual space not only describes the color of a single pixel but also encodes its local neighborhood,
**this loss effectively prevents the kernel prediction network from simply indexing pixels based on their color.**

# 7. Self-Ensembling

- In image classification and super-resolution, a singular prediction is often enhanced by combining the predictions of multiple transformed versions of the same input. Such transforms include
**rotations**,**mirroring**, or**cropping**. - Surprisingly, there is no study of the effect of such a self-ensembling approach in the context of frame interpolation.

Proposed SepConv++ considers taking the

meanand taking themedianof up tosixteen predictionswith transforms based onreversingthe input frames,flippingthem,mirroringthem, and applyingrotationsby ninety degrees.

# 8. Experimental Results

With all the improvements added together, proposed SepConv++ obtains the highest PSNR on multiple datasets.

**All methods benefit from self-ensembling**across all datasets.- However, by combining eight independent predictions, this processing time can now become
**tens of minutes to process a single second of high-resolution footage which is beyond the threshold of being practical for many applications.**

- The above figure shows demonstrates the efficacy of our proposed SepConv++ over SepConv.

SepConv++ is proposed by fixing the problems in SepConv & also using a bunch of techniques.

## Reference

[2021 WACV] [SepConv++]

Revisiting Adaptive Convolutions for Video Frame Interpolation

## Video Frame Interpolation

**2017 **[AdaConv] [SepConv] **2020 **[DSepConv] **2021 **[SepConv++]