Review — SepConv++: Revisiting Adaptive Convolutions for Video Frame Interpolation

SepConv++: A Bunch of Small Improvements for Adaptive Separable Convolutions, Achieve SOTA Performance

Sik-Ho Tsang
5 min readSep 19, 2021
Kernel-Based Interpolation with Spatially-Varying Kernels

In this story, Revisiting Adaptive Convolutions for Video Frame Interpolation, by Adobe Research, is reviewed. In this paper:

  • Network using adaptive separable convolutions, is improved by a subtle set of low level improvements.
  • These improvements are delayed padding (+0.37 dB), input normalization (+0.30 dB), network improvements (+0.42 dB), kernel normalization (+0.52 dB), contextual training (+0.18 dB), self-ensembling (+0.18 dB).

With all these improvements, it is a paper published in 2021 WACV. (Sik-Ho Tsang @ Medium)

Outline

  1. Proposed Video Frame Interpolation Framework
  2. Delayed Padding (+0.37 dB)
  3. Input Normalization (+0.30 dB)
  4. Network Improvements (+0.42 dB)
  5. Kernel Normalization (+0.52 dB)
  6. Contextual Training (+0.18 dB)
  7. Self-Ensembling (+0.18 dB)
  8. Experimental Results

1. Proposed Video Frame Interpolation Framework

Proposed Video Frame Interpolation Framework (φ denotes the adaptive separable convolution operator)
  • Given two consecutive frames I1 and I2 from a video, target is to synthesize the intermediate frame ˆI that is temporally centered.
  • SepConv leverages adaptive separable convolutions by having a neural network predict a set of pixel-wise spatially-varying one-dimensional filter kernels {K1,h, K1,v, K2,h, K2,v} as follows:
  • I1 is filtered with the separable filters K1,h, K1,v while I2 is filtered with the separable filters K2,h, K2,v as follows:

These spatially-varying kernels capture motion and resampling information, which makes for an effective image formation model for frame interpolation

To be able to account for large motion, the kernels should be as large as possible. However, with larger kernels it is more difficult to estimate all coefficients.

2. Delayed Padding

  • Specifically, the original SepConv pads the input frames by 25 pixels before estimating the adaptive kernel coefficients via a neural network.
  • In contrast, authors propose not to pad the input images when they are given to  but instead to pad them when the predicted kernels are applied to the input images as
  • This delayed padding has two positive effects.
  1. First, it improves the computational efficiency. The original SepConv implementation takes 0.027 seconds to interpolate a frame at a resolution of 512 × 512 pixels. In comparison, it takes 0.018 seconds with the delayed padding.
  2. Second, it improves the quality of the interpolated results since the neural network does not have to deal with large padded boundaries that are outside of the manifold of natural images.

3. Input Normalization

  • The contrast and brightness of the input frames should not affect the quality of the synthesized results.
  • To normalize the input frames, the intensity values are shifted and rescaled to have zero mean and unit standard deviation.
  • There are multiple possibilities to do so, it is found that normalizing the two images jointly while treating each color channel separately to work well. That is, for each color channel the mean and standard deviation of I1 and I2 are computed as if they were one image.

4. Network Improvements

  • As shown in the figure above, residual blocks are added to the skip connections that join the two halves of the U-Net.
  • The activation function is changed to parametric rectified linear units (PReLU).
  • The average pooling with strided convolutions is replaced by a Kahan sum within the adaptive separable convolutions.

5. Kernel Normalization

  • The neural network that predicts the kernel coefficients needs to take great care not to alter the apparent brightness of a synthesized pixel.
  • A simple normalization step that can be applied to any kernel-based image formation model is proposed:
  • This simple kernel normalization step improves the quality of the synthesis results and greatly helps with the convergence of the model during training.
  • This kernel normalization has the most significant impact on the quality of the synthesized results.

6. Contextual Training

  • Originally, there is no constraint that forces the kernel prediction network to estimate coefficients that account for the true motion. Instead, the kernel prediction network may simply index pixels that have the desired color. This may hurt the generalizability.
  • Contextual training forces it to predict coefficients that agree with the true motion through a contextual loss.
  • Specifically, the context is obtained from relu1_2 of pre-trained VGGNet ψ, with α=0.1.
  • The loss function right now no just minimizes the difference between the prediction and the ground truth in color space but also in the contextual space, as follows:
  • where
  • Since each pixel in the contextual space not only describes the color of a single pixel but also encodes its local neighborhood, this loss effectively prevents the kernel prediction network from simply indexing pixels based on their color.

7. Self-Ensembling

  • In image classification and super-resolution, a singular prediction is often enhanced by combining the predictions of multiple transformed versions of the same input. Such transforms include rotations, mirroring, or cropping.
  • Surprisingly, there is no study of the effect of such a self-ensembling approach in the context of frame interpolation.

Proposed SepConv++ considers taking the mean and taking the median of up to sixteen predictions with transforms based on reversing the input frames, flipping them, mirroring them, and applying rotations by ninety degrees.

8. Experimental Results

Ablation experiments to quantitatively analyze the effects of our proposed techniques

With all the improvements added together, proposed SepConv++ obtains the highest PSNR on multiple datasets.

Effect of combining the mean of eight independent predictions for several video frame interpolation methods
  • All methods benefit from self-ensembling across all datasets.
  • However, by combining eight independent predictions, this processing time can now become tens of minutes to process a single second of high-resolution footage which is beyond the threshold of being practical for many applications.
Qualitative Comparison of SepConv++ Over SepConv
  • The above figure shows demonstrates the efficacy of our proposed SepConv++ over SepConv.

SepConv++ is proposed by fixing the problems in SepConv & also using a bunch of techniques.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.