# Review — Deep Multi-Scale Video Prediction Beyond Mean Square Error

## Video Frame Extrapolation, Generate Future Frames

--

Deep Multi-Scale Video Prediction Beyond Mean Square ErrorMathieu ICLR’16, by New York University, and Facebook Artificial Intelligence Research2016 ICLR, Over 1700 Citations(Sik-Ho Tsang @ Medium)

Video Frame Extrapolation

- A CNN is proposed to
**generate future frames**given an input sequence. - To deal with the inherently blurry predictions obtained from the standard Mean Squared Error (MSE) loss function,
**three feature learning strategies**are proposed:**a multi-scale architecture**,**an adversarial training method**, and**an image gradient difference loss function**. - This is a paper from Prof. Yann LeCun research group.

# Outline

**Proposed Network Architecture****Multi-Scale Architecture****Adversarial Training****Image Gradient Difference Loss (GDL)****Experimental Results**

**1. Proposed Network Architecture**

- Let
*Y*={*Y*1, …,*Yn*} be a sequence of frames to predict from input frames*X*={*X*1, …,*Xm*}. - A LeNet-like network architecture
*G*is trained to predict one or several concatenated frames*Y*from the concatenated frames*X*by minimizing a distance.

- where
*G*(*X*) is the predicted frame and*lp*can be*l*1 or*l*2 norm.

However, convolutions only account for

short-range dependencies. Also, using anl2 loss, and to a lesser extentl1, producesblurry predictions.

**2. Multi-Scale Architecture**

- A
**multi-scale**version of the model is used. - Let
*s*1, …,*sN*scales be the sizes of the inputs of our network. Typically, in the experiments,*s*1=4×4,*s*2=8×8,*s*3=16×16 and*s*4=32×32. - Let
*uk*be the upscaling operator toward size*sk*, and let*Xik*,*Yik*denote the downscaled versions of*Xi*and*Yi*of size*s*.

This solve the problem of short-range dependencies.

**3. Adversarial Training**

- (Please feel free to read GAN for adversarial training.)
**The generator**is trained with*G***a combined loss composed of the of the adversarial loss and the***Lp*loss:

- where
*LGadv*is:

- The discriminator
*D*to classify the input (*X*,*Y*) into class 1 and the input (*X*,*G*(*X*)) into class 0.

This solve the problem of blurry predictions.

**4. Image Gradient Difference Loss (GDL)**

- GDL is proposed to
**sharpen the image prediction by penalizing the differences of image gradient predictions**in the generative loss function:

- And the
**total loss**is:

This solve the problem of blurry predictions.

# 5. Experimental Results

**Two configurations**are evaluated:

**4 input frames to predict one future frame.**To generate further in the future, the model is applied recursively by using the newly generated frame as an input.**8 input frames are used to produce 8 frames**simultaneously.

The

GDLandadversarial predictionsareleading to further gains, and finallythe combination of the multi-scale,achieves thel1 norm, GDL and adversarial trainingbestPSNR, SSIM and Sharpness difference measure.

As seen, the proposed

Adversarial+GDLobtains moreclearer results.

- More results in https://cs.nyu.edu/~mathieu/iclr2016.html

## References

[2016 ICLR] [Mathieu ICLR’16]

Deep Multi-Scale Video Prediction Beyond Mean Square Error

https://cs.nyu.edu/~mathieu/iclr2016.html

## Video Frame Interpolation/Extrapolation

**2016 **[Mathieu ICLR’16] **2017 **[AdaConv] [SepConv] **2020 **[DSepConv] **2021 **[SepConv++]