Review — Deep Multi-Scale Video Prediction Beyond Mean Square Error

Video Frame Extrapolation, Generate Future Frames

  • A CNN is proposed to generate future frames given an input sequence.
  • To deal with the inherently blurry predictions obtained from the standard Mean Squared Error (MSE) loss function, three feature learning strategies are proposed: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function.
  • This is a paper from Prof. Yann LeCun research group.


  1. Proposed Network Architecture
  2. Multi-Scale Architecture
  3. Adversarial Training
  4. Image Gradient Difference Loss (GDL)
  5. Experimental Results

1. Proposed Network Architecture

Proposed Network Architecture
  • Let Y={Y1, …, Yn} be a sequence of frames to predict from input frames X={X1, …, Xm}.
  • A LeNet-like network architecture G is trained to predict one or several concatenated frames Y from the concatenated frames X by minimizing a distance.
  • where G(X) is the predicted frame and lp can be l1 or l2 norm.

2. Multi-Scale Architecture

Multi-Scale Architecture
  • A multi-scale version of the model is used.
  • Let s1, …, sN scales be the sizes of the inputs of our network. Typically, in the experiments, s1=4×4, s2=8×8, s3=16×16 and s4=32×32.
  • Let uk be the upscaling operator toward size sk, and let Xik, Yik denote the downscaled versions of Xi and Yi of size s.

3. Adversarial Training

  • (Please feel free to read GAN for adversarial training.)
  • The generator G is trained with a combined loss composed of the of the adversarial loss and the Lp loss:
  • where LGadv is:
  • The discriminator D to classify the input (X, Y) into class 1 and the input (X, G(X)) into class 0.

4. Image Gradient Difference Loss (GDL)

  • GDL is proposed to sharpen the image prediction by penalizing the differences of image gradient predictions in the generative loss function:
  • And the total loss is:

5. Experimental Results

  • Two configurations are evaluated:
  1. 4 input frames to predict one future frame. To generate further in the future, the model is applied recursively by using the newly generated frame as an input.
  2. 8 input frames are used to produce 8 frames simultaneously.
Comparison of the accuracy of the predictions on 10% of the UCF101 test images
Results on 3 video clips from Sport1m


[2016 ICLR] [Mathieu ICLR’16]
Deep Multi-Scale Video Prediction Beyond Mean Square Error

Video Frame Interpolation/Extrapolation

2016 [Mathieu ICLR’16] 2017 [AdaConv] [SepConv] 2020 [DSepConv] 2021 [SepConv++]

My Other Previous Paper Readings



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store