Review — MViT: Multiscale Vision Transformers

MViT/MViTv1, Proposed Pooling Attention, Reduce Dimensions

PySlowFast is an open source video understanding codebase, with MViT provided by Meta AI (Formerly FAIR)
  • Multiscale Vision Transformer (MViT) is proposed for video classification, which creates a multiscale pyramid of features.
  • Early layers operate at high spatial resolution to model simple low-level visual information, and deeper layers operate at spatially coarse resolution, but to model complex, high-dimensional features, as below.
  • When number of input frames is reduced to 1, it can be used for image classification as well.


  1. Multi Head Pooling Attention (MHPA)
  2. Multiscale Vision Transformer (MViT)
  3. Video Classification Results
  4. Image Classification Results

1. Multi Head Pooling Attention (MHPA)

Pooling Attention
  • MHPA pools the sequence of latent tensors to reduce the sequence length (resolution) of the attended input.
  • Following ViT, MHPA projects the input X into intermediate query tensor ^Q, key tensor ^K and value tensor ^V:
  • Then, ^Q, ^K, ^V are pooled with the pooling operator P(;) which is the cornerstone of the MHPA:
  • Attention is now computed on these shortened vectors:
Multiscale Vision Transformers (MViTs) learn a hierarchy from dense (in space) and simple (in channels) to coarse and complex features.
  • (There are other implementation details, please feel free to read the paper.)

2. Multiscale Vision Transformer (MViT)

Left: ViT-B, Right: MViT-B
  • A scale stage is defined as a set of N Transformer blocks that operate on the same scale.
  • ViT (Left): always uses the same scale at scale2.
  • MViT (Right): There are multiple scale stages, to downsize the tensors for attention, which makes it more memory and compututionally efficient.
Comparing ViT-B to two instantiations of MViT with varying complexity

3. Video Classification Results

3.1. Kinetics-400

Accuracy/complexity trade-off on Kinetics-400 for varying # of inference clips per video shown in MViT curves.
  • T×τ: A T×τ clip from the full-length video which contains T frames with a temporal stride of τ.

3.2. Kinetics-600

Comparison with previous work on Kinetics-600.

3.3. Something-Something-v2 (SSv2)

Comparison with previous work on SSv2.

3.4. Charades

Comparison with previous work on Charades.

3.5. AVA v2.2

Comparison with previous work on AVA v2.2.
  • (There are other ablation experiments, please feel free to read the paper.)

4. Image Classification Results

Comparison to prior work on ImageNet.
  • With single frame as input, it becomes an image classification model.


1.1. Image Classification

1.11. Video Classification / Action Recognition

My Other Previous Paper Readings



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store