Review — Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification (Video Classification)

1st Place in ActivityNet Kinetics Challenge, Using 4 Temporal Modeling Approaches for Video Classification/Action Recognition

Sik-Ho Tsang
5 min readJun 26, 2021

In this story, Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification, by Baidu IDL & Tsinghua University, is reviewed. In this paper:

  • Four temporal modeling approaches are investigated: Multi-group Shifting Attention Network, Temporal Xception Network, Multi-stream sequence Model, and Fast-Forward Sequence Model.
  • Ensemble of the above approaches brings significant accuracy improvement.

This is a paper in 2017 arXiv with over 40 citations. (Sik-Ho Tsang @ Medium)


  1. Visual Feature and Acoustic Feature
  2. Shifting Attention Network
  3. Temporal Xception Network
  4. Experimental Results

1. Visual Feature and Acoustic Feature

  • Videos are naturally multimodal because a video can be decomposed into visual and acoustic components.

1.1. Visual Feature

  • The visual component can be further divided into spatial and temporal parts. RGB images for spatial feature extraction and stacked optical flow fields for temporal feature extraction.
  • Inception-ResNet-v2 is found to be good for both spatial and temporal components.
  • The RGB model is initialized with pre-trained model from ImageNet and fine-tuned in the Kinetics dataset.
  • The flow model is initialized from the fine-tuned RGB model.
  • Inspired by TSN, the TSN framework is used and three segments are sampled from each trimmed video for video-level training.
  • During testing, Features are densely extracted for each frames in the video.

1.2. Acoustic Feature

  • The audio is divided into 960ms frames, and the frames are processed with Fourier transformation, histogram integration and logarithm transformation.
  • The resulting frame can be seen as a 96×64 image that form the input of a VGG16.
  • Similar with the visual feature, the acoustic feature is trained in the TSN framework.

2. Shifting Attention Network

2.1. Shifting Attention

  • An attention function can be considered as mapping a set of input features to a single output, where the input and output are both matrices that concatenate feature vectors.
  • (The attention function talking about here is the one popularly used in NLP.)
  • The output of the shifting attention SATT(X) is calculated through a shifting operation based on a weighted sum of the features:
  • where λ is a weight vector calculated as:
  • w is learnable vector, a and b are learnable scalars, and α is a hyper-parameter to control the sharpness of the distribution.
  • The shifting operation actually shifts the weighted sum and at the same time ensures scale-invariance.
  • This lays the foundation for Multi-SATT.

2.2. Multi-Group Shifting Attention Network

Multi-group Shifting Attention Network
  • A variety of different features, such as appearance (RGB), motion (flow) and audio signals, is extracted. Yet, it is unrealistic to merge all multi modal feature sets within one attention model.
  • Multi-Group Shifting Attention Networks are proposed for training multiple groups of attentions simultaneously.
  1. First, multiple feature sets are extracted from the video.
  2. For each feature set Xi, Ni different shifting attentions are applied, which is called one attention group.
  3. Then, the outputs are concatenated.
  4. Next, the outputs of different attention groups are normalized separately and concatenated to form a global representation vector for the video.
  5. Finally, the representation vector is used for classification through a fully-connected layer.

3. Temporal Xception Network

Temporal Xception Network
  • Recently, convolutional sequence-to-sequence networks have been successfully applied to machine translation tasks
  • The depthwise separable convolution families are applied to the temporal dimension. (I believe it is the one used in Xception.)
  1. Zero-valued multi modal features were padded to make fixed length data for each stream.
  2. Adaptive temporal max pooling is applied to obtain n segments for each video.
  3. The video segment features are fed into a Temporal Convolutional block, which is consist of a stack of two separable convolutional layers followed by batch norm and activation with a shortcut connection.
  4. Finally, the outputs of three stream features are concatenated and fed into the fully-connected layer for classification.

4. Experiment Results

4.1. Dataset

  • The challenging Kinetics dataset contains 246,535 training videos, 19,907 validation videos and 38,685 testing videos. Each video is in one of 400 categories.

4.2. Results

Kinetics validation results.
  • Multi-stream sequence Model, and Fast-Forward Sequence Model (Fast-forward LSTM), are the ones proposed in Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding.
  • 3 key observations are concluded by authors:
  1. Temporal modeling approaches with multi modal features are a more effective approach than naive combining the classification scores of different modality networks (Late fusion) for the video classification.
  2. The proposed Shifting Attention Network and Temporal Xception Network can achieve comparable or even better results than the traditional sequence models (e.g. LSTM), which indicates they might serve as alternative temporal modeling approaches in future.
  3. Different temporal modeling approaches are complementary to each other that after using ensemble, an obvious top-1 accuracy improvement of 81.5% is achieved.

Finally, the approach in this paper ranked the 1st place in ActivityNet Kinetics challenge.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.