Review — Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding (Video Classification)

3rd Place out of 650 Teams, Using Ensembles of 3 Temporal Modeling Approaches: Two-stream Sequence Model, Fast-Forward Sequence Model, Temporal Residual Neural Network

Sik-Ho Tsang
4 min readJun 19, 2021

In this story, Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding, by Baidu IDL & Tsinghua University, is reviewed. In this paper:

  • 3 Temporal Modeling Approaches are investigated: Two-stream Sequence Model, Fast-Forward Sequence Model, and Temporal Residual Neural Network.
  • Finally, ensembles of these 3 Temporal Modeling Approaches ranked 3rd place in Google Cloud & YouTube-8M Video Understanding Challenge.

This is a paper in 2017 CVPRW with over 40 citations. (Sik-Ho Tsang @ Medium)


  1. Proposed Two-stream Sequence Model
  2. Proposed Fast-Forward Sequence Model
  3. Proposed Temporal Residual Neural Network
  4. Experimental Results

1. Proposed Two-stream Sequence Model

Two-stream LSTM
  • The original Two-Stream ConvNet trains CNNs with RGB and optical flow features separately.
  • Here, The two stream model is built upon the bidirectional LSTM for RGB and audio.
  • Attention layers, which is the one famous in NLP, are inserted after the sequence models and attended feature vectors from two modalities are then concatenated.
  • Finally, the concatenated feature vector is fed into two fully-connected layer and a sigmoid layer sequentially for multi-label classification.

2. Proposed Fast-Forward Sequence Model

Fast-Forward Sequence Models
  • Naively increasing the depth of the LSTM and GRU entails to overfitting and optimization difficulties.
  • A novel deep LSTM/GRU architecture is explored by adding the fast-forward connections to sequence models, which plays an essential role in building a sequence model with 7 bidirectional LSTMs.
  • The RGB and audio features of each frame are firstly concatenated together and then fed into the fast-forward sequence model.
  • The fast-forward connections are added between two feed-forward computation blocks of adjacent recurrent layers.
  • Each fast-forward connection takes the outputs of previous fast-forward and recurrent layer as input, and a fully-connected layer is used to embed them.
  • The fast-forward connection provides a fast path for information to propagate, the idea of fast-forward connection is similar to the skip connection in ResNet.

3. Proposed Temporal Residual Neural Network

Temporal Residual CNNs
  • Convolution and recurrent neural networks are combined to take the advantages of both models.
  • The temporal convolution neural networks are utilized to transform the original frame-level features into a more discriminative feature sequence, and LSTMs are used for final classification.
  • RGB and audio features in each frame are concatenated and zero-valued features are padded to make fixed length data.
  • The size of the resulted input data is 4000×1152×300, where 4000, 1152, and 300 indicates mini-batch size, channel number, and length of frames, respectively.
  • The batch data is propagated into a Temporal ResNet, which is a stack of 9 Temporal Resnet Blocks (TRB).
  • Each TRB consists of two temporal convolutional layers (followed by batch norm and activation), and a shortcut connection.
  • 1024 3×1 filters are used for all the temporal convolution layers.
  • The output of the temporal CNN is then fed into a bidirectional LSTM with attention.

4. Experimental Results

4.1. Dataset

  • Youtube-8M dataset contains around 7 million YouTube videos. Each video is annotated with one or multiple tags. In the competition, visual and audio features are pre-extracted and provided with the dataset for each second of the video.
  • Visual features are obtained by the Google Inception CNN pre-trained on the ImageNet, followed by the PCA-compression into a 1024 dimensional vector.
  • Audio features are extracted from a pre-trained VGG.
  • In the official split, the dataset is divided into three parts: 70% for training, 20% for validation, and 10% for testing.
  • In practice, authors only maintain 60K videos from the official validation set to cross validate the parameters. Other videos in the validation set are included into the training set.
  • Results are evaluated using the Global Average Precision (GAP) metric at top 20 as used in the Youtube-8M Kaggle competition.

4.2. Results

Comparison results on Youtube8M test set
  • Two-stream sequence models and fast forward sequence models achieve significantly better results compared to previous video pooling approaches.
  • The fast-forward LSTM model with depth 7 can boost the shallow sequence model around 0.5% in term of GAP.
  • The final submission ensembles 57 models with different hidden cells and depths, ranks the 3rd place out of 650 teams in the challenge competition.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.