Review — Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding (Video Classification)

3rd Place out of 650 Teams, Using Ensembles of 3 Temporal Modeling Approaches: Two-stream Sequence Model, Fast-Forward Sequence Model, Temporal Residual Neural Network

4 min readJun 19, 2021

In this story, Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding, by Baidu IDL & Tsinghua University, is reviewed. In this paper:

3 Temporal Modeling Approaches are investigated: Two-stream Sequence Model, Fast-Forward Sequence Model, and Temporal Residual Neural Network.
Finally, ensembles of these 3 Temporal Modeling Approaches ranked 3rd place in Google Cloud & YouTube-8M Video Understanding Challenge.

This is a paper in 2017 CVPRW with over 40 citations. (Sik-Ho Tsang @ Medium)

Outline

Proposed Two-stream Sequence Model
Proposed Fast-Forward Sequence Model
Proposed Temporal Residual Neural Network
Experimental Results

1. Proposed Two-stream Sequence Model

The original Two-Stream ConvNet trains CNNs with RGB and optical flow features separately.
Here, The two stream model is built upon the bidirectional LSTM for RGB and audio.
Attention layers, which is the one famous in NLP, are inserted after the sequence models and attended feature vectors from two modalities are then concatenated.
Finally, the concatenated feature vector is fed into two fully-connected layer and a sigmoid layer sequentially for multi-label classification.

2. Proposed Fast-Forward Sequence Model

Naively increasing the depth of the LSTM and GRU entails to overfitting and optimization difficulties.
A novel deep LSTM/GRU architecture is explored by adding the fast-forward connections to sequence models, which plays an essential role in building a sequence model with 7 bidirectional LSTMs.
The RGB and audio features of each frame are firstly concatenated together and then fed into the fast-forward sequence model.
The fast-forward connections are added between two feed-forward computation blocks of adjacent recurrent layers.
Each fast-forward connection takes the outputs of previous fast-forward and recurrent layer as input, and a fully-connected layer is used to embed them.
The fast-forward connection provides a fast path for information to propagate, the idea of fast-forward connection is similar to the skip connection in ResNet.

3. Proposed Temporal Residual Neural Network

Convolution and recurrent neural networks are combined to take the advantages of both models.
The temporal convolution neural networks are utilized to transform the original frame-level features into a more discriminative feature sequence, and LSTMs are used for final classification.
RGB and audio features in each frame are concatenated and zero-valued features are padded to make fixed length data.
The size of the resulted input data is 4000×1152×300, where 4000, 1152, and 300 indicates mini-batch size, channel number, and length of frames, respectively.
The batch data is propagated into a Temporal ResNet, which is a stack of 9 Temporal Resnet Blocks (TRB).
Each TRB consists of two temporal convolutional layers (followed by batch norm and activation), and a shortcut connection.
1024 3×1 filters are used for all the temporal convolution layers.
The output of the temporal CNN is then fed into a bidirectional LSTM with attention.

4. Experimental Results

4.1. Dataset

Youtube-8M dataset contains around 7 million YouTube videos. Each video is annotated with one or multiple tags. In the competition, visual and audio features are pre-extracted and provided with the dataset for each second of the video.
Visual features are obtained by the Google Inception CNN pre-trained on the ImageNet, followed by the PCA-compression into a 1024 dimensional vector.
Audio features are extracted from a pre-trained VGG.
In the official split, the dataset is divided into three parts: 70% for training, 20% for validation, and 10% for testing.
In practice, authors only maintain 60K videos from the official validation set to cross validate the parameters. Other videos in the validation set are included into the training set.
Results are evaluated using the Global Average Precision (GAP) metric at top 20 as used in the Youtube-8M Kaggle competition.

4.2. Results

**Comparison results on Youtube8M test set**

Two-stream sequence models and fast forward sequence models achieve significantly better results compared to previous video pooling approaches.
The fast-forward LSTM model with depth 7 can boost the shallow sequence model around 0.5% in term of GAP.
The final submission ensembles 57 models with different hidden cells and depths, ranks the 3rd place out of 650 teams in the challenge competition.

Reference

[2017 CVPRW] [Temporal Modeling Approaches]
Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding

Video Classification

2014 [Deep Video] [Two-Stream ConvNet] 2015 [DevNet] [C3D] 2017 [Temporal Modeling Approaches] [P3D]