Review — Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding (Video Classification)
3rd Place out of 650 Teams, Using Ensembles of 3 Temporal Modeling Approaches: Two-stream Sequence Model, Fast-Forward Sequence Model, Temporal Residual Neural Network
In this story, Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding, by Baidu IDL & Tsinghua University, is reviewed. In this paper:
- 3 Temporal Modeling Approaches are investigated: Two-stream Sequence Model, Fast-Forward Sequence Model, and Temporal Residual Neural Network.
- Finally, ensembles of these 3 Temporal Modeling Approaches ranked 3rd place in Google Cloud & YouTube-8M Video Understanding Challenge.
This is a paper in 2017 CVPRW with over 40 citations. (Sik-Ho Tsang @ Medium)
- Proposed Two-stream Sequence Model
- Proposed Fast-Forward Sequence Model
- Proposed Temporal Residual Neural Network
- Experimental Results
1. Proposed Two-stream Sequence Model
- The original Two-Stream ConvNet trains CNNs with RGB and optical flow features separately.
- Here, The two stream model is built upon the bidirectional LSTM for RGB and audio.
- Attention layers, which is the one famous in NLP, are inserted after the sequence models and attended feature vectors from two modalities are then concatenated.
- Finally, the concatenated feature vector is fed into two fully-connected layer and a sigmoid layer sequentially for multi-label classification.
2. Proposed Fast-Forward Sequence Model
- Naively increasing the depth of the LSTM and GRU entails to overfitting and optimization difficulties.
- A novel deep LSTM/GRU architecture is explored by adding the fast-forward connections to sequence models, which plays an essential role in building a sequence model with 7 bidirectional LSTMs.
- The RGB and audio features of each frame are firstly concatenated together and then fed into the fast-forward sequence model.
- The fast-forward connections are added between two feed-forward computation blocks of adjacent recurrent layers.
- Each fast-forward connection takes the outputs of previous fast-forward and recurrent layer as input, and a fully-connected layer is used to embed them.
- The fast-forward connection provides a fast path for information to propagate, the idea of fast-forward connection is similar to the skip connection in ResNet.
3. Proposed Temporal Residual Neural Network
- Convolution and recurrent neural networks are combined to take the advantages of both models.
- The temporal convolution neural networks are utilized to transform the original frame-level features into a more discriminative feature sequence, and LSTMs are used for final classification.
- RGB and audio features in each frame are concatenated and zero-valued features are padded to make fixed length data.
- The size of the resulted input data is 4000×1152×300, where 4000, 1152, and 300 indicates mini-batch size, channel number, and length of frames, respectively.
- The batch data is propagated into a Temporal ResNet, which is a stack of 9 Temporal Resnet Blocks (TRB).
- Each TRB consists of two temporal convolutional layers (followed by batch norm and activation), and a shortcut connection.
- 1024 3×1 filters are used for all the temporal convolution layers.
- The output of the temporal CNN is then fed into a bidirectional LSTM with attention.
4. Experimental Results
- Youtube-8M dataset contains around 7 million YouTube videos. Each video is annotated with one or multiple tags. In the competition, visual and audio features are pre-extracted and provided with the dataset for each second of the video.
- Visual features are obtained by the Google Inception CNN pre-trained on the ImageNet, followed by the PCA-compression into a 1024 dimensional vector.
- Audio features are extracted from a pre-trained VGG.
- In the official split, the dataset is divided into three parts: 70% for training, 20% for validation, and 10% for testing.
- In practice, authors only maintain 60K videos from the official validation set to cross validate the parameters. Other videos in the validation set are included into the training set.
- Results are evaluated using the Global Average Precision (GAP) metric at top 20 as used in the Youtube-8M Kaggle competition.
- Two-stream sequence models and fast forward sequence models achieve significantly better results compared to previous video pooling approaches.
- The fast-forward LSTM model with depth 7 can boost the shallow sequence model around 0.5% in term of GAP.
- The final submission ensembles 57 models with different hidden cells and depths, ranks the 3rd place out of 650 teams in the challenge competition.